CSE6339 Computational Journalism Chengkai Li University of Texas

20 Slides2.57 MB

CSE6339 Computational Journalism Chengkai Li University of Texas at Arlington Spring 2015

Big Data http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Big Data The 4 Vs o o o o Volume Variety Velocity Veracity

Volume: How much data is out there? http://www.sciencedaily.com/releases/ 2013/05/130522085217.htm http://www.storagenewsletter.com/rubriques/ market-reportsresearch/ibm-cmo-study/

Variety: Types of Data Structured Data o (relational) database tables o CSV/TSV files Semi-structured Data o XML o JSON o RDF Unstructured Data o text data (documents, Web pages, short texts (e.g., social media)) Multimedia Data (images, videos, audios) Other types of data o matrices, graphs, sequences, time-series, spatiotemporal

Velocity: Streaming Data Stock Trades Highway Sensors Weather Data Social Media Telephone Calls Video Streaming

http://mashable.com/2012/06/22/data-created-

Datasets Amazon Public Data Sets Data.gov Linked Open Data Knowledge Bases, Encyclopedia Yahoo! Webscope Bibliography Databases Network/Graph Datasets UCI Machine Learning Repository UCR Time Series Classification/Clustering Time Series Data Library KDnuggets Dataset List KDD Cup Datasets

Amazon Public Data Sets http://aws.amazon.com/public-datasets/ o NASA NEX: A collection of Earth science data sets o o o o maintained by NASA, including climate change projections and satellite images of the Earth's surface Common Crawl Corpus: A corpus of web crawl data composed of over 5 billion web pages 1000 Genomes Project: A detailed map of human genetic variation Google Books Ngrams: A data set containing Google Books n-gram corpuses US Census Data: US demographic data from 1980, 1990, and 2000 US Censuses

Data.gov http://www.data.gov/ (137,608 datasets) o Consumer Complaint Database o U.S. International Trade in Goods and Services: Monthly o o o o o o o o o report that provides national trade data including imports, exports, and balance of payments for goods and services. DTV Reception Maps Climate Data Online Food Access Research Atlas — presents a spatial overview of food access indicators for low-income and other census tracts using different measures of supermarket. U.S. Hourly Precipitation Data Great Chile Earthquake of May 22, 1960 Consumer Expenditure Survey Campus Security Data Farmers Markets Geographic Data: longitude and latitude, state, address, name, and zip code of Farmers Markets in the United States

Government Data Government spending http://www.usaspending.gov/ Campaign finance http://www.fec.gov/disclosure.shtml http://www.opensecrets.org/ Congress voting record http://www.govtrack.us/ Members of Congress, Bills & Resolutions, Voting Records, Committees Census o http://www.census.gov/main/www/access.html

Linked Data http://linkeddata.org/ (hundreds of datasets, billions of RDF triples)

Knowledge Bases, Encyclopedia o Wikipedia, Dbpedia o o o o Freebase/Google Knowledge Graph YAGO Probase LibraryThing

Yahoo! Webscope Datasets o o o o o o o Language Data Graph and Social Data Ratings and Classification Data Advertising and Market Data Competition Data Computing Systems Data Image Data

Bibliography Databases o Google Scholar, Microsoft Academic Search, DBLP, arXiv.org, CiteSeer, Arnetminer Drug and Disease Databases o Drug Bank, DailyMed, OMIM, KEGG Drug Gene and Protein Databases o UniProt, Protein Data Bank, Genbank

Stanford Large Network Dataset Collection http://snap.stanford.edu/data/ o Social networks : online social networks, edges represent o o o o o o o o interactions between people Networks with ground-truth communities : ground-truth network communities in social and information networks Communication networks : email communication networks with edges representing communication Citation networks : nodes represent papers, edges represent citations Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs : nodes represent webpages and edges are hyperlinks Amazon networks : nodes represent products and edges link commonly co-purchased products Internet networks : nodes represent computers and edges communication Road networks : nodes represent intersections and edges

Time Series Data Library http://robjhyndman.com/TSDL/

KDnuggets Dataset List http://www.kdnuggets.com/datasets/index.html

KDD Cup Datasets http://www.sigkdd.org/kddcup/index.php

Data Mining Software Free, open-source RapidMiner Weka: Data mining tool in java SCaVis: scientific computation and visualization, Java o Orange: Python suite o Scikit-learn: Python machine learning lbirary o NumPy/SciPy/Ipython/ mlpy (python modules for scientific computing, scientific library, interactive computing, machine learning) o R: statistical computing and graphic o RattleGUI: data mining GUI using R o Octave: numerical analysis o Shogun: machine learning toolkit in C Text Mining Tools o NLTK (NLP Toolkit): NLP suite for Python o SenticNet API: sentiment analysis o Stanford NLP software o UIMA o o o Large-Scale Data Processing, Machine Learning o Apache Mahout o GraphLab o MapReduce/ Hadoop o Spark o Pregel/Giraph Commercial o Matlab o Oracle Data Mining o SAS o IBM SPSS o Microsoft SQL Server Analysis Services

Back to top button