Annotated-directory-big-data
Jump to navigation
Jump to search
This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.
Google ngrams
- URL - http://books.google.com/ngrams/datasets
- Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
- Curator - CharlieP
MusicBrainz
- URL - http://musicbrainz.org/doc/MusicBrainz_Database
- Description - In a nutshell, the musical equivalent of IMDb.
- Curator - Jahelton07
World Cubing Association Database
- Browse - http://worldcubeassociation.org/results/
- Download Database - http://www.worldcubeassociation.org/results/misc/export.html
- Description - All times, competitions, competitors of WCA competitions from 1984 until now.
- Curator - Twright09
Large Data Sets on AWS
- URL - http://aws.amazon.com/publicdatasets/#1
- Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.
- Curator - Twright09
Starcraft 2 Hit Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)
- Description - Analysis of the number of hits any given unit can sustain from any other unit
- Curator - rdbean08
Starcraft 2 Combat Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
- Description - Analysis of the percent chance of victory for any given unit versus any other unit
- Curator - rdbean08
Twitter Users by Location
- URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077
- Description - Twitter Census: Twitter Users by Location
- Curator - ibabic09
The AOL Search Data
- URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079
- Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.
- Curator - ibabic09
Research and Innovative Technology Administration
- URL: http://www.rita.dot.gov/
- Description: RITA coordinates the U.S. Department of Transportation's research and education programs. RITA also offers vital transportation statistics and analysis, and supports national efforts to improve education and training in transportation-related fields.
- Curator: eosergi10
"DBpedia"
- URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
- Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
- Curator: eosergi10
IMDB
- URL: http://www.imdb.com/interfaces
- Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
- Curator: gaschue08