Difference between revisions of "Annotated-directory-big-data"
Jump to navigation
Jump to search
Line 52: | Line 52: | ||
*Description: Datasets of normal and updating crawl for YouTube | *Description: Datasets of normal and updating crawl for YouTube | ||
*Curator: eosergi10 | *Curator: eosergi10 | ||
+ | |||
+ | ==== IMDB ==== | ||
+ | *URL: http://www.imdb.com/interfaces | ||
+ | *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files" | ||
+ | *Curator: gaschue08 |
Revision as of 11:21, 7 October 2011
This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.
Google ngrams
- URL - http://books.google.com/ngrams/datasets
- Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
- Curator - CharlieP
MusicBrainz
- URL - http://musicbrainz.org/doc/MusicBrainz_Database
- Description - In a nutshell, the musical equivalent of IMDb.
- Curator - Jahelton07
World Cubing Association Database
- Browse - http://worldcubeassociation.org/results/
- Download Database - http://www.worldcubeassociation.org/results/misc/export.html
- Description - All times, competitions, competitors of WCA competitions from 1984 until now.
- Curator - Twright09
Large Data Sets on AWS
- URL - http://aws.amazon.com/publicdatasets/#1
- Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.
- Curator - Twright09
Starcraft 2 Hit Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)
- Description - Analysis of the number of hits any given unit can sustain from any other unit
- Curator - rdbean08
Starcraft 2 Combat Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
- Description - Analysis of the percent chance of victory for any given unit versus any other unit
- Curator - rdbean08
Twitter Users by Location
- URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077
- Description - Twitter Census: Twitter Users by Location
- Curator - ibabic09
The AOL Search Data
- URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079
- Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.
- Curator - ibabic09
CGI 60 Genomes
- URL: http://data.bionimbus.org/60-genome-data-set/
- Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
- Curator: eosergi10
Dataset for "Statistics and Social Network of YouTube Videos"
- URL: http://netsg.cs.sfu.ca/youtubedata/
- Description: Datasets of normal and updating crawl for YouTube
- Curator: eosergi10
IMDB
- URL: http://www.imdb.com/interfaces
- Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
- Curator: gaschue08