Annotated-directory-big-data

From Earlham CS Department
Revision as of 00:40, 14 October 2011 by Eosergi10 (talk | contribs) (Dataset for "Statistics and Social Network of YouTube Videos")
Jump to navigation Jump to search

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL - http://books.google.com/ngrams/datasets
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP

MusicBrainz

World Cubing Association Database

Large Data Sets on AWS

Starcraft 2 Hit Analysis

Starcraft 2 Combat Analysis

Twitter Users by Location

The AOL Search Data

CGI 60 Genomes

  • URL: http://data.bionimbus.org/60-genome-data-set/
  • Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
  • Curator: eosergi10

"DBpedia"

  • URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
  • Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
  • Curator: eosergi10

IMDB

  • URL: http://www.imdb.com/interfaces
  • Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
  • Curator: gaschue08