Annotated-directory-big-data

From Earlham CS Department
Revision as of 05:32, 7 October 2011 by Eosergi10 (talk | contribs) (CGI 60 Genomes)
Jump to navigation Jump to search

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL - http://books.google.com/ngrams/datasets
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP

MusicBrainz

World Cubing Association Database

Large Data Sets on AWS

Another Data Set

Twitter Users by Location

The AOL Search Data

CGI 60 Genomes

URL: http://data.bionimbus.org/60-genome-data-set/

Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.

Curator: eosergi10