From Earlham CS Department
Revision as of 11:26, 7 October 2011 by Rdbean08 (talk | contribs) (Another Data Set)
Jump to navigation Jump to search

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL -
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP


World Cubing Association Database

Large Data Sets on AWS

Starcraft 2 Unit Strength Comparisons

Twitter Users by Location

The AOL Search Data

CGI 60 Genomes

  • URL:
  • Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
  • Curator: eosergi10

Dataset for "Statistics and Social Network of YouTube Videos"