From Earlham CS Department
Revision as of 01:19, 14 October 2011 by Eosergi10 (talk | contribs) (Research and Innovative Technology Administration)
Jump to navigation Jump to search

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL -
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP


World Cubing Association Database

Large Data Sets on AWS

Starcraft 2 Hit Analysis

Starcraft 2 Combat Analysis

Twitter Users by Location

The AOL Search Data


  • URL:
  • Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories.
  • Curator: eosergi10


  • URL:
  • Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
  • Curator: eosergi10


  • URL:
  • Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
  • Curator: gaschue08