Difference between revisions of "Annotated-directory-big-data"

From Earlham CS Department
Jump to navigation Jump to search
(Starcraft 2 Hit Analysis)
Line 47: Line 47:
 
*Description: Datasets of normal and updating crawl for YouTube
 
*Description: Datasets of normal and updating crawl for YouTube
 
*Curator: eosergi10
 
*Curator: eosergi10
 +
 +
==== Starcraft 2 Combat Analysis ====
 +
* URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
 +
* Description - Analysis of the percent chance of victory for any given unit versus any other unit
 +
* Curator - rdbean08

Revision as of 10:35, 7 October 2011

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL - http://books.google.com/ngrams/datasets
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP

MusicBrainz

World Cubing Association Database

Large Data Sets on AWS

Starcraft 2 Hit Analysis

Twitter Users by Location

The AOL Search Data

CGI 60 Genomes

  • URL: http://data.bionimbus.org/60-genome-data-set/
  • Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
  • Curator: eosergi10

Dataset for "Statistics and Social Network of YouTube Videos"

Starcraft 2 Combat Analysis