Difference between revisions of "Annotated-directory-big-data"

From Earlham CS Department
Jump to navigation Jump to search
(Dataset for "Statistics and Social Network of YouTube Videos")
(CGI 60 Genomes)
Line 43: Line 43:
 
* Curator - ibabic09
 
* Curator - ibabic09
  
==== CGI 60 Genomes ====
+
==== Research and Innovative Technology Administration ====
*URL: http://data.bionimbus.org/60-genome-data-set/
+
*URL: http://www.rita.dot.gov/
*Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
+
*Description: RITA coordinates the U.S. Department of Transportation's research and education programs. RITA also offers vital transportation statistics and analysis, and supports national efforts to improve education and training in transportation-related fields.
 
*Curator: eosergi10
 
*Curator: eosergi10
  

Revision as of 01:04, 14 October 2011

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Google ngrams

  • URL - http://books.google.com/ngrams/datasets
  • Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
  • Curator - CharlieP

MusicBrainz

World Cubing Association Database

Large Data Sets on AWS

Starcraft 2 Hit Analysis

Starcraft 2 Combat Analysis

Twitter Users by Location

The AOL Search Data

Research and Innovative Technology Administration

  • URL: http://www.rita.dot.gov/
  • Description: RITA coordinates the U.S. Department of Transportation's research and education programs. RITA also offers vital transportation statistics and analysis, and supports national efforts to improve education and training in transportation-related fields.
  • Curator: eosergi10

"DBpedia"

  • URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
  • Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
  • Curator: eosergi10

IMDB

  • URL: http://www.imdb.com/interfaces
  • Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
  • Curator: gaschue08