Difference between revisions of "Annotated-directory-big-data"
Jump to navigation
Jump to search
Line 77: | Line 77: | ||
*URL: http://nces.ed.gov/ipeds/datacenter/ | *URL: http://nces.ed.gov/ipeds/datacenter/ | ||
*Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve. | *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve. | ||
+ | *Curator: gaespin07 | ||
+ | |||
+ | ==== Freebase ==== | ||
+ | *URL: http://download.freebase.com/datadumps/ | ||
+ | *Description: Database for publicly aggregated data. Data sets vary but include entries on collections of films, computers and many others. | ||
*Curator: gaespin07 | *Curator: gaespin07 |
Revision as of 03:28, 14 October 2011
This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.
Google ngrams
- URL - http://books.google.com/ngrams/datasets
- Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
- Curator - CharlieP
MusicBrainz
- URL - http://musicbrainz.org/doc/MusicBrainz_Database
- Description - In a nutshell, the musical equivalent of IMDb.
- Curator - Jahelton07
World Cubing Association Database
- Browse - http://worldcubeassociation.org/results/
- Download Database - http://www.worldcubeassociation.org/results/misc/export.html
- Description - All times, competitions, competitors of WCA competitions from 1984 until now.
- Curator - Twright09
Large Data Sets on AWS
- URL - http://aws.amazon.com/publicdatasets/#1
- Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.
- Curator - Twright09
Starcraft 2 Hit Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)
- Description - Analysis of the number of hits any given unit can sustain from any other unit
- Curator - rdbean08
Starcraft 2 Combat Analysis
- URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
- Description - Analysis of the percent chance of victory for any given unit versus any other unit
- Curator - rdbean08
Twitter Users by Location
- URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077
- Description - Twitter Census: Twitter Users by Location
- Curator - ibabic09
The AOL Search Data
- URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079
- Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.
- Curator - ibabic09
Freebase
- URL: http://wiki.freebase.com/wiki/Data_dumps
- Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories.
Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text. Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type.
- Size: Total- 6.0 Gb compressed with bzip2
Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains.
- Format: This is a complete "low level" dump of data which is suitable for post processing into RDF or XML datasets.
- Schema:
Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, <source> (mid - a machine-generated id), <property> (a particular kind of quality of the entity mentioned in the "source" column), <destination> (holds the name of a namespace), <value> (a key within that namespace). Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description Set - TSV per Freebase type: type, type's description
- Curator: eosergi10
"DBpedia"
- URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
- Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
- Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.
- Format:RDF triples
- Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).
If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.
- Curator: eosergi10
IMDB
- URL: http://www.imdb.com/interfaces
- Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
- Curator: gaschue08
ies
- URL: http://nces.ed.gov/ipeds/datacenter/
- Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.
- Curator: gaespin07
Freebase
- URL: http://download.freebase.com/datadumps/
- Description: Database for publicly aggregated data. Data sets vary but include entries on collections of films, computers and many others.
- Curator: gaespin07