Difference between revisions of "Annotated-directory-big-data"

Latest revision as of 08:18, 7 March 2015

This is an annotated directory of public, freely available, "large" data sets. For now they are in no particular order.

Metadata Sources

http://www.drewsullivan.com/database.html (Last updated 2000, last checked 2015-03)

High Frequency Trading/Predicting Stocks

Info URLS:
- http://www.youtube.com/watch?v=p40Kpmu60YM
- http://en.wikipedia.org/wiki/High-frequency_trading
Data URLS: http://finance.yahoo.com/
Description: In high-frequency trading, programs analyze market data to capture trading opportunities to trade securities like stocks or options. The Database will be automatically downloading/update information from online stock market on hourly bases.After the data is updated, using combination of mathematical models such as the "Black Scholes Formula, probability and statistics, etc... the computer will predict which stock to buy. It could be also programed so that the computer buys the predicted stocks or options automatically. Because humans are too slow.
Creator: mmludin08

Wikipedia Page Traffic Statistics

URLS: http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1
Description: This dataset contains a 320 GB sample of the data used to power trendingtopics.org. It includes 7 months of hourly page traffic statistics for over 2.5 Million wikipedia articles (~ 1 TB uncompressed) along with the associated wikipedia content, linkgraph, & metadata. The project will consists in using this real-world data, queries and access patterns to design and develop benchmarks based on real-world data. This will allow users to compare performance of different DBMSs and different storage and access techniques.
Creator: mmludin08

Google ngrams

URL - http://books.google.com/ngrams/datasets
Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc. Each set contains a list of ngrams, frequency, and date information.
Curator - CharlieP

MusicBrainz

URL - http://musicbrainz.org/doc/MusicBrainz_Database
Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone
Complete Size - Without information about editors, 3.47 GB
Format - PostgreSQL "COPY TO" format
Curator - Jahelton07

freedb

URL - http://www.freedb.org/en/download__database.10.html
Description - A similar music-tagging service to MusicBrainz.
Complete Size - Of the latest complete set, 734MB compressed
Curator - Jahelton07

World Cubing Association Database

Browse - http://worldcubeassociation.org/results/
Download Database - http://www.worldcubeassociation.org/results/misc/export.html
Description - All times, competitions, competitors of WCA competitions from 1984 until now.
Size: 5mb, SQL export
Curator - Twright09

Large Data Sets on AWS

URL - http://aws.amazon.com/publicdatasets/#1
Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.
Download: 2 - 250Gb, Various formats
Curator - Twright09

Starcraft 2 Hit Analysis

URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)
Description - Analysis of the number of hits any given unit can sustain from any other unit
Curator - rdbean08

Starcraft 2 Combat Analysis

URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
Description - Analysis of the percent chance of victory for any given unit versus any other unit
Curator - rdbean08

Twitter Users by Location

URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077
Description - Twitter Census: Twitter Users by Location
Curator - ibabic09

The AOL Search Data

URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079
Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.
Curator - ibabic09

Center for Disease Contol Data

URL - http://www.cdc.gov/nchs/
Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.
Curator - jrhurst08

IRS Statistics

URL - http://www.irs.treas.gov/tax_stats/index.html
Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.
Curator - jrhurst08

National Oceanographic and Atmospheric Administration - Storm Prediction Center

URL - http://www.spc.noaa.gov/climo/
Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. Unfortunately, as far as I can find right now, data can only be downloaded for single days worth of events at a time.
Downloadable formats: CSV and KML
Size: On average, one day worth of events is roughly 15KB
Curator - jrhurst08

Ensembl Genome Data

URL:http://useast.ensembl.org/info/data/ftp/index.html
Description:The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online
Format:MySQL
Curator: eosergi10

Freebase

URL: http://wiki.freebase.com/wiki/Data_dumps
Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories.

      Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.
      Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. 
      Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type.

Size: Total- 6.0 Gb compressed with bzip2

      Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) 
      Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. 
      Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains.

Format: This is a complete "low level" dump of data which is suitable for post processing into RDF or XML datasets.
Schema:

      Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, <source> (mid - a machine-generated id), <property> (a particular kind of quality of the entity mentioned in the "source" column), <destination> (holds the name of a namespace), <value> (a key within that namespace). 
      Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description
      Set - TSV per Freebase type: type, type's description

Curator: eosergi10

DBpedia

URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.
Format:RDF triples
Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).

If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.

Curator: eosergi10

CGI 60 Genomes

URL: http://data.bionimbus.org/60-genome-data-set/
Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
Curator: eosergi10

Dataset for "Statistics and Social Network of YouTube Videos"

URL: http://netsg.cs.sfu.ca/youtubedata/
Description: Datasets of normal and updating crawl for YouTube
Curator: eosergi10

IMDB

URL: http://www.imdb.com/interfaces
Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
Curator: gaschue08

ies

URL: http://nces.ed.gov/ipeds/datacenter/
Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.
Curator: gaespin07

Enron Email Dataset

URL: http://www.cs.cmu.edu/~enron/
Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz
Size: 423mb g-zipped
Contains emails from about 150, mostly senior management, employees of Enron.
Curator: Unknown

US Census Data for 2000

URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet
Curator: stahlbr

Project Gutenberg

URL: http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages#Getting_an_Offline_Version_of_our_Site
Size: 14.5GB
Format: unstructured plain text
Curator: stahlbr

Amazon Product/User Data

Description: A large database of product information and reviews,as well as data on user profiles
URL:http://131.193.40.52/data/
Size: 1.866GB zipped
Format: Unstructured plain text
Curator: Rdbean08
Schemas:
memberinfo-locations
member-shortSummary
reviewed-Products
reviewed-AudioCDs
reviewsNew
productinfo
Booksinfo

New York Public Transportation Data

Description: Lots of data about public transportation in New York, such as schedules, average wait time, usage, etc.
URL: http://www.mta.info/developers/download.html
Size: Dependent on how much you want to look at. There are lots of options.
Format: CSV
Curator: gaschue08

National Polar-Orbiting Operational Environmental Satellite System (NPOESS) Preparatory Project - NPP

Description:
URL: http://www.class.ngdc.noaa.gov/data_available/npp/index.htm
Curator: CharlieP

@@ Line 1: / Line 1: @@
-'''Annotated Directory of Data Sets'''
+__NOTOC__
+This is an annotated directory of public, freely available, "large" data sets.  For now they are in no particular order.
-==== Google Books ====
+==== Metadata Sources ====
-* URL -
+http://www.drewsullivan.com/database.html (Last updated 2000, last checked 2015-03)
-* Description -
-* Examples -
+----
+==== High Frequency Trading/Predicting Stocks ====
+*Info URLS:
+** http://www.youtube.com/watch?v=p40Kpmu60YM
+** http://en.wikipedia.org/wiki/High-frequency_trading
+*Data URLS: http://finance.yahoo.com/
+*Description: In high-frequency trading, programs analyze market data to capture trading opportunities to trade securities like stocks or options. The Database will be automatically downloading/update information from online stock market on hourly bases.After the data is updated, using combination of mathematical models such as the "Black Scholes Formula, probability and statistics, etc... the computer will predict which stock to buy. It could be also programed so that the computer buys the predicted stocks or options automatically. Because humans are too slow.
+*Creator: mmludin08
+==== Wikipedia Page Traffic Statistics ====
+* URLS: http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1
+*Description: This dataset contains a 320 GB sample of the data used to power trendingtopics.org. It includes 7 months of hourly page traffic statistics for over 2.5 Million wikipedia articles (~ 1 TB uncompressed) along with the associated wikipedia content, linkgraph, & metadata.  The project will consists in using this real-world data, queries and access patterns to design and develop benchmarks based on real-world data. This will allow users to compare performance of different DBMSs and different storage and access techniques.
+*Creator: mmludin08
+==== Google ngrams ====
+* URL - http://books.google.com/ngrams/datasets
+* Description - The ngram databases on which Google's ngram viewer is built.  A variety of corpora are available, e.g. by language, the "Google Million", English fiction, etc.  Each set contains a list of ngrams, frequency, and date information.
 * Curator - CharlieP
-==== Another Data Set ====
+==== MusicBrainz ====
+* URL - http://musicbrainz.org/doc/MusicBrainz_Database
+* Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone
+* Complete Size - Without information about editors, 3.47 GB
+* Format - PostgreSQL "COPY TO" format
+* Curator - Jahelton07
+==== freedb ====
+* URL - http://www.freedb.org/en/download__database.10.html
+* Description - A similar music-tagging service to MusicBrainz.
+* Complete Size - Of the latest complete set, 734MB compressed
+* Curator - Jahelton07
+==== World Cubing Association Database ====
+* Browse - http://worldcubeassociation.org/results/
+* Download Database - http://www.worldcubeassociation.org/results/misc/export.html
+* Description - All times, competitions, competitors of WCA competitions from 1984 until now.
+* Size: 5mb, SQL export
+* Curator - Twright09
+==== Large Data Sets on AWS ====
+* URL - http://aws.amazon.com/publicdatasets/#1
+* Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.
+* Download: 2 - 250Gb, Various formats
+* Curator - Twright09
+==== Starcraft 2 Hit Analysis ====
+* URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)
+* Description - Analysis of the number of hits any given unit can sustain from any other unit
+* Curator - rdbean08
+==== Starcraft 2 Combat Analysis ====
+* URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)
+* Description - Analysis of the percent chance of victory for any given unit versus any other unit
+* Curator - rdbean08
+==== Twitter Users by Location ====
+* URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077
+* Description - Twitter Census: Twitter Users by Location
+* Curator - ibabic09
+==== The AOL Search Data ====
+* URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079
+* Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.
+* Curator - ibabic09
+==== Center for Disease Contol Data ====
+* URL - http://www.cdc.gov/nchs/
+* Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9.  Data is in Lotus 1-2-3 and ASCII formats.
+* Curator - jrhurst08
+==== IRS Statistics ====
+* URL - http://www.irs.treas.gov/tax_stats/index.html
+* Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.
+* Curator - jrhurst08
+==== National Oceanographic and Atmospheric Administration -  Storm Prediction Center ====
+* URL - http://www.spc.noaa.gov/climo/
+* Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. Unfortunately, as far as I can find right now, data can only be downloaded for single days worth of events at a time.
+* Downloadable formats: CSV and KML
+* Size: On average, one day worth of events is roughly 15KB
+* Curator - jrhurst08
+==== Ensembl Genome Data ====
+*URL:http://useast.ensembl.org/info/data/ftp/index.html
+*Description:The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online
+*Format:MySQL
+*Curator: eosergi10
+==== Freebase ====
+*URL: http://wiki.freebase.com/wiki/Data_dumps
+*Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories.
+       Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.
+       Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase.
+       Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type.
+*Size: Total- 6.0 Gb compressed with bzip2
+       Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed)
+       Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows.
+       Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains.
+*Format:  This is a complete "low level" dump of data which is suitable for post processing into RDF or XML datasets.
+*Schema:
+       Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, <source> (mid - a machine-generated id), <property> (a particular kind of quality of the entity mentioned in the "source" column), <destination> (holds the name of a namespace), <value> (a key within that namespace).
+       Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description
+       Set - TSV per Freebase type: type, type's description
+*Curator: eosergi10
+==== DBpedia  ====
+*URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/
+*Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.
+*Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.
+*Format:RDF triples
+*Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).
+If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.
+*Curator: eosergi10
+==== CGI 60 Genomes ====
+*URL: http://data.bionimbus.org/60-genome-data-set/
+*Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations.
+*Curator: eosergi10
+==== Dataset for "Statistics and Social Network of YouTube Videos" ====
+*URL: http://netsg.cs.sfu.ca/youtubedata/
+*Description: Datasets of normal and updating crawl for YouTube
+*Curator: eosergi10
+==== IMDB ====
+*URL: http://www.imdb.com/interfaces
+*Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under "Plain Text Data Files"
+*Curator: gaschue08
+==== ies ====
+*URL: http://nces.ed.gov/ipeds/datacenter/
+*Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.
+*Curator: gaespin07
-==== Another Data Set ====
+====Enron Email Dataset====
+*URL: http://www.cs.cmu.edu/~enron/
+*Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz
+*Size: 423mb g-zipped
+*Contains emails from about 150, mostly senior management, employees of Enron.
+*Curator: Unknown
-==== Another Data Set ====
+====US Census Data for 2000====
+* URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet
+* Curator: stahlbr
-==== Another Data Set ====
+====Project Gutenberg====
+* URL: http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages#Getting_an_Offline_Version_of_our_Site
+* Size: 14.5GB
+* Format: unstructured plain text
+* Curator: stahlbr
-==== Another Data Set ====
+====Amazon Product/User Data====
+*Description: A large database of product information and reviews,as well as data on user profiles
+*URL:http://131.193.40.52/data/
+*Size: 1.866GB zipped
+*Format: Unstructured plain text
+*Curator: Rdbean08
+*Schemas:
+*memberinfo-locations
+*member-shortSummary
+*reviewed-Products
+*reviewed-AudioCDs
+*reviewsNew
+*productinfo
+*Booksinfo
-==== Another Data Set ====
+====New York Public Transportation Data====
+*Description: Lots of data about public transportation in New York, such as schedules, average wait time, usage, etc.
+*URL: http://www.mta.info/developers/download.html
+*Size: Dependent on how much you want to look at. There are lots of options.
+*Format: CSV
+*Curator: gaschue08
-==== Another Data Set ====
+====National Polar-Orbiting Operational Environmental Satellite System (NPOESS) Preparatory Project - NPP====
+* Description:
+* URL: http://www.class.ngdc.noaa.gov/data_available/npp/index.htm
+* Curator: CharlieP