https://wiki.cs.earlham.edu/api.php?action=feedcontributions&user=Eosergi10&feedformat=atom Earlham CS Department - User contributions [en] 2024-03-29T11:59:56Z User contributions MediaWiki 1.32.1 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12582 Elena-big-data 2011-12-08T08:40:15Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed female immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of females that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The full presentation can be viewwed here: http://prezi.com/1vstya3qtmwy/big-data-project/<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the total and female immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).<br /> [[File:Picture10.jpg|200px|thumb|left|graph 7]]<br /> [[File:Picture8.jpg|200px|thumb|left|graph 8]]<br /> The last thing I compared was the number of males per particular occupation position vs population of women(graph9).<br /> [[File:Picture11.jpg|200px|thumb|left|graph 9]]<br /> ===== Results =====<br /> Sometimes I got predictable and expected results from my visualisations, such that the percentage of Mexican immigrants exceeds any other in the USA, or that people have problems with finding jobs with a degree in arts and humanities. However, I was able to get sometimes even surprising for me information.<br /> When comparing unemployment rate of different immigrantes in USA, I was able to see a particular pattern - the highest unemployment rate was related only to Latin American Countries, while European and Asian countries obtained quite low unemployment rate. It didn't depend on the size of the foreign population from that region in the US, since in the chart the countries were lined up by the exceeding immigration to US population. This graph made me think that I just visualised the concept of discrimination in USA. No other countries from another regions of the world showed the same pattern, and the differences between the unemployment rate hugely varied, comparing to another range of countries.<br /> It was surprising for me to see from graph9, than overall number of women are seen to be involved in technical jobs and it exceeds the percentage of men in that sphere enourmously. There are a lot of talks about the lack of women in science, and it was quite a discovery for me to see my results that brake the stereotypical opinion. Also, I was surprise to see that the number of men in agriculture exceeds the number of women, which I expected to be the opposite.<br /> I never expected for Chinese to obtain such a small percentage of people who work in agriculture. Also, the fact that most of Chinese are involved in market,shop sales are in Hungary is very unpredictable. The chart also says, that the given results are the highest comparing to the rest of the countries. Does it mean that Hungary has better economical or social conditions for Chinese to have a high ration of involvement in that sphere, or, perhaps, this labour market doesn't have as much competition than anywhere else in the world?<br /> From graph5 I was able to see that most percent of the immigrants in the US are involved in comouter/math science, managment and healthcare spheres. This graph can be used in everyday situations too. For example, if I am seeking to find a healthcare practitioner, I will know that I will receive most of the applications from Philippines citizens as foreigners. <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12581 Elena-big-data 2011-12-08T08:38:28Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed female immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of females that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the total and female immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).<br /> [[File:Picture10.jpg|200px|thumb|left|graph 7]]<br /> [[File:Picture8.jpg|200px|thumb|left|graph 8]]<br /> The last thing I compared was the number of males per particular occupation position vs population of women(graph9).<br /> [[File:Picture11.jpg|200px|thumb|left|graph 9]]<br /> ===== Results =====<br /> Sometimes I got predictable and expected results from my visualisations, such that the percentage of Mexican immigrants exceeds any other in the USA, or that people have problems with finding jobs with a degree in arts and humanities. However, I was able to get sometimes even surprising for me information.<br /> When comparing unemployment rate of different immigrantes in USA, I was able to see a particular pattern - the highest unemployment rate was related only to Latin American Countries, while European and Asian countries obtained quite low unemployment rate. It didn't depend on the size of the foreign population from that region in the US, since in the chart the countries were lined up by the exceeding immigration to US population. This graph made me think that I just visualised the concept of discrimination in USA. No other countries from another regions of the world showed the same pattern, and the differences between the unemployment rate hugely varied, comparing to another range of countries.<br /> It was surprising for me to see from graph9, than overall number of women are seen to be involved in technical jobs and it exceeds the percentage of men in that sphere enourmously. There are a lot of talks about the lack of women in science, and it was quite a discovery for me to see my results that brake the stereotypical opinion. Also, I was surprise to see that the number of men in agriculture exceeds the number of women, which I expected to be the opposite.<br /> I never expected for Chinese to obtain such a small percentage of people who work in agriculture. Also, the fact that most of Chinese are involved in market,shop sales are in Hungary is very unpredictable. The chart also says, that the given results are the highest comparing to the rest of the countries. Does it mean that Hungary has better economical or social conditions for Chinese to have a high ration of involvement in that sphere, or, perhaps, this labour market doesn't have as much competition than anywhere else in the world?<br /> From graph5 I was able to see that most percent of the immigrants in the US are involved in comouter/math science, managment and healthcare spheres. This graph can be used in everyday situations too. For example, if I am seeking to find a healthcare practitioner, I will know that I will receive most of the applications from Philippines citizens as foreigners. <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12580 Elena-big-data 2011-12-08T08:14:28Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed female immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of females that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the total and female immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).<br /> [[File:Picture10.jpg|200px|thumb|left|graph 7]]<br /> [[File:Picture8.jpg|200px|thumb|left|graph 8]]<br /> The last thing I compared was the number of males per particular occupation position vs population of women(graph9).<br /> [[File:Picture11.jpg|200px|thumb|left|graph 9]]<br /> ===== Results =====<br /> Sometimes I got predictable and expected results from my visualisations, such that the percentage of Mexican immigrants exceeds any other in the USA, or that people have problems with finding jobs with a degree in arts and humanities. However, I was able to get sometimes even surprising for me information.<br /> When comparing unemployment rate of different immigrantes in USA, I was able to see a particular pattern - the highest unemployment rate was related only to Latin American Countries, while European and Asian countries obtained quite low unemployment rate. It didn't depend on the size of the foreign population from that region in the US, since in the chart the countries were lined up by the exceeding immigration to US population. This graph made me think that I just visualised the concept of discrimination in USA. No other countries from another regions of the world showed the same pattern, and the differences between the unemployment rate hugeky varied, comparing to another range of countries.<br /> It was surprising for me to see from graph9, than overall number of women are seen to be involved in technical jobs and it exceed the percentage of men in that sphere enourmously. There are a lot of talks about the lack of women in science, and it was quite a discovery for me to see my results that brake the stereotypical opinion. Also, I was surprise to see that the number of men in agriculture exceeds the number of women, which I expected to be the opposite.<br /> I never expected for Chinese to obtain such a small percentage of people who work in agriculture. Also, the fact that most of Chinese are involved in market,shop sales are in Hungary is very unpredictable. The chart also says, that the given results are the highest comparing to the rest of the countries. Does it mean that Hungary has better economical or social conditions for Chinese to have a high ration of involvement in that sphere, or perhaps this labour market doesn't have as much competition than anywhere in the world?<br /> From graph5 I was able to see that most percent the immigrants in the US are involved in comouter/math science, managment and healthcare spheres. The graph is helpful to know, for example, if I am seeking to find a healthcare practitioner, I will receive most of the applications from Philippines citizens as foreigners. <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture11.jpg&diff=12579 File:Picture11.jpg 2011-12-08T07:15:25Z <p>Eosergi10: Number of men in the occupation fields vs number of women taking the same work positions</p> <hr /> <div>Number of men in the occupation fields vs number of women taking the same work positions</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12578 Elena-big-data 2011-12-08T07:14:24Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed female immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of females that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the total and female immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).<br /> [[File:Picture10.jpg|200px|thumb|left|graph 7]]<br /> [[File:Picture8.jpg|200px|thumb|left|graph 8]]<br /> The last thing I compared was the number of males per particular occupation position vs population of women(graph9).<br /> [[File:Picture11.jpg|200px|thumb|left|graph 9]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture8.jpg&diff=12577 File:Picture8.jpg 2011-12-08T07:12:02Z <p>Eosergi10: My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).</p> <hr /> <div>My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture10.jpg&diff=12576 File:Picture10.jpg 2011-12-08T07:11:38Z <p>Eosergi10: My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).</p> <hr /> <div>My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12575 Elena-big-data 2011-12-08T07:11:15Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed female immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of females that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the total and female immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> My next step was to compare the popularity of fields of study and how many people with the particular field were currently unemployed (graph 7 - 8).<br /> [[File:Picture10.jpg|200px|thumb|left|graph 7]]<br /> [[File:Picture8.jpg|200px|thumb|left|graph 8]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture6.jpg&diff=12574 File:Picture6.jpg 2011-12-08T07:07:37Z <p>Eosergi10: Compared unemployment rates for each of the immigrants' populations in USA, females of different nations, and average USA unemplyment rate for the year 2000.</p> <hr /> <div>Compared unemployment rates for each of the immigrants' populations in USA, females of different nations, and average USA unemplyment rate for the year 2000.</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12573 Elena-big-data 2011-12-08T07:05:16Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture6.jpg|200px|thumb|left|graph 6]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12572 Elena-big-data 2011-12-08T07:04:55Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> Next, I looked at employment statuses for the immigrants in USA, and compared the found unemployment rates with the average unemployment rate for the USA,2000 (graph 6)<br /> [[File:Picture5.jpg|200px|thumb|left|graph 6]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture5.jpg&diff=12571 File:Picture5.jpg 2011-12-08T06:59:58Z <p>Eosergi10: I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).</p> <hr /> <div>I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12570 Elena-big-data 2011-12-08T06:58:56Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all in and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and wether some countries appeared more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 4]]<br /> I referred again at USA profile and established which professions get occupied the most by which immigrants (graph 5).<br /> [[File:Picture5.jpg|200px|thumb|left|graph 5]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12569 Elena-big-data 2011-12-08T06:47:22Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|graph 1]]<br /> [[File:Picture2.jpg|200px|thumb|left|graph 2]]<br /> [[File:Picture3.jpg|200px|thumb|left|graph 3]]<br /> Next, I decided to focus on specifically one nation and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and weather some countries were used more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|graph 1]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12568 Elena-big-data 2011-12-08T06:44:54Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile (graps 1-3). From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture2.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture3.jpg|200px|thumb|left|alt text]]<br /> Next, I decided to focus on specifically one nation and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and weather some countries were used more that once (graph 4).<br /> [[File:Picture4.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12567 Elena-big-data 2011-12-08T06:44:13Z <p>Eosergi10: Undo revision 12566 by Eosergi10 (talk)</p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile. From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture2.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture3.jpg|200px|thumb|left|alt text]]<br /> Next, I decided to focus on specifically one nation and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and weather some countries were used more that once.<br /> [[File:Picture4.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12566 Elena-big-data 2011-12-08T06:40:36Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> CREATE table citizenship_age (country CHAR(5),coub CHAR(10),fborn INT, edu_lfs INT, edu_cen INT, age_lfs INT, age_cen INT, nat INT, number INT, reg_oecd INT, reg_regions CHAR(30));<br /> The used command for import is: <br /> COPY citizenship_age FROM '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, SUM(number) FROM (SELECT * FROM citizenship_age WHERE country='USA' AND fborn=1) AS p1 GROUP BY coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> SELECT country,occupation,SUM(number) FROM occupations WHERE fborn=1 AND coub='CHN' AND occupation&gt;='10' AND occupation&lt;'20' GROUP BY country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> SELECT coub,SUM(number) FROM occupations WHERE country='USA' AND occupation='USA_02' AND fborn=1 GROUP BY occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> SELECT country,coub,SUM(number) FROM labour_status WHERE fborn=1 AND lfs_lfs=2 AND sex=2 AND country='USA' GROUP BY country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> SELECT SUM(number) FROM fields_study WHERE field_edu=1 AND lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as an example)<br /> SELECT MAX(sum) FROM (SELECT coub,SUM(number) FROM occuations WHERE occupation&gt;='60' AND occupation&lt;'70' AND sex=2 GROUP BY coub) AS p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs I draw to recreate USA population profile. The first graph shows that 10% of USA populations were born in another countries, half of those obtained USA citizenship.<br /> [[File:Picture1.jpg]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture4.jpg&diff=12565 File:Picture4.jpg 2011-12-08T06:38:50Z <p>Eosergi10: I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other </p> <hr /> <div>I decided to focus on specifically one nation (Chinese) and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and weather some countries were used more that once.</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12564 Elena-big-data 2011-12-08T06:38:08Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile. From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture2.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture3.jpg|200px|thumb|left|alt text]]<br /> Next, I decided to focus on specifically one nation and determine which work positions they tempt to work most of all and in which countries (from the given OECD range). I particularly wanted to see if preferences for one occupation varied from other one, and weather some countries were used more that once.<br /> [[File:Picture4.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture3.jpg&diff=12563 File:Picture3.jpg 2011-12-08T06:27:49Z <p>Eosergi10: Shows how many of the immigrants obtained US citizenship</p> <hr /> <div>Shows how many of the immigrants obtained US citizenship</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12562 Elena-big-data 2011-12-08T06:26:23Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile. From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.The second shows where most of the immigrants come from, and the third - how many of those immigrants obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture2.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture3.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12561 Elena-big-data 2011-12-08T06:20:51Z <p>Eosergi10: /* Project Tasks */</p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile. From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> [[File:Picture2.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture2.jpg&diff=12560 File:Picture2.jpg 2011-12-08T06:19:45Z <p>Eosergi10: The graph shows where the majority of immigrants come from.</p> <hr /> <div>The graph shows where the majority of immigrants come from.</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12559 Elena-big-data 2011-12-08T06:12:06Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> The following 3 graphs were used to recreate USA population profile. From the first graph, it is visible that 10% of USA population are people born outside of US, 5 % of which obtained USA citizenship.<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12558 Elena-big-data 2011-12-08T06:06:38Z <p>Eosergi10: /* Project Tasks */</p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> [[File:pop_prof1.jpg|200px|thumb|left|alt text]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Pop_prof1.jpg&diff=12557 File:Pop prof1.jpg 2011-12-08T06:05:18Z <p>Eosergi10: USA population profile</p> <hr /> <div>USA population profile</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12556 Elena-big-data 2011-12-08T06:03:14Z <p>Eosergi10: /* Project Tasks */</p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> [[File:pop_prof1.jpg]]<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12555 Elena-big-data 2011-12-08T06:02:20Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> [[File:graph1.jpg]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12554 Elena-big-data 2011-12-08T05:59:20Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> [[File:Picture1.jpg]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12553 Elena-big-data 2011-12-08T05:58:01Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> [[File:Graph1.jpg]]<br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=File:Picture1.jpg&diff=12552 File:Picture1.jpg 2011-12-08T05:56:05Z <p>Eosergi10: </p> <hr /> <div></div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12551 Elena-big-data 2011-12-08T05:26:18Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> *Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> *Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> *Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> *Develop and document the model function you are exploring in the data<br /> *Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12550 Elena-big-data 2011-12-08T05:24:28Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> *Identifying and downloading the target data set<br /> -- The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> #Data cleaning and pre-processing:<br /> -- Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> #Load the data into your Postgres instance:<br /> -- Example of creating a table:<br /> -- create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> -- The used command for import is: <br /> -- copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> #Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12549 Elena-big-data 2011-12-08T05:24:12Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set<br /> -- The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> #Data cleaning and pre-processing:<br /> -- Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> #Load the data into your Postgres instance:<br /> -- Example of creating a table:<br /> -- create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> -- The used command for import is: <br /> -- copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> #Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12548 Elena-big-data 2011-12-08T02:38:35Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> #Data cleaning and pre-processing:<br /> Data is in CSV format. I had to illuminate few charcters. I erased ^M by using dos2unix file1 &gt; file2 command<br /> #Load the data into your Postgres instance:<br /> Example of creating a table:<br /> create table citizenship_age (country char(5),coub char(10),fborn int, edu_lfs int, edu_cen int, age_lfs int, age_cen int, nat int, number int, reg_oecd int, reg_regions char(30));<br /> The used command for import is: <br /> copy citizenship_age from '/path/to/the/file/FILE.csv DELIMITER ',' CSV;<br /> #Develop queries to explore your ideas in the data<br /> These are the examples of some of the queries I used to investigate different areas of my analysis<br /> Viewing total immigrants' population in USA, sorted by the country of birth:<br /> SELECT coub, sum(number) from (select * from citizenship_age where country='USA' and fborn=1) as p1 group by coub;<br /> Determening where in world Chinese immigrants work the most as managers (as an example). The returned result for each of the positions was further divided by the total Chinese immigrant's population of that country in order to get comparable ratio.<br /> select country,occupation,sum(number) from occupations where fborn=1 and coub='CHN' and occupation&gt;='10' and occupation&lt;'20' group by country, occupation;<br /> Determening immigrants of which country occupy work in business (as an example) more than others in USA.<br /> select coub,sum(number) from occupations where country='USA' and occupation='USA_02' and fborn=1 group by occupation,coub;<br /> Looking at unemployed femail immigrant population in USA. Later the result for each of the nation was divided by the total employed femaile population of that nation in USA.<br /> select country,coub,sum(number) from labour_status where fborn=1 and lfs_lfs=2 and and sex=2 and country='USA' group by country,coub;<br /> Viewing how many people of the particular field of study are unemployed<br /> select sum(number) from fields_study where field_edu=1 and lfs_lfs!=1;<br /> Viewing overall population of femails that work in agriculture (as example)<br /> select max(sum) from (select coub,sum(number) from occuations where occupation&gt;='60' and occupation&lt;'70' and sex=2 group by coub) as p1;<br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12547 Elena-big-data 2011-12-08T01:11:18Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set<br /> The dataset can be downloaded from here: http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html<br /> #Data cleaning and pre-processing:<br /> Data is in CSV format. Ihad to illuminate few charcters. I erased ^M by using - dos2unix file1 &gt; file2<br /> #Load the data into your Postgres instance:<br /> <br /> #Develop queries to explore your ideas in the data <br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12546 Elena-big-data 2011-12-08T01:10:05Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries<br /> *Aims and Ideas: Having a dataset about immigrants' population, I had a chance to create different population profiles, with the aim to verify and/or disprove certain stereotypical knowledge about the immigrants, as well as different nations. This includes me looking at occupations, countries of birth and labour force status.<br /> *Complications: Unfortunatly, my data didn't include any unique identifiers, which made it hard to work with the dataset, as well as made it not possible to answer some of wanted queries. Also, data didn't have range of years, which limited me in the ways of exploring the data. When viewing my results, please, keep in mind that the data was collected for the year 2000, and is limited only for OECD countries.<br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set<br /> The dataset can be downloaded from here: [http://www.oecd.org/document/51/0,3746,en_2649_33931_40644339_1_1_1_1,00.html]<br /> #Data cleaning and pre-processing:<br /> Data is in CSV format. Ihad to illuminate few charcters. I erased ^M by using - dos2unix file1 &gt; file2<br /> #Load the data into your Postgres instance:<br /> <br /> #Develop queries to explore your ideas in the data <br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12545 Elena-big-data 2011-12-07T02:15:17Z <p>Eosergi10: </p> <hr /> <div>*Title: '''Stereotypes and Discrimination Through Statistics'''<br /> *Dataset used: A Profile of Immigrant Population in the 21st century in OECD Countries <br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set--------&gt;DONE<br /> #Data cleaning and pre-processing<br /> #Load the data into your Postgres instance <br /> #Develop queries to explore your ideas in the data <br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12544 Elena-big-data 2011-12-07T02:03:04Z <p>Eosergi10: </p> <hr /> <div>* Stereotypes and Discrimination Through Statistics<br /> * A Profile of Immigrant Population in the 21st century in OECD Countries <br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set--------&gt;DONE<br /> #Data cleaning and pre-processing<br /> #Load the data into your Postgres instance <br /> #Develop queries to explore your ideas in the data <br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Elena-big-data&diff=12461 Elena-big-data 2011-11-29T17:44:43Z <p>Eosergi10: </p> <hr /> <div>* Project title<br /> * Statistics and Social Network of YouTube Videos <br /> <br /> ===== Project Tasks =====<br /> #Identifying and downloading the target data set--------&gt;DONE<br /> #Data cleaning and pre-processing <br /> #Load the data into your Postgres instance <br /> #Develop queries to explore your ideas in the data <br /> #Develop and document the model function you are exploring in the data<br /> #Develop a visualization to show the model/patterns in the data<br /> <br /> ===== Tech Details =====<br /> * Node: as5<br /> * Path to storage space: /scratch/big-data/elena<br /> <br /> ===== Results =====<br /> * The visualization(s)<br /> * The story</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12370 Annotated-directory-big-data 2011-10-16T08:28:08Z <p>Eosergi10: /* Freebase */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Metadata ====<br /> http://www.drewsullivan.com/database.html<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone<br /> * Complete Size - Without information about editors, 3.47 GB<br /> * Format - PostgreSQL &quot;COPY TO&quot; format<br /> * Curator - Jahelton07<br /> <br /> ==== freedb ====<br /> * URL - http://www.freedb.org/en/download__database.10.html<br /> * Description - A similar music-tagging service to MusicBrainz.<br /> * Complete Size - Of the latest complete set, 734MB compressed<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Size: 5mb, SQL export<br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Download: 2 - 250Gb, Various formats<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Center for Disease Contol Data ====<br /> * URL - http://www.cdc.gov/nchs/<br /> * Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.<br /> * Curator - jrhurst08<br /> <br /> ==== IRS Statistics ====<br /> * URL - http://www.irs.treas.gov/tax_stats/index.html<br /> * Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.<br /> * Curator - jrhurst08<br /> <br /> ==== National Oceanographic and Atmospheric Administration - Storm Prediction Center ====<br /> * URL - http://www.spc.noaa.gov/climo/<br /> * Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. <br /> * Curator - jrhurst08<br /> <br /> <br /> ==== Ensembl Genome Data ====<br /> *URL:http://useast.ensembl.org/info/data/ftp/index.html<br /> *Description:The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online<br /> *Format:MySQL<br /> *Curator: eosergi10<br /> <br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== CGI 60 Genomes ====<br /> *URL: http://data.bionimbus.org/60-genome-data-set/<br /> *Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. <br /> *Curator: eosergi10 <br /> <br /> ==== Dataset for &quot;Statistics and Social Network of YouTube Videos&quot; ====<br /> *URL: http://netsg.cs.sfu.ca/youtubedata/ <br /> *Description: Datasets of normal and updating crawl for YouTube <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08<br /> <br /> ==== ies ====<br /> *URL: http://nces.ed.gov/ipeds/datacenter/<br /> *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.<br /> *Curator: gaespin07<br /> <br /> ====Enron Email Dataset====<br /> *URL: http://www.cs.cmu.edu/~enron/<br /> *Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz<br /> *Size: 423mb g-zipped<br /> *Contains emails from about 150, mostly senior management, employees of Enron.<br /> <br /> ====US Census Data for 2000====<br /> * URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet<br /> * Curator: stahlbr<br /> <br /> ====Project Gutenberg====<br /> * URL: http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages#Getting_an_Offline_Version_of_our_Site<br /> * Size: 14.5GB<br /> * Format: unstructured plain text<br /> * Curator: stahlbr</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12369 Annotated-directory-big-data 2011-10-16T07:32:06Z <p>Eosergi10: /* Surveillance, Epidemiology and End Results */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Metadata ====<br /> http://www.drewsullivan.com/database.html<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone<br /> * Complete Size - Without information about editors, 3.47 GB<br /> * Format - PostgreSQL &quot;COPY TO&quot; format<br /> * Curator - Jahelton07<br /> <br /> ==== freedb ====<br /> * URL - http://www.freedb.org/en/download__database.10.html<br /> * Description - A similar music-tagging service to MusicBrainz.<br /> * Complete Size - Of the latest complete set, 734MB compressed<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Size: 5mb, SQL export<br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Download: 2 - 250Gb, Various formats<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Center for Disease Contol Data ====<br /> * URL - http://www.cdc.gov/nchs/<br /> * Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.<br /> * Curator - jrhurst08<br /> <br /> ==== IRS Statistics ====<br /> * URL - http://www.irs.treas.gov/tax_stats/index.html<br /> * Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.<br /> * Curator - jrhurst08<br /> <br /> ==== National Oceanographic and Atmospheric Administration - Storm Prediction Center ====<br /> * URL - http://www.spc.noaa.gov/climo/<br /> * Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. <br /> * Curator - jrhurst08<br /> <br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== CGI 60 Genomes ====<br /> *URL: http://data.bionimbus.org/60-genome-data-set/<br /> *Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. <br /> *Curator: eosergi10 <br /> <br /> ==== Dataset for &quot;Statistics and Social Network of YouTube Videos&quot; ====<br /> *URL: http://netsg.cs.sfu.ca/youtubedata/ <br /> *Description: Datasets of normal and updating crawl for YouTube <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08<br /> <br /> ==== ies ====<br /> *URL: http://nces.ed.gov/ipeds/datacenter/<br /> *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.<br /> *Curator: gaespin07<br /> <br /> ====Enron Email Dataset====<br /> *URL: http://www.cs.cmu.edu/~enron/<br /> *Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz<br /> *Size: 423mb g-zipped<br /> *Contains emails from about 150, mostly senior management, employees of Enron.<br /> <br /> ====US Census Data for 2000====<br /> * URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet<br /> * Curator: stahlbr<br /> <br /> ====Project Gutenberg====<br /> * URL: http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages#Getting_an_Offline_Version_of_our_Site<br /> * Size: 14.5GB<br /> * Format: unstructured plain text<br /> * Curator: stahlbr</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12367 Annotated-directory-big-data 2011-10-14T19:08:04Z <p>Eosergi10: /* &quot;DBpedia&quot; */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Metadata ====<br /> http://www.drewsullivan.com/database.html<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone<br /> * Complete Size - Without information about editors, 3.47 GB<br /> * Format - PostgreSQL &quot;COPY TO&quot; format<br /> * Curator - Jahelton07<br /> <br /> ==== freedb ====<br /> * URL - http://www.freedb.org/en/download__database.10.html<br /> * Description - A similar music-tagging service to MusicBrainz.<br /> * Complete Size - Of the latest complete set, 734MB compressed<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Size: 5mb, SQL export<br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Download: 2 - 250Gb, Various formats<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Center for Disease Contol Data ====<br /> * URL - http://www.cdc.gov/nchs/<br /> * Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.<br /> * Curator - jrhurst08<br /> <br /> ==== IRS Statistics ====<br /> * URL - http://www.irs.treas.gov/tax_stats/index.html<br /> * Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.<br /> * Curator - jrhurst08<br /> <br /> ==== National Oceanographic and Atmospheric Administration - Storm Prediction Center ====<br /> * URL - http://www.spc.noaa.gov/climo/<br /> * Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. <br /> * Curator - jrhurst08<br /> <br /> <br /> ==== Surveillance, Epidemiology and End Results ====<br /> *URL:http://seer.cancer.gov/<br /> *Description: premier source for cancer statistics in the United States<br /> *Curator: eosergi10<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== CGI 60 Genomes ====<br /> *URL: http://data.bionimbus.org/60-genome-data-set/<br /> *Description: A set of public genome sequences. There are four sets of data: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. <br /> *Curator: eosergi10 <br /> <br /> ==== Dataset for &quot;Statistics and Social Network of YouTube Videos&quot; ====<br /> *URL: http://netsg.cs.sfu.ca/youtubedata/ <br /> *Description: Datasets of normal and updating crawl for YouTube <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08<br /> <br /> ==== ies ====<br /> *URL: http://nces.ed.gov/ipeds/datacenter/<br /> *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.<br /> *Curator: gaespin07<br /> <br /> ====Enron Email Dataset====<br /> *URL: http://www.cs.cmu.edu/~enron/<br /> *Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz<br /> *Size: 423mb g-zipped<br /> *Contains emails from about 150, mostly senior management, employees of Enron.<br /> <br /> ====US Census Data for 2000====<br /> * URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet<br /> * Curator: stahlbr</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12366 Annotated-directory-big-data 2011-10-14T18:57:12Z <p>Eosergi10: /* Freebase */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Metadata ====<br /> http://www.drewsullivan.com/database.html<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone<br /> * Complete Size - Without information about editors, 3.47 GB<br /> * Format - PostgreSQL &quot;COPY TO&quot; format<br /> * Curator - Jahelton07<br /> <br /> ==== freedb ====<br /> * URL - http://www.freedb.org/en/download__database.10.html<br /> * Description - A similar music-tagging service to MusicBrainz.<br /> * Complete Size - Of the latest complete set, 734MB compressed<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Size: 5mb, SQL export<br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Download: 2 - 250Gb, Various formats<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Center for Disease Contol Data ====<br /> * URL - http://www.cdc.gov/nchs/<br /> * Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.<br /> * Curator - jrhurst08<br /> <br /> ==== IRS Statistics ====<br /> * URL - http://www.irs.treas.gov/tax_stats/index.html<br /> * Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.<br /> * Curator - jrhurst08<br /> <br /> ==== National Oceanographic and Atmospheric Administration - Storm Prediction Center ====<br /> * URL - http://www.spc.noaa.gov/climo/<br /> * Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. <br /> * Curator - jrhurst08<br /> <br /> <br /> ==== Surveillance, Epidemiology and End Results ====<br /> *URL:http://seer.cancer.gov/<br /> *Description: premier source for cancer statistics in the United States<br /> *Curator: eosergi10<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08<br /> <br /> ==== ies ====<br /> *URL: http://nces.ed.gov/ipeds/datacenter/<br /> *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.<br /> *Curator: gaespin07<br /> <br /> ====Enron Email Dataset====<br /> *URL: http://www.cs.cmu.edu/~enron/<br /> *Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz<br /> *Size: 423mb g-zipped<br /> *Contains emails from about 150, mostly senior management, employees of Enron.<br /> <br /> ====US Census Data for 2000====<br /> * URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet<br /> * Curator: stahlbr</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12365 Annotated-directory-big-data 2011-10-14T18:56:36Z <p>Eosergi10: /* Freebase */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Metadata ====<br /> http://www.drewsullivan.com/database.html<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb, except editable by anyone<br /> * Complete Size - Without information about editors, 3.47 GB<br /> * Format - PostgreSQL &quot;COPY TO&quot; format<br /> * Curator - Jahelton07<br /> <br /> ==== freedb ====<br /> * URL - http://www.freedb.org/en/download__database.10.html<br /> * Description - A similar music-tagging service to MusicBrainz.<br /> * Complete Size - Of the latest complete set, 734MB compressed<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Size: 5mb, SQL export<br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Download: 2 - 250Gb, Various formats<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Center for Disease Contol Data ====<br /> * URL - http://www.cdc.gov/nchs/<br /> * Description - The National Center for Health Statistics under the CDC has a lot of nice downloadable datasets on mortality and health in its data warehouse. You can download the complete 1998 ICD-9 and 2000 ICD-9 here as well (the coding manual for cause of death used by many state and federal agencies) along with a guide to the ICD-9. Data is in Lotus 1-2-3 and ASCII formats.<br /> * Curator - jrhurst08<br /> <br /> ==== IRS Statistics ====<br /> * URL - http://www.irs.treas.gov/tax_stats/index.html<br /> * Description - The IRS Statistics of Income program tracks all sorts of data but always in summary form. You can find all sorts of information on non-profits (including the database of tax data for approved non-profits - downloadable in ASCII fixed-length form) and other stats on income earned, migration and foreign taxes paid. All are downloadble often in spreadsheet form. Some databases are VERY big and the site is VERY slow.<br /> * Curator - jrhurst08<br /> <br /> ==== National Oceanographic and Atmospheric Administration - Storm Prediction Center ====<br /> * URL - http://www.spc.noaa.gov/climo/<br /> * Description - This NOAA site includes a nice archive with downloadable files on tornadoes and tornado deaths since 1950 . There's also data on hail and wind damage data. <br /> * Curator - jrhurst08<br /> <br /> <br /> ==== Freebase ====<br /> *URL:http://seer.cancer.gov/<br /> *Description: premier source for cancer statistics in the United States<br /> *Curator: eosergi10<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08<br /> <br /> ==== ies ====<br /> *URL: http://nces.ed.gov/ipeds/datacenter/<br /> *Description: Data for all US colleges since 1980, available in different sizes based on how much data you wish to retrieve.<br /> *Curator: gaespin07<br /> <br /> ====Enron Email Dataset====<br /> *URL: http://www.cs.cmu.edu/~enron/<br /> *Download: http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz<br /> *Size: 423mb g-zipped<br /> *Contains emails from about 150, mostly senior management, employees of Enron.<br /> <br /> ====US Census Data for 2000====<br /> * URL: http://factfinder.census.gov/servlet/DatasetMainPageServlet<br /> * Curator: stahlbr</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12355 Annotated-directory-big-data 2011-10-14T07:49:24Z <p>Eosergi10: /* &quot;DBpedia&quot; */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb.<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.<br /> *Size:The dataset consists of 1 billion pieces of information out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets. Totoal is approximatly: 2.5Gb.<br /> *Format:RDF triples<br /> *Schema:Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).<br /> If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description.<br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12353 Annotated-directory-big-data 2011-10-14T07:04:05Z <p>Eosergi10: /* Freebase */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb.<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12352 Annotated-directory-big-data 2011-10-14T07:03:41Z <p>Eosergi10: /* Freebase */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb.<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> Set - Quad dump is a full dump of Freebase assertions (quad dump) as tab separated utf8 text.<br /> Set - Simple Topic Dump is a tab-separated file containing basic identifying data about every topic in Freebase. <br /> Set - TSV per Freebase type is a tab-separated file for each type in Freebase, suitable for loading into spreadsheets.Each line represents an instance of a Freebase type and columns represent the available properties for the type. <br /> *Curator: eosergi10<br /> *Size: Total- 6.0 Gb compressed with bzip2<br /> Set - Quad dump: The Link Export is approximately 3.5 Gbytes compressed with bzip2 (35 GB uncompressed) <br /> Set - Simple Topic Dump: approximately 1.2 Gbyte compressed with bzip2 (5 GB uncompressed). In June 2011, there were over 22 million rows. <br /> Set - TSV per Freebase type: The full download is approximately 1.3 Gbytes compressed with bzip2.The browseable set contains approximately 7500 TSV files in 100 domains. <br /> *Format: This is a complete &quot;low level&quot; dump of data which is suitable for post processing into RDF or XML datasets.<br /> *Schema: <br /> Set - Quad dump:The format of the link export is a series of lines, one assertion per line.The lines are tab separated quadruples, &lt;source&gt; (mid - a machine-generated id), &lt;property&gt; (a particular kind of quality of the entity mentioned in the &quot;source&quot; column), &lt;destination&gt; (holds the name of a namespace), &lt;value&gt; (a key within that namespace). <br /> Set - Simple Topic Dump: mid,English display name, Freebase /en keys, numeric English Wikipedia keys, Freebase types, short text description<br /> Set - TSV per Freebase type: type, type's description<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12351 Annotated-directory-big-data 2011-10-14T06:19:24Z <p>Eosergi10: /* Research and Innovative Technology Administration */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb.<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Freebase ====<br /> *URL: http://wiki.freebase.com/wiki/Data_dumps<br /> *Description: Full data dumps of every fact and assertion in Freebase,an open database of the world's information, covering millions of topics in hundreds of categories. <br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08</div> Eosergi10 https://wiki.cs.earlham.edu/index.php?title=Annotated-directory-big-data&diff=12350 Annotated-directory-big-data 2011-10-14T06:04:01Z <p>Eosergi10: /* CGI 60 Genomes */</p> <hr /> <div>__NOTOC__<br /> This is an annotated directory of public, freely available, &quot;large&quot; data sets. For now they are in no particular order.<br /> <br /> ==== Google ngrams ====<br /> * URL - http://books.google.com/ngrams/datasets<br /> * Description - The ngram databases on which Google's ngram viewer is built. A variety of corpora are available, e.g. by language, the &quot;Google Million&quot;, English fiction, etc. Each set contains a list of ngrams, frequency, and date information.<br /> * Curator - CharlieP<br /> <br /> ==== MusicBrainz ====<br /> * URL - http://musicbrainz.org/doc/MusicBrainz_Database<br /> * Description - In a nutshell, the musical equivalent of IMDb.<br /> * Curator - Jahelton07<br /> <br /> ==== World Cubing Association Database ====<br /> * Browse - http://worldcubeassociation.org/results/<br /> * Download Database - http://www.worldcubeassociation.org/results/misc/export.html<br /> * Description - All times, competitions, competitors of WCA competitions from 1984 until now. <br /> * Curator - Twright09<br /> <br /> ==== Large Data Sets on AWS ====<br /> * URL - http://aws.amazon.com/publicdatasets/#1<br /> * Description - A list of large data sets on Amazon's AWS, more data sets within the four links in the list.<br /> * Curator - Twright09<br /> <br /> ==== Starcraft 2 Hit Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (second dataset from top)<br /> * Description - Analysis of the number of hits any given unit can sustain from any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Starcraft 2 Combat Analysis ====<br /> * URL - http://www.teamliquid.net/forum/viewmessage.php?topic_id=116789 (first dataset from top)<br /> * Description - Analysis of the percent chance of victory for any given unit versus any other unit<br /> * Curator - rdbean08<br /> <br /> ==== Twitter Users by Location ====<br /> * URL - http://www.infochimps.com/datasets/twitter-census-twitter-users-by-location/downloads/70077<br /> * Description - Twitter Census: Twitter Users by Location<br /> * Curator - ibabic09<br /> <br /> ==== The AOL Search Data ====<br /> * URL - http://www.infochimps.com/datasets/aol-search-data/downloads/70079<br /> * Description - The AOL Search Data is a collection of real query log data that is based on real users. The data set consists of 20M web queries collected from 650k users over three months.<br /> * Curator - ibabic09<br /> <br /> ==== Research and Innovative Technology Administration ====<br /> *URL: http://www.rita.dot.gov/<br /> *Description: RITA coordinates the U.S. Department of Transportation's research and education programs. RITA also offers vital transportation statistics and analysis, and supports national efforts to improve education and training in transportation-related fields.<br /> *Curator: eosergi10<br /> <br /> ====&quot;DBpedia&quot; ====<br /> *URL:http://blog.dbpedia.org/2011/09/11/dbpedia-37-released-including-15-localized-editions/ <br /> *Description: The dataset release is based on Wikipedia dumps dating from late July 2011.DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. <br /> *Curator: eosergi10<br /> <br /> ==== IMDB ====<br /> *URL: http://www.imdb.com/interfaces<br /> *Description: All the data used to create IMDB, available from any of the 3 ftp sites listed under &quot;Plain Text Data Files&quot;<br /> *Curator: gaschue08</div> Eosergi10