Difference between revisions of "Ivan-big-data"

From Earlham CS Department
Jump to navigation Jump to search
(Develop queries to explore your ideas in the data)
Line 27: Line 27:
 
==Develop queries to explore your ideas in the data==
 
==Develop queries to explore your ideas in the data==
  
Query to connect all data sets with country as a unique key:
+
*Query to connect all data sets with country as a unique key:
  
 
select * from (((gdp inner join education_cost ON (gdp.country = education_cost.country)) <br/>
 
select * from (((gdp inner join education_cost ON (gdp.country = education_cost.country)) <br/>
Line 34: Line 34:
 
inner join abortion ON (gdp.country = abortion.country)<br/>
 
inner join abortion ON (gdp.country = abortion.country)<br/>
  
This is what I used to export result from PSQL to CSV:
+
*This is what I used to export result from PSQL to CSV:
  
 
Copy (some_query) to '/home/postgres/something.csv' with CSV;
 
Copy (some_query) to '/home/postgres/something.csv' with CSV;

Revision as of 18:53, 4 December 2011

  • Project title: Relationship between Homicide, Education, Abortion, HIV Incidence, Population and GDP for countries around the globe
  • Project data set: United Nations DB (UNdata)
Project Tasks

Identifying and downloading the target data set

Data sets can be founded here:

Data cleaning and pre-processing

The first obstacle I faced with cleaning and pre-processing was inconsistency in countries naming. For example name China in education and name People's Republic of China in homicide... So when I did full join of country columns I realized that not all of them are in one line (things that are supposed to be in one line). So I changed names and made it unique through all 6 data sets.   

Load the data into your Postgres instance

Data-sets I downloaded were in CSV files.

Here is an example for inserting data-set homicide into my PQSL:

  • drop table homicide;
  • create TABLE homicide (COUNTRY varchar primary key, YEAR int, RATE float);
  • COPY homicide FROM '/home/postgres/HOMICIDE.csv' DELIMITER ';' CSV;

Develop queries to explore your ideas in the data

  • Query to connect all data sets with country as a unique key:

select * from (((gdp inner join education_cost ON (gdp.country = education_cost.country))
inner join population ON (gdp.country = population.country))
inner join homicide ON (gdp.country = homicide.country))
inner join abortion ON (gdp.country = abortion.country)

  • This is what I used to export result from PSQL to CSV:

Copy (some_query) to '/home/postgres/something.csv' with CSV;

Develop and document the model function you are exploring in the data

Develop a visualization to show the model/patterns in the data

Tech Details
  • Node: as2
  • Path to storage space: /scratch/big-data/ivan
Results
  • The visualization(s)
  • The story