Difference between revisions of "Ivan-big-data"

From Earlham CS Department
Jump to navigation Jump to search
Line 3: Line 3:
  
 
===== Project Tasks =====
 
===== Project Tasks =====
**Identifying and downloading the target data set
+
##Identifying and downloading the target data set
 
Data sets can be founded here:
 
Data sets can be founded here:
 
* http://data.un.org/Data.aspx?q=gdp&d=SNAAMA&f=grID%3a101%3bcurrID%3aUSD%3bpcFlag%3a1
 
* http://data.un.org/Data.aspx?q=gdp&d=SNAAMA&f=grID%3a101%3bcurrID%3aUSD%3bpcFlag%3a1
Line 12: Line 12:
 
* http://data.un.org/Data.aspx?d=UNODC&f=tableCode%3a1
 
* http://data.un.org/Data.aspx?d=UNODC&f=tableCode%3a1
  
**Data cleaning and pre-processing  
+
##Data cleaning and pre-processing  
  
 
The first obstacle I faced with cleaning and pre-processing was inconsistency in countries naming. For example name China in education and name People's Republic of China in homicide... So when I did full join of country columns I realized that not all of them are in one line (things that are supposed to be in one line). So I changed names and made it unique through all 6 data sets.   
 
The first obstacle I faced with cleaning and pre-processing was inconsistency in countries naming. For example name China in education and name People's Republic of China in homicide... So when I did full join of country columns I realized that not all of them are in one line (things that are supposed to be in one line). So I changed names and made it unique through all 6 data sets.   

Revision as of 17:05, 4 December 2011

  • Project title: Relationship between Homicide, Education, Abortion, HIV Incidence, Population and GDP for countries around the globe
  • Project data set: United Nations DB (UNdata)
Project Tasks
    1. Identifying and downloading the target data set

Data sets can be founded here:

    1. Data cleaning and pre-processing

The first obstacle I faced with cleaning and pre-processing was inconsistency in countries naming. For example name China in education and name People's Republic of China in homicide... So when I did full join of country columns I realized that not all of them are in one line (things that are supposed to be in one line). So I changed names and made it unique through all 6 data sets.   

  1. Load the data into your Postgres instance

Data-sets I downloaded were in CSV files.
Here is an example for inserting data-set homicide into my PQSL:

  • drop table homicide;
  • create TABLE homicide (COUNTRY varchar primary key, YEAR int, RATE float);
  • COPY homicide FROM '/home/postgres/HOMICIDE.csv' DELIMITER ';' CSV;
  1. Develop queries to explore your ideas in the data
  2. Develop and document the model function you are exploring in the data
  3. Develop a visualization to show the model/patterns in the data
Tech Details
  • Node: as2
  • Path to storage space: /scratch/big-data/ivan
Results
  • The visualization(s)
  • The story