Difference between revisions of "Ivan-big-data"

Revision as of 19:03, 4 December 2011

Project title: Relationship between Homicide, Education, Abortion, HIV Incidence, Population and GDP for countries around the globe
Project data set: United Nations DB (UNdata)

1 Project Tasks
2 Identifying and downloading the target data set
3 Metadata
4 Data cleaning and pre-processing
5 Load the data into your Postgres instance
6 Develop queries to explore your ideas in the data
7 Develop and document the model function you are exploring in the data
8 Develop a visualization to show the model/patterns in the data
- 8.1 Tech Details
- 8.2 Results

Project Tasks

Identifying and downloading the target data set

Data sets can be founded here:

Metadata

Homicide - rate per 100,000 population

-- Data for 195 countries
-- Min: 0.373944299393987
-- Max: 60.8707260176513

Education_cost - expenditure on education as % total government expenditure

-- Data for 170 countries
-- Min: 10.13355
-- Max: 53.99897

Abortion - Abortions per 1,000 women (woman 15-44)

-- Data for 61 countries
-- Min: 0.1
-- Max: 53.7

Total Population - both sexes combined (thousands)

-- Data for 263 countries
-- Min: 0.146
-- Max: 6895889.018

GDP - at current prices - US dollars

-- Data for 209 countries
-- Min: 150.617985349753
-- Max: 186174.902651821

Data cleaning and pre-processing

The first obstacle I faced with cleaning and pre-processing was inconsistency in countries naming. For example name China in education and name People's Republic of China in homicide... So when I did full join of country columns I realized that not all of them are in one line (things that are supposed to be in one line). So I changed names and made it unique through all 6 data sets.

Load the data into your Postgres instance

Data-sets I downloaded were in CSV files.

Here is an example for inserting data-set homicide into my PQSL:

drop table homicide;
create TABLE homicide (COUNTRY varchar primary key, YEAR int, RATE float);
COPY homicide FROM '/home/postgres/HOMICIDE.csv' DELIMITER ';' CSV;

Develop queries to explore your ideas in the data

Query to connect all data sets with country as a unique key and to see the overlap (Use full join to see all):

select * from (((gdp inner join education_cost ON (gdp.country = education_cost.country))
inner join population ON (gdp.country = population.country))
inner join homicide ON (gdp.country = homicide.country))
inner join abortion ON (gdp.country = abortion.country)

This is what I used to export result from PSQL to CSV:

Copy (some_query) to '/home/postgres/something.csv' with CSV;

Develop and document the model function you are exploring in the data

Develop a visualization to show the model/patterns in the data

Tech Details

Node: as2
Path to storage space: /scratch/big-data/ivan

Results

The visualization(s)
The story

@@ Line 17: / Line 17: @@
 -- Data for 195 countries<br/>
 -- Min: 0.373944299393987<br/>
--- Max: 60.8707260176513<br/>
+-- Max: 60.8707260176513
 *Education_cost - expenditure on education as % total government expenditure
 -- Data for 170 countries<br/>
 -- Min: 10.13355<br/>
--- Max: 53.99897<br/>
+-- Max: 53.99897
 *Abortion - Abortions per 1,000 women (woman 15-44)

Difference between revisions of "Ivan-big-data"

Revision as of 19:03, 4 December 2011

Contents

Project Tasks

Identifying and downloading the target data set

Metadata

Data cleaning and pre-processing

Load the data into your Postgres instance

Develop queries to explore your ideas in the data

Develop and document the model function you are exploring in the data

Develop a visualization to show the model/patterns in the data

Tech Details

Results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

websites

wiki

applied groups

Tools