Difference between revisions of "Mobeen-big-data"

From Earlham CS Department
Jump to navigation Jump to search
(I. Develop and document the model function you are exploring in the data)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== MovieLens Data Sets Project ==  
 
== MovieLens Data Sets Project ==  
 
'''Name:  Mobeen Ludin'''
 
'''Name:  Mobeen Ludin'''
 +
 
'''Class: Database Management System'''  
 
'''Class: Database Management System'''  
 +
 
'''H.W.:  Final Project'''  
 
'''H.W.:  Final Project'''  
  
'''*Tech Details:'''
+
'''Tech Details:'''
 
* Node: as7
 
* Node: as7
 
* Path to storage space: /scratch/big-data/mobeen
 
* Path to storage space: /scratch/big-data/mobeen
Line 22: Line 24:
 
*  '''Directories:'''
 
*  '''Directories:'''
  
'''1. Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens.   
+
-- ''' Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens.   
  
'''2. Clean_Data:'''  The Clean_Data directory has all the data files that were formated by using the perl/python scripts.   
+
-- ''' Clean_Data:'''  The Clean_Data directory has all the data files that were formated by using the perl/python scripts.   
  
'''3. Q_results:'''  The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.  
+
-- ''' Q_results:'''  The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.  
 
      
 
      
'''4. Scripts:'''    This directory contain all the scripts written for this project. some of the script were written for test purposes.  
+
-- ''' Scripts:'''    This directory contain all the scripts written for this project. some of the script were written for test purposes.  
  
    bashscript.sh -- Works for cleaning the data (this script was used mostly)       
+
  -- bashscript.sh -- Works for cleaning the data (this script was used mostly)       
    perl_script.pl -- Didnt quite work the way i wanted to. Conversion and cleaning together.   
+
  -- perl_script.pl -- Didnt quite work the way i wanted to. Conversion and cleaning together.   
    conversion_script.pl  -- Works fine for epoch conversion to only date. (test version )  
+
  -- conversion_script.pl  -- Works fine for epoch conversion to only date. (test version )  
    py_conv_script.py    -- Works fine for epoch conversion to data and time.   
+
  -- py_conv_script.py    -- Works fine for epoch conversion to data and time.   
    test.txt              -- was created for conversion and cleaning test on a small scale.
+
  -- test.txt              -- was created for conversion and cleaning test on a small scale.
    python_script.py -- works Fine for conversion to date. And this was used for the actual data conversion.  
+
  -- python_script.py -- works Fine for conversion to date. And this was used for the actual data conversion.
 +
  -- insert_script.pl  -- Works Fine for converting the .csv files to .psql. Insertion statements so i could upload to bigfe.  
  
 
*  '''Files: '''
 
*  '''Files: '''
  
'''1. bigdata.sql''': This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.   
+
-- ''' bigdata.sql''': This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.   
 +
 
 +
These file contain the actual data for the project, and were uploaded to database and used for queries.
 +
 
 +
-- ''' movies.csv'''
 +
 
 +
-- ''' ratings.csv''' 
  
'''2. movies.csv''' 
+
-- ''' tags.csv'''
'''3. ratings.csv''' 
 
'''3. tags.csv''' These file contain the actual data for the project, and were uploaded to database and used for queries.
 
  
 
==== 3. Load the data into your Postgres instance ====
 
==== 3. Load the data into your Postgres instance ====
Line 51: Line 58:
 
* SQL statements with results are on cluster:  /cluster/home/mmludin08/Big-Data-M  
 
* SQL statements with results are on cluster:  /cluster/home/mmludin08/Big-Data-M  
  
==== 5. Develop and document the model function you are exploring in the data  ====
+
==== 5. Results ====
 +
=====    I. Develop and document the model function you are exploring in the data  =====
 
   
 
   
*For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern  
+
*For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern the users are rating a movie. For example i tried to find if the same users rated a movie high, and if they watch the same type of movies and in the same period of time. the queries develop to do this could be found on  /cluster/home/mmludin08/Bi-Data-M/Clean-Data/bigdata.sql.
  
==== 6. Develop a visualization to show the model/patterns in the data  ====
+
=====    II. Develop a visualization to show the model/patterns in the data  =====
  
===== Results =====
 
 
* The visualization(s)
 
* The visualization(s)
 
* The story
 
* The story

Latest revision as of 02:32, 18 December 2011

MovieLens Data Sets Project

Name: Mobeen Ludin

Class: Database Management System

H.W.: Final Project

Tech Details:

  • Node: as7
  • Path to storage space: /scratch/big-data/mobeen

Project data set

  • This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.
  • Link to data set: http://www.grouplens.org/node/12

Project Tasks

1. Identifying and downloading the target data set

  • The downloaded data is on cluster at: /cluster/home/mmludin08/Big-Data-M

2. Data cleaning and per-processing

The original data was in the .dat format. one perl script and a python script was written to change the formate and clean the data. The Big-Data-M contains the follwing directories and files:

  • Directories:

-- Backupfiles: The Backupfiles directory contains the data set that was downloaded from Movielens.

-- Clean_Data: The Clean_Data directory has all the data files that were formated by using the perl/python scripts.

-- Q_results: The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.

-- Scripts: This directory contain all the scripts written for this project. some of the script were written for test purposes.

  -- bashscript.sh	 -- Works for cleaning the data (this script was used mostly)       
  -- perl_script.pl	 -- Didnt quite work the way i wanted to. Conversion and cleaning together.  
  -- conversion_script.pl  -- Works fine for epoch conversion to only date. (test version ) 
  -- py_conv_script.py     -- Works fine for epoch conversion to data and time.  
  -- test.txt              -- was created for conversion and cleaning test on a small scale.
  -- python_script.py	 -- works Fine for conversion to date. And this was used for the actual data conversion. 
  -- insert_script.pl   -- Works Fine for converting the .csv files to .psql. Insertion statements so i could upload to bigfe. 
  • Files:

-- bigdata.sql: This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.

These file contain the actual data for the project, and were uploaded to database and used for queries.

-- movies.csv

-- ratings.csv

-- tags.csv

3. Load the data into your Postgres instance

  • After the cleaning the data was uploaded to cluster and laptop machine.

4. Develop queries to explore your ideas in the data

  • SQL statements with results are on cluster: /cluster/home/mmludin08/Big-Data-M

5. Results

I. Develop and document the model function you are exploring in the data
  • For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern the users are rating a movie. For example i tried to find if the same users rated a movie high, and if they watch the same type of movies and in the same period of time. the queries develop to do this could be found on /cluster/home/mmludin08/Bi-Data-M/Clean-Data/bigdata.sql.
II. Develop a visualization to show the model/patterns in the data
  • The visualization(s)
  • The story