Difference between revisions of "Mobeen-big-data"

From Earlham CS Department
Jump to navigation Jump to search
Line 24: Line 24:
 
*  '''Directories:'''
 
*  '''Directories:'''
  
'''--. Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens.   
+
'''-- Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens.   
  
'''--. Clean_Data:'''  The Clean_Data directory has all the data files that were formated by using the perl/python scripts.   
+
'''-- Clean_Data:'''  The Clean_Data directory has all the data files that were formated by using the perl/python scripts.   
  
'''--. Q_results:'''  The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.  
+
'''-- Q_results:'''  The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.  
 
      
 
      
'''--. Scripts:'''    This directory contain all the scripts written for this project. some of the script were written for test purposes.  
+
'''-- Scripts:'''    This directory contain all the scripts written for this project. some of the script were written for test purposes.  
  
 
   -- bashscript.sh -- Works for cleaning the data (this script was used mostly)       
 
   -- bashscript.sh -- Works for cleaning the data (this script was used mostly)       
Line 41: Line 41:
 
*  '''Files: '''
 
*  '''Files: '''
  
  '''--. bigdata.sql''': This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.   
+
  '''-- bigdata.sql''': This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.   
  
  '''--. movies.csv'''   
+
  '''-- movies.csv'''   
  '''--. ratings.csv'''   
+
  '''-- ratings.csv'''   
  '''--. tags.csv''' These file contain the actual data for the project, and were uploaded to database and used for queries.
+
  '''-- tags.csv''' These file contain the actual data for the project, and were uploaded to database and used for queries.
  
 
==== 3. Load the data into your Postgres instance ====
 
==== 3. Load the data into your Postgres instance ====

Revision as of 08:51, 14 December 2011

MovieLens Data Sets Project

Name: Mobeen Ludin

Class: Database Management System

H.W.: Final Project

Tech Details:

  • Node: as7
  • Path to storage space: /scratch/big-data/mobeen

Project data set

  • This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.
  • Link to data set: http://www.grouplens.org/node/12

Project Tasks

1. Identifying and downloading the target data set

  • The downloaded data is on cluster at: /cluster/home/mmludin08/Big-Data-M

2. Data cleaning and per-processing

The original data was in the .dat format. one perl script and a python script was written to change the formate and clean the data. The Big-Data-M contains the follwing directories and files:

  • Directories:

-- Backupfiles: The Backupfiles directory contains the data set that was downloaded from Movielens.

-- Clean_Data: The Clean_Data directory has all the data files that were formated by using the perl/python scripts.

-- Q_results: The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.

-- Scripts: This directory contain all the scripts written for this project. some of the script were written for test purposes.

  -- bashscript.sh	 -- Works for cleaning the data (this script was used mostly)       
  -- perl_script.pl	 -- Didnt quite work the way i wanted to. Conversion and cleaning together.  
  -- conversion_script.pl  -- Works fine for epoch conversion to only date. (test version ) 
  -- py_conv_script.py     -- Works fine for epoch conversion to data and time.  
  -- test.txt              -- was created for conversion and cleaning test on a small scale.
  -- python_script.py	 -- works Fine for conversion to date. And this was used for the actual data conversion. 
  • Files:
-- bigdata.sql: This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.   
-- movies.csv  
-- ratings.csv  
-- tags.csv These file contain the actual data for the project, and were uploaded to database and used for queries.

3. Load the data into your Postgres instance

  • After the cleaning the data was uploaded to cluster and laptop machine.

4. Develop queries to explore your ideas in the data

  • SQL statements with results are on cluster: /cluster/home/mmludin08/Big-Data-M

5. Results

I. Develop and document the model function you are exploring in the data
  • For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern
II. Develop a visualization to show the model/patterns in the data
  • The visualization(s)
  • The story