Difference between revisions of "Mobeen-big-data"
(→5. Results) |
|||
Line 24: | Line 24: | ||
* '''Directories:''' | * '''Directories:''' | ||
− | ''' | + | '''--. Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens. |
− | ''' | + | '''--. Clean_Data:''' The Clean_Data directory has all the data files that were formated by using the perl/python scripts. |
− | ''' | + | '''--. Q_results:''' The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file. |
− | ''' | + | '''--. Scripts:''' This directory contain all the scripts written for this project. some of the script were written for test purposes. |
− | + | -- bashscript.sh -- Works for cleaning the data (this script was used mostly) | |
− | + | -- perl_script.pl -- Didnt quite work the way i wanted to. Conversion and cleaning together. | |
− | + | -- conversion_script.pl -- Works fine for epoch conversion to only date. (test version ) | |
− | + | -- py_conv_script.py -- Works fine for epoch conversion to data and time. | |
− | + | -- test.txt -- was created for conversion and cleaning test on a small scale. | |
− | + | -- python_script.py -- works Fine for conversion to date. And this was used for the actual data conversion. | |
* '''Files: ''' | * '''Files: ''' | ||
− | ''' | + | '''--. bigdata.sql''': This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes. |
− | ''' | + | '''--. movies.csv''' |
− | ''' | + | '''--. ratings.csv''' |
− | ''' | + | '''--. tags.csv''' These file contain the actual data for the project, and were uploaded to database and used for queries. |
==== 3. Load the data into your Postgres instance ==== | ==== 3. Load the data into your Postgres instance ==== |
Revision as of 08:50, 14 December 2011
Contents
MovieLens Data Sets Project
Name: Mobeen Ludin
Class: Database Management System
H.W.: Final Project
Tech Details:
- Node: as7
- Path to storage space: /scratch/big-data/mobeen
Project data set
- This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.
- Link to data set: http://www.grouplens.org/node/12
Project Tasks
1. Identifying and downloading the target data set
- The downloaded data is on cluster at: /cluster/home/mmludin08/Big-Data-M
2. Data cleaning and per-processing
The original data was in the .dat format. one perl script and a python script was written to change the formate and clean the data. The Big-Data-M contains the follwing directories and files:
- Directories:
--. Backupfiles: The Backupfiles directory contains the data set that was downloaded from Movielens.
--. Clean_Data: The Clean_Data directory has all the data files that were formated by using the perl/python scripts.
--. Q_results: The Q_results directory has has the out files for each queries that were passed to the postgres from bigdata.sql file.
--. Scripts: This directory contain all the scripts written for this project. some of the script were written for test purposes.
-- bashscript.sh -- Works for cleaning the data (this script was used mostly) -- perl_script.pl -- Didnt quite work the way i wanted to. Conversion and cleaning together. -- conversion_script.pl -- Works fine for epoch conversion to only date. (test version ) -- py_conv_script.py -- Works fine for epoch conversion to data and time. -- test.txt -- was created for conversion and cleaning test on a small scale. -- python_script.py -- works Fine for conversion to date. And this was used for the actual data conversion.
- Files:
--. bigdata.sql: This file has the queries that were written to generate results for my project. most of them works fine. some didnt or took longer even after using indexes.
--. movies.csv --. ratings.csv --. tags.csv These file contain the actual data for the project, and were uploaded to database and used for queries.
3. Load the data into your Postgres instance
- After the cleaning the data was uploaded to cluster and laptop machine.
4. Develop queries to explore your ideas in the data
- SQL statements with results are on cluster: /cluster/home/mmludin08/Big-Data-M
5. Results
I. Develop and document the model function you are exploring in the data
- For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern
II. Develop a visualization to show the model/patterns in the data
- The visualization(s)
- The story