Difference between revisions of "Mobeen-big-data"
Jump to navigation
Jump to search
(→1. Identifying and downloading the target data set) |
(→1. Identifying and downloading the target data set) |
||
Line 11: | Line 11: | ||
The Big-Data-M contains the follwing directories and files: | The Big-Data-M contains the follwing directories and files: | ||
− | * '''Directories: Backupfiles | + | * '''Directories:''' |
+ | '''# Backupfiles:''' The Backupfiles directory contains the data set that was downloaded from Movielens. | ||
− | + | '''# Clean_Data:''' The Clean_Data directory has all the data files that were formated by using the perl/python scripts. | |
− | + | '''# Q_results:''' The Q_results directory has | |
+ | |||
+ | # Scripts | ||
− | + | * '''Files: bigdata.sql movies.csv ratings.csv tags.csv''' | |
− | |||
− | |||
==== 2. Data cleaning and per-processing ==== | ==== 2. Data cleaning and per-processing ==== |
Revision as of 07:57, 14 December 2011
Contents
- 1 MovieLens Data Sets Project
- 1.1 Project data set
- 1.2 Project Tasks
- 1.2.1 1. Identifying and downloading the target data set
- 1.2.2 2. Data cleaning and per-processing
- 1.2.3 3. Load the data into your Postgres instance
- 1.2.4 4. Develop queries to explore your ideas in the data
- 1.2.5 5. Develop and document the model function you are exploring in the data
- 1.2.6 6. Develop a visualization to show the model/patterns in the data
MovieLens Data Sets Project
Project data set
- This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.
- Link to data set: http://www.grouplens.org/node/12
Project Tasks
1. Identifying and downloading the target data set
- The downloaded data is on cluster at: /cluster/home/mmludin08/Big-Data-M
The Big-Data-M contains the follwing directories and files:
- Directories:
# Backupfiles: The Backupfiles directory contains the data set that was downloaded from Movielens.
# Clean_Data: The Clean_Data directory has all the data files that were formated by using the perl/python scripts.
# Q_results: The Q_results directory has
- Scripts
- Files: bigdata.sql movies.csv ratings.csv tags.csv
2. Data cleaning and per-processing
- The original data was in the .dat format. one perl script and a python script was written to change the formate and clean the data.
3. Load the data into your Postgres instance
- After the cleaning the data was uploaded to cluster and laptop machine.
4. Develop queries to explore your ideas in the data
- SQL statements with results are on cluster: /cluster/home/mmludin08/Big-Data-M
5. Develop and document the model function you are exploring in the data
- For this project my aim was to discover the movie genres time line. In more words, I wanted to find out at what period of time people watch what type of movies. I also tried to look for the pattern
6. Develop a visualization to show the model/patterns in the data
Tech Details
- Node: as7
- Path to storage space: /scratch/big-data/mobeen
Results
- The visualization(s)
- The story