Difference between revisions of "Leif-big-data"

From Earlham CS Department
Jump to navigation Jump to search
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
* Project title  
+
* Project title: Stories in Words
* Project data set  
+
* Project data set: Google Ngrams - 1gram (English)
  
 
===== Project Tasks =====
 
===== Project Tasks =====
 
#Identifying and downloading the target data set
 
#Identifying and downloading the target data set
#Data cleaning and pre-processing  
+
#*This project uses Google Ngrams - 1gram (English) which can be downloaded from Google Books at [http://books.google.com/ngrams/datasets] 0-10 CSV files.
#Load the data into your Postgres instance  
+
#Data cleaning and pre-processing
#Develop queries to explore your ideas in the data  
+
#*The raw CSV file values are separated by TABS so I had to use a script to replace TABS with COMMAS as follows: tr '\t' ',' <input_file.csv>output_file.csv
 +
#Load the data into your Postgres instance
 +
#*I used a script which when piped into postgres drops existing tables, creates the tables, copies the data in, and then indexes the tables.
 +
#Develop queries to explore your ideas in the data
 +
#* I wrote a script to fish my database for the data I specify and that is included in my shared directory
 
#Develop and document the model function you are exploring in the data
 
#Develop and document the model function you are exploring in the data
 +
#* Exploring what stories I can say about graphing key words
 
#Develop a visualization to show the model/patterns in the data
 
#Develop a visualization to show the model/patterns in the data
 +
#* I have included a keynote presentation in my public directory
  
 
===== Tech Details =====
 
===== Tech Details =====
 
* Node: as6
 
* Node: as6
* Path to storage space: /scratch/big-data/leif
+
* Path to storage space: local machine
 +
* Path to project files: ~lnulric09/public/big_data/
 +
 
 +
===== Results =====
 +
* The visualization(s)
 +
* The story

Latest revision as of 12:43, 12 December 2011

  • Project title: Stories in Words
  • Project data set: Google Ngrams - 1gram (English)
Project Tasks
  1. Identifying and downloading the target data set
    • This project uses Google Ngrams - 1gram (English) which can be downloaded from Google Books at [1] 0-10 CSV files.
  2. Data cleaning and pre-processing
    • The raw CSV file values are separated by TABS so I had to use a script to replace TABS with COMMAS as follows: tr '\t' ',' <input_file.csv>output_file.csv
  3. Load the data into your Postgres instance
    • I used a script which when piped into postgres drops existing tables, creates the tables, copies the data in, and then indexes the tables.
  4. Develop queries to explore your ideas in the data
    • I wrote a script to fish my database for the data I specify and that is included in my shared directory
  5. Develop and document the model function you are exploring in the data
    • Exploring what stories I can say about graphing key words
  6. Develop a visualization to show the model/patterns in the data
    • I have included a keynote presentation in my public directory
Tech Details
  • Node: as6
  • Path to storage space: local machine
  • Path to project files: ~lnulric09/public/big_data/
Results
  • The visualization(s)
  • The story