Difference between revisions of "Tristan-big-data"
Jump to navigation
Jump to search
(Created page with "* Project title * Project data set ===== Project Tasks ===== #Identifying and downloading the target data set #Data cleaning and pre-processing #Load the data into your Postg...") |
(→SQL Queries) |
||
(12 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | * | + | * Examining Trends in a Performance Sport |
− | * | + | * Data set: WCA Database |
+ | |||
+ | =====Question===== | ||
+ | :What can we discover about how people improve in a field over time? To explore this I looked through the World Cubing Association database. A database with tens of thousands of Rubik's cube solves for thousands of people. | ||
===== Project Tasks ===== | ===== Project Tasks ===== | ||
− | + | *Identifying and downloading the target data set | |
− | + | :The WCA Dataset was easily downloaded as a set of SQL inserts. The file can be downloaded from [http://worldcubeassociation.org/results/misc/export.html here]. | |
− | + | ||
− | + | *Data cleaning and pre-processing | |
+ | :The issue was that the .sql file was in MS-SQL or OracleSQL, so some mass modifications to the file had to be made. Primarily it was with changing smallint(n) to int, and `tablename` without the `. | ||
+ | |||
+ | *Load the data into your Postgres instance | ||
+ | :It took a few times to get everything from the script all working, but the script was successfully run on my directory on BigFe. | ||
+ | |||
+ | =====SQL Queries===== | ||
+ | The gathered data consisted of queries with exactly or close to the below queries. | ||
+ | |||
+ | :*People with more than 100 3x3 averages of 5. | ||
+ | ::<code>SELECT personname, count(average) FROM results <br/></code> | ||
+ | ::<code>WHERE eventid = '333' GROUP BY personname HAVING count(average) > 100 <br/></code> | ||
+ | ::<code>ORDER BY count(average)</code> | ||
+ | |||
+ | :*Average of the nth solve in an average of 5. An average or solve that is less than 0 is a did not finish of a did not start, so it's a little long to prevent those times from being included in the averages. | ||
+ | ::<code>SELECT avg(value1) one, avg(value2) two, avg(value3) three, avg(value4) four, avg(value5) five FROM results <br/></code> | ||
+ | ::<code>WHERE eventid = '333' AND formatid = 'a' AND average > 0<br/></code> | ||
+ | ::<code>AND value1 > 0 AND value2 > 0 AND value3 > 0 ANDvalue4 > 0 ANDvalue5 > 0;</code> | ||
+ | |||
+ | |||
#Develop and document the model function you are exploring in the data | #Develop and document the model function you are exploring in the data | ||
+ | |||
#Develop a visualization to show the model/patterns in the data | #Develop a visualization to show the model/patterns in the data | ||
+ | |||
+ | ===== Tech Details ===== | ||
+ | * Node: as3 | ||
+ | * Path to storage space: /scratch/big-data/tristan | ||
+ | |||
+ | ===== Results and Discussion===== | ||
+ | * The visualization(s) | ||
+ | |||
+ | From the results gathered we can only begin extrapolate two things: | ||
+ | |||
+ | #If you have multiple tries for a particular task in which you want to do well, the first will be the worst and the last will be the best. | ||
+ | #The younger you start something the more you'll be able to improve overtime, whereas if you start something later in your life, it is less likely chance you will be able to improve as drastically, quickly, or as much. | ||
+ | |||
+ | The two tid-bits extrapolated above are by no means proven to be concrete, this are only some spotted information gathered from a cubing database. To examine this further, other performance sports should be investigated. |
Latest revision as of 11:30, 3 December 2011
- Examining Trends in a Performance Sport
- Data set: WCA Database
Question
- What can we discover about how people improve in a field over time? To explore this I looked through the World Cubing Association database. A database with tens of thousands of Rubik's cube solves for thousands of people.
Project Tasks
- Identifying and downloading the target data set
- The WCA Dataset was easily downloaded as a set of SQL inserts. The file can be downloaded from here.
- Data cleaning and pre-processing
- The issue was that the .sql file was in MS-SQL or OracleSQL, so some mass modifications to the file had to be made. Primarily it was with changing smallint(n) to int, and `tablename` without the `.
- Load the data into your Postgres instance
- It took a few times to get everything from the script all working, but the script was successfully run on my directory on BigFe.
SQL Queries
The gathered data consisted of queries with exactly or close to the below queries.
- People with more than 100 3x3 averages of 5.
SELECT personname, count(average) FROM results
WHERE eventid = '333' GROUP BY personname HAVING count(average) > 100
ORDER BY count(average)
- Average of the nth solve in an average of 5. An average or solve that is less than 0 is a did not finish of a did not start, so it's a little long to prevent those times from being included in the averages.
SELECT avg(value1) one, avg(value2) two, avg(value3) three, avg(value4) four, avg(value5) five FROM results
WHERE eventid = '333' AND formatid = 'a' AND average > 0
AND value1 > 0 AND value2 > 0 AND value3 > 0 ANDvalue4 > 0 ANDvalue5 > 0;
- Develop and document the model function you are exploring in the data
- Develop a visualization to show the model/patterns in the data
Tech Details
- Node: as3
- Path to storage space: /scratch/big-data/tristan
Results and Discussion
- The visualization(s)
From the results gathered we can only begin extrapolate two things:
- If you have multiple tries for a particular task in which you want to do well, the first will be the worst and the last will be the best.
- The younger you start something the more you'll be able to improve overtime, whereas if you start something later in your life, it is less likely chance you will be able to improve as drastically, quickly, or as much.
The two tid-bits extrapolated above are by no means proven to be concrete, this are only some spotted information gathered from a cubing database. To examine this further, other performance sports should be investigated.