Difference between revisions of "Cluster Information"

From Earlham CS Department
Jump to navigation Jump to search
((Old) To Do)
 
(35 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
== Summer 2011 (draft) ==
 +
Calendar: May 9th through August 19th
 +
 +
Events:
 +
* Undergraduate Petascale Institute @ NCSA - 29 May through 11 June (tentative)
 +
* Intermediate Parallel Programming and Distributed Computing workshop @ OU - 30 July through 6 August (tentative)
 +
 +
Projects:
 +
* LittleFe Build-out @ Earlham - n weeks of x people's time
 +
* New cluster assembly - n weeks of x people's time
 +
* Petascale project work - n weeks of x people's time
 +
 +
Personal Schedules:
 +
* Charlie: Out 3-12 June (beach),
 +
* Ivan:
 +
* Aaron:
 +
* Fitz:
 +
* Mobeen:
 +
* Brad:
 +
 +
== Al-Salam/CCG Downtime tasks ==
 +
* LDAP Server migration from bs-new -> hopper
 +
* yum update over all nodes
 +
* Turn HT off
 +
* PVFS
 +
* PBS server on Hopper
 +
 +
== How to use the PetaKit ==
 +
#  If you do not already have it, obtain the source for the PetaKit from the CVS repository on hopper (curriculum-modules/PetaKit).
 +
#  cd to the Subkits directory of PetaKit and run the area-subkit.sh to make an area subkit tarball or GalaxSee-subkit.sh to make a GalaxSee subkit tarball.
 +
#  scp the tarball to the target resource and unpack it.
 +
#  cd into the directory and run ./configure --with-mpi --with-openmp
 +
#  Use stat.pl and args_man to make an appropriate statistics run.  See args_man for a description of predicates.  Example:
 +
perl -w stat.pl --program area --style serial,mpi,openmp,hybrid --scheduler lsf --user leemasa --problem_size 200000000000
 +
--processes 1,2,3,4,5,6,7,8-16-64 --repetitions 10 -m -tag Sooner-strongest-newest --mpirun mpirun.lsf --ppn 8
 +
[[Modifying Programs for Use with PetaKit]]
 +
 
== Current To-Do ==
 
== Current To-Do ==
 
Date represents last meeting where we discussed the item
 
Date represents last meeting where we discussed the item
  
* Writeups for Jeff Krause  (3/Nov/09)
+
* Brad's Graphing Tool  (28/Feb/10)
* Brad's Graphing Tool  (19/Oct/09)
+
   * Nice new functionality, see
   * 2 y-axis scales (runtime and problem size)
+
  * Waiting on clean data to finish multiple resource displays
   * error bars for left and right y-axes with checkboxes for each
+
   * Error bars for left and right y-axes with checkboxes for each
  * eps file for the poster
+
* TeraGrid Runs  (28/Feb/10)
* TeraGrid Runs  (1/Dec/09)
+
 
  * Check new GalaxSee on BobSCEd
+
In first box:  Put initial of who is doing run
  * Update configure.ac based on Fitz's notes from Kraken
+
 
  * Status on pople?
+
In second box:  B = builds, R = runs, D = reports back to database, S = there is a good set of runs (10 per data point) for strong scaling in the database that appear on a graph, W = there is a good set of runs (10 per data point) for weak scaling in the database that appear on a graph
  * Remember Big Red has a debug queue
+
{| class="wikitable" border="1"
  * GalaxSee -- after stats changes, GalaxSee slowed down by a lot.
+
! rowspan="2" |
    * Double check on BobSCEd, find the algorithm problem
+
! colspan="8" | area under curve
* New LittleFe Boards  (19/Oct/09)
+
! colspan="8" | GalaxSee
* New cluster (9/Dec/09)
+
|-
   * Better to not spend $$$ on next-day business response.  Losing a node is not a big deal.
+
!  colspan="2" | Serial
   * BobSCEd already has IB.  Any value to having CUDA and IB in the same nodes?
+
!  colspan="2" | MPI
* BCCD Testing  (9/Dec/09)
+
!  colspan="2" | OpenMP
 +
!  colspan="2" | Hybrid
 +
!  colspan="2" | Serial
 +
!  colspan="2" | MPI
 +
!  colspan="2" | OpenMP
 +
!  colspan="2" | Hybrid
 +
|-
 +
! [http://cs.earlham.edu/~amweeden06/memory-heap-summary.pdf ACLs]
 +
|  Sam
 +
 +
|  Sam
 +
 +
|  Sam
 +
 +
|  Sam
 +
|
 +
|  AW
 +
 +
|  AW
 +
 +
|  AW
 +
 +
|  AW
 +
|
 +
|-
 +
! [http://cs.earlham.edu/~amweeden06/memory-heap-summary.pdf BobSCEd]
 +
 +
 +
 +
 +
|
 +
 +
 +
|
 +
 +
 +
 +
 +
|
 +
 +
 +
|
 +
|-
 +
!  [http://rc.uits.iu.edu/kb/index.php?kbID=aueo BigRed]
 +
| Sam
 +
 +
| Sam
 +
 +
| Sam
 +
|
 +
| Sam
 +
|
 +
|
 +
 +
|
 +
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
! [http://www.oscer.ou.edu/resources.php#sooner Sooner]
 +
| Sam
 +
 +
|  Sam
 +
 +
|Sam
 +
|
 +
|  Sam
 +
|
 +
|
 +
 +
|
 +
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
! [http://www.psc.edu/machines/sgi/altix/pople.php#hardware pople]
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
| AW/CP
 +
 +
| AW/CP
 +
 +
| AW/CP
 +
|
 +
| AW/CP 
 +
|
 +
|}
 +
 
 +
Problem-sizes
 +
*Big Red: 750000000000
 +
*Sooner: 200000000000
 +
*BobSCEd: 18000000000
 +
*ACL: 18000000000
 +
 
 +
 
 +
* New cluster (5/Feb/10)
 +
  * [https://wiki.cs.earlham.edu/index.php/Al-salam wiki page]
 +
  * Decommission Cairo
 +
   * Figure out how to mount on Telco Rack
 +
   * Get pdfs of all materials -- post them on wiki
 +
* BCCD Testing  (5/Feb/10)
 
   * Get Fitz's liberation instructions into wiki
 
   * Get Fitz's liberation instructions into wiki
 
   * Get Kevin's VirtualBox instructions into wiki
 
   * Get Kevin's VirtualBox instructions into wiki
 
   * pxe booting -- see if they booted, if you can ssh to them, if the run matrix works  
 
   * pxe booting -- see if they booted, if you can ssh to them, if the run matrix works  
  * Test all boot options
 
  * CUDA LittleFe is now online -- test CUDA boot options
 
 
   * Send /etc/bccd-revision with each email
 
   * Send /etc/bccd-revision with each email
 
   * Send output of netstat -rn and /sbin/ifconfig -a with each email
 
   * Send output of netstat -rn and /sbin/ifconfig -a with each email
 
   * [http://bccd.net/ver3/wiki/index.php/Tests Run Matrix]
 
   * [http://bccd.net/ver3/wiki/index.php/Tests Run Matrix]
 
   * For the future:  scripts to boot & change bios, watchdog timer, 'test' mode in bccd, send emails about errors
 
   * For the future:  scripts to boot & change bios, watchdog timer, 'test' mode in bccd, send emails about errors
 +
  * USB scripts -- we don't need the "copy" script
 +
* SIGCSE Conference -- March 10-13 (28/Feb/10)
 +
  * Leaving 8:00 Wednesday
 +
  * Brad, Sam, or Gus pick up the van around 7, bring it by loading dock outside Noyes
 +
  * Posters -- new area runs for graphs, start implementing stats collection and OpenMP, print at small size (what is that?)
 +
  * Take 2 LittleFes, small switch, monitor/kyb/mouse (wireless), printed matter
 +
* Spring Cleaning (Noyes Basement) (5/Feb/10)
 +
  * Next meeting: Saturday 6/Feb @ 3 pm
 +
 +
== Generalized, Modular Parallel Framework ==
 +
 +
=== 10,000 foot view of problems ===
 +
{| class="wikitable" border="1"
 +
|+ this conceptual view may not reflect current code
 +
|-
 +
! || Parent Process Sends Out || Children Send Back || Results Compiled By
 +
|-
 +
! Area
 +
|function, bounds, segment size or count || sum of area for specified bounds || sum
 +
|-
 +
! GalaxSee
 +
| complete array of stars, bounds (which stars to compute) || an array containing the computed stars|| construct a new array of stars and repeat for next time step
 +
|-
 +
! Matrix x Matrix
 +
| n rows from Matrix A and n columns from Matrix B, location of rows and cols || n resulting matrix position values, their location in results matrix || construct new result array
 +
|}
 +
 +
=== Visualizing Parallel Framework ===
 +
http://cs.earlham.edu/~carrick/parallel/parallelism-approaches.png
 +
 +
=== Parallel Problem Space ===
 +
* Dwarf (algorithm family)
 +
* Style of parallelism (shared, distributed, GPGPU, hybrid)
 +
* Tiling (mapping problem to work units to workers)
 +
* Distribution algorithm (getting work units to workers)
  
 
== Summer of Fun (2009) ==
 
== Summer of Fun (2009) ==
Line 64: Line 244:
 
What works so far?  B = builds, R = runs, W = works
 
What works so far?  B = builds, R = runs, W = works
 
{| class="wikitable" border="1"
 
{| class="wikitable" border="1"
! rowspan="2" | B-builds, R-runs
+
! rowspan="2" |  
 
! colspan="4" | area under curve
 
! colspan="4" | area under curve
 
! colspan="4" | GalaxSee (standalone)
 
! colspan="4" | GalaxSee (standalone)
Line 83: Line 263:
 
|  BRW
 
|  BRW
 
|   
 
|   
BRW
+
BR
 
|   
 
|   
 
|
 
|
Line 93: Line 273:
 
|  BRW
 
|  BRW
 
|
 
|
BRW
+
BR
 
|   
 
|   
 
|
 
|
Line 103: Line 283:
 
|
 
|
 
|
 
|
 +
|  BR
 +
|
 +
|
 +
|-
 +
!  BigRed
 +
|  BRW
 +
|  BRW
 +
|  BRW
 
|  BRW
 
|  BRW
 
|
 
|
 +
|
 +
 +
|
 +
|-
 +
!  Sooner
 +
|  BRW
 +
|  BRW
 +
|  BRW
 +
|  BRW
 +
|
 +
|
 +
 
|
 
|
 
|-
 
|-
Line 123: Line 323:
 
|
 
|
 
|
 
|
BRW
+
BR
 
|
 
|
 
|
 
|
Line 249: Line 449:
 
== Items Particular to a Specific Cluster ==
 
== Items Particular to a Specific Cluster ==
 
* [[ACL Cluster|ACL]]
 
* [[ACL Cluster|ACL]]
 +
* [[Al-salam|Al-Salam]]
 
* [[Athena Cluster|Athena]]
 
* [[Athena Cluster|Athena]]
 
* [[Bazaar Cluster|Bazaar]]
 
* [[Bazaar Cluster|Bazaar]]
Line 255: Line 456:
  
 
== Curriculum Modules ==
 
== Curriculum Modules ==
 +
* [[Cluster:Gprof|gprof]] - statistical source code profiler
 
* [[Cluster:Curriculum|Curriculum]]
 
* [[Cluster:Curriculum|Curriculum]]
 
* [[Cluster:Fluid Dynamics|Fluid Dynamics]]
 
* [[Cluster:Fluid Dynamics|Fluid Dynamics]]
Line 260: Line 462:
 
* [[Cluster:GROMACS Web Interface|GROMACS Web Interface]]
 
* [[Cluster:GROMACS Web Interface|GROMACS Web Interface]]
 
* [[Cluster:Wiki|Wiki Life for Academics]]
 
* [[Cluster:Wiki|Wiki Life for Academics]]
 +
* [[Cluster:PetaKit|PetaKit]]
  
 
== Possible Future Projects ==
 
== Possible Future Projects ==

Latest revision as of 07:29, 11 February 2011

Summer 2011 (draft)

Calendar: May 9th through August 19th

Events:

  • Undergraduate Petascale Institute @ NCSA - 29 May through 11 June (tentative)
  • Intermediate Parallel Programming and Distributed Computing workshop @ OU - 30 July through 6 August (tentative)

Projects:

  • LittleFe Build-out @ Earlham - n weeks of x people's time
  • New cluster assembly - n weeks of x people's time
  • Petascale project work - n weeks of x people's time

Personal Schedules:

  • Charlie: Out 3-12 June (beach),
  • Ivan:
  • Aaron:
  • Fitz:
  • Mobeen:
  • Brad:

Al-Salam/CCG Downtime tasks

  • LDAP Server migration from bs-new -> hopper
  • yum update over all nodes
  • Turn HT off
  • PVFS
  • PBS server on Hopper

How to use the PetaKit

  1. If you do not already have it, obtain the source for the PetaKit from the CVS repository on hopper (curriculum-modules/PetaKit).
  2. cd to the Subkits directory of PetaKit and run the area-subkit.sh to make an area subkit tarball or GalaxSee-subkit.sh to make a GalaxSee subkit tarball.
  3. scp the tarball to the target resource and unpack it.
  4. cd into the directory and run ./configure --with-mpi --with-openmp
  5. Use stat.pl and args_man to make an appropriate statistics run. See args_man for a description of predicates. Example:
perl -w stat.pl --program area --style serial,mpi,openmp,hybrid --scheduler lsf --user leemasa --problem_size 200000000000
--processes 1,2,3,4,5,6,7,8-16-64 --repetitions 10 -m -tag Sooner-strongest-newest --mpirun mpirun.lsf --ppn 8

Modifying Programs for Use with PetaKit

Current To-Do

Date represents last meeting where we discussed the item

  • Brad's Graphing Tool (28/Feb/10)
 * Nice new functionality, see 
 * Waiting on clean data to finish multiple resource displays
 * Error bars for left and right y-axes with checkboxes for each
  • TeraGrid Runs (28/Feb/10)

In first box: Put initial of who is doing run

In second box: B = builds, R = runs, D = reports back to database, S = there is a good set of runs (10 per data point) for strong scaling in the database that appear on a graph, W = there is a good set of runs (10 per data point) for weak scaling in the database that appear on a graph

area under curve GalaxSee
Serial MPI OpenMP Hybrid Serial MPI OpenMP Hybrid
ACLs Sam Sam Sam Sam AW AW AW AW
BobSCEd
BigRed Sam Sam Sam Sam
Sooner Sam Sam Sam Sam
pople AW/CP AW/CP AW/CP AW/CP

Problem-sizes

  • Big Red: 750000000000
  • Sooner: 200000000000
  • BobSCEd: 18000000000
  • ACL: 18000000000


  • New cluster (5/Feb/10)
 * wiki page
 * Decommission Cairo
 * Figure out how to mount on Telco Rack
 * Get pdfs of all materials -- post them on wiki
  • BCCD Testing (5/Feb/10)
 * Get Fitz's liberation instructions into wiki
 * Get Kevin's VirtualBox instructions into wiki
 * pxe booting -- see if they booted, if you can ssh to them, if the run matrix works 
 * Send /etc/bccd-revision with each email
 * Send output of netstat -rn and /sbin/ifconfig -a with each email
 * Run Matrix
 * For the future:  scripts to boot & change bios, watchdog timer, 'test' mode in bccd, send emails about errors
 * USB scripts -- we don't need the "copy" script
  • SIGCSE Conference -- March 10-13 (28/Feb/10)
 * Leaving 8:00 Wednesday
 * Brad, Sam, or Gus pick up the van around 7, bring it by loading dock outside Noyes
 * Posters -- new area runs for graphs, start implementing stats collection and OpenMP, print at small size (what is that?)
 * Take 2 LittleFes, small switch, monitor/kyb/mouse (wireless), printed matter
  • Spring Cleaning (Noyes Basement) (5/Feb/10)
 * Next meeting: Saturday 6/Feb @ 3 pm

Generalized, Modular Parallel Framework

10,000 foot view of problems

this conceptual view may not reflect current code
Parent Process Sends Out Children Send Back Results Compiled By
Area function, bounds, segment size or count sum of area for specified bounds sum
GalaxSee complete array of stars, bounds (which stars to compute) an array containing the computed stars construct a new array of stars and repeat for next time step
Matrix x Matrix n rows from Matrix A and n columns from Matrix B, location of rows and cols n resulting matrix position values, their location in results matrix construct new result array

Visualizing Parallel Framework

http://cs.earlham.edu/~carrick/parallel/parallelism-approaches.png

Parallel Problem Space

  • Dwarf (algorithm family)
  • Style of parallelism (shared, distributed, GPGPU, hybrid)
  • Tiling (mapping problem to work units to workers)
  • Distribution algorithm (getting work units to workers)

Summer of Fun (2009)

An external doc for GalaxSee
Documentation for OpenSim GalaxSee

What's in the database?

GalaxSee (MPI) area-under-curve (MPI, openmpi) area-under-curve (Hybrid, openmpi)
acl0-5 bs0-5 GigE bs0-5 IB acl0-5 bs0-5 GigE bs0-5 IB acl0-5 bs0-5 GigE bs0-5 IB
np X-XX 2-20 2-48 2-48 2-12 2-48 2-48 2-20 2-48 2-48

What works so far? B = builds, R = runs, W = works

area under curve GalaxSee (standalone)
Serial MPI OpenMP Hybrid Serial MPI OpenMP Hybrid
acls BRW BRW BRW BRW BR
bobsced0 BRW BRW BRW BRW BR
c13 BR
BigRed BRW BRW BRW BRW
Sooner BRW BRW BRW BRW
pople
Charlie's laptop BR

To Do

  • Fitz/Charlie's message
  • Petascale review
  • BobSCEd stress test

Implementations of area under the curve

  • Serial
  • OpenMP (shared)
  • MPI (message passing)
  • MPI (hybrid mp and shared)
  • OpenMP + MPI (hybrid)

GalaxSee Goals

  • Good piece of code, serves as teaching example for n-body problems in petascale.
  • Dials, knobs, etc. in place to easily control how work is distributed when running in parallel.
  • Architecture generally supports hybrid model running on large-scale constellations.
  • Produces runtime data that enables nice comparisons across multiple resources (scaling, speedup, efficiency).
  • Render in BCCD, metaverse, and /dev/null environments.
  • Serial version
  • Improve performance on math?

GalaxSee - scale to petascale with MPI and OpenMP hybrid.

  • GalaxSee - render in-world and steer from in-world.
  • Area under a curve - serial, MPI, and OpenMP implementations.
  • OpenMPI - testing, performance.
  • Start May 11th

LittleFe

  • Testing
  • Documentation
  • Touch screen interface

Notes from May 21, 2009 Review

  • Combined Makefiles with defines to build on a particular platform
  • Write a driver script for GalaxSee ala the area under the curve script, consider combining
  • Schema
    • date, program_name, program_version, style, command line, compute_resource, NP, wall_time
  • Document the process from start to finish
  • Consider how we might iterate over e.g. number of stars, number of segments, etc.
  • Command line option to stat.pl that provides a Torque wrapper for the scripts.
  • Lint all code, consistent formatting
  • Install latest and greatest Intel compiler in /cluster/bobsced

BobSCEd Upgrade

Build a new image for BobSCEd:

  1. One of the Suse versions supported for Gaussian09 on EM64T [v11.1] - Red Hat Enterprise Linux 5.3; SuSE Linux 9.3, 10.3, 11.1; or SuSE Linux Enterprise 10 (see G09 platform list) <-- CentOS 5.3 runs Gaussian binaries for RHEL ok
  2. Firmware update?
  3. C3 tools and configuration [v4.0.1]
  4. Ganglia and configuration [v3.1.2]
  5. PBS and configuration [v2.3.16]
  6. /cluster/bobsced local to bs0
  7. /cluster/... passed-through to compute nodes
  8. Large local scratch space on each node
  9. Gaussian09
  10. WebMO and configuration [v9.1] - Gamess, Gaussian, Mopac, Tinker
  11. Infiniband and configuration
  12. GNU toolchain with OpenMPI and MPICH [GCC v4.4.0], [OpenMPI v1.3.2] [MPICH v1.2.7p1]
  13. Intel toolchain with OpenMPI and native libraries
  14. Sage with do-dads (see Charlie)
  15. Systemimager for the client nodes?

Installed:

Fix the broken nodes.

(Old) To Do

BCCD Liberation

  • v1.1 release - upgrade procedures

Curriculum Modules

  • POVRay
  • GROMACS
  • Energy and Weather
  • Dave's math modules
  • Standard format, templates, how-to for V and V

LittleFe

Infrastructure

  • Masa's GROMACS interface on Cairo
  • gridgate configuration, Open Science Grid peering
  • hopper'

SC Education

Current Projects

Past Projects

General Stuff

Items Particular to a Specific Cluster

Curriculum Modules

Possible Future Projects

Archive