Difference between revisions of "Cluster:Todo"

From Earlham CS Department
Jump to navigation Jump to search
Line 2: Line 2:
 
__NOTOC__
 
__NOTOC__
  
== Current Items (updated September 14, 2005)==
+
== Current Items (updated September 21, 2005)==
 
* Little-Fe
 
* Little-Fe
** Fossilize BCCD onto Little-Fe (Kevin and Toby)
+
** Fossilize BCCD onto Little-Fe, making progress, see BCCD/PPC wiki for details (Kevin and Toby)
 
** Measure load with 4 nodes, 2 disk drives, CD drive; order transformer (Charlie)
 
** Measure load with 4 nodes, 2 disk drives, CD drive; order transformer (Charlie)
 
* LLK (see [[Cluster:LowLatency]] for the details)
 
* LLK (see [[Cluster:LowLatency]] for the details)
 +
**
 
** Read the STP paper, emulate his test methodology/program? (Everyone)
 
** Read the STP paper, emulate his test methodology/program? (Everyone)
 
** Is there a 2.6 version of STP? (''Only from us.'')  SGI? (''Not likely.'') (Skylar)
 
** Is there a 2.6 version of STP? (''Only from us.'')  SGI? (''Not likely.'') (Skylar)
** Measure latency in kernel and on wire - kperf - (Alex and Toby)
+
** Measure latency in kernel and on wire using either kperf or tp_timer - (Alex and Toby)
 
*** Test accuracy by loading one of the nodes with CPU and disk traffic (lots)
 
*** Test accuracy by loading one of the nodes with CPU and disk traffic (lots)
 
*** Setup and document this, along with kernel building/loading/starting, so that any of us can make a change and a measurement.
 
*** Setup and document this, along with kernel building/loading/starting, so that any of us can make a change and a measurement.
Line 22: Line 23:
 
*** Look for papers, etc. that discuss network performance/latency under Linux 2.6.x.  For each paper write a paragraph describing what's useful, things we should try, etc., a BibTeX entry, put an electronic copy in /cluster/project/llk/references.
 
*** Look for papers, etc. that discuss network performance/latency under Linux 2.6.x.  For each paper write a paragraph describing what's useful, things we should try, etc., a BibTeX entry, put an electronic copy in /cluster/project/llk/references.
 
* Folding@Clusters
 
* Folding@Clusters
** Build and test with GROMACS 3.1.4 (JoshM)
+
** Build a 3.1.4 CVS export with instructions and park it in ~pande for Guha (JoshM)
** Modify our conversion scripts to support 3.1.4 and 3.2.1 (JoshM)
+
** Test a range of molecules, clusters, and sizes with Alex's scripts and PBS/Maui (Skylar)
** Build a 3.1.4 CVS module ala 3.2.1, that is with our cookies in the source (JoshM)
 
** Test a range of molecules, clusters, and sizes with Alex's scripts and PBS/Maui (Alex)
 
 
* Plumbing
 
* Plumbing
 
** on bazaar: ccache & distcc with wiki howto (Skylar,''done'')
 
** on bazaar: ccache & distcc with wiki howto (Skylar,''done'')

Revision as of 13:36, 21 September 2005

(Need a notation for relative priority. Please don't delete anything unless we're updating this during a meeting.)


Current Items (updated September 21, 2005)

  • Little-Fe
    • Fossilize BCCD onto Little-Fe, making progress, see BCCD/PPC wiki for details (Kevin and Toby)
    • Measure load with 4 nodes, 2 disk drives, CD drive; order transformer (Charlie)
  • LLK (see Cluster:LowLatency for the details)
    • Read the STP paper, emulate his test methodology/program? (Everyone)
    • Is there a 2.6 version of STP? (Only from us.) SGI? (Not likely.) (Skylar)
    • Measure latency in kernel and on wire using either kperf or tp_timer - (Alex and Toby)
      • Test accuracy by loading one of the nodes with CPU and disk traffic (lots)
      • Setup and document this, along with kernel building/loading/starting, so that any of us can make a change and a measurement.
      • Measure packet loss rate at each node and the switches and hopper using SNMP/Cricket (Skylar)
      • Measure bit error rate (Skylar)
      • Use a structure/array for tp_ routines (Alex and Toby)
    • No word from MAGNET on 2.6.x
    • Real time kernels and TCP/IP book _late next week_.
    • Update the list of references, llk.bib (Everyone)
    • Literature search (Skylar, JoshM)
      • Document all sysctls, build options, etc. that effect network performance, particularly latency
      • Look for papers, etc. that discuss network performance/latency under Linux 2.6.x. For each paper write a paragraph describing what's useful, things we should try, etc., a BibTeX entry, put an electronic copy in /cluster/project/llk/references.
  • Folding@Clusters
    • Build a 3.1.4 CVS export with instructions and park it in ~pande for Guha (JoshM)
    • Test a range of molecules, clusters, and sizes with Alex's scripts and PBS/Maui (Skylar)
  • Plumbing
    • on bazaar: ccache & distcc with wiki howto (Skylar,done)
    • on cairo: distcc (currently installed but not running) with wiki howto (Skylar,done)
    • Nate Bass's problems (JoshM)
  • Other
    • Dinner at the Ranch on September 30th (Everyone, including SOs)
    • There may be a student program at the Oberlin Computational Science conference (November 4-6)
    • Merck/AAAS Poster Session. Wednesday October 26th. Folding@Clusters and Little-Fe. We need to remember how to do smallish posters. (JoshM)

Aug 26, 2005 Meeting Minutes

  • Charlie, Toby, Alex (3 credits), Kevin (2 credits), Josh, Skylar (3 credits)
  • SIAM PP06
    • September 30, 2005 abstracts due, conference is February 22-24.
    • Low Latency Kernel
      • Collect papers, read, discuss next Wednesday. Wiki entry, papers, Beowulf posts, kernel source, in central place. Literature search, scholar.google.com, kernel development How-Tos, kernel mailing list, Beowulf mailing list, Toby to ask.
    • Little-Fe
      • BCCD with scripts to do mods for diskless booting
      • Write-up with pedagogical stuff and curriculum modules (list-packages)
  • Other
    • Clean and organize Wiki
    • Clean and organize Recompute/CCG lab

Plumbing

  • Get PBS/Maui working on bazaar (done), cairo (done), athena (done), and ACL (done); wiki for usage description, wiki for setup description; node0 exlude lists for startup files or figure-out particular files per machine with SystemImager. Can't get Lori's molecular system to parallelize even outside of PBS/Maui, Josh can give us a pointer to a test suite molecule that will parallelize (Skylar, done)
  • Setup Amanda, follow-up with Dan (Skylar)
    • Finish setting up weekly tape backups with AMANDA (Skylar). Stalled on borken tape drive.
  • Fiber uplink for bazaar (Charlie)
  • Return GBIC module (Alex)
  • Cool Athena in the display cabinet
  • Construct shelving for Athena
  • Figure-out why CVS commit emails don't always appear
  • Setup POVray on Athena (Skylar)
  • Protect F@C source and molecular systems, open http, ftp?, ssh? at cluster.earlham.edu (JoshM and Skylar)
    • Protect folding-at-clusters/articles/dr-dobbs (Charlie)
    • Can script find-out (environment variable?) where the checkout went so that it can protect those files?
  • Updates to http://cluster.earlham.edu
    • add link to cluster wiki page
    • loose rss feed, leave a single link to mt, last mt entry
    • news - siam posters, bccd.cs.uni.edu, others?
    • General and Resources link sets horizontal instead of vertical.
    • Add link to Resources called documents (add static link to last MT entry and link to wiki)
    • Update weatherduck link
    • Add link to wiki doc under tools
    • Update preset query link
    • Overview and Press prose update (Charlie)
  • Update speedup and speedup/efficiency within DVT for endnodes (Alex)

LittleFe

Folding@Clusters

  • Work with Betsy Ward to get the plumbing for F@C setup on the D224 OSX machines. Local user, document the setup with a Wiki entry. Modify run-fatc.pl, command line option to produce raw SQL. (JoshM and Alex)
    • Stalled on NetPIPE capability discovery errors from fatc.
  • Console (JoshM)
    • Environment variable called $FATCHOME
    • Command line
    • Sockets to communicate with mother
    • Variable in mother.conf for console port
    • Mother listens and responds to commands on the console port
    • Command list: status [(running|paused|stopped), molecular system, x out of y steps completed, estimated time remaining, # nodes started, # of nodes current], checkpoint, pause, resume, stop.
    • Command line option -nn interval for compact, refreshed display.
    • First version of console has to be supplied with a hostname and port number.
    • Future versions (possibly when we introduce the grandmother) can take a $FATCHOME environment variable that points to a mother.conf file (to get a port number) as a discovery mechanism.
      • Can happen now: status(x out of y, stopped or running), stop
      • After mods to mdrun and F@C updates: status(paused, molecular system, estimated time remaining, nodes started, nodes current), pause, resume, checkpoint
    • Give the console the ability to trigger the checkpoint and quit mechanism.
  • Checkpointing hook in GROMACS - change mdrun data structures (checkpoint frequency variable) instead of using SIGUSR1, and which files we are using. Figure-out how we can start, checkpoint, and recover using TPR and TOP files as input (JoshM)
  • Develop test canon (Alex and Charlie)
  • Document pval_report.pl and compare_walltime.pl (done) in Wiki (headings for each are already under HowTos) (Alex)
  • Supervise test runs, non-nfs, a2.7, all molecules, 1-4 nodes, bazaar and cairo, separate table (Alex)
  • rerun the following configurations and compare nfs/nonnfs (Alex)
    • bazaar proteasome
    • bazaar villin-urea
    • cairo methanol 1-8 nodes
    • cairo mixed
    • cairo proteasome
    • cairo water 1-8 nodes
    • bazaar water
  • Check copyright headers, see Adam's message of March 11, 2005
  • Test with PBS/Maui (Skylar)
  • Re-write run-fatc to take PBS (no LAM) into account. (Skylar)
  • Get PBS to allow unlimited walltime. (Skylar)

Curriculum Modules

  • Producing a cluster/distro specific set of modules out of one base unit
  • Generating a wiki entry and repository entry from one base unit
  • Population ecology module, start by finding what packages are available and making a list. (Skylar)

Recompute

  • Setup room in permament configuration.
  • Accept next shipment from ECS.

Green Science

  • Track down current and archival weather data (wind, temperature, others) for this area, Muncie airport, RP&L, other sources? (Mary)