Cluster:Todo
Jump to navigation
Jump to search
(Need a notation for relative priority. Please don't delete anything unless we're updating this during a meeting.)
Current Items (updated September 28, 2005)
- Little-Fe
- Upgrade firmware on borken LF node (Skylar)
- Implement boot-off-zip
- Fossilize BCCD onto Little-Fe, making progress, see BCCD/PPC wiki for details (Kevin and Toby)
- Make backup dump of current LittleFe to admin (Kevin)
- Measure load with 4 nodes, 2 disk drives, CD drive; order transformer (Charlie)
- Send email to Paul about BCCD changes (Toby)
- Return lf2 with proper processor for 2.6 (Skylar)
- LLK (see Cluster:LowLatency for the details)
- Verify&validate content, and define high- and low-level benchmarks
- Condense diagram
- Make diagram vertical
- Make bar lengths relative to size
- Check Alteon drivers for STP, do they support cairo?
- Read the STP paper, emulate his test methodology/program? (Everyone)
- Is there a 2.6 version of STP? (Only from us.) SGI? (Not likely.) (Skylar)
- Look at separate socket implementations (Skylar)
- Look at Netpipe calls for STP help (Skylar)
- Measure latency in kernel and on wire using either kperf or tp_timer - (Alex and Toby)
- Put together HOWTO using JoshM's diagram and timing structure, with both Netpipe and packgen (Alex, Toby)
- Use latest Netpipe for packet generation, specify payload, validate (Toby, Alex)
- Investigate tp_timer instabilities (Toby, Alex)
- Test accuracy by loading one of the nodes with CPU and disk traffic (lots)
- Setup and document this, along with kernel building/loading/starting, so that any of us can make a change and a measurement.
- Measure packet loss rate at each node and the switches and hopper using SNMP/Cricket (Skylar)
- Measure bit error rate (Skylar)
- Use a structure/array for tp_ routines (Alex and Toby)
- No word from MAGNET on 2.6.x
- Real time kernels and TCP/IP book _late next week_.
- Update the list of references, llk.bib (Everyone)
- Literature search (Skylar, JoshM)
- Document all sysctls, build options, etc. that effect network performance, particularly latency
- Look for papers, etc. that discuss network performance/latency under Linux 2.6.x. For each paper write a paragraph describing what's useful, things we should try, etc., a BibTeX entry, put an electronic copy in /cluster/project/llk/references.
- Folding@Clusters
- Build a 3.1.4 CVS export with instructions and park it in ~pande for Guha (JoshM)
- Test a range of molecules, clusters, and sizes with Alex's scripts and PBS/Maui (Skylar)
- Plumbing
- Figure out how to get poster in one go (Skylar) LaTeX?
- on bazaar: ccache & distcc with wiki howto (Skylar,done)
- on cairo: distcc (currently installed but not running) with wiki howto (Skylar,done). Pull image too (Skylar)
- Investigate cairo's network delays. Switches? sshd? Timer reset? (Skylar, Toby)
- Setup testing flows (UDP, ICMP) between hopper and cairo to test latency. (Skylar)
- Other
- There may be a student program at the Oberlin Computational Science conference (November 4-6)
- Merck/AAAS Poster Session. Wednesday October 26th. Folding@Clusters and Little-Fe. We need to remember how to do smallish posters. (JoshM)
Aug 26, 2005 Meeting Minutes
- Charlie, Toby, Alex (3 credits), Kevin (2 credits), Josh, Skylar (3 credits)
- SIAM PP06
- September 30, 2005 abstracts due, conference is February 22-24.
- Low Latency Kernel
- Collect papers, read, discuss next Wednesday. Wiki entry, papers, Beowulf posts, kernel source, in central place. Literature search, scholar.google.com, kernel development How-Tos, kernel mailing list, Beowulf mailing list, Toby to ask.
- Little-Fe
- BCCD with scripts to do mods for diskless booting
- Write-up with pedagogical stuff and curriculum modules (list-packages)
- Other
- Clean and organize Wiki
- Clean and organize Recompute/CCG lab
Plumbing
- Whitebord(s) for 4th (Charlie)
- Change Vijay's password (JoshM)
- Fix Watts up?, possibly battery (Skylar)
- Actually needs replacing (Charlie)
- Setup Amanda, follow-up with Dan (Skylar)
- Finish setting up weekly tape backups with AMANDA (Skylar). Stalled on borken tape drive.
- Fiber uplink for bazaar (Charlie)
- Return GBIC module (Alex)
- Cool Athena in the display cabinet
- Construct shelving for Athena
- Figure-out why CVS commit emails don't always appear
- Setup POVray on Athena (Skylar)
- Protect F@C source and molecular systems, open http, ftp?, ssh? at cluster.earlham.edu (JoshM and Skylar)
- Protect folding-at-clusters/articles/dr-dobbs (Charlie)
- Can script find-out (environment variable?) where the checkout went so that it can protect those files?
- Updates to http://cluster.earlham.edu
- add link to cluster wiki page
- loose rss feed, leave a single link to mt, last mt entry
- news - siam posters, bccd.cs.uni.edu, others?
- General and Resources link sets horizontal instead of vertical.
- Add link to Resources called documents (add static link to last MT entry and link to wiki)
- Update weatherduck link
- Add link to wiki doc under tools
- Update preset query link
- Overview and Press prose update (Charlie)
- Update speedup and speedup/efficiency within DVT for endnodes (Alex)
LittleFe
Folding@Clusters
- Work with Betsy Ward to get the plumbing for F@C setup on the D224 OSX machines. Local user, document the setup with a Wiki entry. Modify run-fatc.pl, command line option to produce raw SQL. (JoshM and Alex)
- Stalled on NetPIPE capability discovery errors from fatc.
- Console (JoshM)
- Environment variable called $FATCHOME
- Command line
- Sockets to communicate with mother
- Variable in mother.conf for console port
- Mother listens and responds to commands on the console port
- Command list: status [(running|paused|stopped), molecular system, x out of y steps completed, estimated time remaining, # nodes started, # of nodes current], checkpoint, pause, resume, stop.
- Command line option -nn interval for compact, refreshed display.
- First version of console has to be supplied with a hostname and port number.
- Future versions (possibly when we introduce the grandmother) can take a $FATCHOME environment variable that points to a mother.conf file (to get a port number) as a discovery mechanism.
- Can happen now: status(x out of y, stopped or running), stop
- After mods to mdrun and F@C updates: status(paused, molecular system, estimated time remaining, nodes started, nodes current), pause, resume, checkpoint
- Give the console the ability to trigger the checkpoint and quit mechanism.
- Checkpointing hook in GROMACS - change mdrun data structures (checkpoint frequency variable) instead of using SIGUSR1, and which files we are using. Figure-out how we can start, checkpoint, and recover using TPR and TOP files as input (JoshM)
- Develop test canon (Alex and Charlie)
- Document pval_report.pl and compare_walltime.pl (done) in Wiki (headings for each are already under HowTos) (Alex)
- Supervise test runs, non-nfs, a2.7, all molecules, 1-4 nodes, bazaar and cairo, separate table (Alex)
- rerun the following configurations and compare nfs/nonnfs (Alex)
- bazaar proteasome
- bazaar villin-urea
- cairo methanol 1-8 nodes
- cairo mixed
- cairo proteasome
- cairo water 1-8 nodes
- bazaar water
- Check copyright headers, see Adam's message of March 11, 2005
- Test with PBS/Maui (Skylar)
- Re-write run-fatc to take PBS (no LAM) into account. (Skylar)
- Get PBS to allow unlimited walltime. (Skylar)
Curriculum Modules
- Producing a cluster/distro specific set of modules out of one base unit
- Generating a wiki entry and repository entry from one base unit
- Population ecology module, start by finding what packages are available and making a list. (Skylar)
Recompute
- Setup room in permament configuration.
- Accept next shipment from ECS.
Green Science
- Track down current and archival weather data (wind, temperature, others) for this area, Muncie airport, RP&L, other sources? (Mary)