Difference between revisions of "Cluster: New BobSCEd Install Log"
Jump to navigation
Jump to search
Mail
m (→Cluster Image Review) |
(→Modules Software) |
||
Line 17: | Line 17: | ||
* MPICH1 installed with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1</code> | * MPICH1 installed with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1</code> | ||
** Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall | ** Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall | ||
− | ** It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file. I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script. | + | ** It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file. I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script. <font color="green">Gave up and added it back in until I finish writing a qsub script that incorporates it</font> |
* MPICH2 installed with | * MPICH2 installed with | ||
:<code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix</code> | :<code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix</code> |
Revision as of 19:15, 21 April 2010
The source code for *anything* installed locally is in /usr/local/src
. The source for *anything* installed on NFS is in /mounts/bobsced/usr/local/src
.
Contents
Modules Software
Intel Compilers
- C/C++ Compiler Release Notes (PDF)
- Installed both custom and *not from RPM* so that I could put it in a different install location (on NFS):
- /mounts/bobsced/usr/local/modules-sw/intel/cce/11.1/
- /mounts/bobsced/usr/local/modules-sw/intel/fce/11.1/
Openmpi
- 1.3.1 - configured with
./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib
- 1.3.3 same
- 1.4.1 same
MPICH
- MPICH1 installed with
./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1
- Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall
- It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file. I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script. Gave up and added it back in until I finish writing a qsub script that incorporates it
- MPICH2 installed with
./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix
- Threw OSC's Torque functionality mpiexec on top of it (originally this was package mpich2, now it's mpich2-osc):
- ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --with-pbs=/var/spool/pbs/ --with-default-comm=mpich2-pmi
- See Debian Clustesr: MPICH with Torque
- Threw OSC's Torque functionality mpiexec on top of it (originally this was package mpich2, now it's mpich2-osc):
Using Modules
Important commands -
. /etc/profile.d/modules.sh
module load modules modules-init modules-bobsced
module avail
module load x
Log
Green color indicates something that still needs to be done.
Cloning
- Download the udpcast rpm from http://udpcast.linux.lu/source.html
- Install with
yum --nogpgcheck localinstall udpcast-20081213-1.i386.rpm
- On hopper, installed the syslinux and tftpd-hpa ports
- Enable tftpd in
/etc/inetd.conf
by removing the comments and restart inetd with/etc/rc.d/inetd restart
, and then also run the command listed on that line to start tftpd - The following lines were already in
/usr/local/etc/dhcpd.conf: allow booting; allow bootp;
, put the filename in the particular group (see Debian Clusters) cp /usr/local/share/syslinux/pxelinux.0 /tftpboot/
(the /tftpboot directory needs to be created)- Download linux, initrd, and default from the udpcast site into /tftpboot
- Move default into /tftpboot/pxelinux.cfg
- Restart dhcpd (
killall -KILL dhcpd
and/usr/local/sbin/dhcpd -q -cf /usr/local/etc/dhcpd.conf -lf /var/db/dhcpd/dhcpd.leases -pf /var/run/dhcpd/dhcpd.pid -user dhcpd -group dhcpd
- Enable tftpd in
- Install with
Head Node
Yum installed:
- gcc.x86_64, gcc-c++.x86_64
- for Ganglia:
- apr.x86_64 and apr-devel.x86_64
- libconfuse-2.5-4.el5.x86_64.rpm, libconfuse-devel-2.5-4.el5.x86_64.rpm (from Fedora repositories)
- expat-devel.x86_64
- for Intel updates:
- compat-libstdc++-33.i386
- blas.x86_64 (on all nodes)
Install C3 tools from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml
- Downloaded full install rpm on bs0, installed with
yum --nogpgcheck localinstall c3-4.0.1-1.noarch.rpm
- See C3 Tools README and C3 Tools INSTALL
- Put root's keys in the home directory and authorized itself, then copied that to the worker node image
Ganglia
- On hopper, added the data_source line for bs0 to
/usr/local/etc/gmetad.conf
and restarted it with/usr/local/etc/rc.d/gmetad restart
- Downloaded tar ball from http://sourceforge.net/projects/ganglia/
- See Ganglia README
./configure --prefix=/cluster
- The head node uses a different Ganglia gmond.conf in /etc/ganglia/gmond.conf and the workers just have theirs symlinked to /cluster/etc/gmond.conf
- By default, iptables is running on the CentOS install and blocks hopper's Ganglia requests
- Turned off by clearing it and then running
/sbin/service iptables save
- Turned off by clearing it and then running
Networking
- Shorewall, see
/etc/shorewall/params
for almost all of the important definitions- Natting is done through
/etc/shorewall/masq
- Natting is done through
- DHCP relay, added to boot with
chkconfig on
, set for hopper (installed as part of dhcp yum package)- See
/etc/sysconfig/dhcrelay
- This means that a dhcp server is also installed, but it is not set to run and is not configured, either
- Hopper needs to have a static route added in order to have the responses return, these are in
/etc/rc.conf
:
- See
static_routes="bs0"
route_bs0="192.168.0.1 159.28.234.200"
Modules
- Installed environment-modules from http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/environment-modules.html
- Important directories:
/usr/share/Modules/
- Important directories:
Torque
- Installed from source with
./configure --with-default-server=bs0.bobsced.loc --with-rc=scp --disable-mom --with-server-home=/var/spool/pbs
("clients" is what installs qmgr) - Installs to /usr/local/
- Set up according to Debian Clusters setup
- Reran the ./configure but without --disable-moms, then ran
make packages
, copied this to worker node
Maui
- installed Maui according to same link as above
Intel Firmware Updates
- Configured sendmail by adding bs0.bobsced.loc to /etc/mail/local-host-names
NFS
- The actual filesystem on bs0-new is at /mounts/bobsced. The nodes all mount this in the same place.
- It's mounted on hopper at /mounts/bobsced, with the symlink (currently) at /cluster/bobscednew
WebMO
- yum installed httpd
- Installed on bs0 with the following params:
Path to perl: /usr/bin/perl Webserver name: bs0-new.cluster.earlham.edu HTML directory: /var/www/webmo HTML URL: /webmo CGI script directory: /var/www/cgi-bin CGI script URL: /cgi-bin User files directory: /mounts/bobsced/WebMO
- Get this error when authing with LDAP:
Can't locate Authen/Simple/LDAP.pm
- yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
- edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
- Before externally authenticated users can use it, you have to go in as administrator and check the box to allow them in the Webmo group (or whatever other group)
- Gamess:
- yum install compat-gcc-34-g77.x86_64 and gfortran
- Followed directions from Webmo site
- Added the following line to httpd.conf:
SuexecUserGroup bob users
- Gaussian 09 not supported, though it's installed in /mounts/bobsced/usr/local/g09
- Installed g03, except get errors:
Erroneous write during file extend. write 160 instead of 4096 Probably out of disk space. Write error in NtrExt1: No such file or directory
or
Write error in NtrExt1: Bad address
- To fix this, do
echo 0 > /proc/sys/kernel/randomize_va_space
- This needs to be set to happen all the time on boot
- To fix this, do
Infiniband
- Drivers downloaded from here - the Red Hat 5.3 ones
- mount the ISO as a loopback somewhere (ie
mount -o loop /mounts/bobsced/usr/src/MLNX_OFED_LINUX-1.4-rhel5.3.iso /media/
) - run with -msm (ie
/media/mlnxofedinstall --msm
- mount the ISO as a loopback somewhere (ie
- Need to boot into the kernel that came in the original install (2.6.18-128.el5), otherwise get a message like this:
The 2.6.18-164.el5 kernel is installed, but do not have drivers available. Cannot continue.
- Then (before rebooting back to old kernel), run
mst start
- Then, sym link to current kernel version, like this: (see Rocks discussion here about it)
- You should get the version number in the error mst start will give you
ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus
- Success looks like this:
[root@bs0-new ~]# mst start Starting MST (Mellanox Software Tools) driver set: Loading MST PCI module [ OK ] Loading MST PCI configuration module [ OK ] Saving configuration for PCI device 01:00.0 [ OK ] Create devices
- Does this need to be done every time? It looks like yes.
Cluster Image Review
- LDAP needs to be installed on hopper
- Need to test if these are working
- mpich1
- mpich2
- openmpi
- Intel MPI needs to be installed (maybe?)
- Init scripts for pbs and maui need to be setup
' | Command Line -np 8 | Command Line -np 9 | Torque -np 8 | Torque -np 9 | Machinefile | Notes |
mpich1 | OK, except allocates 1 process to bs0 | OK, except allocates 1 process to bs0 | OK | OK | /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.LINUX | Creates temporary file PIxxxxx while running under qsub; MPI does not respect qsub # of nodes given |
mpich2-osc | N/A | N/A | OK, MUST specify # nodes, ppn | OK, MUST specify # nodes, ppn | Generated by PBS | Currently uses OSC's pbs-specific mpiexec (cannot be run outside of qsub) |
mpich2 | OK, must setup mpd ring | OK, must set up mdp ring | OK, must set up mdp ring | OK, must set up mdp ring | N/A (mpd ring) | Requires chmod 600 .mpd.conf in home directory with MPD_SECRETWORD=somevalue; setup ring with sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes; mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2
|
openmpi 1.* | OK for running on one node | OK for running on one node | Uses PBS_NODES automatically | Uses PBS_NODES automatically |
Submission scripts:
Mpich1
Mpich2
#!/bin/bash #PBS -N testmpich2 #PBS -l cput=00:60:00 #PBS -l nodes=2:ppn=4 . /etc/profile.d/modules.sh module load modules modules-init modules-bobsced module load mpich2 sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2 mpiexec -np 9 /cluster/home/kwanous/a.out mpdallexit rm -rf /cluster/home/kwanous/tmp/mpd.nodes
Mpich2 - OSC
#!/bin/bash #PBS -N testmpich1 #PBS -l cput=00:60:00 #PBS -l nodes=2:ppn=4 hostname . /etc/profile.d/modules.sh module load modules modules-init modules-bobsced module load mpich2 mpirun -np 8 /cluster/home/kwanous/a.out
Openmpi
#!/bin/bash #PBS -N testopenmpi #PBS -l cput=00:60:00 #PBS -l nodes=2:ppn=4 hostname . /etc/profile.d/modules.sh module load modules modules-init modules-bobsced module load openmpi/1.3.1 mpirun -np 9 /cluster/home/kwanous/a.out