Difference between revisions of "Cluster: New BobSCEd Install Log"

From Earlham CS Department
Jump to navigation Jump to search
(Cluster Image Review)
(Modules Software)
Line 11: Line 11:
 
* 1.3.1 - configured with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib</code>
 
* 1.3.1 - configured with <code>./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib</code>
 
* 1.3.3 same
 
* 1.3.3 same
 +
* 1.4.1 same
  
 
'''MPICH'''
 
'''MPICH'''

Revision as of 14:10, 13 April 2010

The source code for *anything* installed locally is in /usr/local/src. The source for *anything* installed on NFS is in /mounts/bobsced/usr/local/src.

Modules Software

Intel Compilers

  • C/C++ Compiler Release Notes (PDF)
  • Installed both custom and *not from RPM* so that I could put it in a different install location (on NFS):
    • /mounts/bobsced/usr/local/modules-sw/intel/cce/11.1/
    • /mounts/bobsced/usr/local/modules-sw/intel/fce/11.1/

Openmpi

  • 1.3.1 - configured with ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib
  • 1.3.3 same
  • 1.4.1 same

MPICH

  • MPICH1 installed with ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1
    • Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall
    • It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file. I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script.
  • MPICH2 installed with
./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix
    • Threw OSC's Torque functionality mpiexec on top of it (originally this was package mpich2, now it's mpich2-osc):

Using Modules

Important commands -

  • . /etc/profile.d/modules.sh
  • module load modules modules-init modules-bobsced
  • module avail
  • module load x

Log

Green color indicates something that still needs to be done.

Cloning

  • Download the udpcast rpm from http://udpcast.linux.lu/source.html
    • Install with yum --nogpgcheck localinstall udpcast-20081213-1.i386.rpm
    • On hopper, installed the syslinux and tftpd-hpa ports
      • Enable tftpd in /etc/inetd.conf by removing the comments and restart inetd with /etc/rc.d/inetd restart, and then also run the command listed on that line to start tftpd
      • The following lines were already in /usr/local/etc/dhcpd.conf: allow booting; allow bootp;, put the filename in the particular group (see Debian Clusters)
      • cp /usr/local/share/syslinux/pxelinux.0 /tftpboot/ (the /tftpboot directory needs to be created)
      • Download linux, initrd, and default from the udpcast site into /tftpboot
      • Move default into /tftpboot/pxelinux.cfg
      • Restart dhcpd (killall -KILL dhcpd and /usr/local/sbin/dhcpd -q -cf /usr/local/etc/dhcpd.conf -lf /var/db/dhcpd/dhcpd.leases -pf /var/run/dhcpd/dhcpd.pid -user dhcpd -group dhcpd

Head Node

Yum installed:

  • gcc.x86_64, gcc-c++.x86_64
  • for Ganglia:
    • apr.x86_64 and apr-devel.x86_64
    • libconfuse-2.5-4.el5.x86_64.rpm, libconfuse-devel-2.5-4.el5.x86_64.rpm (from Fedora repositories)
    • expat-devel.x86_64
  • for Intel updates:
    • compat-libstdc++-33.i386
  • blas.x86_64 (on all nodes)

Install C3 tools from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml

  • Downloaded full install rpm on bs0, installed with yum --nogpgcheck localinstall c3-4.0.1-1.noarch.rpm

Ganglia

  • On hopper, added the data_source line for bs0 to /usr/local/etc/gmetad.conf and restarted it with /usr/local/etc/rc.d/gmetad restart
  • Downloaded tar ball from http://sourceforge.net/projects/ganglia/
    • See Ganglia README
    • ./configure --prefix=/cluster
    • The head node uses a different Ganglia gmond.conf in /etc/ganglia/gmond.conf and the workers just have theirs symlinked to /cluster/etc/gmond.conf
    • By default, iptables is running on the CentOS install and blocks hopper's Ganglia requests
      • Turned off by clearing it and then running /sbin/service iptables save

Networking

  • Shorewall, see /etc/shorewall/params for almost all of the important definitions
    • Natting is done through /etc/shorewall/masq
  • DHCP relay, added to boot with chkconfig on, set for hopper (installed as part of dhcp yum package)
    • See /etc/sysconfig/dhcrelay
    • This means that a dhcp server is also installed, but it is not set to run and is not configured, either
    • Hopper needs to have a static route added in order to have the responses return, these are in /etc/rc.conf:
static_routes="bs0"
route_bs0="192.168.0.1 159.28.234.200"

Modules

Torque

  • Installed from source with ./configure --with-default-server=bs0.bobsced.loc --with-rc=scp --disable-mom --with-server-home=/var/spool/pbs ("clients" is what installs qmgr)
  • Installs to /usr/local/
  • Set up according to Debian Clusters setup
  • Reran the ./configure but without --disable-moms, then ran make packages, copied this to worker node

Maui

  • installed Maui according to same link as above

Intel Firmware Updates

Mail

  • Configured sendmail by adding bs0.bobsced.loc to /etc/mail/local-host-names

NFS

  • The actual filesystem on bs0-new is at /mounts/bobsced. The nodes all mount this in the same place.
  • It's mounted on hopper at /mounts/bobsced, with the symlink (currently) at /cluster/bobscednew

WebMO

  • yum installed httpd
  • Installed on bs0 with the following params:
Path to perl:         /usr/bin/perl
Webserver name:       bs0-new.cluster.earlham.edu
HTML directory:       /var/www/webmo
HTML URL:             /webmo
CGI script directory: /var/www/cgi-bin
CGI script URL:       /cgi-bin
User files directory: /mounts/bobsced/WebMO
  • Get this error when authing with LDAP: Can't locate Authen/Simple/LDAP.pm
  • yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
  • edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
  • Before externally authenticated users can use it, you have to go in as administrator and check the box to allow them in the Webmo group (or whatever other group)
  • Gamess:
    • yum install compat-gcc-34-g77.x86_64 and gfortran
    • Followed directions from Webmo site
  • Added the following line to httpd.conf:
SuexecUserGroup bob users
  • Gaussian 09 not supported, though it's installed in /mounts/bobsced/usr/local/g09
  • Installed g03, except get errors:
Erroneous write during file extend. write 160 instead of 4096
Probably out of disk space.
Write error in NtrExt1: No such file or directory

or

Write error in NtrExt1: Bad address
    • To fix this, do echo 0 > /proc/sys/kernel/randomize_va_space
    • This needs to be set to happen all the time on boot

Infiniband

  • Drivers downloaded from here - the Red Hat 5.3 ones
    • mount the ISO as a loopback somewhere (ie mount -o loop /mounts/bobsced/usr/src/MLNX_OFED_LINUX-1.4-rhel5.3.iso /media/)
    • run with -msm (ie /media/mlnxofedinstall --msm
  • Need to boot into the kernel that came in the original install (2.6.18-128.el5), otherwise get a message like this:
The 2.6.18-164.el5 kernel is installed, but do not have drivers available. 
Cannot continue.
  • Then (before rebooting back to old kernel), run mst start
  • Then, sym link to current kernel version, like this: (see Rocks discussion here about it)
    • You should get the version number in the error mst start will give you
ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus
  • Success looks like this:
[root@bs0-new ~]# mst start
    Starting MST (Mellanox Software Tools) driver set: 
Loading MST PCI module                                     [  OK  ]
Loading MST PCI configuration module                       [  OK  ]
Saving configuration for PCI device 01:00.0                [  OK  ]
Create devices
  • Does this need to be done every time? It looks like yes.

Cluster Image Review

  • LDAP needs to be installed on hopper
  • Need to test if these are working
    • mpich1
    • mpich2
    • openmpi
  • Intel MPI needs to be installed (maybe?)
  • Init scripts for pbs and maui need to be setup
' Command Line -np 8 Command Line -np 9 Torque -np 8 Torque -np 9 Machinefile Notes
mpich1 OK, except allocates 1 process to bs0 OK, except allocates 1 process to bs0 OK OK /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.LINUX Creates temporary file PIxxxxx while running under qsub; MPI does not respect qsub # of nodes given
mpich2-osc N/A N/A OK, MUST specify # nodes, ppn OK, MUST specify # nodes, ppn Generated by PBS Currently uses OSC's pbs-specific mpiexec (cannot be run outside of qsub)
mpich2 OK, must setup mpd ring OK, must set up mdp ring OK, must set up mdp ring OK, must set up mdp ring N/A (mpd ring) Requires chmod 600 .mpd.conf in home directory with MPD_SECRETWORD=somevalue; setup ring with sort $PBS_NODEFILE | uniq -c | awk '{print $2":"$1}' > /cluster/home/kwanous/tmp/mpd.nodes; mpdboot -f /cluster/home/kwanous/tmp/mpd.nodes -n 2
openmpi 1.3.1
openmpi 1.3.3