Cluster: New BobSCEd Install Log

From Earlham CS Department
Revision as of 12:15, 12 April 2010 by Kay (talk | contribs) (Cluster Image Review)
Jump to navigation Jump to search

The source code for *anything* installed locally is in /usr/local/src. The source for *anything* installed on NFS is in /mounts/bobsced/usr/local/src.

Modules Software

Intel Compilers

  • C/C++ Compiler Release Notes (PDF)
  • Installed both custom and *not from RPM* so that I could put it in a different install location (on NFS):
    • /mounts/bobsced/usr/local/modules-sw/intel/cce/11.1/
    • /mounts/bobsced/usr/local/modules-sw/intel/fce/11.1/

Openmpi

  • 1.3.1 - configured with ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/openmpi/1.3.1/ --enable-mpi-threads --with-openib
  • 1.3.3 same

MPICH

  • MPICH1 installed with ./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1
    • Make note: to uninstall - /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/sbin/mpiuninstall
    • It looks to /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.ARCH instead of just running on the host machine if run without a machine file. I removed this file to force people to use a machinesfile... that's going to have to be generated from PBS as part of the qsub script.
  • MPICH2 installed with
./configure --prefix=/mounts/bobsced/usr/local/modules-sw/mpich2/2.1.2 --enable-cxx --enable-f90 --enable-f77 --enable-threads=multiple --with-thread-package=posix
    • Threw OSC's Torque functionality mpiexec on top of it:

Using Modules

Important commands -

  • . /etc/profile.d/modules.sh
  • module load modules modules-init modules-bobsced
  • module avail
  • module load x

Log

Green color indicates something that still needs to be done.

Cloning

  • Download the udpcast rpm from http://udpcast.linux.lu/source.html
    • Install with yum --nogpgcheck localinstall udpcast-20081213-1.i386.rpm
    • On hopper, installed the syslinux and tftpd-hpa ports
      • Enable tftpd in /etc/inetd.conf by removing the comments and restart inetd with /etc/rc.d/inetd restart, and then also run the command listed on that line to start tftpd
      • The following lines were already in /usr/local/etc/dhcpd.conf: allow booting; allow bootp;, put the filename in the particular group (see Debian Clusters)
      • cp /usr/local/share/syslinux/pxelinux.0 /tftpboot/ (the /tftpboot directory needs to be created)
      • Download linux, initrd, and default from the udpcast site into /tftpboot
      • Move default into /tftpboot/pxelinux.cfg
      • Restart dhcpd (killall -KILL dhcpd and /usr/local/sbin/dhcpd -q -cf /usr/local/etc/dhcpd.conf -lf /var/db/dhcpd/dhcpd.leases -pf /var/run/dhcpd/dhcpd.pid -user dhcpd -group dhcpd

Head Node

Yum installed:

  • gcc.x86_64, gcc-c++.x86_64
  • for Ganglia:
    • apr.x86_64 and apr-devel.x86_64
    • libconfuse-2.5-4.el5.x86_64.rpm, libconfuse-devel-2.5-4.el5.x86_64.rpm (from Fedora repositories)
    • expat-devel.x86_64
  • for Intel updates:
    • compat-libstdc++-33.i386
  • blas.x86_64 (on all nodes)

Install C3 tools from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml

  • Downloaded full install rpm on bs0, installed with yum --nogpgcheck localinstall c3-4.0.1-1.noarch.rpm

Ganglia

  • On hopper, added the data_source line for bs0 to /usr/local/etc/gmetad.conf and restarted it with /usr/local/etc/rc.d/gmetad restart
  • Downloaded tar ball from http://sourceforge.net/projects/ganglia/
    • See Ganglia README
    • ./configure --prefix=/cluster
    • The head node uses a different Ganglia gmond.conf in /etc/ganglia/gmond.conf and the workers just have theirs symlinked to /cluster/etc/gmond.conf
    • By default, iptables is running on the CentOS install and blocks hopper's Ganglia requests
      • Turned off by clearing it and then running /sbin/service iptables save

Networking

  • Shorewall, see /etc/shorewall/params for almost all of the important definitions
    • Natting is done through /etc/shorewall/masq
  • DHCP relay, added to boot with chkconfig on, set for hopper (installed as part of dhcp yum package)
    • See /etc/sysconfig/dhcrelay
    • This means that a dhcp server is also installed, but it is not set to run and is not configured, either
    • Hopper needs to have a static route added in order to have the responses return, these are in /etc/rc.conf:
static_routes="bs0"
route_bs0="192.168.0.1 159.28.234.200"

Modules

Torque

  • Installed from source with ./configure --with-default-server=bs0.bobsced.loc --with-rc=scp --disable-mom --with-server-home=/var/spool/pbs ("clients" is what installs qmgr)
  • Installs to /usr/local/
  • Set up according to Debian Clusters setup
  • Reran the ./configure but without --disable-moms, then ran make packages, copied this to worker node

Maui

  • installed Maui according to same link as above

Intel Firmware Updates

Mail

  • Configured sendmail by adding bs0.bobsced.loc to /etc/mail/local-host-names

NFS

  • The actual filesystem on bs0-new is at /mounts/bobsced. The nodes all mount this in the same place.
  • It's mounted on hopper at /mounts/bobsced, with the symlink (currently) at /cluster/bobscednew

WebMO

  • yum installed httpd
  • Installed on bs0 with the following params:
Path to perl:         /usr/bin/perl
Webserver name:       bs0-new.cluster.earlham.edu
HTML directory:       /var/www/webmo
HTML URL:             /webmo
CGI script directory: /var/www/cgi-bin
CGI script URL:       /cgi-bin
User files directory: /mounts/bobsced/WebMO
  • Get this error when authing with LDAP: Can't locate Authen/Simple/LDAP.pm
  • yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
  • edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
  • Before externally authenticated users can use it, you have to go in as administrator and check the box to allow them in the Webmo group (or whatever other group)
  • Gamess:
    • yum install compat-gcc-34-g77.x86_64 and gfortran
    • Followed directions from Webmo site
  • Added the following line to httpd.conf:
SuexecUserGroup bob users
  • Gaussian 09 not supported, though it's installed in /mounts/bobsced/usr/local/g09
  • Installed g03, except get errors:
Erroneous write during file extend. write 160 instead of 4096
Probably out of disk space.
Write error in NtrExt1: No such file or directory

or

Write error in NtrExt1: Bad address
    • To fix this, do echo 0 > /proc/sys/kernel/randomize_va_space
    • This needs to be set to happen all the time on boot

Infiniband

  • Drivers downloaded from here - the Red Hat 5.3 ones
    • mount the ISO as a loopback somewhere (ie mount -o loop /mounts/bobsced/usr/src/MLNX_OFED_LINUX-1.4-rhel5.3.iso /media/)
    • run with -msm (ie /media/mlnxofedinstall --msm
  • Need to boot into the kernel that came in the original install (2.6.18-128.el5), otherwise get a message like this:
The 2.6.18-164.el5 kernel is installed, but do not have drivers available. 
Cannot continue.
  • Then (before rebooting back to old kernel), run mst start
  • Then, sym link to current kernel version, like this: (see Rocks discussion here about it)
    • You should get the version number in the error mst start will give you
ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus
  • Success looks like this:
[root@bs0-new ~]# mst start
    Starting MST (Mellanox Software Tools) driver set: 
Loading MST PCI module                                     [  OK  ]
Loading MST PCI configuration module                       [  OK  ]
Saving configuration for PCI device 01:00.0                [  OK  ]
Create devices
  • Does this need to be done every time? It looks like yes.

Cluster Image Review

  • LDAP needs to be installed on hopper
  • Need to test if these are working
    • mpich1
    • mpich2
    • openmpi
  • Intel MPI needs to be installed (maybe?)
  • Init scripts for pbs and maui need to be setup
' Command Line -np 8 Command Line -np 9 Torque -np 8 Torque -np 9 Machinefile Notes
mpich1 OK, except allocates 1 process to bs0 OK, except allocates 1 process to bs0 OK OK /mounts/bobsced/usr/local/modules-sw/mpich1/1.2.7p1/share/machines.LINUX Creates temporary file PIxxxxx while running under qsub
mpich2
openmpi 1.3.1
openmpi 1.3.3