Cluster: New BobSCEd Install Log

From Earlham CS Department
Revision as of 19:24, 17 October 2009 by Kay (talk | contribs) (Infiniband)
Jump to navigation Jump to search

Scratch Space

Log

Green color indicates something that still needs to be done.

Cloning

  • Download the udpcast rpm from http://udpcast.linux.lu/source.html
    • Install with yum --nogpgcheck localinstall udpcast-20081213-1.i386.rpm
    • On hopper, installed the syslinux and tftpd-hpa ports
      • Enable tftpd in /etc/inetd.conf by removing the comments and restart inetd with /etc/rc.d/inetd restart, and then also run the command listed on that line to start tftpd
      • The following lines were already in /usr/local/etc/dhcpd.conf: allow booting; allow bootp;, put the filename in the particular group (see Debian Clusters)
      • cp /usr/local/share/syslinux/pxelinux.0 /tftpboot/ (the /tftpboot directory needs to be created)
      • Download linux, initrd, and default from the udpcast site into /tftpboot
      • Move default into /tftpboot/pxelinux.cfg
      • Restart dhcpd (killall -KILL dhcpd and /usr/local/sbin/dhcpd -q -cf /usr/local/etc/dhcpd.conf -lf /var/db/dhcpd/dhcpd.leases -pf /var/run/dhcpd/dhcpd.pid -user dhcpd -group dhcpd

Head Node

Yum installed:

  • gcc.x86_64, gcc-c++.x86_64
  • for Ganglia:
    • apr.x86_64 and apr-devel.x86_64
    • libconfuse-2.5-4.el5.x86_64.rpm, libconfuse-devel-2.5-4.el5.x86_64.rpm (from Fedora repositories)
    • expat-devel.x86_64
  • for Intel updates:
    • compat-libstdc++-33.i386
  • blas.x86_64 (on all nodes)

Install C3 tools from http://www.csm.ornl.gov/torc/C3/C3softwarepage.shtml

  • Downloaded full install rpm on bs0, installed with yum --nogpgcheck localinstall c3-4.0.1-1.noarch.rpm

Ganglia

  • On hopper, added the data_source line for bs0 to /usr/local/etc/gmetad.conf and restarted it with /usr/local/etc/rc.d/gmetad restart
  • Downloaded tar ball from http://sourceforge.net/projects/ganglia/
    • See Ganglia README
    • ./configure --prefix=/cluster
    • The head node uses a different Ganglia gmond.conf in /etc/ganglia/gmond.conf and the workers just have theirs symlinked to /cluster/etc/gmond.conf
    • By default, iptables is running on the CentOS install and blocks hopper's Ganglia requests
      • Turned off by clearing it and then running /sbin/service iptables save

Networking

  • Shorewall, see /etc/shorewall/params for almost all of the important definitions
    • Natting is done through /etc/shorewall/masq
  • DHCP relay, added to boot with chkconfig on, set for hopper (installed as part of dhcp yum package)
    • See /etc/sysconfig/dhcrelay
    • This means that a dhcp server is also installed, but it is not set to run and is not configured, either
    • Hopper needs to have a static route added in order to have the responses return, these are in /etc/rc.conf:
static_routes="bs0"
route_bs0="192.168.0.1 159.28.234.200"

Modules

Torque

  • Installed from source with ./configure --with-default-server=bs0.bobsced.loc --with-rc=scp --disable-mom --with-server-home=/var/spool/pbs ("clients" is what installs qmgr)
  • Installs to /usr/local/
  • Set up according to Debian Clusters setup
  • Reran the ./configure but without --disable-moms, then ran make packages, copied this to worker node

Maui

  • installed Maui according to same link as above

Intel Firmware Updates

Mail

  • Configured sendmail by adding bs0.bobsced.loc to /etc/mail/local-host-names

NFS

  • The actual filesystem on bs0-new is at /mounts/bobsced. The nodes all mount this in the same place.
  • It's mounted on hopper at /mounts/bobsced, with the symlink (currently) at /cluster/bobscednew

WebMO

  • yum installed httpd
  • Installed on bs0 with the following params:
Path to perl:         /usr/bin/perl
Webserver name:       bs0-new.cluster.earlham.edu
HTML directory:       /var/www/webmo
HTML URL:             /webmo
CGI script directory: /var/www/cgi-bin
CGI script URL:       /cgi-bin
User files directory: /mounts/bobsced/WebMO
  • Get this error when authing with LDAP: Can't locate Authen/Simple/LDAP.pm
  • yum installed perl-LDAP.noarch, didn't work, so used CPAN to install Authen::Simple::LDAP
  • edited /var/www/cgi-bin/interfaces/authen.conf for our LDAP settings
  • Before externally authenticated users can use it, you have to go in as administrator and check the box to allow them in the Webmo group (or whatever other group)
  • Gamess:
    • yum install compat-gcc-34-g77.x86_64 and gfortran
    • Followed directions from Webmo site
  • Added the following line to httpd.conf:
SuexecUserGroup bob users
  • Gaussian 09 not supported, though it's installed in /mounts/bobsced/usr/local/g09
  • Installed g03, except get errors:
Erroneous write during file extend. write 160 instead of 4096
Probably out of disk space.
Write error in NtrExt1: No such file or directory

or

Write error in NtrExt1: Bad address
    • To fix this, do echo 0 > /proc/sys/kernel/randomize_va_space

Infiniband

  • Drivers downloaded from here - the Red Hat 5.3 ones
    • mount the ISO as a loopback somewhere (ie mount -o loop /mounts/bobsced/usr/src/MLNX_OFED_LINUX-1.4-rhel5.3.iso /media/)
    • run with -msm (ie /media/mlnxofedinstall --msm
  • Need to boot into the kernel that came in the original install (2.6.18-128.el5), otherwise get a message like this:
The 2.6.18-164.el5 kernel is installed, but do not have drivers available. 
Cannot continue.
  • Then (before rebooting back to old kernel), run mst start
  • Then, sym link to current kernel version, like this: (see Rocks discussion here about it)
    • You should get the version number in the error mst start will give you
ln -s /usr/mst/lib/2.6.18-128.el5/ /usr/mst/lib/2.6.18-164.2.1.e15.plus
  • Success looks like this:
[root@bs0-new ~]# mst start
    Starting MST (Mellanox Software Tools) driver set: 
Loading MST PCI module                                     [  OK  ]
Loading MST PCI configuration module                       [  OK  ]
Saving configuration for PCI device 01:00.0                [  OK  ]
Create devices