Difference between revisions of "Bobsced Cluster"
Jump to navigation
Jump to search
(added info on disabling kickstart after hard reset) |
(→RLIMIT_MEMLOCK) |
||
(8 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | =Todo= | |
− | + | * 411 tools | |
− | + | ** fix ganglia to recognize broadcasts & update | |
− | * 411 tools | + | * Naming scheme |
− | * Naming scheme bs* vs compute-*-* vs c*-* | + | ** bs* vs compute-*-* vs c*-* |
** This is terrible it needs work | ** This is terrible it needs work | ||
− | * Updating bobsced0's RPM repo | + | *Updating bobsced0's RPM repo |
** yum-- free | ** yum-- free | ||
** up2date-- RHEL | ** up2date-- RHEL | ||
** "Aborting the rocks-update tool while the tool is downloading RPMs might produce corrupted RPM packages (SDSC Toolkit)" from pr_troubleshooting.doc | ** "Aborting the rocks-update tool while the tool is downloading RPMs might produce corrupted RPM packages (SDSC Toolkit)" from pr_troubleshooting.doc | ||
− | * NIS map | + | *NIS map |
− | * What broke <code>cluster-fork</code>? | + | **<code>/etc/passwd & /etc/group</code> permissions |
+ | **An architecture without a variable amount of delay before BobSCEd is updated would be nice. | ||
+ | *What broke <code>cluster-fork</code>? | ||
+ | *NAT & NFS | ||
+ | **[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-October/021958.html Mailing list] | ||
+ | **[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-May/018503.html Mailing list] | ||
+ | **We should consider flattening the network, that is moving everything into the 159.28.234/24 subnet. | ||
+ | *Stop the hourly cron from producing output on stdout unless there is an error | ||
+ | *Setup and test Infiniband fabric | ||
+ | *Why does hopper see an interface flap from bobsced0? | ||
+ | *bosced0 wants to be the DNS server for the compute nodes | ||
− | + | =Howtos= | |
− | + | ==Updating nodes to be kickstarted & adding new packages== | |
* bobsced0 can be updated by just installing rpms | * bobsced0 can be updated by just installing rpms | ||
* Check for an RPM in: <code>/state/partition1/home/install/rocks-dist/lan/x86_64/RedHat/RPMS/</code> | * Check for an RPM in: <code>/state/partition1/home/install/rocks-dist/lan/x86_64/RedHat/RPMS/</code> | ||
Line 29: | Line 39: | ||
** <code>ssh –p 2200 compute-x-x</code> | ** <code>ssh –p 2200 compute-x-x</code> | ||
− | + | == Adding post install scripts to kickstart == | |
* Edit <code>/home/install/site-profiles/4.1.1/nodes/extend-compute.xml</code> | * Edit <code>/home/install/site-profiles/4.1.1/nodes/extend-compute.xml</code> | ||
* Add a <code><post arch="x86_64"></code> entry i.e.: | * Add a <code><post arch="x86_64"></code> entry i.e.: | ||
** <code><post arch="x86_64">cp /cluster/ganglia/gmond.conf /etc/gmond.conf</post></code> | ** <code><post arch="x86_64">cp /cluster/ganglia/gmond.conf /etc/gmond.conf</post></code> | ||
− | + | ||
+ | == Using 411 tools == | ||
* make -C /var/411 on bobsced0 | * make -C /var/411 on bobsced0 | ||
** Copies the files to /etc/411.d/ using 411put | ** Copies the files to /etc/411.d/ using 411put | ||
Line 39: | Line 50: | ||
** The files that are watched can be updated by changing the makefiles in /var/411/ | ** The files that are watched can be updated by changing the makefiles in /var/411/ | ||
* <code>cluster-fork /opt/rocks/bin/411get --all</code> | * <code>cluster-fork /opt/rocks/bin/411get --all</code> | ||
− | + | ||
+ | == cluster-fork == | ||
* Used to run commands on all cluster nodes like the c3tools | * Used to run commands on all cluster nodes like the c3tools | ||
** Broken, see todo | ** Broken, see todo | ||
** Temporary fix: <code>cluster-fork --nodes="compute-0-%d:0-14" <command> </code> | ** Temporary fix: <code>cluster-fork --nodes="compute-0-%d:0-14" <command> </code> | ||
− | + | == disabling reinstall (kickstart) after hard reset == | |
* [http://www.rocksclusters.org/rocks-documentation/4.2.1/faq-configuration.html#DISABLE-REINSTALL Official documentation] | * [http://www.rocksclusters.org/rocks-documentation/4.2.1/faq-configuration.html#DISABLE-REINSTALL Official documentation] | ||
* [https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-December/022969.html From the mailing list] | * [https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2006-December/022969.html From the mailing list] | ||
− | ==General Info== | + | == Generating an up-to-date machinefile == |
− | + | Ethernet: | |
− | + | cluster-fork /sbin/ifconfig -a | grep -1 Ethernet | awk '{printf("%s slots=4\n",$2)}' | cut -d : -f 2 > bs-eth-hosts | |
− | + | Infiniband: | |
− | + | cluster-fork /sbin/ifconfig -a | grep -1 UNSPEC | awk '{printf("%s slots=4\n",$2)}' | cut -d : -f 2 > bs-ib-hosts | |
− | + | ||
− | + | Notice that the only difference is the search field in the first grep command. UNSPEC here refers to Infiniband. | |
− | + | ||
− | + | =General Info= | |
− | + | ==NIS Importing== | |
+ | * <code>/etc/cron.hourly/importNIS.sh</code> | ||
+ | * This comes from the rocks users guide & a mailing list thread. | ||
+ | ==http== | ||
+ | * <code>/cluster/www/bobsced/</code> | ||
+ | ==<code>/cluster</code>== | ||
+ | * Mounted using <code>/etc/rc.local</code> | ||
+ | ==<code>/cluster/bobsced/etc/</code>== | ||
+ | * What's in here? Things for client or bobsced0? | ||
+ | |||
+ | =Known Error Messages= | ||
+ | Please refer here if you encounter an error message on BobSCEd that you cannot handle. | ||
+ | ==RLIMIT_MEMLOCK== | ||
+ | $./area_mpi | ||
+ | libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. | ||
+ | This will severely limit memory registrations.-------------------------------------------------------------------------- | ||
+ | The OpenFabrics (openib) BTL failed to initialize while trying to | ||
+ | allocate some locked memory. This typically can indicate that the | ||
+ | memlock limits are set too low. For most HPC installations, the | ||
+ | memlock limits should be set to "unlimited". The failure occured | ||
+ | here: | ||
+ | Local host: bobsced0 | ||
+ | OMPI source: btl_openib_component.c:1040 | ||
+ | Function: ompi_free_list_init_ex_new() | ||
+ | Device: mthca0 | ||
+ | Memlock limit: 32768 | ||
+ | You may need to consult with your system administrator to get this | ||
+ | problem fixed. This FAQ entry on the Open MPI web site may also be | ||
+ | helpful: | ||
+ | http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages | ||
+ | -------------------------------------------------------------------------- | ||
+ | -------------------------------------------------------------------------- | ||
+ | WARNING: There was an error initializing an OpenFabrics device. | ||
+ | Local host: bobsced0 | ||
+ | Local device: mthca0 | ||
+ | -------------------------------------------------------------------------- | ||
+ | This error is specifically related to infiniband on BobSCEd. If you're using ethernet, specify as such by creating a file, | ||
+ | ~/.openmpi/mca-params.conf | ||
+ | with the first line being | ||
+ | btl = ^openib | ||
+ | This will tell BobSCEd not to try infiniband, and will stop the error. | ||
+ | If you are using Infiniband and find a solution to this error, please place it in the wiki. | ||
− | == | + | =References= |
− | + | ==Rocks Documentation== | |
− | + | * [http://www.rocksclusters.org/rocks-documentation/4.1/rocks-usersguide-4.1.pdf Rocks users guide pdf] | |
− | + | * [http://www.rocksclusters.org/rocks-documentation/4.1/ Online version] | |
* [http://www.dell.com/downloads/global/power/ps4q05-20050227-Ali.pdf Platform rocks] | * [http://www.dell.com/downloads/global/power/ps4q05-20050227-Ali.pdf Platform rocks] | ||
+ | ==Troubleshooting Platform Open Cluster Stack (OCS) and Platform Lava== | ||
* pr_troubleshooting.doc | * pr_troubleshooting.doc | ||
+ | ==411 Tools== | ||
* [http://www.rocksclusters.org/rocks-doc/papers/hpdc2005/hpdc2005-411.pdf 411tools] | * [http://www.rocksclusters.org/rocks-doc/papers/hpdc2005/hpdc2005-411.pdf 411tools] | ||
− | * [http://www.centos.org/docs/4/pdf/rhel-ig-x8664-multi-en.pdf RHEL] | + | ==RHEL== |
+ | * [http://www.centos.org/docs/4/pdf/rhel-ig-x8664-multi-en.pdf RHEL] | ||
+ | * [http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/pdf/rhel-isa-en.pdf More RHEL] |
Latest revision as of 11:51, 4 January 2010
Todo
- 411 tools
- fix ganglia to recognize broadcasts & update
- Naming scheme
- bs* vs compute-*-* vs c*-*
- This is terrible it needs work
- Updating bobsced0's RPM repo
- yum-- free
- up2date-- RHEL
- "Aborting the rocks-update tool while the tool is downloading RPMs might produce corrupted RPM packages (SDSC Toolkit)" from pr_troubleshooting.doc
- NIS map
/etc/passwd & /etc/group
permissions- An architecture without a variable amount of delay before BobSCEd is updated would be nice.
- What broke
cluster-fork
? - NAT & NFS
- Mailing list
- Mailing list
- We should consider flattening the network, that is moving everything into the 159.28.234/24 subnet.
- Stop the hourly cron from producing output on stdout unless there is an error
- Setup and test Infiniband fabric
- Why does hopper see an interface flap from bobsced0?
- bosced0 wants to be the DNS server for the compute nodes
Howtos
Updating nodes to be kickstarted & adding new packages
- bobsced0 can be updated by just installing rpms
- Check for an RPM in:
/state/partition1/home/install/rocks-dist/lan/x86_64/RedHat/RPMS/
- Edit
/home/install/site-profiles/4.1.1/nodes/extend-compute.xml
- Add a package i.e.
<package arch="x86_64">libgfortran</package>
- Add a package i.e.
- Update the files that get loaded on kickstart:
cd /home/install
rocks-dist dist
- Check the kickstart file
dbreport kickstart c0-0
- If there were no errors, kickstart the node. i.e.:
shoot-node c0-0
- Check the progress of a kickstart
ssh –p 2200 compute-x-x
Adding post install scripts to kickstart
- Edit
/home/install/site-profiles/4.1.1/nodes/extend-compute.xml
- Add a
<post arch="x86_64">
entry i.e.:<post arch="x86_64">cp /cluster/ganglia/gmond.conf /etc/gmond.conf</post>
Using 411 tools
- make -C /var/411 on bobsced0
- Copies the files to /etc/411.d/ using 411put
- Notifies client nodes to run 411get using ganglia
- The files that are watched can be updated by changing the makefiles in /var/411/
cluster-fork /opt/rocks/bin/411get --all
cluster-fork
- Used to run commands on all cluster nodes like the c3tools
- Broken, see todo
- Temporary fix:
cluster-fork --nodes="compute-0-%d:0-14" <command>
disabling reinstall (kickstart) after hard reset
Generating an up-to-date machinefile
Ethernet:
cluster-fork /sbin/ifconfig -a | grep -1 Ethernet | awk '{printf("%s slots=4\n",$2)}' | cut -d : -f 2 > bs-eth-hosts
Infiniband:
cluster-fork /sbin/ifconfig -a | grep -1 UNSPEC | awk '{printf("%s slots=4\n",$2)}' | cut -d : -f 2 > bs-ib-hosts
Notice that the only difference is the search field in the first grep command. UNSPEC here refers to Infiniband.
General Info
NIS Importing
/etc/cron.hourly/importNIS.sh
- This comes from the rocks users guide & a mailing list thread.
http
/cluster/www/bobsced/
/cluster
- Mounted using
/etc/rc.local
/cluster/bobsced/etc/
- What's in here? Things for client or bobsced0?
Known Error Messages
Please refer here if you encounter an error message on BobSCEd that you cannot handle.
RLIMIT_MEMLOCK
$./area_mpi libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations.-------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: bobsced0 OMPI source: btl_openib_component.c:1040 Function: ompi_free_list_init_ex_new() Device: mthca0 Memlock limit: 32768 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: bobsced0 Local device: mthca0 --------------------------------------------------------------------------
This error is specifically related to infiniband on BobSCEd. If you're using ethernet, specify as such by creating a file,
~/.openmpi/mca-params.conf
with the first line being
btl = ^openib
This will tell BobSCEd not to try infiniband, and will stop the error. If you are using Infiniband and find a solution to this error, please place it in the wiki.
References
Rocks Documentation
Troubleshooting Platform Open Cluster Stack (OCS) and Platform Lava
- pr_troubleshooting.doc