Sysadmin:Old:Start/Shutdown
The Machine room is located behind the NoYes lab. There are 2 main infrastructures in the machine room. The CS and the cluster which are spread over 3 racks. The CS servers and systems are on the Arctic rack and cluster servers and systems are on the Equatorial and Antartica rack.
Here is the list of all servers and nodes. If the server is in green color that means it is on a UPS energy backup.
| CS Subnet | murphy | 41 (quark) | elwood | elwood' | babbage | hopper | sage | bestey |
|---|---|---|---|---|---|---|---|---|
| Cluster Subnet | as0 - as12 | bigFe | dali | lo0 - lo4 | fatboy | t-voc | bs0 - bs11 |
Shutting down / start up of the CS servers
Shutting Down
Order of shutdown and startup is very important. When shutting down the CS servers, make sure that murphy goes down last, babbage goes before murphy, everything else goes down before and in any order.
Starting up
When bringing CS servers up, make sure that murphy goes up first, babbage second, and after all other servers in no particular order.
Other notes:
- to start up murphy in case of unsuccessful boot up use fsck command:
- go to single user mode
- df -h
- cat /etc/fstab (look for the matching mount names of the particions) (the fsck should be issued in the right order: first on '/' patrician, then '/var', and then the rest)
- fsck -y /dev/mfid0s1d (example for /var)
- after fsck are done; control D
- when 41 is up, log in and start quark vm:
- localhost:8080 (in the browser)
- root & root password
- in vmware console - press power on
Shutting down / start up of the cluster servers
Shutting Down
First step is to shut down as1-12 (working nodes). In order to do so, ssh from hopper to as0 and than become root on as0:
as0$ sudo su - root
Now write a message to all users that system is going down (good sysadmin practice):
# wall this system is going down in 5 minuets because of...
In order to see who is on system, just type "who: in the shell.
In order to shut down all working nodes (as1 - as12) we will use cexecs command:
# cexecs shutdown -h now
In order to check weather working nodes are still up:
# cexecs uptime
Once all working nodes are down, you can shutdown the as0 (server).
Now you can shutdown the rest of the cluster servers.
Starting up
When bringing cluster servers up, it is important to bring as0 first. Once as0 is up, bring the rest of the working nodes up. Also bring other cluster servers up.
Planned Building/Campus Power Outages
This is easy-peasy now, most of the work is setting up lights so you can work in the lab while the power is out. Run an extension cord from one of the free wall outlets in the machine room into the lab, put a power strip on it, and feed the desk lamp, floor lamps, power bricks, etc.
- Power Down
hopper# ssh al-salam.cluster.earlham.edu "cexecs shutdown -h now" hopper# ssh bobsced.cluster.earlham.edu "cexecs shutdown -h now" hopper# ssh layout.cluster.earlham.edu "cexecs shutdown -h now" in the future - positron# cexecs acls: shutdown -h now
- Power Up
Press the power buttons on the machines shutdown above, wait 5 minutes, check them with cexec.