Difference between revisions of "ShutdownProcedure"
(→Cluster) |
m (→File system mounting) |
||
(15 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | This is a general document outlining how to do a controlled system shutdown of all CS and Cluster servers, including bringing them back up. We try to do one controlled shutdown per semester. | |
− | |||
− | + | For the shutdown script running as a daemon, click [https://gitlab.cluster.earlham.edu/sysadmin/safe-shutdown here] and [[Power down before outages|here]]. | |
− | |||
− | + | (Note that sometimes we have reason to shut down individual machines, which is much different from shutting down every server on every rack. Information about restarting an individual server can be found [[Shut down one server|here]].) | |
+ | = Before = | ||
+ | * Read this document. | ||
+ | * If there's anything specific you want to do, know that going in - don't decide in the moment. | ||
+ | * One week in advance, notify: | ||
+ | ** admins | ||
+ | ** faculty | ||
+ | ** SciDiv | ||
+ | ** CS-students | ||
+ | * Make sure you can ssh to at least one sysadmin account. | ||
+ | * If you're doing lunch, get the credit card and collect orders early. | ||
+ | * Back up critical wiki pages: There is a script in <code>sysadmin@home.cs.earlham.edu:~/wiki_critical/</code> called <code>send_wiki.sh</code> that specifies which pages to pull down and send out via email. This is important to do before anyone starts shutting down the machines because the wiki will go offline. | ||
+ | |||
+ | = During = | ||
+ | This is the procedure for the day of a shutdown. The basics: | ||
+ | # Shut everything down. | ||
+ | # Tidy things up. | ||
+ | # Bring everything back up. | ||
+ | Suggestion: Don't do OS upgrades or similar during this process. Don't entangle startup issues with upgrade issues. | ||
+ | |||
+ | == General Info == | ||
+ | * '''Take notes.''' This cannot be emphasized enough: '''take notes.''' Keep it simple. During the Fall 2018 shutdown we used a yellow notepad and a cheap pen to record in the moment, and then put those notes into text in a Google Doc after the fact. But above all, take notes, and then share them. Notes should include: problems encountered, anything you had to start manually which should have started automatically at boot, etc. | ||
* babbage should be the very last machine brought down | * babbage should be the very last machine brought down | ||
* to get to <code>sysadmin@smiley</code> first <code>ssh</code> into <code>home</code> or <code>hopper</code>. | * to get to <code>sysadmin@smiley</code> first <code>ssh</code> into <code>home</code> or <code>hopper</code>. | ||
* '''Make sure all virtual machines are shut down before restarting the bare metal hardware''' | * '''Make sure all virtual machines are shut down before restarting the bare metal hardware''' | ||
− | = Cluster = | + | == Cluster == |
+ | These are all (or almost all) physical machines, not virtual machines. | ||
− | == Order of shutdowns == | + | === Order of shutdowns === |
# all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood | # all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood | ||
# all head nodes: (layout, alsalam, whedon) | # all head nodes: (layout, alsalam, whedon) | ||
# pollock | # pollock | ||
+ | # shinken | ||
# bronte + disk array | # bronte + disk array | ||
# wait until everything up to this point has shutdown | # wait until everything up to this point has shutdown | ||
Line 25: | Line 46: | ||
# hopper | # hopper | ||
− | == Order to bring up == | + | === File system mounting === |
+ | |||
+ | We consistently run into file system mounting problems on shutdown day. Mostly this relates to mounts from dali and kahlo to hopper, and from hopper to dali or kahlo. | ||
+ | |||
+ | === Order to bring up === | ||
The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps. | The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps. | ||
+ | |||
+ | It is important for file system reasons to make sure hopper is up and stable before proceeding to the other machines. Likewise, make sure dali and kahlo are stable before proceeding to clusters, utilities, and phat nodes. | ||
=== Hadoop === | === Hadoop === | ||
Line 38: | Line 65: | ||
</pre> | </pre> | ||
− | = CS = | + | == CS == |
+ | There are virtual machines here. | ||
− | == Shutdown process == | + | === Shutdown process === |
If <code>hopper</code> is back online, <code>ssh sysadmin@cluster.cs.earlham.edu</code> and then <code>ssh sysadmin@control.cs.earlham.edu</code>. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room. | If <code>hopper</code> is back online, <code>ssh sysadmin@cluster.cs.earlham.edu</code> and then <code>ssh sysadmin@control.cs.earlham.edu</code>. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room. | ||
Line 58: | Line 86: | ||
</pre> | </pre> | ||
− | == Order of shutdowns == | + | === Order of shutdowns === |
− | |||
# tools | # tools | ||
# web | # web | ||
Line 76: | Line 103: | ||
</pre> | </pre> | ||
− | |||
− | === Mounting Logical Volumes === | + | === Start up again === |
+ | Because of virtual machines this is a little more complex. | ||
+ | |||
+ | ==== Mounting Logical Volumes ==== | ||
When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do | When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do | ||
Line 87: | Line 116: | ||
console-# vgchange -a y | console-# vgchange -a y | ||
</pre> | </pre> | ||
− | This should be done at boot using <code>/etc | + | This should be done at boot using <code>/etc/rc.local</code>. Verify this by running lvdisplay and creating the VM's. |
− | === Starting VMs === | + | ==== File system mounting ==== |
+ | |||
+ | File system mounting is sometimes a problem on the CS side too, but for different reasons. The problem here only appears when we start up again. | ||
+ | |||
+ | In order: | ||
+ | |||
+ | # Make sure /smiley-home-disk is mounted on smiley. (If not, run <code>mount --source=/dev/vmdata/eccs-home-disk --target=/smiley-eccs-home-disk</code>) | ||
+ | ## If you get the message <tt>requested nfs version or transport protocol not supported</tt> at any point when running <code>mount -a</code>, run <code>service nfs-server start</code> on smiley and try again. | ||
+ | # Launch net (see below). On the Xen console, check that all file systems are correct there. Do not move on to other VM's until net is up, running, and mounting file systems. | ||
+ | # Bring up tools and web. Make sure they're stable and have each mounted file systems. | ||
+ | |||
+ | ==== Starting VMs ==== | ||
The VMs on <code>smiley</code> should be brought up in the reverse order they were shutdown. | The VMs on <code>smiley</code> should be brought up in the reverse order they were shutdown. | ||
Line 139: | Line 179: | ||
If you exit that shell the kernel will panic, if you leave it with <code>^]</code> it seems to stay stable. | If you exit that shell the kernel will panic, if you leave it with <code>^]</code> it seems to stay stable. | ||
+ | |||
+ | = After = | ||
+ | Share all those notes from the shutdown. A Drive Doc is good. | ||
+ | |||
+ | Make sure all the machines are actually back up. :) | ||
+ | |||
+ | Discuss any issues and assign tasks based on discoveries during shutdown at the next meeting. |
Latest revision as of 09:04, 19 August 2019
This is a general document outlining how to do a controlled system shutdown of all CS and Cluster servers, including bringing them back up. We try to do one controlled shutdown per semester.
For the shutdown script running as a daemon, click here and here.
(Note that sometimes we have reason to shut down individual machines, which is much different from shutting down every server on every rack. Information about restarting an individual server can be found here.)
Contents
Before
- Read this document.
- If there's anything specific you want to do, know that going in - don't decide in the moment.
- One week in advance, notify:
- admins
- faculty
- SciDiv
- CS-students
- Make sure you can ssh to at least one sysadmin account.
- If you're doing lunch, get the credit card and collect orders early.
- Back up critical wiki pages: There is a script in
sysadmin@home.cs.earlham.edu:~/wiki_critical/
calledsend_wiki.sh
that specifies which pages to pull down and send out via email. This is important to do before anyone starts shutting down the machines because the wiki will go offline.
During
This is the procedure for the day of a shutdown. The basics:
- Shut everything down.
- Tidy things up.
- Bring everything back up.
Suggestion: Don't do OS upgrades or similar during this process. Don't entangle startup issues with upgrade issues.
General Info
- Take notes. This cannot be emphasized enough: take notes. Keep it simple. During the Fall 2018 shutdown we used a yellow notepad and a cheap pen to record in the moment, and then put those notes into text in a Google Doc after the fact. But above all, take notes, and then share them. Notes should include: problems encountered, anything you had to start manually which should have started automatically at boot, etc.
- babbage should be the very last machine brought down
- to get to
sysadmin@smiley
firstssh
intohome
orhopper
. - Make sure all virtual machines are shut down before restarting the bare metal hardware
Cluster
These are all (or almost all) physical machines, not virtual machines.
Order of shutdowns
- all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood
- all head nodes: (layout, alsalam, whedon)
- pollock
- shinken
- bronte + disk array
- wait until everything up to this point has shutdown
- dali
- kahlo
- wait until everything up to this point has shutdown
- hopper
File system mounting
We consistently run into file system mounting problems on shutdown day. Mostly this relates to mounts from dali and kahlo to hopper, and from hopper to dali or kahlo.
Order to bring up
The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps.
It is important for file system reasons to make sure hopper is up and stable before proceeding to the other machines. Likewise, make sure dali and kahlo are stable before proceeding to clusters, utilities, and phat nodes.
Hadoop
Hadoop runs on whedon
and might also need to be restarted manually.
sysadmin@hopper$ ssh w0 sysadmin@w0$ sudo su hadoop haddop@w0$ cd $HADOOP_HOME hadoop@w0$ ./sbin/start-all.sh
CS
There are virtual machines here.
Shutdown process
If hopper
is back online, ssh sysadmin@cluster.cs.earlham.edu
and then ssh sysadmin@control.cs.earlham.edu
. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room.
Recipe for shutting down a machine on smiley
:
ssh sysadmin@tools.cs.earlham.edu ssh sysadmin@smiley.cs.earlham.edu sudo su - smiley-# xl destroy <hostname>.cs.earlham.edu
List running VMs
smiley-# xl list
Order of shutdowns
- tools
- web
- net
- smiley (tools, web, net are VM's run on smiley's hardware)
- babbage (firewall)
- Make sure all virtual machines are shut down before restarting the bare metal hardware*
Ideally the VM's should be shutdown from inside (by ssh'ing into them and running shutdown
). After that, run "xl list" to see if they're still listed as domains, then run the "xl destroy" commands as needed.
# xl destroy tools.cs.earlham.edu # xl destroy web.cs.earlham.edu # xl destroy net.cs.earlham.edu
Start up again
Because of virtual machines this is a little more complex.
Mounting Logical Volumes
When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do
console-# lvscan console-# vgscan console-# vgchange -a y
This should be done at boot using /etc/rc.local
. Verify this by running lvdisplay and creating the VM's.
File system mounting
File system mounting is sometimes a problem on the CS side too, but for different reasons. The problem here only appears when we start up again.
In order:
- Make sure /smiley-home-disk is mounted on smiley. (If not, run
mount --source=/dev/vmdata/eccs-home-disk --target=/smiley-eccs-home-disk
)- If you get the message requested nfs version or transport protocol not supported at any point when running
mount -a
, runservice nfs-server start
on smiley and try again.
- If you get the message requested nfs version or transport protocol not supported at any point when running
- Launch net (see below). On the Xen console, check that all file systems are correct there. Do not move on to other VM's until net is up, running, and mounting file systems.
- Bring up tools and web. Make sure they're stable and have each mounted file systems.
Starting VMs
The VMs on smiley
should be brought up in the reverse order they were shutdown.
It is important to bring up net first because it runs DNS, DHCP, and LDAP.
smiley-# xl create -c ~sysadmin/xen-configs/eccs-<hostname>.cfg # To exit to the hypervisor shell you can press Ctrl + ]
To start them up without going into the console:
# xl create ~sysadmin/xen-configs/eccs-<hostname>.cfg
Connect to VM console after the VM is running:
smiley-# xl console <hostname>.cs.earlham.edu
The different VMs mount from eachother, so just be patient and hopefully everything will work out.
Tools
We may have to restart nginx
, jupyter
, and sage
by hand. Using history | grep <command>
is helpful here. (make sure to grab the entire command including ampersand)
Jupyter
eccs-tools# nohup su -c "/mnt/lovelace/software/anaconda/envs/py35/bin/jupyterhub -f /etc/jupyterhub/jupyterhub_config.py --no-ssl" &
Sage
eccs-tools# nohup /home/sage/sage-6.8/sage --notebook=sagenb accounts=False automatic_login=False interface= port=8080 &
Troubleshooting
If things aren't going well, it's possible to start the VMs in a pseudo single-user mode:
xm create -c eccs-home.cfg extra="init=/bin/bash" # start and leave it in single user mode with the console (from within the vm) mount -o remount,rw / service networking start # ignore the upstart errors mount /eccs/users mount /eccs/clients mount /mnt/lovelace/software
If you exit that shell the kernel will panic, if you leave it with ^]
it seems to stay stable.
After
Share all those notes from the shutdown. A Drive Doc is good.
Make sure all the machines are actually back up. :)
Discuss any issues and assign tasks based on discoveries during shutdown at the next meeting.