Difference between revisions of "ShutdownProcedure"

From Earlham CS Department
Jump to navigation Jump to search
(Cluster)
Line 2: Line 2:
 
These page also has the reboot procedure.
 
These page also has the reboot procedure.
  
= Backup Critical wiki pages =
+
= Before =
There is a script in <code>sysadmin@home.cs.earlham.edu:~/wiki_critical/</code> called <code>send_wiki.sh</code> that specifies which pages to pull down and send out via email. This is important to do before anyone starts shutting down the machines because the wiki will go offline.
+
* Read this document.
 +
* If there's anything specific you want to do, know that going in - don't decide in the moment.
 +
* One week in advance, notify:
 +
** admins
 +
** faculty
 +
** SciDiv
 +
** CS-students
 +
* Make sure you can ssh to at least one sysadmin account.
 +
* If you're doing lunch, get the credit card and collect orders early.
 +
* Back up critical wiki pages: There is a script in <code>sysadmin@home.cs.earlham.edu:~/wiki_critical/</code> called <code>send_wiki.sh</code> that specifies which pages to pull down and send out via email. This is important to do before anyone starts shutting down the machines because the wiki will go offline.
  
= General Info =
+
= During =
 +
This is the procedure for the day of a shutdown. The basics:
 +
# Shut everything down.
 +
# Tidy things up.
 +
# Bring everything back up.
 +
# Eat lunch.
 +
Suggestion: Don't do OS upgrades or anything during this process. Don't entangle startup issues with upgrade issues.
  
 +
== General Info ==
 +
* Take notes. This cannot be emphasized enough: take notes. Keep it simple. During the Fall 2018 shutdown we used a yellow notepad and a cheap pen to record in the moment, and then put those notes into text in a Google Doc after the fact. But above all, take notes, and then share them. Notes should include: problems encountered, anything you had to start manually which should have started automatically at boot, etc.
 
* babbage should be the very last machine brought down
 
* babbage should be the very last machine brought down
 
* to get to <code>sysadmin@smiley</code> first <code>ssh</code> into <code>home</code> or <code>hopper</code>.
 
* to get to <code>sysadmin@smiley</code> first <code>ssh</code> into <code>home</code> or <code>hopper</code>.
 
* '''Make sure all virtual machines are shut down before restarting the bare metal hardware'''
 
* '''Make sure all virtual machines are shut down before restarting the bare metal hardware'''
  
= Cluster =
+
== Cluster ==
 +
These are all (or almost all) physical machines, not virtual machines.
  
== Order of shutdowns ==
+
=== Order of shutdowns ===
  
 
# all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood
 
# all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood
Line 25: Line 43:
 
# hopper
 
# hopper
  
== Order to bring up ==
+
=== Order to bring up ===
  
 
The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps.
 
The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps.
Line 38: Line 56:
 
</pre>
 
</pre>
  
= CS =
+
== CS ==
 +
There are virtual machines here.
  
== Shutdown process ==
+
=== Shutdown process ===
  
 
If <code>hopper</code> is back online, <code>ssh sysadmin@cluster.cs.earlham.edu</code> and then <code>ssh sysadmin@control.cs.earlham.edu</code>. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room.
 
If <code>hopper</code> is back online, <code>ssh sysadmin@cluster.cs.earlham.edu</code> and then <code>ssh sysadmin@control.cs.earlham.edu</code>. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room.
Line 58: Line 77:
 
</pre>
 
</pre>
  
== Order of shutdowns ==
+
=== Order of shutdowns ===
  
 
# proto (lives seperatly, <code>ssh admin@proto.cs.earlham.edu</code>)
 
# proto (lives seperatly, <code>ssh admin@proto.cs.earlham.edu</code>)
Line 76: Line 95:
 
</pre>
 
</pre>
  
== Start up again ==
+
=== Start up again ===
 +
Because of virtual machines this is a little more complex.
  
=== Mounting Logical Volumes ===
+
==== Mounting Logical Volumes ====
  
 
When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do
 
When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do
Line 89: Line 109:
 
This should be done at boot using <code>/etc/init.d/rc.sysinit</code> but there still might be some subtleties there.
 
This should be done at boot using <code>/etc/init.d/rc.sysinit</code> but there still might be some subtleties there.
  
=== Starting VMs ===
+
==== Starting VMs ====
  
 
The VMs on <code>smiley</code> should be brought up in the reverse order they were shutdown.
 
The VMs on <code>smiley</code> should be brought up in the reverse order they were shutdown.
Line 139: Line 159:
  
 
If you exit that shell the kernel will panic, if you leave it with <code>^]</code> it seems to stay stable.
 
If you exit that shell the kernel will panic, if you leave it with <code>^]</code> it seems to stay stable.
 +
 +
= After =
 +
Share all those notes from the shutdown. A Drive Doc is good.
 +
 +
Make sure all the machines are actually back up. :)
 +
 +
Discuss any issues and assign tasks based on discoveries during shutdown at the next meeting.

Revision as of 11:36, 5 November 2018

These are the shutdown and boot up instructions for CS and Cluster servers. These page also has the reboot procedure.

Before

  • Read this document.
  • If there's anything specific you want to do, know that going in - don't decide in the moment.
  • One week in advance, notify:
    • admins
    • faculty
    • SciDiv
    • CS-students
  • Make sure you can ssh to at least one sysadmin account.
  • If you're doing lunch, get the credit card and collect orders early.
  • Back up critical wiki pages: There is a script in sysadmin@home.cs.earlham.edu:~/wiki_critical/ called send_wiki.sh that specifies which pages to pull down and send out via email. This is important to do before anyone starts shutting down the machines because the wiki will go offline.

During

This is the procedure for the day of a shutdown. The basics:

  1. Shut everything down.
  2. Tidy things up.
  3. Bring everything back up.
  4. Eat lunch.

Suggestion: Don't do OS upgrades or anything during this process. Don't entangle startup issues with upgrade issues.

General Info

  • Take notes. This cannot be emphasized enough: take notes. Keep it simple. During the Fall 2018 shutdown we used a yellow notepad and a cheap pen to record in the moment, and then put those notes into text in a Google Doc after the fact. But above all, take notes, and then share them. Notes should include: problems encountered, anything you had to start manually which should have started automatically at boot, etc.
  • babbage should be the very last machine brought down
  • to get to sysadmin@smiley first ssh into home or hopper.
  • Make sure all virtual machines are shut down before restarting the bare metal hardware

Cluster

These are all (or almost all) physical machines, not virtual machines.

Order of shutdowns

  1. all compute nodes: (layout, alsalam, whedon) and t-voc, bigfe, elwood
  2. all head nodes: (layout, alsalam, whedon)
  3. pollock
  4. bronte + disk array
  5. wait until everything up to this point has shutdown
  6. dali
  7. kahlo
  8. wait until everything up to this point has shutdown
  9. hopper

Order to bring up

The reverse of shutdown, again make sure to wait before proceeding at the appropriate steps.

Hadoop

Hadoop runs on whedon and might also need to be restarted manually.

sysadmin@hopper$ ssh w0
sysadmin@w0$ sudo su hadoop
haddop@w0$ cd $HADOOP_HOME
hadoop@w0$ ./sbin/start-all.sh

CS

There are virtual machines here.

Shutdown process

If hopper is back online, ssh sysadmin@cluster.cs.earlham.edu and then ssh sysadmin@control.cs.earlham.edu. This way we can shutdown all the VMs directly without being knocked off line or being in the machine room.

Recipe for shutting down a machine on smiley:

ssh sysadmin@tools.cs.earlham.edu
ssh sysadmin@smiley.cs.earlham.edu
sudo su -
smiley-# xl destroy <hostname>.cs.earlham.edu

List running VMs

smiley-# xl list

Order of shutdowns

  1. proto (lives seperatly, ssh admin@proto.cs.earlham.edu)
  2. tools
  3. web
  4. net
  5. smiley (tools, web, net are VM's run on smiley's hardware)
  6. babbage (firewall)
  • Make sure all virtual machines are shut down before restarting the bare metal hardware*

Ideally the VM's should be shutdown from inside (by ssh'ing into them and running shutdown). After that, run "xl list" to see if they're still listed as domains, then run the "xl destroy" commands as needed.

# xl destroy tools.cs.earlham.edu
# xl destroy web.cs.earlham.edu
# xl destroy net.cs.earlham.edu

Start up again

Because of virtual machines this is a little more complex.

Mounting Logical Volumes

When you reboot, the LVM volume groups and logical volumes may not be automatically enabled. To bring them back do

console-# lvscan
console-# vgscan
console-# vgchange -a y

This should be done at boot using /etc/init.d/rc.sysinit but there still might be some subtleties there.

Starting VMs

The VMs on smiley should be brought up in the reverse order they were shutdown.

It is important to bring up net first because it runs DNS, DHCP, and LDAP.

smiley-# xl create -c ~sysadmin/xen-configs/eccs-<hostname>.cfg
# To exit to the hypervisor shell you can press Ctrl + ]

To start them up without going into the console:

# xl create ~sysadmin/xen-configs/eccs-<hostname>.cfg

Connect to VM console after the VM is running:

smiley-# xl console <hostname>.cs.earlham.edu

The different VMs mount from eachother, so just be patient and hopefully everything will work out.

Tools

We may have to restart nginx, jupyter, and sage by hand. Using history | grep <command> is helpful here. (make sure to grab the entire command including ampersand)

Jupyter

eccs-tools# nohup su -c "/mnt/lovelace/software/anaconda/envs/py35/bin/jupyterhub -f /etc/jupyterhub/jupyterhub_config.py --no-ssl" &

Sage

eccs-tools# nohup /home/sage/sage-6.8/sage --notebook=sagenb accounts=False automatic_login=False interface= port=8080 &

Troubleshooting

If things aren't going well, it's possible to start the VMs in a pseudo single-user mode:

xm create -c eccs-home.cfg extra="init=/bin/bash" # start and leave it in single user mode with the console
(from within the vm)
mount -o remount,rw /
service networking start # ignore the upstart errors
mount /eccs/users
mount /eccs/clients
mount /mnt/lovelace/software

If you exit that shell the kernel will panic, if you leave it with ^] it seems to stay stable.

After

Share all those notes from the shutdown. A Drive Doc is good.

Make sure all the machines are actually back up. :)

Discuss any issues and assign tasks based on discoveries during shutdown at the next meeting.