Difference between revisions of "Sysadmin"

From Earlham CS Department
Jump to navigation Jump to search
(To do - updated 2018-02-15 (22))
(Last updated 12 April 2018)
(18 intermediate revisions by 5 users not shown)
Line 37: Line 37:
 
| style="height:55px; width:150px; text-align:center; background-color:#EEDC82; border-left:solid 5px #EEDC82; border-top:solid 5px #EEDC82; border-bottom:solid 1px white; border-right:solid 5px #EEDC82; font-size:120%;" | [[Sysadmin:Servers:Proto | PROTO]]
 
| style="height:55px; width:150px; text-align:center; background-color:#EEDC82; border-left:solid 5px #EEDC82; border-top:solid 5px #EEDC82; border-bottom:solid 1px white; border-right:solid 5px #EEDC82; font-size:120%;" | [[Sysadmin:Servers:Proto | PROTO]]
 
|-
 
|-
| style="height:210px; width:150px; background-color: #EEDC82; border-left:solid 5px #EEDC82; border-bottom:solid 5px #EEDC82; border-right:solid 5px #EEDC82;" | Weather Monitoring <br> GPS/NTP <br> Energy Monitoring
+
| style="height:210px; width:150px; background-color: #EEDC82; border-left:solid 5px #EEDC82; border-bottom:solid 5px #EEDC82; border-right:solid 5px #EEDC82;" | Weather Monitoring <br> GPS/NTP <br> Energy Monitoring <br><br> Backup to Dali: etc, var
 
|}
 
|}
  
Line 43: Line 43:
 
| style="height:40px; width:150px; text-align:center; background-color:#FF7E6D; border-left:solid 5px #FF7E6D; border-top:solid 5px #FF7E6D; border-bottom:solid 1px white; border-right:solid 5px      #FF7E6D; font-size:120%;" | CONTROL  
 
| style="height:40px; width:150px; text-align:center; background-color:#FF7E6D; border-left:solid 5px #FF7E6D; border-top:solid 5px #FF7E6D; border-bottom:solid 1px white; border-right:solid 5px      #FF7E6D; font-size:120%;" | CONTROL  
 
|-
 
|-
| style="height:210px; width:150px; background-color:#FF7E6D; border-left:solid 5px #FF7E6D; border-bottom:solid 5px #FF7E6D; border-right:solid 5px #FF7E6D;" | Users <br> SSH <br> HOME <br> TOOLS
+
| style="height:210px; width:150px; background-color:#FF7E6D; border-left:solid 5px #FF7E6D; border-bottom:solid 5px #FF7E6D; border-right:solid 5px #FF7E6D;" | Users <br> SSH <br> HOME <br> TOOLS <br><br> Backup to Dali: etc, var
 
|}  
 
|}  
  
Line 49: Line 49:
 
| style="height:40px; width:150px; text-align:center; background-color:#54C571; border-left:solid 5px #54C571; border-top:solid 5px #54C571; border-bottom:solid 1px white; border-right:solid 5px      #54C571; font-size:120%;" | SMILEY  
 
| style="height:40px; width:150px; text-align:center; background-color:#54C571; border-left:solid 5px #54C571; border-top:solid 5px #54C571; border-bottom:solid 1px white; border-right:solid 5px      #54C571; font-size:120%;" | SMILEY  
 
|-
 
|-
| style="height:210px; width:150px; background-color:#54C571; border-left:solid 5px #54C571; border-bottom:solid 5px #54C571; border-right:solid 5px #54C571;" | [[Sysadmin:XenDocs]] <br> NET <br> WEB
+
| style="height:210px; width:150px; background-color:#54C571; border-left:solid 5px #54C571; border-bottom:solid 5px #54C571; border-right:solid 5px #54C571;" | [[Sysadmin:XenDocs]] <br> NET <br> WEB <br><br> Backup to Dali: etc, var
 
|}  
 
|}  
  
Line 77: Line 77:
 
| style="height:55px; width:150px; text-align:center; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-top:solid 5px #ffdb4d; border-bottom:solid 1px white; border-right:solid 5px #ffdb4d; font-size:120%;" | DALI
 
| style="height:55px; width:150px; text-align:center; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-top:solid 5px #ffdb4d; border-bottom:solid 1px white; border-right:solid 5px #ffdb4d; font-size:120%;" | DALI
 
|-
 
|-
| style="height:300px; width:150px; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-bottom:solid 5px #ffdb4d; border-right:solid 5px #ffdb4d;" | Storage Server <br>[[Sysadmin:Gitlab | Gitlab]] <br> Backups <br> NginX <br><br> No backup (storage)
+
| style="height:300px; width:150px; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-bottom:solid 5px #ffdb4d; border-right:solid 5px #ffdb4d;" | Storage Server <br>[[Sysadmin:Gitlab | Gitlab]] <br> Backups <br> NginX <br><br> Backup to Dali (/media/r10_vol/backups/): etc, var/opt/gitlab/backups
 
|}
 
|}
  
Line 83: Line 83:
 
| style="height:55px; width:150px; text-align:center; background-color:#ff4d94; border-left:solid 5px #ff4d94; border-top:solid 5px #ff4d94; border-bottom:solid 1px white; border-right:solid 5px #ff4d94; font-size:120%;" | AL-SALAM
 
| style="height:55px; width:150px; text-align:center; background-color:#ff4d94; border-left:solid 5px #ff4d94; border-top:solid 5px #ff4d94; border-bottom:solid 1px white; border-right:solid 5px #ff4d94; font-size:120%;" | AL-SALAM
 
|-
 
|-
| style="height:300px; width:150px; background-color:#ff4d94; border-left:solid 5px #ff4d94; border-bottom:solid 5px #ff4d94; border-right:solid 5px #ff4d94;" | WebMO <br> [[Sysadmin:Software Modules | Software Modules]] <br> Apache2 <br><br> No backup
+
| style="height:300px; width:150px; background-color:#ff4d94; border-left:solid 5px #ff4d94; border-bottom:solid 5px #ff4d94; border-right:solid 5px #ff4d94;" | WebMO <br> [[Sysadmin:Software Modules | Software Modules]] <br> Apache2 <br><br> Backup to Dali: etc, var
 
|}
 
|}
  
Line 101: Line 101:
 
| style="height:55px; width:150px; text-align:center; background-color:#0099cc; border-left:solid 5px #0099cc; border-top:solid 5px #0099cc; border-bottom:solid 1px white; border-right:solid 5px #0099cc; font-size:120%;" | POLLOCK
 
| style="height:55px; width:150px; text-align:center; background-color:#0099cc; border-left:solid 5px #0099cc; border-top:solid 5px #0099cc; border-bottom:solid 1px white; border-right:solid 5px #0099cc; font-size:120%;" | POLLOCK
 
|-
 
|-
| style="height:300px; width:150px; background-color:#0099cc; border-left:solid 5px #0099cc; border-bottom:solid 5px #0099cc; border-right:solid 5px #0099cc;" |  [[Sysadmin:Software Modules | Software Modules]] <br> WebMO <br> NginX <br><br> No backup
+
| style="height:300px; width:150px; background-color:#0099cc; border-left:solid 5px #0099cc; border-bottom:solid 5px #0099cc; border-right:solid 5px #0099cc;" |  [[Sysadmin:Software Modules | Software Modules]] <br> WebMO <br> NginX <br><br> Backup to Dali: etc, var
 
|}
 
|}
  
Line 107: Line 107:
 
| style="height:55px; width:150px; text-align:center; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-top:solid 5px #ffdb4d; border-bottom:solid 1px white; border-right:solid 5px #ffdb4d; font-size:120%;" | KAHLO
 
| style="height:55px; width:150px; text-align:center; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-top:solid 5px #ffdb4d; border-bottom:solid 1px white; border-right:solid 5px #ffdb4d; font-size:120%;" | KAHLO
 
|-
 
|-
| style="height:300px; width:150px; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-bottom:solid 5px #ffdb4d; border-right:solid 5px #ffdb4d;" | Storage Server <br>Backups <br> NginX <br><br> No backup
+
| style="height:300px; width:150px; background-color:#ffdb4d; border-left:solid 5px #ffdb4d; border-bottom:solid 5px #ffdb4d; border-right:solid 5px #ffdb4d;" | Storage Server <br>Backups <br> NginX <br><br> Backup to Dali: etc, var
 
|}
 
|}
  
Line 344: Line 344:
 
|}
 
|}
  
== Current Projects ==  
+
== Current Projects ==
=== To do - updated 2018-03-01 ===
+
=== Last updated 9 May 2018 ===
* UPS Shinken monitoring - Aleks and Vitalii
+
* 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is.
* UPS load monitoring - Eli  
+
* <s>Noyes room cleaning and reboot on Saturday, April 14 at 13:30 </s>
 +
* Adam and Ahsan will be there for Noyes room tour on Friday, April 20 at 08:00
 +
* Setup and install new machines in Lovelace - Eli
 +
** New machine has Debian installed, needs ethernet driver installed though
 +
* 10 Gbps for dali and kahlo, hopper, etc (nfs mounts in /etc/fstab) - Adam
 +
** real close, reboot test 10a 8 April
 +
* Chau & Laurence will be brought up to speed for Layout - Adam
 +
** "Layout needs to be brought out of surgery" - Charlie
 +
** Layout - head node swap, check disk space on /scratch - Adam
 +
* APC is on Shinken now - Alek
 +
* Postgres connection monitoring on Shinken - Vitalli
 +
* Figure out the cs web production and test solution - Chau
 +
* Upgrade t-voc - ?
 +
 
 +
* -------------------------
 +
* Backup - max disk capacity, scripts on all machines backing-up at least /etc; babbage - Chau
 +
* Ganglia on Hopper - Ahsan
 +
* <s> UPS load monitoring </s> - Eli  
 +
** One UPS is being monitored, more can be added and stuff can still be cleared up
 +
* Temperature and humidity monitoring in the machine room (new item)
 +
* Fixing the user create script
 +
* Password auditing script
 +
 
 +
* -------------------------
 
* FIFO for requests rather than ad-hoc
 
* FIFO for requests rather than ad-hoc
 
* Accounting for hours logged  
 
* Accounting for hours logged  
* PBS Shinken monitoring - Aleks and Vitalii
+
* PBS Shinken monitoring  
 
* Power layout - legend, color-code servers by type, how are servers with 2x power supplies plumbed?
 
* Power layout - legend, color-code servers by type, how are servers with 2x power supplies plumbed?
* Bringing the new people on-board - Aleks (Vitalii to add them to the listserv)
+
 
 
* -------------------------
 
* -------------------------
 
* Shinken - Vitalii and Aleks (documentation, monitoring webmo and pm8)  
 
* Shinken - Vitalii and Aleks (documentation, monitoring webmo and pm8)  
Line 359: Line 382:
 
* Gaussian & WebMO on Whedon - Ahsan and Eli (stuck on firewall)
 
* Gaussian & WebMO on Whedon - Ahsan and Eli (stuck on firewall)
 
* Backup - Ch'''â'''u (moving along, setup backup.cs.e.e next)
 
* Backup - Ch'''â'''u (moving along, setup backup.cs.e.e next)
* Installing power monitor, etc. and rack cleanup - TO BE ASSIGNED (eli and charlie)
+
** Needs to setup a machine with CentOs in Noyes basement, eventually run it over to Lilly - Laurence
** switch switch (charlie to check inventory)
+
* <s> Installing power monitor, etc. and rack cleanup - TO BE ASSIGNED (eli and charlie) </s>
 +
** <s> switch switch (charlie to check inventory) </s>
 
* Mothur - Ahsan  
 
* Mothur - Ahsan  
* password policy, force change and random initial
+
* <s> password policy, force change and random initial </s>
** for now notify people with default and then change after a couple of days; script will generate random string
+
** <s> for now notify people with default and then change after a couple of days; script will generate random string </s>
 
* Talk about at next meeting:  
 
* Talk about at next meeting:  
 
** Spring break and summer people (important)
 
** Spring break and summer people (important)
Line 373: Line 397:
 
** Record consistent & thorough documentation, especially concerning the startup and shutdown of the VMs
 
** Record consistent & thorough documentation, especially concerning the startup and shutdown of the VMs
 
* Setup graceful shutdown when we detect to be running solely off UPS
 
* Setup graceful shutdown when we detect to be running solely off UPS
** Additionally, setup clean shutdown and startup for VMs on <s>smiley</s> control (?)
+
** Additionally, setup clean shutdown and startup for VMs on smiley and control (?)
* Fix reverse lookup error for mail.cs.earlham.edu
+
* <s> Fix reverse lookup error for mail.cs.earlham.edu
 
** Should consistently refer to 159.28.22.2 (web.cs.earlham.edu)
 
** Should consistently refer to 159.28.22.2 (web.cs.earlham.edu)
** It's possible that this isn't actually broken.
+
** It's possible that this isn't actually broken. </s>
 
* Layout infiniband subnet manager
 
* Layout infiniband subnet manager
 
* Layout disk swap, new lo0
 
* Layout disk swap, new lo0

Revision as of 16:26, 9 May 2018


Machines and Brief Descriptions of Services

CS Machines

Server layout as of May 2017
HOME
(vm0)
Users
SSH
NFS

Backup to Dali: eccs, etc, var
NET
(vm1)
LDAP server
DNS
DHCP

Backup to Dali: etc, var
WEB
(vm2)
Mailman
Mail Stack
Apache2
PostgresQL
MySQL
Wiki

Backup to Dali: etc, var
TOOLS
(vm3)
SageNB Server
Jupyterhub Server
Software Modules
NginX

Backup to Dali: etc, var, mnts, sage
BABBAGE
Firewall
PROTO
Weather Monitoring
GPS/NTP
Energy Monitoring

Backup to Dali: etc, var
CONTROL
Users
SSH
HOME
TOOLS

Backup to Dali: etc, var
SMILEY
Sysadmin:XenDocs
NET
WEB

Backup to Dali: etc, var
SHINKEN
Users
SSH
Add machines
MURPHY
Elderly email stack
Users
SSH













Cluster Machines

HOPPER
Users
SSH
NFS server
LDAP server
Software Modules
PostgreSQL
Wiki
Apache2
DNS
DHCP

Backup to Dali: etc, var, cluster
DALI
Storage Server
Gitlab
Backups
NginX

Backup to Dali (/media/r10_vol/backups/): etc, var/opt/gitlab/backups
AL-SALAM
WebMO
Software Modules
Apache2

Backup to Dali: etc, var
LAYOUT
Jupyterhub Server
Software Modules
NginX
Apache2
WebMO

Backup to Dali: etc, var
BRONTE
Software Modules

Backup to Dali: etc, var, nbserver
POLLOCK
Software Modules
WebMO
NginX

Backup to Dali: etc, var
KAHLO
Storage Server
Backups
NginX

Backup to Dali: etc, var
BIGFE
Software Modules
T-VOC
Software Modules
ELWOOD
Software Modules



































Switches

SG538SF02J
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Unused as of May 12, 2017
CN63FP762S
  • Model: HP 2530-24G
  • Ports: 24
  • Switching Capacity:
    • 56 Gbps
    • 41.6 million pps
  • Memory:
    • 1.5 MB packet buffer
    • 256 MB flash
    • 128 MB DDR3 DIMM
  • Cut-through switching: No
  • Connected to Al-Salam as of May 12, 2017
SG525SG025
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Connected to layout and whedon as of May 12, 2017
Netgear JGS524
  • Current cluster head-node
  • Unmanaged (no console/configuration)
  • Ports: 24
  • Switching bandwidth:
    • 48 Gbps
    • 1.5 million pps
  • Memory:
    • 2MB packet buffer
  • Cut-through switching: No
  • Connected to Al-Salam, Hopper, Pollock, Nagios, Dali, Kahlo, Bronte as of May 12, 2017
cs-main
  • Model: HP 5920AF-24XG
  • Ports: 24
  • Backplane bandwidth:
    • 480 Gbps
    • 367 million pps
  • Memory:
    • 3.6 GB packet buffer
    • 256 MB dual flash
    • 2 GB SDRAM
  • Cut-through switching: Yes
  • IP Address: 159.28.31.66
  • Connected to layout, kahlo, and dali as of May 12, 2017
5500denniscs-sw1
  • Model: HP 5500 JG542A
  • Ports: 24
  • Backplane bandwidth:
    • 224 Gbps
    • 166.6 million pps
  • Memory:
    • 6 MB packet buffer
    • 512 MB dual flash
    • 1 GB SDRAM
  • Cut-through switching: No
  • IP Address: 159.28.31.67
  • Connected to Babbage, Control, Nagios, and the cluster's netgear switch (via port 14) as of May 12, 2017

























Systems Administration Documentation

For old documentation, see: Old Wiki Information

Current Projects

Last updated 9 May 2018

  • 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is.
  • Noyes room cleaning and reboot on Saturday, April 14 at 13:30
  • Adam and Ahsan will be there for Noyes room tour on Friday, April 20 at 08:00
  • Setup and install new machines in Lovelace - Eli
    • New machine has Debian installed, needs ethernet driver installed though
  • 10 Gbps for dali and kahlo, hopper, etc (nfs mounts in /etc/fstab) - Adam
    • real close, reboot test 10a 8 April
  • Chau & Laurence will be brought up to speed for Layout - Adam
    • "Layout needs to be brought out of surgery" - Charlie
    • Layout - head node swap, check disk space on /scratch - Adam
  • APC is on Shinken now - Alek
  • Postgres connection monitoring on Shinken - Vitalli
  • Figure out the cs web production and test solution - Chau
  • Upgrade t-voc - ?
  • -------------------------
  • Backup - max disk capacity, scripts on all machines backing-up at least /etc; babbage - Chau
  • Ganglia on Hopper - Ahsan
  • UPS load monitoring - Eli
    • One UPS is being monitored, more can be added and stuff can still be cleared up
  • Temperature and humidity monitoring in the machine room (new item)
  • Fixing the user create script
  • Password auditing script
  • -------------------------
  • FIFO for requests rather than ad-hoc
  • Accounting for hours logged
  • PBS Shinken monitoring
  • Power layout - legend, color-code servers by type, how are servers with 2x power supplies plumbed?
  • -------------------------
  • Shinken - Vitalii and Aleks (documentation, monitoring webmo and pm8)
  • Hadoop on Whedon - Vitalii and Adam (stuck on ?)
  • Layout - Adam (stuck on ldap)
  • Gaussian & WebMO on Whedon - Ahsan and Eli (stuck on firewall)
  • Backup - Châu (moving along, setup backup.cs.e.e next)
    • Needs to setup a machine with CentOs in Noyes basement, eventually run it over to Lilly - Laurence
  • Installing power monitor, etc. and rack cleanup - TO BE ASSIGNED (eli and charlie)
    • switch switch (charlie to check inventory)
  • Mothur - Ahsan
  • password policy, force change and random initial
    • for now notify people with default and then change after a couple of days; script will generate random string
  • Talk about at next meeting:
    • Spring break and summer people (important)
    • Jon's user and Postgres database
    • investigate tools /clients/ directory with what looks like duplicate user directories

(list from 2017-10-26)

  • Finish migrating tools and home to smiley migrate web and net back to control
    • Record consistent & thorough documentation, especially concerning the startup and shutdown of the VMs
  • Setup graceful shutdown when we detect to be running solely off UPS
    • Additionally, setup clean shutdown and startup for VMs on smiley and control (?)
  • Fix reverse lookup error for mail.cs.earlham.edu
    • Should consistently refer to 159.28.22.2 (web.cs.earlham.edu)
    • It's possible that this isn't actually broken.
  • Layout infiniband subnet manager
  • Layout disk swap, new lo0
    • Redo /scratch for mglerner group on /media/r10_vol ?
  • Migrate Elwood, BigFe, t-voc to repurposed Lovelace Machines (Eli)
  • HP Al-Salam switch enable jumboframes ?
  • Strike unused lovelace machine addresses from CS DNS file
    • Perhaps there's a python file in root's home somewhere that checks for unused DNS/DHCP addresses?

Ongoing Projects (Spring 2017)

TODO

  • EMAILING ALL THE USERS https://wiki.cs.earlham.edu/index.php/Sysadmin:Old:Contacting_All_Users
  • SHUTDOWN SCHEDULED FOR SUNDAY (APRIL 16)
  • Fix certs for gitlab, etc.
  • Secure 1-2 admins for the summer
  • Prep layout for May-June usage
  • Practice shutdown-startup procedure (with Michael)
  • Nsswitch consistency across all machines
  • Document tools: startup / shutdown - Charlie
  • Use Sysadmin namespace for all our pages - All
    • Testing usefulness of documentation - Dave
  • Al Salam: configure switch, re-rack. - Vitalii
    • HP switch should be reset and tested.
  • LDAP cleanup of system users / old groups - James
  • Layout - Nirdesh
    • Lo0 RAID (mdadm)
    • 10GB from Dali to lo0 (adding rules on compute node routing tables as a possible fix)
    • BIOS reset
  • 10Gb, perfsonar, ...
  • Monitoring: (Ganglia, Shinken)
    • Getting consistency among all the machines(check_nrpe regularly stops working).
  • Whedon: configured and available
  • Change passwords (on everything). Postgres, shenken, ...
  • Webcam on office whiteboard (new office location?)
  • Learn virtual machine architecture and modules - Dave
    • Document in a format for future admin training?
    • Find existing introduction material
  • Mirror control for testing, swapping, etc.

DONE (19 Jan 2017)

  • Examine extra "layout" node. - Adam
    • Differences are: Single PSU, Single GPGPU, No VGA.
    • It has Infiniband and 10GB cards installed.
  • Networking - Adam, Charlie
    • IP over Infiniband working on layout
      • Resolved by resetting IB switch configuration: ibwarn: [3349] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)

FUTURE

  • Centralized password database / manager / location

Current Projects (updated 13 Oct 16)

  • Groups and LDAP and sudo - James
  • Amber - James
  • Edward's setup - Vitalli
  • WebDev access - Nirdesh
  • Puppet - James and Vitalii
  • Bacula - Nirdesh
  • SSL certificate upgrade and documentation - Kristin
  • Listserv merging with archives preserved - Nirdesh
  • Ganglia - Bret
  • Shenken - Vitalii
    • latency, UPS
  • New Layout node - ? and ?
  • Provision Sappho (compute) - after Puppet
  • Provision Kahlo (storage) -
    • replace broken drive
  • I2 setup
    • DTN, storage nodes, head nodes, ports in CST
  • Provision Whedon (compute) - after Puppet
  • Shutdown and startup test - scheduled for Sunday 27 November
  • Disk cleaning - Charlie
  • Password changing in the CS and cluster domains - Vitalii and James
  • Proto setup and maintenance with HIP/Green Science