Difference between revisions of "Sysadmin"

From Earlham CS Department
Jump to navigation Jump to search
(Current Projects)
(Current Projects)
Line 365: Line 365:
 
** Hamilton: installed, can ssh, so can do whatever else we need
 
** Hamilton: installed, can ssh, so can do whatever else we need
 
** Lovelace machine: still no Internet
 
** Lovelace machine: still no Internet
 +
* shinken: web is back, need to check status
 +
* layout node outside machine room - Eli interested
 +
* layout: make different node the head node?
 +
* passwords
 +
* email to users (automated)
 +
* check Lovelace DNS file - check what machines we have vs. what's in the DNS file?
 +
* check backup statuses of machines, see the wiki page on this - also backup.cs.e.e (indiana?)
 +
* Password auditing script
 +
* FIFO for requests rather than ad-hoc
 +
* Accounting for hours logged
  
=== Elderly, to be sorted through ===
+
=== From the summer, possible followup ===
===16 July 2018===
+
* 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is. - ping -f - Eli looking into it in the context of al-salam
* 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is. - ping -f  
 
 
* Power map additions and updates  
 
* Power map additions and updates  
 
* nbgrader - setup continues  
 
* nbgrader - setup continues  
 
* indiana config  
 
* indiana config  
* shinken - didn't boot, and the whole two sets of processes bit
+
* Hadoop on Whedon - Vitalii and Adam (stuck on ?) - talk to Ajit
* [[File:Manual TeraStation5010.pdf]]
 
 
 
===Spring 2018===
 
* Adam and Ahsan will be there for Noyes room tour on Friday, April 20 at 08:00
 
* Setup and install new machines in Lovelace - Eli
 
** New machine has Debian installed, needs ethernet driver installed though
 
* 10 Gbps for dali and kahlo, hopper, etc (nfs mounts in /etc/fstab) - Adam
 
** real close, reboot test 10a 8 April
 
* Chau & Laurence will be brought up to speed for Layout - Adam
 
** "Layout needs to be brought out of surgery" - Charlie
 
** Layout - head node swap, check disk space on /scratch - Adam
 
* APC is on Shinken now - Alek
 
* Postgres connection monitoring on Shinken - Vitalli
 
* Figure out the cs web production and test solution - Chau
 
* Upgrade t-voc - ?
 
 
 
* -------------------------
 
* Backup - max disk capacity, scripts on all machines backing-up at least /etc; babbage - Chau
 
* Ganglia on Hopper - Ahsan
 
* <s> UPS load monitoring </s> - Eli
 
** One UPS is being monitored, more can be added and stuff can still be cleared up
 
* Temperature and humidity monitoring in the machine room (new item)
 
* Fixing the user create script
 
* Password auditing script
 
 
 
* -------------------------
 
* FIFO for requests rather than ad-hoc
 
* Accounting for hours logged
 
* PBS Shinken monitoring
 
* Power layout - legend, color-code servers by type, how are servers with 2x power supplies plumbed?
 
 
 
* -------------------------
 
* Shinken - Vitalii and Aleks (documentation, monitoring webmo and pm8)
 
* Hadoop on Whedon - Vitalii and Adam (stuck on ?)
 
* Layout - Adam (stuck on ldap)
 
* Gaussian & WebMO on Whedon - Ahsan and Eli (stuck on firewall)
 
* Backup - Ch'''â'''u (moving along, setup backup.cs.e.e next)
 
** Needs to setup a machine with CentOs in Noyes basement, eventually run it over to Lilly - Laurence
 
* <s> Installing power monitor, etc. and rack cleanup - TO BE ASSIGNED (eli and charlie) </s>
 
** <s> switch switch (charlie to check inventory) </s>
 
* Mothur - Ahsan
 
* <s> password policy, force change and random initial </s>
 
** <s> for now notify people with default and then change after a couple of days; script will generate random string </s>
 
* Talk about at next meeting:
 
** Spring break and summer people (important)
 
** Jon's user and Postgres database
 
** investigate tools /clients/ directory with what looks like duplicate user directories
 
 
 
=== (list from 2017-10-26) ===
 
* <s>Finish migrating tools and home to smiley</s> migrate web and net back to control
 
** Record consistent & thorough documentation, especially concerning the startup and shutdown of the VMs
 
* Setup graceful shutdown when we detect to be running solely off UPS
 
** Additionally, setup clean shutdown and startup for VMs on smiley and control (?)
 
* <s> Fix reverse lookup error for mail.cs.earlham.edu
 
** Should consistently refer to 159.28.22.2 (web.cs.earlham.edu)
 
** It's possible that this isn't actually broken. </s>
 
* Layout infiniband subnet manager
 
* Layout disk swap, new lo0
 
** <s>Redo /scratch for mglerner group on /media/r10_vol</s> ?
 
* Migrate Elwood, BigFe, t-voc to repurposed Lovelace Machines (Eli)
 
* <s>HP Al-Salam switch enable jumboframes</s> ?
 
* Strike unused lovelace machine addresses from CS DNS file
 
** Perhaps there's a python file in root's home somewhere that checks for unused DNS/DHCP addresses?
 
 
 
=== Ongoing Projects (Spring 2017) ===
 
=== TODO ===
 
* EMAILING ALL THE USERS https://wiki.cs.earlham.edu/index.php/Sysadmin:Old:Contacting_All_Users
 
* SHUTDOWN SCHEDULED FOR SUNDAY (APRIL 16)
 
** Check/update instructions - one version is at https://wiki.cs.earlham.edu/index.php/Sysadmin:ImportantInfo:PowerFailure, there are others too
 
** Notify users
 
* Fix certs for gitlab, etc.
 
* Secure 1-2 admins for the summer
 
* Prep layout for May-June usage
 
* Practice shutdown-startup procedure (with Michael)
 
* Nsswitch consistency across all machines
 
* Document tools: startup / shutdown - Charlie
 
* Use Sysadmin namespace for all our pages - All
 
** Testing usefulness of documentation - Dave
 
* Al Salam: configure switch, re-rack. - Vitalii
 
** HP switch should be reset and tested.
 
* LDAP cleanup of system users / old groups - James
 
* Layout - Nirdesh
 
** Lo0 RAID (mdadm)
 
** 10GB from Dali to lo0 (adding rules on compute node routing tables as a possible fix)
 
** BIOS reset
 
* 10Gb, perfsonar, ...
 
* Monitoring: (Ganglia, Shinken)
 
** Getting consistency among all the machines(check_nrpe regularly stops working).
 
* Whedon: configured and available
 
* Change passwords (on everything). Postgres, shenken, ...
 
* Webcam on office whiteboard (new office location?)
 
* Learn virtual machine architecture and modules - Dave
 
** Document in a format for future admin training?
 
** Find existing introduction material
 
* Mirror ''control'' for testing, swapping, etc.
 
 
 
=== DONE (19 Jan 2017) ===
 
* Examine extra "layout" node. - Adam
 
** Differences are: Single PSU, Single GPGPU, No VGA.
 
** It has Infiniband and 10GB cards installed.
 
* Networking - Adam, Charlie
 
** IP over Infiniband working on layout
 
*** Resolved by resetting IB switch configuration: <code>ibwarn: [3349] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)</code>
 
 
 
=== FUTURE ===
 
* Centralized password database / manager / location
 
 
 
=== Current Projects (updated 13 Oct 16) ===
 
* '''Groups and LDAP and sudo - James'''
 
* <s>Amber - James</s>
 
* <s>Edward's setup - Vitalli</s>
 
* <s>WebDev access - Nirdesh<s>
 
* Puppet - James and Vitalii
 
* '''Bacula - Nirdesh'''
 
* SSL certificate upgrade and documentation - Kristin
 
* <s>Listserv merging with archives preserved - Nirdesh </s>
 
* '''Ganglia - Bret'''
 
* '''Shenken - Vitalii'''
 
** latency, UPS
 
* New Layout node - ? and ?
 
* Provision Sappho (compute) - after Puppet
 
* Provision Kahlo (storage) -
 
** replace broken drive
 
* I2 setup
 
** DTN, storage nodes, head nodes, ports in CST
 
* [[Sysadmin:WhedonProvisioning|Provision Whedon]] (compute) - after Puppet
 
* '''Shutdown and startup test - scheduled for Sunday 27 November'''
 
* Disk cleaning - Charlie
 
* <s>Password changing in the CS and cluster domains - Vitalii and James</s>
 
* Proto setup and maintenance with HIP/Green Science
 

Revision as of 13:52, 6 August 2018


Machines and Brief Descriptions of Services

CS Machines

Server layout as of May 2017
NET
(vm1)
LDAP Server
DNS
DHCP

Backup to Dali: etc, var
WEB
(vm2)
Mailman
Mail Stack
Apache2
PostgresQL
MySQL
Wiki

Backup to Dali: etc, var
TOOLS
(vm3)
SageNB Server
Jupyterhub Server
Software Modules
NginX
SSH
Users

Backup to Dali: etc, var, mnts, sage
BABBAGE
Firewall
PROTO
Weather Monitoring
GPS/NTP
Energy Monitoring

Backup to Dali: etc, var
CONTROL
Users
SSH
HOME
TOOLS

Backup to Dali: etc, var
SMILEY
XenDocs
NET
WEB
NFS

Backup to Dali: etc, var
SHINKEN
Users
SSH
Add machines
MURPHY
Elderly email stack
Users
SSH
HOME
(vm0)
SSH
NFS

Backup to Dali: eccs, etc, var

deprecated 07-2018














Cluster Machines

HOPPER
Users
SSH
NFS server
LDAP server
Software Modules
PostgreSQL
Wiki
Apache2
DNS
DHCP

Backup to Dali: etc, var, cluster
INDIANA
New Storage Server
DALI
Storage Server
Gitlab
Backups
NginX

Backup to Dali (/media/r10_vol/backups/): etc, var/opt/gitlab/backups
AL-SALAM
WebMO
Software Modules
Apache2

Backup to Dali: etc, var
LAYOUT
Jupyterhub Server
Software Modules
NginX
Apache2
WebMO

Backup to Dali: etc, var
BRONTE
Software Modules

Backup to Dali: etc, var, nbserver
POLLOCK
Software Modules
WebMO
NginX

Backup to Dali: etc, var
KAHLO
Storage Server
Backups
NginX

Backup to Dali: etc, var
BIGFE
Software Modules

Hosts BCCD related repositories and distributions.
T-VOC
Software Modules
ELWOOD
Software Modules

Used by BCCD to host www.bccd.net and www.littlefe.net. Will be deprecated when BCCD project offloads their sites onto cloud-based hosting platforms.
krasner
Docker platform on an old lovelace machine upgraded to have 16GB of RAM.


































Switches

SG538SF02J
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Unused as of May 12, 2017
CN63FP762S
  • Model: HP 2530-24G
  • Ports: 24
  • Switching Capacity:
    • 56 Gbps
    • 41.6 million pps
  • Memory:
    • 1.5 MB packet buffer
    • 256 MB flash
    • 128 MB DDR3 DIMM
  • Cut-through switching: No
  • Connected to Al-Salam as of May 12, 2017
SG525SG025
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Connected to layout and whedon as of May 12, 2017
Netgear JGS524
  • Current cluster head-node
  • Unmanaged (no console/configuration)
  • Ports: 24
  • Switching bandwidth:
    • 48 Gbps
    • 1.5 million pps
  • Memory:
    • 2MB packet buffer
  • Cut-through switching: No
  • Connected to Al-Salam, Hopper, Pollock, Nagios, Dali, Kahlo, Bronte as of May 12, 2017
cs-main
  • Model: HP 5920AF-24XG
  • Ports: 24
  • Backplane bandwidth:
    • 480 Gbps
    • 367 million pps
  • Memory:
    • 3.6 GB packet buffer
    • 256 MB dual flash
    • 2 GB SDRAM
  • Cut-through switching: Yes
  • IP Address: 159.28.31.66
  • Connected to layout, kahlo, and dali as of May 12, 2017
5500denniscs-sw1
  • Model: HP 5500 JG542A
  • Ports: 24
  • Backplane bandwidth:
    • 224 Gbps
    • 166.6 million pps
  • Memory:
    • 6 MB packet buffer
    • 512 MB dual flash
    • 1 GB SDRAM
  • Cut-through switching: No
  • IP Address: 159.28.31.67
  • Connected to Babbage, Control, Nagios, and the cluster's netgear switch (via port 14) as of May 12, 2017

























Systems Administration Documentation

For old documentation, see: Old Wiki Information

Current Projects

This is the list we will work from in addition to service requests.

  • Possibly: update CentOS on Pollock and Bronte
  • Mailing list archives are gone
  • Graceful shutdown during outage
  • Web logins
  • Reinstall OS's on Lovelace and Hamilton machines
    • Hamilton: installed, can ssh, so can do whatever else we need
    • Lovelace machine: still no Internet
  • shinken: web is back, need to check status
  • layout node outside machine room - Eli interested
  • layout: make different node the head node?
  • passwords
  • email to users (automated)
  • check Lovelace DNS file - check what machines we have vs. what's in the DNS file?
  • check backup statuses of machines, see the wiki page on this - also backup.cs.e.e (indiana?)
  • Password auditing script
  • FIFO for requests rather than ad-hoc
  • Accounting for hours logged

From the summer, possible followup

  • 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is. - ping -f - Eli looking into it in the context of al-salam
  • Power map additions and updates
  • nbgrader - setup continues
  • indiana config
  • Hadoop on Whedon - Vitalii and Adam (stuck on ?) - talk to Ajit