Difference between revisions of "Sysadmin"

From Earlham CS Department
Jump to navigation Jump to search
(Last updated 16 July)
m (Last updated 16 July)
Line 363: Line 363:
 
* indiana config  
 
* indiana config  
 
* shinken - didn't boot, and the whole two sets of processes bit
 
* shinken - didn't boot, and the whole two sets of processes bit
* [[File:Manual TeraStation5010.pdf| Manual for storage server Indiana]]
+
* [[File:Manual TeraStation5010.pdf]]
  
 
=== Elderly, to be sorted through ===
 
=== Elderly, to be sorted through ===

Revision as of 10:24, 20 July 2018


Machines and Brief Descriptions of Services

CS Machines

Server layout as of May 2017
NET
(vm1)
LDAP Server
DNS
DHCP

Backup to Dali: etc, var
WEB
(vm2)
Mailman
Mail Stack
Apache2
PostgresQL
MySQL
Wiki

Backup to Dali: etc, var
TOOLS
(vm3)
SageNB Server
Jupyterhub Server
Software Modules
NginX
SSH
Users

Backup to Dali: etc, var, mnts, sage
BABBAGE
Firewall
PROTO
Weather Monitoring
GPS/NTP
Energy Monitoring

Backup to Dali: etc, var
CONTROL
Users
SSH
HOME
TOOLS

Backup to Dali: etc, var
SMILEY
XenDocs
NET
WEB
NFS

Backup to Dali: etc, var
SHINKEN
Users
SSH
Add machines
MURPHY
Elderly email stack
Users
SSH
HOME
(vm0)
SSH
NFS

Backup to Dali: eccs, etc, var

deprecated 07-2018














Cluster Machines

HOPPER
Users
SSH
NFS server
LDAP server
Software Modules
PostgreSQL
Wiki
Apache2
DNS
DHCP

Backup to Dali: etc, var, cluster
INDIANA
New Storage Server
DALI
Storage Server
Gitlab
Backups
NginX

Backup to Dali (/media/r10_vol/backups/): etc, var/opt/gitlab/backups
AL-SALAM
WebMO
Software Modules
Apache2

Backup to Dali: etc, var
LAYOUT
Jupyterhub Server
Software Modules
NginX
Apache2
WebMO

Backup to Dali: etc, var
BRONTE
Software Modules

Backup to Dali: etc, var, nbserver
POLLOCK
Software Modules
WebMO
NginX

Backup to Dali: etc, var
KAHLO
Storage Server
Backups
NginX

Backup to Dali: etc, var
BIGFE
Software Modules

Hosts BCCD related repositories and distributions.
T-VOC
Software Modules
ELWOOD
Software Modules

Used by BCCD to host www.bccd.net and www.littlefe.net. Will be deprecated when BCCD project offloads their sites onto cloud-based hosting platforms.
krasner
Docker platform on an old lovelace machine upgraded to have 16GB of RAM.


































Switches

SG538SF02J
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Unused as of May 12, 2017
CN63FP762S
  • Model: HP 2530-24G
  • Ports: 24
  • Switching Capacity:
    • 56 Gbps
    • 41.6 million pps
  • Memory:
    • 1.5 MB packet buffer
    • 256 MB flash
    • 128 MB DDR3 DIMM
  • Cut-through switching: No
  • Connected to Al-Salam as of May 12, 2017
SG525SG025
  • Model: HP Procurve 3400cl
  • Ports: 24
  • Backplane bandwidth:
    • 88 Gbps
    • 64 million pps
  • Memory:
    • 2MB packet buffer
    • 16 MB dual flash
    • 128 MB SDRAM
  • Cut-through switching: No
  • Connected to layout and whedon as of May 12, 2017
Netgear JGS524
  • Current cluster head-node
  • Unmanaged (no console/configuration)
  • Ports: 24
  • Switching bandwidth:
    • 48 Gbps
    • 1.5 million pps
  • Memory:
    • 2MB packet buffer
  • Cut-through switching: No
  • Connected to Al-Salam, Hopper, Pollock, Nagios, Dali, Kahlo, Bronte as of May 12, 2017
cs-main
  • Model: HP 5920AF-24XG
  • Ports: 24
  • Backplane bandwidth:
    • 480 Gbps
    • 367 million pps
  • Memory:
    • 3.6 GB packet buffer
    • 256 MB dual flash
    • 2 GB SDRAM
  • Cut-through switching: Yes
  • IP Address: 159.28.31.66
  • Connected to layout, kahlo, and dali as of May 12, 2017
5500denniscs-sw1
  • Model: HP 5500 JG542A
  • Ports: 24
  • Backplane bandwidth:
    • 224 Gbps
    • 166.6 million pps
  • Memory:
    • 6 MB packet buffer
    • 512 MB dual flash
    • 1 GB SDRAM
  • Cut-through switching: No
  • IP Address: 159.28.31.67
  • Connected to Babbage, Control, Nagios, and the cluster's netgear switch (via port 14) as of May 12, 2017

























Systems Administration Documentation

For old documentation, see: Old Wiki Information

Current Projects

Last updated 16 July

  • 159.28.23.26 is a ghost machine - it responds to ping and purportedly exists, but we are unsure where it is or what it is. - ping -f
  • Power map additions and updates
  • nbgrader - setup continues
  • indiana config
  • shinken - didn't boot, and the whole two sets of processes bit
  • File:Manual TeraStation5010.pdf

Elderly, to be sorted through

  • Adam and Ahsan will be there for Noyes room tour on Friday, April 20 at 08:00
  • Setup and install new machines in Lovelace - Eli
    • New machine has Debian installed, needs ethernet driver installed though
  • 10 Gbps for dali and kahlo, hopper, etc (nfs mounts in /etc/fstab) - Adam
    • real close, reboot test 10a 8 April
  • Chau & Laurence will be brought up to speed for Layout - Adam
    • "Layout needs to be brought out of surgery" - Charlie
    • Layout - head node swap, check disk space on /scratch - Adam
  • APC is on Shinken now - Alek
  • Postgres connection monitoring on Shinken - Vitalli
  • Figure out the cs web production and test solution - Chau
  • Upgrade t-voc - ?
  • -------------------------
  • Backup - max disk capacity, scripts on all machines backing-up at least /etc; babbage - Chau
  • Ganglia on Hopper - Ahsan
  • UPS load monitoring - Eli
    • One UPS is being monitored, more can be added and stuff can still be cleared up
  • Temperature and humidity monitoring in the machine room (new item)
  • Fixing the user create script
  • Password auditing script
  • -------------------------
  • FIFO for requests rather than ad-hoc
  • Accounting for hours logged
  • PBS Shinken monitoring
  • Power layout - legend, color-code servers by type, how are servers with 2x power supplies plumbed?
  • -------------------------
  • Shinken - Vitalii and Aleks (documentation, monitoring webmo and pm8)
  • Hadoop on Whedon - Vitalii and Adam (stuck on ?)
  • Layout - Adam (stuck on ldap)
  • Gaussian & WebMO on Whedon - Ahsan and Eli (stuck on firewall)
  • Backup - Châu (moving along, setup backup.cs.e.e next)
    • Needs to setup a machine with CentOs in Noyes basement, eventually run it over to Lilly - Laurence
  • Installing power monitor, etc. and rack cleanup - TO BE ASSIGNED (eli and charlie)
    • switch switch (charlie to check inventory)
  • Mothur - Ahsan
  • password policy, force change and random initial
    • for now notify people with default and then change after a couple of days; script will generate random string
  • Talk about at next meeting:
    • Spring break and summer people (important)
    • Jon's user and Postgres database
    • investigate tools /clients/ directory with what looks like duplicate user directories

(list from 2017-10-26)

  • Finish migrating tools and home to smiley migrate web and net back to control
    • Record consistent & thorough documentation, especially concerning the startup and shutdown of the VMs
  • Setup graceful shutdown when we detect to be running solely off UPS
    • Additionally, setup clean shutdown and startup for VMs on smiley and control (?)
  • Fix reverse lookup error for mail.cs.earlham.edu
    • Should consistently refer to 159.28.22.2 (web.cs.earlham.edu)
    • It's possible that this isn't actually broken.
  • Layout infiniband subnet manager
  • Layout disk swap, new lo0
    • Redo /scratch for mglerner group on /media/r10_vol ?
  • Migrate Elwood, BigFe, t-voc to repurposed Lovelace Machines (Eli)
  • HP Al-Salam switch enable jumboframes ?
  • Strike unused lovelace machine addresses from CS DNS file
    • Perhaps there's a python file in root's home somewhere that checks for unused DNS/DHCP addresses?

Ongoing Projects (Spring 2017)

TODO

  • EMAILING ALL THE USERS https://wiki.cs.earlham.edu/index.php/Sysadmin:Old:Contacting_All_Users
  • SHUTDOWN SCHEDULED FOR SUNDAY (APRIL 16)
  • Fix certs for gitlab, etc.
  • Secure 1-2 admins for the summer
  • Prep layout for May-June usage
  • Practice shutdown-startup procedure (with Michael)
  • Nsswitch consistency across all machines
  • Document tools: startup / shutdown - Charlie
  • Use Sysadmin namespace for all our pages - All
    • Testing usefulness of documentation - Dave
  • Al Salam: configure switch, re-rack. - Vitalii
    • HP switch should be reset and tested.
  • LDAP cleanup of system users / old groups - James
  • Layout - Nirdesh
    • Lo0 RAID (mdadm)
    • 10GB from Dali to lo0 (adding rules on compute node routing tables as a possible fix)
    • BIOS reset
  • 10Gb, perfsonar, ...
  • Monitoring: (Ganglia, Shinken)
    • Getting consistency among all the machines(check_nrpe regularly stops working).
  • Whedon: configured and available
  • Change passwords (on everything). Postgres, shenken, ...
  • Webcam on office whiteboard (new office location?)
  • Learn virtual machine architecture and modules - Dave
    • Document in a format for future admin training?
    • Find existing introduction material
  • Mirror control for testing, swapping, etc.

DONE (19 Jan 2017)

  • Examine extra "layout" node. - Adam
    • Differences are: Single PSU, Single GPGPU, No VGA.
    • It has Infiniband and 10GB cards installed.
  • Networking - Adam, Charlie
    • IP over Infiniband working on layout
      • Resolved by resetting IB switch configuration: ibwarn: [3349] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)

FUTURE

  • Centralized password database / manager / location

Current Projects (updated 13 Oct 16)

  • Groups and LDAP and sudo - James
  • Amber - James
  • Edward's setup - Vitalli
  • WebDev access - Nirdesh
  • Puppet - James and Vitalii
  • Bacula - Nirdesh
  • SSL certificate upgrade and documentation - Kristin
  • Listserv merging with archives preserved - Nirdesh
  • Ganglia - Bret
  • Shenken - Vitalii
    • latency, UPS
  • New Layout node - ? and ?
  • Provision Sappho (compute) - after Puppet
  • Provision Kahlo (storage) -
    • replace broken drive
  • I2 setup
    • DTN, storage nodes, head nodes, ports in CST
  • Provision Whedon (compute) - after Puppet
  • Shutdown and startup test - scheduled for Sunday 27 November
  • Disk cleaning - Charlie
  • Password changing in the CS and cluster domains - Vitalii and James
  • Proto setup and maintenance with HIP/Green Science