Difference between revisions of "Sysadmin"
Jump to navigation
Jump to search
(→Current Projects (updated 15 Jan 2017)) |
(→TODO) |
||
Line 173: | Line 173: | ||
** Testing usefulness of documentation - Dave | ** Testing usefulness of documentation - Dave | ||
* Al Salam: configure switch, re-rack. - Vitalii | * Al Salam: configure switch, re-rack. - Vitalii | ||
− | ** | + | ** HP switch should be reset and tested. |
* LDAP cleanup of system users / old groups - James | * LDAP cleanup of system users / old groups - James | ||
* Layout - Nirdesh | * Layout - Nirdesh | ||
** Lo0 RAID (mdadm) | ** Lo0 RAID (mdadm) | ||
− | ** 10GB from Dali to lo0 | + | ** 10GB from Dali to lo0 (adding rules on compute node routing tables as a possible fix) |
** BIOS reset | ** BIOS reset | ||
* 10Gb, perfsonar, ... | * 10Gb, perfsonar, ... | ||
− | * Monitoring: (Ganglia, | + | * Monitoring: (Ganglia, Shinken) |
+ | ** Getting consistency among all the machines(check_nrpe regularly stops working). | ||
* Whedon: configured and available | * Whedon: configured and available | ||
* Change passwords (on everything). Postgres, shenken, ... | * Change passwords (on everything). Postgres, shenken, ... |
Revision as of 00:48, 13 April 2017
Machines and Brief Descriptions of Services
HOME (vm0) |
Users SSH NFS |
NET (vm1) |
LDAP server DNS DHCP |
WEB (vm2) |
Mailman Mail Stack Apache2 PostgresQL MySQL Wiki |
TOOLS (vm3) |
SageNB Server Jupyterhub Server Software Modules NginX |
BABBAGE |
Firewall |
PROTO |
Weather Monitoring GPS/NTP Energy Monitoring |
HOPPER |
Users SSH NFS Software Modules PostgresQL Wiki Apache2 DNS DHCP |
DALI |
Gitlab Backups NginX |
AL-SALAM |
WebMO Software Modules Apache2 |
LAYOUT |
Jupyterhub Server Software Modules NginX Apache2 WebMO |
BRONTE |
Software Modules |
POLLOCK |
Software Modules WebMO NginX |
Systems Administration Documentation
For old documentation, see: Old Wiki Information
Services |
Current Projects (updated 15 Jan 2017)
TODO
- Fix certs for gitlab, etc.
- Secure 1-2 admins for the summer
- Prep layout for May-June usage
- Practice shutdown-startup procedure (with Michael)
- Nsswitch consistency across all machines
- Document tools: startup / shutdown - Charlie
- Use Sysadmin namespace for all our pages - All
- Testing usefulness of documentation - Dave
- Al Salam: configure switch, re-rack. - Vitalii
- HP switch should be reset and tested.
- LDAP cleanup of system users / old groups - James
- Layout - Nirdesh
- Lo0 RAID (mdadm)
- 10GB from Dali to lo0 (adding rules on compute node routing tables as a possible fix)
- BIOS reset
- 10Gb, perfsonar, ...
- Monitoring: (Ganglia, Shinken)
- Getting consistency among all the machines(check_nrpe regularly stops working).
- Whedon: configured and available
- Change passwords (on everything). Postgres, shenken, ...
- Webcam on office whiteboard (new office location?)
- Learn virtual machine architecture and modules - Dave
- Document in a format for future admin training?
- Find existing introduction material
- Mirror control for testing, swapping, etc.
DONE (19 Jan 2017)
- Examine extra "layout" node. - Adam
- Differences are: Single PSU, Single GPGPU, No VGA.
- It has Infiniband and 10GB cards installed.
- Networking - Adam, Charlie
- IP over Infiniband working on layout
- Resolved by resetting IB switch configuration:
ibwarn: [3349] mad_rpc_rmpp: _do_madrpc failed; dport (Lid 1)
- Resolved by resetting IB switch configuration:
- IP over Infiniband working on layout
FUTURE
- Centralized password database / manager / location
Current Projects (updated 13 Oct 16)
- Groups and LDAP and sudo - James
Amber - JamesEdward's setup - VitalliWebDev access - Nirdesh- Puppet - James and Vitalii
- Bacula - Nirdesh
- SSL certificate upgrade and documentation - Kristin
Listserv merging with archives preserved - Nirdesh- Ganglia - Bret
- Shenken - Vitalii
- latency, UPS
- New Layout node - ? and ?
- Provision Sappho (compute) - after Puppet
- Provision Kahlo (storage) -
- replace broken drive
- I2 setup
- DTN, storage nodes, head nodes, ports in CST
- Provision Whedon (compute) - after Puppet
- Shutdown and startup test - scheduled for Sunday 27 November
- Disk cleaning - Charlie
Password changing in the CS and cluster domains - Vitalii and James- Proto setup and maintenance with HIP/Green Science