Difference between revisions of "Sysadmin"

From Earlham CS Department
Jump to navigation Jump to search
m (Services)
m
Line 1: Line 1:
 
 
This is the hub for the CS sysadmins on the wiki.
 
This is the hub for the CS sysadmins on the wiki.
  
== Common Tasks ==
+
= Overview =
 
 
* [[Sysadmin:New Sysadmins | Welcoming a new sysadmin ]] <- log in here with wiki credentials to begin learning to be an admin
 
* [[Sysadmin:Troubleshooting|General troubleshooting tips for admins]]
 
* [[Sysadmin:SSH|Useful ssh information for admins]]
 
* [[Sysadmin:Recurring Tasks | Recurring tasks - e.g. software updates, hardware replacements]]
 
* [[Sysadmin:SlackAndGitLab | Slack and GitLab integration]]
 
* [https://code.cs.earlham.edu/sysadmin/ticket-tracker Ticket tracking for current projects]
 
* [[Sysadmin:User Management | User Management]]
 
** [[Sysadmin:Contacting all users|Contacting all users]]
 
** [[Reset password]]
 
* [[Sysadmin:Software installation | Software installation]]
 
** [[Modules | Installing software under modules ]]
 
* [[Sysadmin:Monitoring | Monitoring ]]
 
* [[Sysadmin:Backup|Backup]]
 
* [[Sysadmin:AddComputer|Add a computer]]
 
** [[Sysadmin:Setting up Lovelace Lab Machines | Setting up Lovelace Lab Machines]]
 
* [[Senior projects]]
 
* [[ShutdownProcedure| Shutdown and Boot up]]
 
** [[Sysadmin:TestingServices | Testing services]] (After a reboot, upgrade, change in the phase of the moon, etc.)
 
* [[Password managers]]
 
* [[Server safety]]
 
* [[Sysadmin:Upgrading SSL Certificate | Upgrading SSL Certificates ]]
 
** [[Sysadmin:ImportantInfo:SSLcerts| Generating SSL Certificates]]
 
* [[Sysadmin:Launch at startup|Launch a process at startup]]
 
* [[Sysadmin:CS-ITS Interoperability|Working with ITS]]
 
* [[Sysadmin:Recurring spending | Recurring spending ]]
 
* [[Sandbox Notes|Sandbox Notes]]
 
 
 
== Services ==
 
* [[Sysadmin:Services:ClusterOverview|Cluster Overview]] and [[Sysadmin:Ccg-admin|additional details]]
 
* [[Sysadmin:Jupyterhub Notebook Server|Jupyterhub]] and [[Nbgrader notes|NBGrader]]
 
* [[Sysadmin:Web Servers|Web Servers and Websites]]
 
* [[Sysadmin:Services:Databases|Databases]]
 
* [[Sysadmin:DNS & DHCP|DNS and DHCP]]
 
* [[Sysadmin:VirtualBox | VirtualBox]]
 
* [[Sysadmin:XenDocs | Xen Server]]
 
* [[X Applications]]
 
* [[Bash_start_up_script|Bash startup scripts]]
 
* [[Sysadmin:AWS|AWS]]
 
For old documentation, see: [[Sysadmin:Old | Old Wiki Information]]
 
 
 
= Machines and Brief Descriptions of Services =
 
  
 
[https://docs.google.com/drawings/d/1XaULz5IxXV_BZQjrko3QJ8wV5aXsSTYcSWxxT49OyZk/edit If you're visually inclined, we have a colorful and easy-to-edit map of our servers here!]
 
[https://docs.google.com/drawings/d/1XaULz5IxXV_BZQjrko3QJ8wV5aXsSTYcSWxxT49OyZk/edit If you're visually inclined, we have a colorful and easy-to-edit map of our servers here!]
  
=== Compute (servers and clusters) ===
+
== Compute (servers and clusters) ==
  
 
We have CS and cluster machines.
 
We have CS and cluster machines.
Line 61: Line 18:
  
 
Cluster machines:
 
Cluster machines:
* hopper: landing server
+
* hopper: landing server, NFS host for cluster
 
* bronte, pollock, lovelace: large compute servers
 
* bronte, pollock, lovelace: large compute servers
* layout, whedon: clusters of multiple nodes linked together through a switch and managed through a headnode
+
* layout, wachowski: clusters of multiple nodes linked together through a switch and managed through a headnode
* sakurai: big data storage and exports
+
* meier, miyamoto, sakurai: backup servers
* meier, miyamoto: backup servers
 
 
* monitor: server monitoring
 
* monitor: server monitoring
  
 
We have spare nodes on the old al-salam cluster’s rack. These should be used for services that can handle minutes to hours of downtime, as they only have one power supply.
 
We have spare nodes on the old al-salam cluster’s rack. These should be used for services that can handle minutes to hours of downtime, as they only have one power supply.
  
==== Specialized resources ====
+
=== Specialized resources ===
  
 
Specialized computing applications are supported on the following machines:
 
Specialized computing applications are supported on the following machines:
Line 78: Line 34:
 
* containers: bowie
 
* containers: bowie
  
=== Network ===
+
== Network ==
  
 
We have two network fabrics linking the machines together. There are three subdomains.
 
We have two network fabrics linking the machines together. There are three subdomains.
  
==== 10 Gb ====
+
=== 10 Gb ===
  
 
We have 10Gb fabric to mount files over NFS. Machines with 10Gb support have an IP address in the class C range 10.10.10.0/24 and we want to add DNS to these addresses.
 
We have 10Gb fabric to mount files over NFS. Machines with 10Gb support have an IP address in the class C range 10.10.10.0/24 and we want to add DNS to these addresses.
  
==== 1 Gb (cluster, cs) ====
+
=== 1 Gb (cluster, cs) ===
  
 
We have two class C subnets on the 1Gb fabric: 159.28.22.0/24 (CS) and 159.28.23.0/24 (cluster). This means we have double the IP addresses on the 1Gb fabric that we have on the 10Gb fabric.
 
We have two class C subnets on the 1Gb fabric: 159.28.22.0/24 (CS) and 159.28.23.0/24 (cluster). This means we have double the IP addresses on the 1Gb fabric that we have on the 10Gb fabric.
Line 92: Line 48:
 
Any user accessing *.cluster.earlham.edu and *.cs.earlham.edu is making calls on a 1Gb network.
 
Any user accessing *.cluster.earlham.edu and *.cs.earlham.edu is making calls on a 1Gb network.
  
==== Intra-cluster fabrics ====
+
=== Intra-cluster fabrics ===
  
The layout cluster has an Infiniband infrastructure. Whedon has only a 1Gb infrastructure.
+
The layout cluster has an Infiniband infrastructure. Wachowski has only a 1Gb infrastructure.
  
=== Power ===
+
== Power ==
  
 
We have a backup power supply, with batteries last upgraded in 2019 (?). We’ve had a few outages since then and power has held up well.
 
We have a backup power supply, with batteries last upgraded in 2019 (?). We’ve had a few outages since then and power has held up well.
  
=== HVAC ===
+
== HVAC ==
  
 
HVAC systems are static and are largely managed by Facilities.
 
HVAC systems are static and are largely managed by Facilities.
Line 107: Line 63:
  
 
[[Sysadmin:Layers of abstraction for filesystems|A word about what's happening between files and the drives they live on.]]
 
[[Sysadmin:Layers of abstraction for filesystems|A word about what's happening between files and the drives they live on.]]
 +
 +
 +
= New sysadmins =
 +
 +
These pages will be helpful for you if you're just starting in the group:
 +
 +
* [[Sysadmin:New Sysadmins | Welcoming a new sysadmin ]]
 +
* [[Sysadmin:Troubleshooting|General troubleshooting tips for admins]]
 +
* [[Sandbox Notes|Sandbox Notes]]
 +
* [[Password managers]]
 +
* [[Server safety]]
 +
* [https://code.cs.earlham.edu/sysadmin/ticket-tracker Ticket tracking for current projects]
 +
 +
Note: you'll need to log in with wiki credentials to see most Sysadmin pages.
 +
 +
= Additional information =
 +
 +
These pages contain a lot of the most important information about our systems and how we operate.
 +
 +
===Technical docs===
 +
 +
* [https://code.cs.earlham.edu/sysadmin/ticket-tracker Ticket tracking for current projects]
 +
* [[Server safety]]
 +
* [[Sysadmin:Backup|Backup]]
 +
* [[Sysadmin:Monitoring | Monitoring ]]
 +
* [[Sysadmin:SSH|SSH info relevant to admins]]
 +
* [[Sysadmin:User Management | User Management]] and [[Sysadmin:LDAP|LDAP]] generally
 +
* [[Sysadmin:Jupyterhub Notebook Server|Jupyterhub]] and [[Nbgrader notes|NBGrader]]
 +
* [[Sysadmin:MailStack|Email service]]
 +
* [[Sysadmin:XenDocs | Xen Server]]
 +
* [[Sysadmin:NFS|Network File System (NFS)]]
 +
* [[Sysadmin:Web Servers|Web Servers and Websites]]
 +
* [[Sysadmin:Services:Databases|Databases]]
 +
* [[Sysadmin:DNS & DHCP|DNS and DHCP]]
 +
* [[Sysadmin:AWS|AWS]]
 +
* [[Bash_start_up_script|Bash startup scripts]]
 +
* [[Sysadmin:VirtualBox | VirtualBox]]
 +
* [[X Applications]]
 +
* [[Sysadmin:Services:ClusterOverview|Cluster Overview]] and [[Sysadmin:Ccg-admin|additional details]]
 +
 +
===Common tasks===
 +
* [[Sysadmin:Recurring Tasks | Recurring tasks - e.g. software updates, hardware replacements]]
 +
* [[Sysadmin:Contacting all users|Contacting all users]]
 +
* [[Reset password]]
 +
* [[Sysadmin:Software installation | Software installation]]
 +
* [[Modules | Installing software under modules ]]
 +
* [[Sysadmin:AddComputer|Add a computer to CS or cluster domains]]
 +
* [[Senior projects|Supporting senior projects]]
 +
* [[ShutdownProcedure|How to do a planned shutdown and reboot of the system]]
 +
** [[Sysadmin:TestingServices | Testing services]] (after a reboot, upgrade, change in the phase of the moon, etc.)
 +
* [[Sysadmin:Upgrading SSL Certificate | Upgrading SSL Certificates ]]
 +
* [[Sysadmin:Launch at startup|Launch a process at startup]]
 +
 +
===Group and institution information===
 +
* [[Sysadmin:CS-ITS Interoperability|Working with ITS]]
 +
* [[Sysadmin:Recurring spending | Recurring spending ]]
 +
* [[Sysadmin:SlackAndGitLab | Slack and GitLab integration]]

Revision as of 13:29, 15 July 2021

This is the hub for the CS sysadmins on the wiki.

Overview

If you're visually inclined, we have a colorful and easy-to-edit map of our servers here!

Compute (servers and clusters)

We have CS and cluster machines.

CS machines:

  • bowie: hosts and exports user files; Jupyterhub; landing server
  • smiley: VM host, not accessible to regular users
  • web: website host
  • net: network administration host for CS
  • code: GitLab host
  • auth: host of the LDAP user database

Cluster machines:

  • hopper: landing server, NFS host for cluster
  • bronte, pollock, lovelace: large compute servers
  • layout, wachowski: clusters of multiple nodes linked together through a switch and managed through a headnode
  • meier, miyamoto, sakurai: backup servers
  • monitor: server monitoring

We have spare nodes on the old al-salam cluster’s rack. These should be used for services that can handle minutes to hours of downtime, as they only have one power supply.

Specialized resources

Specialized computing applications are supported on the following machines:

Network

We have two network fabrics linking the machines together. There are three subdomains.

10 Gb

We have 10Gb fabric to mount files over NFS. Machines with 10Gb support have an IP address in the class C range 10.10.10.0/24 and we want to add DNS to these addresses.

1 Gb (cluster, cs)

We have two class C subnets on the 1Gb fabric: 159.28.22.0/24 (CS) and 159.28.23.0/24 (cluster). This means we have double the IP addresses on the 1Gb fabric that we have on the 10Gb fabric.

Any user accessing *.cluster.earlham.edu and *.cs.earlham.edu is making calls on a 1Gb network.

Intra-cluster fabrics

The layout cluster has an Infiniband infrastructure. Wachowski has only a 1Gb infrastructure.

Power

We have a backup power supply, with batteries last upgraded in 2019 (?). We’ve had a few outages since then and power has held up well.

HVAC

HVAC systems are static and are largely managed by Facilities.

See full topology diagrams here.

A word about what's happening between files and the drives they live on.


New sysadmins

These pages will be helpful for you if you're just starting in the group:

Note: you'll need to log in with wiki credentials to see most Sysadmin pages.

Additional information

These pages contain a lot of the most important information about our systems and how we operate.

Technical docs

Common tasks

Group and institution information