Difference between revisions of "Al-salam"

From Earlham CS Department
Jump to navigation Jump to search
(Quick breakdown)
 
(28 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Al-Salam is the working name for the Earlham Computer Science Department's upcoming cluster computer.
+
Al-Salam is the name for the Earlham Computer Science Department's venerable 13-node computing cluster.
  
At the moment Al-Salam exists only as a $40,000 grant and a growing list of tentative specifications:
+
= Software =
 +
Al-salam was upgraded to CentOS 7 in summer 2019. The notes below, in "Archive", contain useful information, but the specifics may not be current to the cluster as it exists now.
 +
 
 +
= Hardware notes =
 +
* As{0,9,10} have a 500GB hard drive on it in addition to the 80GB, remember for configuration
 +
* As10: 500 drive appears when you run the installer but not in df -h; 80 drive has the OS but can’t seen the 500
 +
* As12 has 2 1TB hard drives in software RAID in addition to the 80GB
 +
 
 +
= Archive =
 +
A few extra notes beyond these are available [http://cluster.earlham.edu/wiki/index.php/Al-salam here] so check first before moving forward with big changes.
 +
 
 +
== Installation Notes ==
 +
=== headnode ===
 +
I'll be maintaining a script, <tt>/root/install/al-salam.sh</tt>, that will also serve as a log. Also following along with [[Cluster: New BobSCEd Install Log#Head Node|BobSCEd-new logs]] for consistency between clusters.
 +
 
 +
==== TODO ====
 +
* MPI
 +
* Software installations into /cluster/al-salam
 +
* User auth via bs0-new's ldap
 +
* torque/maui
 +
* Ganglia
 +
* shorewall
 +
* modules
 +
 
 +
==== Have done ====
 +
 
 +
* yum install:
 +
gcc.x86_64 gcc-c++.x86_64 gcc-gfortran.x86_64 \
 +
gcc44.x86_64 gcc44-c++.x86_64 gcc44-gfortran.x86_64 \
 +
apr-x86_64 apr-devel.x86_64 expat-devel.x86_64 \
 +
blas.x86_64 dhcp.x86_64
 +
* rpm install:
 +
** c3
 +
** libconfuse
 +
** libconfuse-devel
 +
* /etc/c3.conf:
 +
cluster al-salam {
 +
    as0.cluster.earlham.edu:as0.al-salam.loc
 +
    as[1-12]
 +
}
 +
* in hopper:/etc/rc.conf:
 +
static_routes="bs0 as0"
 +
route_as0="192.168.1.1 159.28.234.150"
 +
* hopper:/etc/namedb/master/cluster.zone:
 +
as0.cluster.earlham.edu.      IN  A 159.28.234.150
 +
as.cluster.earlham.edu.      IN  CNAME as0
 +
al-salam.cluster.earlham.edu. IN  CNAME as0
 +
* hopper:/etc/namedb/master/al-salam.loc
 +
** copy from bobsced.loc, amend as necessary
 +
* hopper:/etc/namedb/master/1.168.192.in-addr.arpa
 +
** copy from 0.168.192.in-addr.arpa, amend as necessary
 +
* hopper:/etc/namedb/master/159.28.234.zone
 +
150 IN  PTR as0.cluster.earlham.edu.
 +
* hopper:/usr/local/etc/dhcpd.conf
 +
<pre>
 +
  subnet 192.168.1.0 netmask 255.255.255.0 {
 +
 
 +
    option routers      192.168.1.1;
 +
    option subnet-mask    255.255.255.0;
 +
    option domain-name    "al-salam.loc";
 +
    option domain-name-servers  159.28.234.1;
 +
 
 +
    next-server    159.28.234.1;
 +
    filename "pxelinux.0";
 +
 
 +
    range 192.168.1.100 192.168.1.254;
 +
  }
 +
</pre>
 +
* as0:/etc/dhcrelay
 +
# Command line options here
 +
INTERFACES="eth1"
 +
DHCPSERVERS="cluster.earlham.edu"
 +
 
 +
=== compute nodes ===
 +
* clone from bs1-new using udpcast
 +
* modify network, etc. settings from ''bobsced'' to ''al-salam''
  
 
== Latest Overarching Questions ==
 
== Latest Overarching Questions ==
Line 23: Line 98:
 
# Nodes - case, motherboard(s), power supply, CPU, RAM, GPGPU cards
 
# Nodes - case, motherboard(s), power supply, CPU, RAM, GPGPU cards
 
# Switch - managed, cut-through
 
# Switch - managed, cut-through
 +
## Fitz: Having a hard time finding anyone who sells cut-through switches
 +
### How about [http://www.tigerdirect.com/applications/searchtools/item-Details.asp?sku=H24-J9021A&SRCCODE=CHANNELINC&cisrccode=cii_7240393&cpncode=20-3634289 this] store-and-forward switch from hp?
 
# Power distribution - rack-mount PDUs
 
# Power distribution - rack-mount PDUs
  
Line 62: Line 139:
 
! [[Al-salam#Newegg_Quote_.231|Newegg #1]]
 
! [[Al-salam#Newegg_Quote_.231|Newegg #1]]
 
! [[Al-salam#Newegg_Quote_.232|Newegg #2]]
 
! [[Al-salam#Newegg_Quote_.232|Newegg #2]]
 +
! [[Al-salam#Intel_List_.231|Intel List #1]]
 +
! [[Al-salam#AMD_List_.231|AMD List #1]]
 +
! [[Al-salam#AMD_List_.232|AMD List #2]]
 
|-
 
|-
 
| '''CPU'''
 
| '''CPU'''
Line 69: Line 149:
 
| 128 2.4GHz Intel E5530
 
| 128 2.4GHz Intel E5530
 
| 112 2.4GHz Intel E5530
 
| 112 2.4GHz Intel E5530
 +
| 100 2.4GHz Intel E5530
 +
| 156 2.0GHz [http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189 AMD Opteron 2350]
 +
| 126 2.6GHz [http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189 AMD Opteron 2435]
 
|-
 
|-
 
| '''RAM'''
 
| '''RAM'''
Line 76: Line 159:
 
| 192GB DDR3-1333
 
| 192GB DDR3-1333
 
| 168GB DDR3-1333
 
| 168GB DDR3-1333
 +
| 144GB DDR3-1333
 +
| 160GB DDR2-800
 +
| 120GB DDR2-800
 
|-
 
|-
 
| '''GPU'''
 
| '''GPU'''
Line 83: Line 169:
 
| None
 
| None
 
| 4 Tesla C1060
 
| 4 Tesla C1060
 +
| 2 Tesla C1060
 +
| 2 Tesla C1060
 +
| 2 Tesla C1060
 
|-
 
|-
 
| '''Local disk'''
 
| '''Local disk'''
 +
| Yes
 +
| Yes
 +
| Yes
 
| Yes
 
| Yes
 
| Yes
 
| Yes
Line 95: Line 187:
 
| Yes
 
| Yes
 
| Yes
 
| Yes
 +
| No
 +
| No
 +
| No
 
| No
 
| No
 
| No
 
| No
Line 104: Line 199:
 
| No
 
| No
 
| IPMI on GPU nodes
 
| IPMI on GPU nodes
 +
| IPMI
 +
| IPMI
 +
| IPMI
 
|-
 
|-
 
| '''Size (just nodes)'''
 
| '''Size (just nodes)'''
Line 111: Line 209:
 
| 16U
 
| 16U
 
| 12U
 
| 12U
 +
| 12U
 +
| 20U
 +
| 10U
 
|-
 
|-
 
| '''Price'''
 
| '''Price'''
Line 118: Line 219:
 
| $32,910.56
 
| $32,910.56
 
| $34,696.78
 
| $34,696.78
 +
| $35,846.00
 +
| $35,275.00
 +
| $33,755.00
 
|}
 
|}
  
Line 288: Line 392:
  
 
* Price tag: $34,696.78
 
* Price tag: $34,696.78
 +
 +
==Intel List #1==
 +
* 13x Chassis + mainboard: http://www.provantage.com/supermicro-sys-6016t-gtf~7SUP91FA.htm
 +
* 25x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819117184
 +
* 39x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820139041
 +
* 14x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
 +
* 2x Tesla card: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
 +
* Notes:
 +
** ~$2600/node without Tesla
 +
** ~$3850/node with Tesla
 +
** 1.5GB RAM/core
 +
** 2 dies/node (8 cores/node)
 +
** Yes IPMI
 +
** 1 headnode (compute node - one die + one HDD) + 11 compute nodes + 2 Tesla nodes ~= $35,846
 +
 +
==AMD List #1==
 +
* 20x Chassis: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152128
 +
* 20x Mainboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182108
 +
* 39x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189
 +
* 40x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820134936
 +
* 21x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
 +
* 2x Tesla: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
 +
* Notes:
 +
** $1650/node without Tesla
 +
** $2900/node with Tesla
 +
** 1G RAM/core
 +
** 2 dies/node (8 cores/node)
 +
** Yes IMPI
 +
** 1 headnode (compute node - one die + one HDD) + 17 compute nodes + 2 Tesla nodes ~= $35,275
 +
 +
==AMD List #2==
 +
* 10x Chassis: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152128
 +
* 10x Mainboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182108
 +
* 19x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189
 +
* 30x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820134936
 +
* 11x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
 +
* 2x Tesla: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
 +
* Notes:
 +
** $3130/node without Tesla
 +
** $4350/node with Tesla
 +
** 1G RAM/core
 +
** 2 dies/node (12 cores/node)
 +
** Yes IMPI
 +
** 1 headnode (compute node - one die + one HDD) + 7 compute nodes + 2 Tesla nodes ~= $33,755

Latest revision as of 14:15, 28 June 2019

Al-Salam is the name for the Earlham Computer Science Department's venerable 13-node computing cluster.

Software

Al-salam was upgraded to CentOS 7 in summer 2019. The notes below, in "Archive", contain useful information, but the specifics may not be current to the cluster as it exists now.

Hardware notes

  • As{0,9,10} have a 500GB hard drive on it in addition to the 80GB, remember for configuration
  • As10: 500 drive appears when you run the installer but not in df -h; 80 drive has the OS but can’t seen the 500
  • As12 has 2 1TB hard drives in software RAID in addition to the 80GB

Archive

A few extra notes beyond these are available here so check first before moving forward with big changes.

Installation Notes

headnode

I'll be maintaining a script, /root/install/al-salam.sh, that will also serve as a log. Also following along with BobSCEd-new logs for consistency between clusters.

TODO

  • MPI
  • Software installations into /cluster/al-salam
  • User auth via bs0-new's ldap
  • torque/maui
  • Ganglia
  • shorewall
  • modules

Have done

  • yum install:
gcc.x86_64 gcc-c++.x86_64 gcc-gfortran.x86_64 \
gcc44.x86_64 gcc44-c++.x86_64 gcc44-gfortran.x86_64 \
apr-x86_64 apr-devel.x86_64 expat-devel.x86_64 \
blas.x86_64 dhcp.x86_64
  • rpm install:
    • c3
    • libconfuse
    • libconfuse-devel
  • /etc/c3.conf:
cluster al-salam {
    as0.cluster.earlham.edu:as0.al-salam.loc
    as[1-12]
}
  • in hopper:/etc/rc.conf:
static_routes="bs0 as0"
route_as0="192.168.1.1 159.28.234.150"
  • hopper:/etc/namedb/master/cluster.zone:
as0.cluster.earlham.edu.      IN  A 159.28.234.150
as.cluster.earlham.edu.       IN  CNAME as0
al-salam.cluster.earlham.edu. IN  CNAME as0
  • hopper:/etc/namedb/master/al-salam.loc
    • copy from bobsced.loc, amend as necessary
  • hopper:/etc/namedb/master/1.168.192.in-addr.arpa
    • copy from 0.168.192.in-addr.arpa, amend as necessary
  • hopper:/etc/namedb/master/159.28.234.zone
150 IN  PTR as0.cluster.earlham.edu.
  • hopper:/usr/local/etc/dhcpd.conf
  subnet 192.168.1.0 netmask 255.255.255.0 {

    option routers      192.168.1.1;
    option subnet-mask    255.255.255.0;
    option domain-name    "al-salam.loc";
    option domain-name-servers  159.28.234.1;

    next-server     159.28.234.1;
    filename "pxelinux.0";

    range 192.168.1.100 192.168.1.254;
  }
  • as0:/etc/dhcrelay
# Command line options here
INTERFACES="eth1"
DHCPSERVERS="cluster.earlham.edu"

compute nodes

  • clone from bs1-new using udpcast
  • modify network, etc. settings from bobsced to al-salam

Latest Overarching Questions

  • Should we build this machine ourselves?
    1. Are we wasting our money and learning opportunity letting them do the building for us?
    2. If it is cheaper, Would it be a useful experience for the students this coming semester to take a large collection of hardware and make it into a cluster?
    3. Yes, this was pretty clear from the email thread in early December.
  • How much if any GPGPU hardware do we want? 0, 1 or 2 nodes worth?
  • Do we want a high bandwidth/low latency network?
    • We do not. More expensive than it is worth.
  • What software stack do we want to run? Vendor supplied or the BCCD?
    • Both. Vendor-supplied base with a BCCD virtual machine
      • Will the virtual machine support CUDA?
  • Do compute nodes have spinning disk?
    • Compute nodes have a spinning disk. Solid state is still expensive
  • What's on the local persistent store? /tmp? An entire OS?
  • Support
    • Consider getting the cheapest hardware support, loosing a node isn't critical as long as they send a replacement quickly.

Parts List

  1. Nodes - case, motherboard(s), power supply, CPU, RAM, GPGPU cards
  2. Switch - managed, cut-through
    1. Fitz: Having a hard time finding anyone who sells cut-through switches
      1. How about this store-and-forward switch from hp?
  3. Power distribution - rack-mount PDUs

Tentative Specifications

Budget

  • $35,000 (leaving $5,000 for discretionary spending)

Nodes

  • Intel Nehalem processors
  • 4 core processors minimum
    • Six cores still expensive
  • 1.5GB RAM per core

Specialty Nodes

  • Two nodes should support CUDA GPGPU

Educationally, we could expect to get significant use out of GPGPUs, but the production use is limited. Increasing the variance of the architecture landscape would be a bonus to education.

Network

  • Gigabit Ethernet fabric with switch

Disk

  • Spinning Disk

OS

  • Virtual BCCD on top of built-in OS.

Quick breakdown

Nodes

ION #61116 ION #61164 SM #174536 Newegg #1 Newegg #2 Intel List #1 AMD List #1 AMD List #2
CPU 72 2.4GHz Intel E5530 80 2.4GHz Intel E5530 80 2.4GHz Intel E5530 128 2.4GHz Intel E5530 112 2.4GHz Intel E5530 100 2.4GHz Intel E5530 156 2.0GHz AMD Opteron 2350 126 2.6GHz AMD Opteron 2435
RAM 108GB PC3-10600 120GB PC3-10600 120GB DDR3-1333 192GB DDR3-1333 168GB DDR3-1333 144GB DDR3-1333 160GB DDR2-800 120GB DDR2-800
GPU 2 Tesla C1060 2 Tesla C1060 2 Tesla C1060 None 4 Tesla C1060 2 Tesla C1060 2 Tesla C1060 2 Tesla C1060
Local disk Yes Yes Yes Yes Yes Yes Yes Yes
Shared chassis No Yes Yes No No No No No
Remote mgmt No No IPMI No IPMI on GPU nodes IPMI IPMI IPMI
Size (just nodes) 9U 6U 6U 16U 12U 12U 20U 10U
Price $33,173.20 $33,054.30 $30,078.00 $32,910.56 $34,696.78 $35,846.00 $35,275.00 $33,755.00

Power distribution

PDU1220 PDUMH20 AP9563 AP7801
Vendor TrippLite TrippLite APC APC
Size 1U 1U 1U 1U
Capabilities Dumb Metered Dumb Metered
Input power 20A, 1x NEMA 5-20P 20A, 1x NEMA L5-20P w/ NEMA 5-20P adapter 20A, 1x NEMA 5-20P 20A, 1x NEMA 5-20P
Output power 13x NEMA 5-20R 12x NEMA 5-20R 10x NEMA 5-20R 8x NEMA 5-20R
Price $195 $230 $120 $380

ION Computer Systems Quotation #61116

  • 2 ION G10 Server with GPU: 4,972.00 each
    • (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
    • 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
    • Total memory: 12GB DDR3_1333
    • Configure 1 RAID sets / arrays.
    • Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
    • (1) NVidia Tesla C1060 w. 4GB DDR3
    • Dual Intel Gigabit Server NICs with IOAT2 Integrated
  • 7 ION G10 Server without GPU: $3,697.00 each
    • (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
    • 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
    • Total memory: 12GB DDR3_1333
    • Configure 1 RAID sets / arrays.
    • Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
    • Dual Intel Gigabit Server NICs with IOAT2 Integrated
  • Networking Fabric
    • Network not included
  • Other stuff
    • scorpion: ION bootable USB Flash device for trouble shooting.
    • 3 year Next Business Day response Onsite Repair Service by Source Support
    • Default load for testing (Service Partition + CentOS 5.3 for Intel64)
  • Price tag: $33,173.20

ION Computer Systems Quotation #61164

  • 2 ION G10 Server with GPU: $4,972.00 each
    • (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
    • 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
    • Total memory: 12GB DDR3_1333
    • Configure 1 RAID sets / arrays.
    • Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
    • (1) NVidia Tesla C1060 w. 4GB DDR3
    • Dual Intel Gigabit Server NICs with IOAT2 Integrated
  • 4 ION T11 DualNode: $6,477.0 each
    • (2x2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
    • Total memory: 12GB DDR3_1333 per node
    • No RAID, Separate disks (NO redundancy)
    • Configure 1 RAID sets / arrays.
    • Seagate Constellation 160GB, 7200RPM, SATA 3Gb NCQ 2.5“ Disk
    • Dual Intel Gigabit Server NICs with IOAT2 Integrated
    • These nodes are modular. One can be unplugged and worked on while the others remain running.
  • Network
    • Network Not Included
  • Other stuff
    • scorpion: ION bootable USB Flash device for trouble shooting.
    • 3 year Next Business Day response Onsite Repair Service by Source Support
    • Default load for testing (Service Partition + CentOS 5.3 for Intel64)
  • Price Tag: $33,054.30

Silicon Mechanics Quote #174536

  • 2x Rackform iServ R4410: $11043.00 each ($10601.00 each with education) link
    • Shared Chassis: The following chassis resources are shared by all 4 compute nodes
    • External Optical Drive: No Item Selected
    • Power Supply: Shared, Redundant 1400W Power Supply with PMBus - 80 PLUS Gold Certified
    • Rail Kit: Quick-Release Rail Kit for Square Holes, 26.5 - 36.4 inches
    • Compute Nodes x4
      • CPU: 2 x Intel Xeon E5530 Quad-Core 2.40GHz, 8MB Cache, 5.86GT/s QPI
      • RAM: 12GB (6 x 2GB) Operating at 1333MHz Max (DDR3-1333 ECC Unbuffered DIMMs)
      • NIC: Intel 82576 Dual-Port Gigabit Ethernet Controller - Integrated
      • Management: Integrated IPMI with KVM over LAN
      • Hot-Swap Drive - 1: 250GB Western Digital RE3 (3.0Gb/s, 7.2Krpm, 16MB Cache) SATA
  • 2x Rackform iServ R350-GPU: $5196.00 each ($4433.00 each with education) link
    • CPU: 2 x Intel Xeon E5530 Quad-Core 2.40GHz, 8MB Cache, 5.86GT/s QPI
    • RAM: 12GB (6 x 2GB) Operating at 1333MHz Max (DDR3-1333 ECC Unbuffered DIMMs)
    • NIC: Intel 82576 Dual-Port Gigabit Ethernet Controller - Integrated
    • Management: Integrated IPMI 2.0 & KVM with Dedicated LAN
    • GPU: 1U System with 1 x Tesla C1060 GPU, Actively Cooled
    • LP PCIe x4 2.0 (x16 Slot): No Item Selected
    • Hot-Swap Drive - 1: 250GB Seagate Barracuda ES.2 (3Gb/s, 7.2Krpm, 32MB Cache, NCQ) SATA
    • Power Supply: 1400W Power Supply with PMBus - 80 PLUS Gold Certified
    • Rail Kit: 1U Rail Kit
  • Price Tag: $32,478 ($30,078 with education)
  • Questions
    • Can we lose the hot-swappability to save money?
    • Do we need to get a Gig-Switch?
      • Would Cairo do?

Newegg Quote #1

  • 16x Newegg list
    • 1U link
    • 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt link
    • Slim CD/DVD Drive
    • 4x Gigabit ethernet motherboard
    • 500W non-redundant power supply
    • 160GB 7200RPM Seagate link
    • 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
  • Price Tag: $32,910.56

Newegg Quote #2

  • 2x Newegg list
    • 1U
    • 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt
    • 2x Gigabit ethernet
    • IPMI
    • 1400W non-redundant power supply
    • 2x C1060 Tesla
    • 160GB 7200RPM Seagate
    • 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
  • 12x Newegg list
    • 1U link
    • 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt
    • Slim CD/DVD Drive
    • 4x Gigabit ethernet
    • 500W non-redundant power supply
    • 160GB 7200RPM Seagate
    • 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
  • Price tag: $34,696.78

Intel List #1

AMD List #1

AMD List #2