Difference between revisions of "Al-salam"
Jump to navigation
Jump to search
m |
|||
Line 1: | Line 1: | ||
− | Al-Salam is the | + | Al-Salam is the name for the Earlham Computer Science Department's venerable 13-node computing cluster. |
− | + | = Software = | |
+ | Al-salam was upgraded to CentOS 7 in summer 2019. The notes below, in "Archive", contain useful information, but the specifics may not be current to the cluster as it exists now. | ||
= Hardware notes = | = Hardware notes = | ||
Line 9: | Line 10: | ||
= Archive = | = Archive = | ||
+ | A few extra notes beyond these are available [http://cluster.earlham.edu/wiki/index.php/Al-salam here] so check first before moving forward with big changes. | ||
+ | |||
== Installation Notes == | == Installation Notes == | ||
=== headnode === | === headnode === |
Latest revision as of 15:15, 28 June 2019
Al-Salam is the name for the Earlham Computer Science Department's venerable 13-node computing cluster.
Contents
- 1 Software
- 2 Hardware notes
- 3 Archive
- 3.1 Installation Notes
- 3.2 Latest Overarching Questions
- 3.3 Parts List
- 3.4 Tentative Specifications
- 3.5 Quick breakdown
- 3.6 ION Computer Systems Quotation #61116
- 3.7 ION Computer Systems Quotation #61164
- 3.8 Silicon Mechanics Quote #174536
- 3.9 Newegg Quote #1
- 3.10 Newegg Quote #2
- 3.11 Intel List #1
- 3.12 AMD List #1
- 3.13 AMD List #2
Software
Al-salam was upgraded to CentOS 7 in summer 2019. The notes below, in "Archive", contain useful information, but the specifics may not be current to the cluster as it exists now.
Hardware notes
- As{0,9,10} have a 500GB hard drive on it in addition to the 80GB, remember for configuration
- As10: 500 drive appears when you run the installer but not in df -h; 80 drive has the OS but can’t seen the 500
- As12 has 2 1TB hard drives in software RAID in addition to the 80GB
Archive
A few extra notes beyond these are available here so check first before moving forward with big changes.
Installation Notes
headnode
I'll be maintaining a script, /root/install/al-salam.sh, that will also serve as a log. Also following along with BobSCEd-new logs for consistency between clusters.
TODO
- MPI
- Software installations into /cluster/al-salam
- User auth via bs0-new's ldap
- torque/maui
- Ganglia
- shorewall
- modules
Have done
- yum install:
gcc.x86_64 gcc-c++.x86_64 gcc-gfortran.x86_64 \ gcc44.x86_64 gcc44-c++.x86_64 gcc44-gfortran.x86_64 \ apr-x86_64 apr-devel.x86_64 expat-devel.x86_64 \ blas.x86_64 dhcp.x86_64
- rpm install:
- c3
- libconfuse
- libconfuse-devel
- /etc/c3.conf:
cluster al-salam { as0.cluster.earlham.edu:as0.al-salam.loc as[1-12] }
- in hopper:/etc/rc.conf:
static_routes="bs0 as0" route_as0="192.168.1.1 159.28.234.150"
- hopper:/etc/namedb/master/cluster.zone:
as0.cluster.earlham.edu. IN A 159.28.234.150 as.cluster.earlham.edu. IN CNAME as0 al-salam.cluster.earlham.edu. IN CNAME as0
- hopper:/etc/namedb/master/al-salam.loc
- copy from bobsced.loc, amend as necessary
- hopper:/etc/namedb/master/1.168.192.in-addr.arpa
- copy from 0.168.192.in-addr.arpa, amend as necessary
- hopper:/etc/namedb/master/159.28.234.zone
150 IN PTR as0.cluster.earlham.edu.
- hopper:/usr/local/etc/dhcpd.conf
subnet 192.168.1.0 netmask 255.255.255.0 { option routers 192.168.1.1; option subnet-mask 255.255.255.0; option domain-name "al-salam.loc"; option domain-name-servers 159.28.234.1; next-server 159.28.234.1; filename "pxelinux.0"; range 192.168.1.100 192.168.1.254; }
- as0:/etc/dhcrelay
# Command line options here INTERFACES="eth1" DHCPSERVERS="cluster.earlham.edu"
compute nodes
- clone from bs1-new using udpcast
- modify network, etc. settings from bobsced to al-salam
Latest Overarching Questions
- Should we build this machine ourselves?
- Are we wasting our money and learning opportunity letting them do the building for us?
- If it is cheaper, Would it be a useful experience for the students this coming semester to take a large collection of hardware and make it into a cluster?
- Yes, this was pretty clear from the email thread in early December.
- How much if any GPGPU hardware do we want? 0, 1 or 2 nodes worth?
- Do we want a high bandwidth/low latency network?
- We do not. More expensive than it is worth.
- What software stack do we want to run? Vendor supplied or the BCCD?
- Both. Vendor-supplied base with a BCCD virtual machine
- Will the virtual machine support CUDA?
- Both. Vendor-supplied base with a BCCD virtual machine
- Do compute nodes have spinning disk?
- Compute nodes have a spinning disk. Solid state is still expensive
- What's on the local persistent store? /tmp? An entire OS?
- Support
- Consider getting the cheapest hardware support, loosing a node isn't critical as long as they send a replacement quickly.
Parts List
- Nodes - case, motherboard(s), power supply, CPU, RAM, GPGPU cards
- Switch - managed, cut-through
- Fitz: Having a hard time finding anyone who sells cut-through switches
- How about this store-and-forward switch from hp?
- Fitz: Having a hard time finding anyone who sells cut-through switches
- Power distribution - rack-mount PDUs
Tentative Specifications
Budget
- $35,000 (leaving $5,000 for discretionary spending)
Nodes
- Intel Nehalem processors
- 4 core processors minimum
- Six cores still expensive
- 1.5GB RAM per core
Specialty Nodes
- Two nodes should support CUDA GPGPU
Educationally, we could expect to get significant use out of GPGPUs, but the production use is limited. Increasing the variance of the architecture landscape would be a bonus to education.
Network
- Gigabit Ethernet fabric with switch
Disk
- Spinning Disk
OS
- Virtual BCCD on top of built-in OS.
Quick breakdown
Nodes
ION #61116 | ION #61164 | SM #174536 | Newegg #1 | Newegg #2 | Intel List #1 | AMD List #1 | AMD List #2 | |
---|---|---|---|---|---|---|---|---|
CPU | 72 2.4GHz Intel E5530 | 80 2.4GHz Intel E5530 | 80 2.4GHz Intel E5530 | 128 2.4GHz Intel E5530 | 112 2.4GHz Intel E5530 | 100 2.4GHz Intel E5530 | 156 2.0GHz AMD Opteron 2350 | 126 2.6GHz AMD Opteron 2435 |
RAM | 108GB PC3-10600 | 120GB PC3-10600 | 120GB DDR3-1333 | 192GB DDR3-1333 | 168GB DDR3-1333 | 144GB DDR3-1333 | 160GB DDR2-800 | 120GB DDR2-800 |
GPU | 2 Tesla C1060 | 2 Tesla C1060 | 2 Tesla C1060 | None | 4 Tesla C1060 | 2 Tesla C1060 | 2 Tesla C1060 | 2 Tesla C1060 |
Local disk | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
Shared chassis | No | Yes | Yes | No | No | No | No | No |
Remote mgmt | No | No | IPMI | No | IPMI on GPU nodes | IPMI | IPMI | IPMI |
Size (just nodes) | 9U | 6U | 6U | 16U | 12U | 12U | 20U | 10U |
Price | $33,173.20 | $33,054.30 | $30,078.00 | $32,910.56 | $34,696.78 | $35,846.00 | $35,275.00 | $33,755.00 |
Power distribution
PDU1220 | PDUMH20 | AP9563 | AP7801 | |
---|---|---|---|---|
Vendor | TrippLite | TrippLite | APC | APC |
Size | 1U | 1U | 1U | 1U |
Capabilities | Dumb | Metered | Dumb | Metered |
Input power | 20A, 1x NEMA 5-20P | 20A, 1x NEMA L5-20P w/ NEMA 5-20P adapter | 20A, 1x NEMA 5-20P | 20A, 1x NEMA 5-20P |
Output power | 13x NEMA 5-20R | 12x NEMA 5-20R | 10x NEMA 5-20R | 8x NEMA 5-20R |
Price | $195 | $230 | $120 | $380 |
ION Computer Systems Quotation #61116
- 2 ION G10 Server with GPU: 4,972.00 each
- (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
- 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
- Total memory: 12GB DDR3_1333
- Configure 1 RAID sets / arrays.
- Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
- (1) NVidia Tesla C1060 w. 4GB DDR3
- Dual Intel Gigabit Server NICs with IOAT2 Integrated
- 7 ION G10 Server without GPU: $3,697.00 each
- (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
- 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
- Total memory: 12GB DDR3_1333
- Configure 1 RAID sets / arrays.
- Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
- Dual Intel Gigabit Server NICs with IOAT2 Integrated
- Networking Fabric
- Network not included
- Other stuff
- scorpion: ION bootable USB Flash device for trouble shooting.
- 3 year Next Business Day response Onsite Repair Service by Source Support
- Default load for testing (Service Partition + CentOS 5.3 for Intel64)
- Price tag: $33,173.20
ION Computer Systems Quotation #61164
- 2 ION G10 Server with GPU: $4,972.00 each
- (2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
- 12GB RAM [Bank 1 of 2: (6) 2GB ECC PC3-10600 1333MHz 2rank DDR3 RDIMM Modules][Smart]
- Total memory: 12GB DDR3_1333
- Configure 1 RAID sets / arrays.
- Seagate SV35.3 250GB, 7200RPM, SATA 3Gb for SDVR 3.5“ Disk
- (1) NVidia Tesla C1060 w. 4GB DDR3
- Dual Intel Gigabit Server NICs with IOAT2 Integrated
- 4 ION T11 DualNode: $6,477.0 each
- (2x2) Intel® Quad-Core Xeon® processor E5530 (2.40GHz, 8MB Cache, 5.86GT/s, 80W)
- Total memory: 12GB DDR3_1333 per node
- No RAID, Separate disks (NO redundancy)
- Configure 1 RAID sets / arrays.
- Seagate Constellation 160GB, 7200RPM, SATA 3Gb NCQ 2.5“ Disk
- Dual Intel Gigabit Server NICs with IOAT2 Integrated
- These nodes are modular. One can be unplugged and worked on while the others remain running.
- Network
- Network Not Included
- Other stuff
- scorpion: ION bootable USB Flash device for trouble shooting.
- 3 year Next Business Day response Onsite Repair Service by Source Support
- Default load for testing (Service Partition + CentOS 5.3 for Intel64)
- Price Tag: $33,054.30
Silicon Mechanics Quote #174536
- 2x Rackform iServ R4410: $11043.00 each ($10601.00 each with education) link
- Shared Chassis: The following chassis resources are shared by all 4 compute nodes
- External Optical Drive: No Item Selected
- Power Supply: Shared, Redundant 1400W Power Supply with PMBus - 80 PLUS Gold Certified
- Rail Kit: Quick-Release Rail Kit for Square Holes, 26.5 - 36.4 inches
- Compute Nodes x4
- CPU: 2 x Intel Xeon E5530 Quad-Core 2.40GHz, 8MB Cache, 5.86GT/s QPI
- RAM: 12GB (6 x 2GB) Operating at 1333MHz Max (DDR3-1333 ECC Unbuffered DIMMs)
- NIC: Intel 82576 Dual-Port Gigabit Ethernet Controller - Integrated
- Management: Integrated IPMI with KVM over LAN
- Hot-Swap Drive - 1: 250GB Western Digital RE3 (3.0Gb/s, 7.2Krpm, 16MB Cache) SATA
- 2x Rackform iServ R350-GPU: $5196.00 each ($4433.00 each with education) link
- CPU: 2 x Intel Xeon E5530 Quad-Core 2.40GHz, 8MB Cache, 5.86GT/s QPI
- RAM: 12GB (6 x 2GB) Operating at 1333MHz Max (DDR3-1333 ECC Unbuffered DIMMs)
- NIC: Intel 82576 Dual-Port Gigabit Ethernet Controller - Integrated
- Management: Integrated IPMI 2.0 & KVM with Dedicated LAN
- GPU: 1U System with 1 x Tesla C1060 GPU, Actively Cooled
- LP PCIe x4 2.0 (x16 Slot): No Item Selected
- Hot-Swap Drive - 1: 250GB Seagate Barracuda ES.2 (3Gb/s, 7.2Krpm, 32MB Cache, NCQ) SATA
- Power Supply: 1400W Power Supply with PMBus - 80 PLUS Gold Certified
- Rail Kit: 1U Rail Kit
- Price Tag: $32,478 ($30,078 with education)
- Questions
- Can we lose the hot-swappability to save money?
- Do we need to get a Gig-Switch?
- Would Cairo do?
Newegg Quote #1
- 16x Newegg list
- 1U link
- 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt link
- Slim CD/DVD Drive
- 4x Gigabit ethernet motherboard
- 500W non-redundant power supply
- 160GB 7200RPM Seagate link
- 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
- Price Tag: $32,910.56
Newegg Quote #2
- 2x Newegg list
- 1U
- 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt
- 2x Gigabit ethernet
- IPMI
- 1400W non-redundant power supply
- 2x C1060 Tesla
- 160GB 7200RPM Seagate
- 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
- 12x Newegg list
- 1U link
- 2x Intel Xeon (Nehalem) E5530, Quad-Core, 2.4GHz, 80 Watt
- Slim CD/DVD Drive
- 4x Gigabit ethernet
- 500W non-redundant power supply
- 160GB 7200RPM Seagate
- 12GB RAM (240-pin DDR3 1333 ECC, unbuffered)
- Price tag: $34,696.78
Intel List #1
- 13x Chassis + mainboard: http://www.provantage.com/supermicro-sys-6016t-gtf~7SUP91FA.htm
- 25x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819117184
- 39x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820139041
- 14x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
- 2x Tesla card: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
- Notes:
- ~$2600/node without Tesla
- ~$3850/node with Tesla
- 1.5GB RAM/core
- 2 dies/node (8 cores/node)
- Yes IPMI
- 1 headnode (compute node - one die + one HDD) + 11 compute nodes + 2 Tesla nodes ~= $35,846
AMD List #1
- 20x Chassis: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152128
- 20x Mainboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182108
- 39x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189
- 40x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820134936
- 21x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
- 2x Tesla: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
- Notes:
- $1650/node without Tesla
- $2900/node with Tesla
- 1G RAM/core
- 2 dies/node (8 cores/node)
- Yes IMPI
- 1 headnode (compute node - one die + one HDD) + 17 compute nodes + 2 Tesla nodes ~= $35,275
AMD List #2
- 10x Chassis: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152128
- 10x Mainboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182108
- 19x CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819105189
- 30x RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820134936
- 11x HDD: http://www.newegg.com/Product/Product.aspx?Item=N82E16822136280
- 2x Tesla: http://www.tigerdirect.com/applications/searchtools/item-details.asp?EdpNo=4259469&SRCCODE=GOOGLEBASE&cm_mmc_o=VRqCjC7BBTkwCjCECjCE
- Notes:
- $3130/node without Tesla
- $4350/node with Tesla
- 1G RAM/core
- 2 dies/node (12 cores/node)
- Yes IMPI
- 1 headnode (compute node - one die + one HDD) + 7 compute nodes + 2 Tesla nodes ~= $33,755