Lab – AI Chess Staging

This lab was not assembled to look tasteful. It was created for specific reasons:

Run chess workloads that are impossible with consumer gear
Provide the biggest bang for buck in terms of compute and storage
Learn how to set up an entire domain from scratch
Gain experience with virtualization automation and enterprise storage
Provide a platform for distributed computing development

Servers

The servers are a collection hardware that I procured for crazy low prices as datacenter liquidation/castoffs. Most of the fleet(10) are HPE DL360 and DL380 Gen9 hardware, each with dual Xeons (112 total threads) and 256 GB of RAM. Two of those have NVIDIA Tesla P40s and 22x2TB SSDs using Storage Direct. Those are joined by a Dell dual-Xeon system with 1.5 TB of memory and a Supermicro dual-EPYC 7742 box with 256 GB of DDR4-3200. All of the servers run Windows Server 2022. The point is to have enough stable compute to run long validation jobs, data-generation passes, and distributed experiments without pretending one desktop is a cluster.

Network

The network is split the way it should be when the machines are expected to do real work. The Ethernet side is built around an Arista 48 x 10GbE switch with 40GbE uplinks. The high-speed fabric is a Mellanox 6036 InfiniBand switch. Management is kept separate through the usual iLO/IPMI path. That separation is not there for drama. It keeps ordinary host traffic, bulk data movement, and out-of-band recovery from tripping over one another, which is a more useful trait in a lab than elegance.

Workstations

My first workstation was a 5950X/64GB RAM/3090RTX mid-tower. It has served faithfully, training neural networks and acting as dev box for many years. Ultimately, though, I needed an upgrade…

The new workstation is the other half of the story. It’s a fully stuffed HAF700 EVO case with Threadripper 9980X, 128 GB of ECC DDR5-7200, a Gigabyte TRX50 AI TOP board, and an MSI RTX 5090 SUPRIM LIQUID, running Windows 11 Pro for Workstations. It has a RAID 0 8TB PCI5.0 boot and dev drive, 16TB of RAID 0 PCI 4.0 SSD serving as a L2 cache for 12x26TB 3.5″ HDDs coughing up ~240TB of RAID 6 storage via an LSI 9460 with cache and battery backup. It is intentionally overqualified for ordinary desktop duty, which is exactly the point, it’s a workstation. This is the box where code gets written, profilers get read, binaries get built, and new ideas earn the right to waste server time.

GPUPerft, FENMaster, the statistics pipeline, and the larger distributed perft/upc work all benefit from having multiple classes of machine available at once: a rack full of reliable server nodes, a faster workstation for development and local proof work, a real Ethernet fabric, and a separate InfiniBand path when the workload justifies it. Some of the work is embarrassingly parallel. Some of it is bandwidth-sensitive. Some of it is validation-heavy and simply wants the environment to stop moving around underneath it. The lab is shaped around that reality.

If there is a design philosophy here, it is probably this: build enough cost-effective infrastructure to provide meaningful capacity for work at scale. This makes it possible to ask larger chess-computation questions without every answer being distorted by a flimsy environment.

This lab has provided all of the cores to do ELO evaluation and testing of multiple chess engines.