We are attempting to scale application codes from small clusters of 8-16 processors to clusters with one or two hundred processors. The programs of interest are: SUSP3D, a lattice-Boltzmann code for simulating the dynamics of suspensions and polymers solutions; GROMACS and DLPOLY, classical molecular dynamics codes; and VASP, a quantum-mechanical molecular dynamics code. We pooled our resources to purchase a 192-node dual-core P4D cluster from Dell, using 48-port edge switches from Extreme Networks for the interconnect. Single node benchmarks suggested that the dual-core P4D processor was comparable in performance to an Opteron 175 and considerably cheaper. With the help of generous discounts from the vendors, we were able to purchase the cluster for just under $205,000, including all switches, racks, cables and power strips. We have compared network and application benchmarks with the University of Florida's HPC center's Opteron cluster which uses an Infiniband interconnect. The pages document our experiences and results. We gratefully acknowledge access to the HPC cluster, and technical assistance in running our benchmarks.
Hardware |
![]() |
Single node application benchmarks |
|
Network Benchmarks |
|
Parallel application benchmarks |
|
Network Adapters |
|
Switches |
|
MPIGamma |
Our hardware choices were primarily determined by price-performance at the time of purchase. In the most recent instance (December 2005), we decided on a single cpu, dual-core rack server: dual-core Xeons (Paxville) were eliminated by the inadequate memory bus, while dual-core Opterons 270/275 were eliminated by price (at about $1000 per processor). Because of the size of the combined purchase we were offered substantial discounts by Dell and Sun Microsystems. Our single-node benchmarks indicated that the Opteron 175 (Sun Fire X2100) was about 10% faster than a 3.0GHz P4D (Dell PE 850), but 40% more expensive at the best discounts we could negotiate. Moreover, the Dell units came with rails, cable management arms, and an IPMI compliant network card. These were expensive extras with Sun. Although this did not play a significant part in our decision, I have subsequently come to appreciate the value of these items, particularly the IPMI network card, which allows remote power cycling, independent of the operating system.
The interprocessor communication rate depends on the performance of a number of components: network adapter, switch, PCI bus, and software layer. In particular, we have found that the performance of different Gigabit switches can vary by orders of magnitude under certain conditions. A good choice of hardware will almost certainly require hands-on benchmarking; fortunately many vendors are willing to cooperate by loaning hardware for evaluation. Pre-sale, Dell had told us that the onboard NICS on the PE850 (Broadcom 5721J) supported Jumbo frames; when it transpired that they did not, they gave us 194 Intel PRO 1000 PCI-X adapters to make up, which was nice. Unfortunately, although the Intel NICS have a higher bandwidth then the Broadcom 5721j, they also have a much higher latency in their TCP driver. We therefore experimented with the MPIGAMMA software layer, which has an MPI latency of about 9 microsecs. MPIGAMMA makes a substantial difference to application benchmarks, particularly for large numbers of processors.
We decided on edge switches as opposed to a modular switch based on cost. As yet we have not assessed the consequences of the limited bandwidth between switches, but the cost of a 192-port chassis switch was well outside our budget. We decided on the Extreme Networks Summit 400-48t which outperformed the Force10 S50 in head to head comparisons. Subsequently it transpired that the Summit 400-48t also suffers from serious performance loss in certain traffic patterns. We discuss this in more detail here. However Extreme Networks offered us a free exchange with the new x450a-48t, which uses the same hardware as the Black Diamond chassis switch modules. The x450a is close to flawless in our tests.