Parallel Application benchmarks

We have compared the multi-node performance of a dual-core P4 with dual Opteron 275's with a total of 4 cores per node. We have again used our 4 key application codes, SUSP3D, DLPOLY, GROMACS, and VASP as benchmarks, with up to 96 processors. We plot the inverse time for each benchmark as a function of the number of processors. For SUSP3D, the problem size scales with the number of processors; for the other cases the problem size is fixed. In addition to the benchmarks used in the single-node tests, we have added DLPOLY Test 2 and Test 8 (8 times as many atoms as Test 1 and Test7; 200 steps for Test 2 and 50 for Test 8) and a Large 65 atom VASP test. The scaling for SUSP3D-2 is very similar to SUSP3D-1 and is not shown. A complete summary of the benchmarks, including raw wall-clock times, can be downloaded here.

SUSP3D

The SUSP3D-1 benchmark involves minimal collective communications and scalable point-to-point communications; the results show the expected linear scaling with number of processors regardless of the interconnect. On the other hand SUSP3D-3 has significant collective communications that grow linearly with the number of processors. With TCP sockets the scaling is poor beyond 32 processors whereas MPI/GAMMA is scaling beyond 64 processors, with a peak throughput more than twice that of TCP. It can be seen that 64 processors run significantly faster than 80 processors on this problem; this is because SUSP3D utilizes a very fast version of Allreduce that scales asymptotically as 2M (M is the message size), and is faster even than the Allreduce in MPICH; our implementation currently only works for powers of 2.

DLPOLY

These tests are standard DLPOLY benchmarks: Tests 1 and 2 are an ionic melt with 27,000 and 216,000 ions. Tests 7 and 8 are simulations of Gramicidin A in water with 99120 and 792960 atoms. For a sufficiently coarse-grained simulation (Tests 2 and 8) DLPOLY scales well up to 64 nodes, even with a TCP interconnect, but for a finer-grained parallelism (Tests 1 and 7) TCP is insufficient to scale beyond about 16 processors. However GAMMA scales well up to 64 processors even in these cases. Rather remarkably, MPIGAMMA is about 20% faster than the HPC cluster with 64 processors and is in fact scaling significantly better for the higher processor counts. We discuss possible reasons for this here. GAMMA is also scaling slightly better than an IBM p690, as indicated by a comparison with these benchmarks.

GROMACS

Here we consider the two most common GROMACS benchmarks, Villin and DPPC. In both cases GAMMA outperforms TCP by a significant margin, but HPC is 50% faster than GAMMA on the Villin benchmark. In part this is because the scalar performance of the Opteron is better than the P4D for GROMACS, and in part because the Villin benchmark reaches peak speed for only 8 processors. With the more scalable DPPC benchmark, GAMMA and HPC are essentially identical, despite the scalar performance advantage of the Opterons.

VASP

Since there are no standard benchmarks for VASP, we ran simulations of Pt crystals with 11 atom (858 electrons) and 63 atom (4914 electrons) unit cells. For the medium benchmark (11 atoms) we timed one full relaxation cycle (39 iterations) but for the large benchmark (63 atoms) we only timed 5 iterations. We did not pursue many of the possible optimizations within VASP but used the recommended settings for large problems (IALGO=48, LREAL=Auto) for all tests. VASP uses complex collective communications, but GAMMA nevertheless scales well up to 80 processors. For some reason VASP seems to run best with processor counts that are multiples of 10. With TCP VASP does not scale beyond 40 processors. Again GAMMA performance equals or exceeds that of HPC Opteron cluster.