Single-Node Application Benchmarks

We have compared the single node performance of a dual-core P4  node with a dual Opteron 275 node with a total of 4 cores. We have used our 4 key application codes, SUSP3D, DLPOLY, GROMACS, and VASP as a benchmark; the timings in seconds for the individual tests are recorded in the table below.

SUSP3D simulates the dynamics of particles suspended in a viscous fluid. It uses a lattice-Boltzmann model to follow the fluid motion, and couples the particles and fluid through boundary conditions on the particle surfaces. The table reports the time for 100 steps with 3 million grid points per processor. SUSP3D-1 has very few particles (2 per processor) and the computation time is entirely that of the LBE model itself. SUSP3D-2 has a small number (128 per processor) of large particles and includes a significant but scalable computation time for the boundary conditions. SUSP3D-3 has a large number of small particles (3456 per processor) and includes a significant serial computation time for the particle motion. The LBE code and the boundary conditions are fully parallelized by domain decomposition, but the particle dynamics resembles the "replicated data" scheme common in molecular dynamics codes. It  has limited scaling to large numbers of particles and processors; this deficiency will show up in parallel benchmarks for SUSP3D-3.

DLPOLY, GROMACS, and VASP are well-established codes for classical and quantum mechanical molecular dynamics. For DLPOLY we ran test 1, corresponding to a  27000 ion simulation of NaCl, and test 7, corresponding to a 99120 atom simulation of Gramicidin A in water; the times reported are for 500 and 100 steps respectively. For GROMACS we ran the standard Villin and DPPC benchmarks. The VASP problem is an 11 Pt atom unit-cell with a total of 858 electrons. The times (in seconds) are for the complete relaxation of the wavefunction using 39 cycles and a total of approximately 25,000 conjugate gradient steps. Note there are small differences in the iteration process in VASP, depending on the processor count.

3.0GHz P4D 2 X Opteron 275 2.2GHz
  1 2 1 2 4
SUSP3D-1 93 100 80 83 91
SUSP3D-2 143 160 130 135 150
SUSP3D-3 172 186 160 166 182
DLPOLY-1 1276 635 1103 779 410
DLPOLY-7 972 560 929 619 338
GROMACS-Villin 90 44 72 37 22
GROMACS-DPPC 4127 1758 3401 1459 775
VASP-Pt 1487 872 1521 836 499
MEAN (GM) 446 233 398 217 121

The results for SUSP3D are for a problem size that scales with the number of processors, while for the other benchmarks the problem size remains the same. We take this into account when calculating the geometric mean (GM).

The timings suggest the performance of a 3.0GHz P4 (Nocona) core and 2.2GHz Opteron core are similar, with a 12% edge to the Opteron in the geometric mean; this is consistent with the SPEC benchmarks.

A comparison of geometric means for different numbers of processes suggest a 96% efficiency overall for the P4D in dual-core mode, but only 92% for two Opteron cores and 82% for four cores. Note that on a dual processor Opteron, the workload is distributed over both processors first and then over both cores. When running dual-core applications the P4D is equivalent overall to an Opteron processor on these benchmarks.

We used the Intel 9.0 compiler suite with MKL libraries and the Pathscale compiler with ACML libraries. Compiler optimizations were fairly standard: -O3 -xP -tpp7 on the P4 and -O3 -OPT:Ofast on the Opterons. On the P4 we included the -no-prec-div flag with SUSP3D (equivalent to -ffast-math) where it gives a substantial performance boost (factor of 2); in the other applications it makes a difference of only 1% or so and was not used.