Network Benchmarks

We have used netbench to compare network benchmarks for Gigabit Ethernet with an Infiniband interconnect. Gigabit Ethernet used TCP + OpenMPI (v1.2) and GAMMA (v06-09-14) + MPIGamma (v06-10-06) network protocols; Infinband used OpenMPI (v1.0). The Gigabit cluster uses Intel PRO 1000 PCI-X network cards and an Extreme Networks x450a-48t switch. The University of Florida HPC cluster uses a Topspin 4X Infiniband interconnect with a 50% blocking fat-tree topology.

We consider four different message-passing patterns for N processes (N even): (i) N/2 pairs of edge exchanges with each pair exchanging data independently of the other processes (ii) an N process ring, where each process receives from the rank below and sends to the one above (iii) MPI_Allreduce (iii) MPI_Barrier. We report the throughput for each test R = datasize/time, where datasize is the total amount of data sent by one processor (in MBytes). Note this is half the bandwidth reported for point-to-point communications by the IMB/Pallas benchmark. We have observed that the IMB/Pallas benchmark gives artificially high throughputs for medium-sized messages. This is because it reuses the same buffer and subsequent tests benefit from caching; netbench uses a new buffer for every trial.

Exchange

We measured the one-way data throughput for pairs of bidirectional edge exchanges, as shown below; first for a single pair and then for 16 pairs of nodes. For a single pair, we measure asymptotic throughputs (i.e. for messages sizes up to 1 MByte) of about 60MBytes/sec for TCP, 110MBytes/sec for GAMMA, and 500MBytes/sec for Infiniband. The Gigabit Ethernet throughput is essentially unaffected by adding more nodes, due to the excellent performance of the x450a switch, but the bandwidth limitations of the fat-tree topology reduce the Infiniband throughput to 210Mbytes/sec. In addition, the performance of the Infiniband cluster becomes a little erratic, due to fluctuating loads at the switches. For small messages GAMMA is about 3.5 times faster than TCP while Infiniband is 4-5 times faster than GAMMA.

It is more realistic in HPC applications to use all the processors on a single node. This tends to oversubscribe the network card as several processes try to send data at once. Here we compare the bidirectional edge exchange rate, with 2 processes per node for the P4D and 4 processes per node for the dual Opterons; we use the same number of processors in each case. For the P4D nodes, 50% of the messages are oversubscribed by a factor of 2 and 50% of the messages are transferred via shared memory copies. Since the shared memory copies in MPI run at about 360 MBytes/sec we expect a 10-20% reduction in measured throughput, which is roughly what we observe. In the case of the dual-processor Opteron, at least 2 processes are always using shared memory in a pairwise exchange, so the oversubscription is again a factor of 2 on 50% of the messages. However the shared-memory copies are no faster than the Infiniband interconnect so the Infinband cluster is more affected by the oversubscription; with 2 nodes and 8 processors it has an asymptotic throughput of 260MBytes/sec. With 64 processors the GAMMA throughput drops to about 90-95MBytes/sec, in comparison with 55MBytes/sec for OpenMPI. The HPC cluster is hit by oversubscription at the network card and at the switch and its throughput drops to about 220MBytes/sec.

Ring

In a message-passing ring, each processor receives from the rank below and sends to the rank above; all the data then flows in the same direction. Again we use all the available processors on each node, but in this case there is no oversubscription, so the throughput for large messages should be higher. MPIGamma obtains reduced latency and increased throughput from the shared memory and reaches 50MBytes/sec for small (1500 byte) messages. MPIGAMMA performance is essentially independent of the number of processors, but in several of these tests we see a slow dropoff for messages larger than 128KBytes, where its credit limit becomes exhausted. It may be possible to improve the performance of GAMMA for large messages at the cost of additional memory. Surprisingly Infiniband performance suffers from a large number of dropouts in this pattern, which constantly synchronizes all the processors, not just pairs or processors. Very low data rates, of order 10MBytes/sec, are frequently observed, especially on startup. This emphasizes the need for flat switching, even in high-performance networks. Overall the Infiniband network is only about twice as fast as MPIGAMMA in this pattern.

Allreduce

The most widely used collective communication is MPI_Allreduce, which can be used to sum the contents of equal-length vectors on every processor and then broadcast the result to all the other processors. The standard MPI implementation uses a binary tree, which makes for a multiplicative factor on 2log2 N in the communication time compared with a similar size edge exchange. However MPICH includes a more optimal Allreduce, utilizing a divide an conquer strategy with an asymptotic communication time of 4N. This even enables GAMMA to beat out the Infiniband cluster, which is hampered by the less than optimal collective communications in OpenMPI 1.1. TCP lags well behind GAMMA in this test. The smoothness of GAMMA's communication, due to its simple flow control and constant polling, make it much more efficient for large numbers of processors.

Barrier

Finally we consider the average time for a call to MPI_Barrier. GAMMA has an order of magnitude advantage over TCP in this test, and is even competitive with Infiniband, especially for larger numbers of processors.

Sources and documentation

LAM v7.1: http://www.lam-mpi.org
OpenMPI: http://www.open-mpi.org
MPIGamma: http://www.disi.unige.it/project/gamma/mpigamma
University of Florida HPC: http://www.hpc.ufl.edu