Netbench

 

Netbench tests several key MPI routines: MPI_Barrier, MPI_Send/Recv, MPI_Allreduce, and MPI_Alltoall.

Syntax: mpirun -np procs -machinefile nodefile netbench datasize buffersize

Here datasize and buffersize are in kilobytes; typical values are 1024 & 1.

Netbench calculates the following:
1) Average time to synchronize the processors (using MPI_Barrier).
2) Throughput in a periodic ring (using MPI_Sendrecv). Process p
receives from p-1 and sends to p+1; direction reverses on successive calls
3) Throughput of a pairwise exchange (using MPI_Isend MPI_Recv and MPI_Wait,
or MPI_IRecv, MPI_Send and MPI_Wait [default])
4) Throughput of a call to MPI_Allreduce
5) Throughput of a call to MPI_Alltoall
 

Netbench reports unidirectional throughput R in Mbytes (1048576 bytes) per second. Thus R = buffersize/(1024*time); for rings and exchanges this is half the throughput reported by the Intel MPI Benchmark (IMB) suite:
http://www.intel.com/cd/software/products/asmo-na/eng/cluster/244171.htm


Why netbench?

1) Its simple and compact.
2) I have found it to be more reliable for my purposes than some well-established benchmarks. In particular I discovered what I consider to be flaws in both Netpipe and IMB.
Netpipe times the completion of the send by the transmitter rather than the completion of the receive. Thus data can be buffered on the transmitter and the time it takes to cross the network is not recorded. This is particularly problematic with large numbers of processors, when oversubscription of the switch is a distinct possibility.
IMB times the completion of the receive, but reuses the same buffer for multiple passes, which allows data to remain cached. This can give a higher throughput for intermediate message sizes than if new buffers are used each time, particularly with multiprocessor nodes. Netbench uses a new buffer for every run and therefore gives a more realistic timing.

Download netbench