Netbench tests several key MPI routines: MPI_Barrier, MPI_Send/Recv,
MPI_Allreduce, and MPI_Alltoall.
Syntax: mpirun -np procs -machinefile nodefile netbench
datasize buffersize
Here datasize and buffersize are in kilobytes; typical values are
1024 & 1.
Netbench calculates the following:
1) Average time to synchronize the processors (using MPI_Barrier).
2) Throughput in a periodic ring (using MPI_Sendrecv). Process p
receives from p-1 and sends to p+1; direction reverses on successive calls
3) Throughput of a pairwise exchange (using MPI_Isend MPI_Recv and MPI_Wait,
or MPI_IRecv, MPI_Send and MPI_Wait [default])
4) Throughput of a call to MPI_Allreduce
5) Throughput of a call to MPI_Alltoall
Netbench reports unidirectional throughput R in Mbytes
(1048576 bytes) per second. Thus R = buffersize/(1024*time); for rings
and exchanges this is half the throughput
reported by the Intel MPI Benchmark (IMB) suite:
http://www.intel.com/cd/software/products/asmo-na/eng/cluster/244171.htm
Why netbench?
1) Its simple and compact.
2) I have found it to be more reliable for my purposes than some well-established benchmarks. In particular I discovered what I consider to be
flaws in both Netpipe and IMB.
Netpipe times the completion of the send by the transmitter rather than the completion of the receive. Thus data can be buffered on the transmitter
and the time it takes to cross the network is not recorded. This is
particularly problematic with large numbers of processors, when
oversubscription of the switch is a distinct possibility.
IMB times the completion of the receive, but reuses the same
buffer for multiple passes, which allows data to remain cached. This can give a higher throughput for intermediate message sizes than if new
buffers are used each time, particularly with multiprocessor nodes.
Netbench uses a new buffer for every run and therefore gives a more
realistic timing.