Network Adapter Performance
We have used netpipe to test the performance of different Gigabit network cards. The Intel/MPIGAMMA combination has the lowest latency, outperforming proprietary RDMA cards. We use bidirectional messages rather than the standard one-way test, since this is more closely related to real-world applications, where nodes exchange data rather than just send a message from A to B. We compare different network cards, different software layers, and switches vs. crossover cables; details are given below. The one-way data rate is reported in MBytes/sec; the theoretical maximum for Gigabit Ethernet is 125. Note that the output file from netpipe reports the total message buffer size, twice the size of an individual message (reported below).
|
B'Com A |
B'Com B |
Intel A |
Intel B |
Intel C |
Intel D |
L5 |
A'sso* |
| Latency |
59μs |
20μs |
63μs |
32μs |
8.8μs |
12μs |
20μs |
30μs |
| Peak rate (MB/s) |
99 |
90 |
107 |
113 |
116 |
110 |
102 |
74 |
| Rate at 64KB (MB/s) |
89 |
81 |
95 |
105 |
116 |
110 |
102 |
69 |
| Rate at 16KB (MB/s) |
74 |
84 |
66 |
82 |
99 |
82 |
80 |
55 |
| Rate at 4KB (MB/s) |
31 |
51 |
33 |
43 |
62 |
40 |
50 |
39 |
| Rate at 1KB (MB/s) |
13 |
22 |
12 |
22 |
45 |
31 |
26 |
16 |
| Message size at max |
256KB |
4096KB |
512KB |
256 |
64KB |
96KB |
64KB |
128KB |
Hardware:
- Dell Pe850: single P4D (3.0GHz dual-core) and Dell PE1750*: dual
Xeon (2.8GHz)
- OS: Centos 4.2 (Linux 2.6.9 and Linux 2.6.12-gamma) and NPACI Rocks (Linux 2.4)*
- Switches: Crossover cable, Extreme Networks Summit400-t48, Cisco chassis switch (model 6509)*
| Network card |
Driver |
MTU |
MPI |
Switch |
| Broadcom 5721 (A) |
tg3-3.43 |
1500 |
LAM-7.1 |
400-t48 |
| Broadcom 5721 (B) |
GAMMA-06-08-08 |
1500 |
MPIGAMMA-06-07-17 |
400-t48 |
| Intel 82545GM (A) |
e1000-6.0.54-k2-NAPI |
4120 |
LAM-7.1 |
Crossover |
| Intel 82545GM (B) |
TIR=0 TID=128 |
4120 |
LAM-7.1 |
Crossover |
| Intel 82545GM (C) |
GAMMA-06-02-17 |
4120 |
MPIGAMMA-06-02-09 |
Crossover |
| Intel 82545GM (D) |
GAMMA-06-02-17 |
4120 |
MPIGAMMA-06-02-09 |
400-t48 |
| Level 5 EF1-21022T |
Proprietary |
N/A |
Proprietary |
Crossover |
| Ammasso 1100 |
Proprietary |
N/A |
Proprietary |
Cisco 6509 |
Some observations:
- The Broadcom 5721 (B'Com A) performs well when used with a recent tg3 driver (from Broadcom); the 3.10 driver in the 2.6.9 kernel is not as good. The Intel PRO 1000 (Intel A) has a high bandwidth but the large TCP driver latency reduces the performance for small messages.
- The TCP performance of the Intel NIC (82545GM) can be improved by tuning the driver parameters. In case B we used InterruptThrottleRate=0 and TxIntDelay=128. The ITR=0 setting reduces the latency by a factor of 2; TID=128 allows for reduced cpu load with no measureable effect on performance. The effects of driver tuning may be more noticeable in practice than these limited results suggest. We notice that the throughput of the PRO 1000 with default TCP driver settings is quite erratic and these dropout may substantially reduce performance with large numbers of nodes.
- Tuning the TCP buffer size brings increased performance. Generally, larger TCP buffers increase performance for large messages at the cost of a small performance penalty for smaller messages. We used 1Mbyte buffers which we found to be a good compromise. Increasing the frame size (MTU) has a similar effect, with the added benefit of a reduction in cpu usage; we used MTU=4120 (4096+24) in these tests.
- MPIGAMMA is a port of MPI to the GAMMA Ethernet driver; it reduces the latency of the default Intel adapter by an order of magnitude, better than the expensive proprietary interfaces from Level 5 and Ammasso. It is only compatible with the 2.6.12 and 2.6.18 kernels at present. The hardware latencies of the Broadcom and Intel NIC's wired back to back, as measured by GAMMA are 14μs and 6.5μs respectively. There is an additional latency of 2.3μs from the MPIGAMMA software layer and 3.3μs from the Summit400t-48 switch. The Broadcom NIC failed to reach its asympotic bandwidth because we did not adjust the credit limit to account for the smaller packet size (1500 bytes); the Intel NICs used the optimum 4120 byte frame size.
- With a low-latency network card the switch latency becomes critical and the Summit-t48 performs very well in this regard-the additional latency from the switch is only 3.3μs.
- The switch reduces the maximum throughput under MPIGAMMA because the added latency requires a larger message size to saturate. GAMMA was configured to accept up to 32 4KB packets before sending an acknowledgement packet, and MPIGAMMA therefore takes a small (<10%) performance hit around 128KBytes; slightly better performance can be obtained at the cost of more memory allocated to GAMMA. In bidirectional mode MPI typically reaches a peak performance for messages of around 64KBytes, when it switches from "eager" to "rendezvous" protocols. LAM has a user interface (SSI) for choosing the transition message size; our results with LAM used the eager protocol throughout, which was found to deliver the maximum throughput.
- The overall performance of a GAMMA-enabled Intel PRO NIC is remarkable; real-world applications can pass quite small bidirectional messages (32KBytes) at rates in excess of 100 MB/s each way.
- The proprietary RDMA cards from Level5 and Ammasso (retail price around $500) used to outperform TCP/IP based NICS, but with newer TCP drivers the Intel NIC outperforms both proprietary cards, except the L5 at small packet sizes. The Intel+MPIGAMMA combination is faster than the proprietary cards.
- The Ammasso and Level5 NICS substantially reduce CPU usage, which can increase performance if computation and communications are overlapped; GAMMA requires 100% CPU utilization and cannot profit from overlapping. The TCP driver (e1000-6.0.54-k2-NAPI) uses about 30% of the CPU; the tuned driver uses
40-50% CPU, depending on TID and MTU.
- The Ammasso cards were tested on older hardware (2.8GHz Xeons) with a 2.4 kernel; the other tests were with 3GhZ P4D and the 2.6 kernel.
Sources and documentation
LAM v7.1: http://www.lam-mpi.org
MPI/GAMMA http://www.disi.unige.it/project/gamma/mpigamma
Acknowledgements: We thank Giuseppe Ciaccio for extensive help in setting up our MPIGAMMA installation and also for eliminating a number of obscure bugs. We thank the University of Florida High-Performance Computer Center (http://www.hpc.ufl.edu) for access to the Ammasso and Cisco hardware.