MPI/GAMMA

MPI/GAMMA offers a high-performance alternative to TCP sockets. Our tests with a number of application codes show that MPIGAMMA can give a substantial performance boost in comparison to a TCP-based message passing. In fact our results show that an optimized, GAMMA-equipped Gigabit Ethernet cluster can compete with much more expensive proprietary interconnects, at least up to ~100 processors. I should emphasize that hardware selection plays an important role in the performance we obtained. The Dell units perform very well individually; however the onboard Broadcom network adapters were supplemented by PCI-X Intel PRO 1000 cards, which have better throughput and significantly less latency. It cannot be overemphasized how important the switch is to parallel performance; we have found the Extreme Networks x450a-48t to be the only switch out of several we have investigated that can support wire-speed throughput under full load. We have measured throughputs that differ by more than a factor of 100 for the x450a-48t and its predecessor the s400-48t. Other Gigabit switches I have tested do not approach even the s400-48t in performance.

GAMMA is the base driver that replaces the TCP sockets. It provides a low-level application layer that is patched directly to the kernel. A specific kernel version is necessary for each GAMMA release (currently 2.6.18.1) which must be recompiled to include the GAMMA driver. We use Centos 4.2 and I have had no problems with vanilla 2.6.x kernels (x >= 12). There is a detailed set of installation instructions on the MPI/GAMMA website. Once GAMMA is installed it is useful to test it with its ping-pong and barrier call applications. The ping-pong checks the raw bandwidth between the nodes, which should be around 123MBytes/sec for large messages, and the latency (6.5 microsecs for two Intel PRO 1000 wired back-to-back). The barrier call will ensure all nodes are communicating. MPIGAMMA is installed over MPICH-1.2.7p1 in the normal way and the applications can then be compiled as with MPICH. Note that MPIGAMMA does not process the -nolocal option.

Applications running under MPI/GAMMA scale extremely well, frequently beating out the Infinband HPC cluster in both raw speed and scalability. The high-level of performance of GAMMA is a little mysterious to me; in particular, how does it compensate for the extra bandwidth of the Infiniband interconnect? Even in the most favorable cases, the Infiniband bandwidth is at least double that available to GAMMA. Perhaps the answer is in GAMMA's simple but effective flow control algorithm. Certainly one notices how steady the communications are with MPIGAMMA. Results from run to run and even repetitions within a run are more reproducible and stable than with either TCP or Infiniband. With a large number of processes, fluctuations in throughput at a local level can have a significant impact on the overall performance. It should also be noted that while the MPI/GAMMA setup is as ideal as we can make it from a performance point of view, the HPC cluster makes a lot of compromises due to its multi-user environment. Also the HPC cluster is hindered by performance issues with OpenMPI collectives, especially with the current version (1.0). Nevertheless the GAMMA performance is remarkable and suggests that well optimized Ethernet configurations can contribute to high-performance computing.

There are still a number of issues to address before MPI/GAMMA could move from a research to a production environment. My personal rank order, in descending level of importance:

  1. Defining a GAMMA Virtual Machine (VM) at the initiation of the parallel job rather than on boot up. This would eliminate interaction between different jobs; at present any user can reset any other users GAMMA VM. My understanding is that this requires some changes at a quite fundamental level.
  2. Redesign of the flow-control algorithm. The current credit-based algorithm has difficulty scaling beyond about 100 processors; it requires more receive buffer pointers than are available on the network adapter.
  3. Integrating GAMMA with other MPI implementations; OpenMPI is the obvious candidate.
  4. A simpler installation, perhaps as a Dynamic Kernel Module, that does not require a kernel recompilation.

Sources and documentation

MPI/GAMMA http://www.disi.unige.it/project/gamma/mpigamma