Bridging MPI-level Collective Primitives and Network/System Capabilities: A Case Study with Modern InfiniBand Multicore Clusters

Colloq: Speaker: 
Amith R. Mamidala
Colloq: Speaker Institution: 
The Ohio State University
Colloq: Date and Time: 
Fri, 2008-02-15 10:00
Colloq: Location: 
Building 5100, Room 128 (JICS Auditorium)
Colloq: Host: 
Jeff Vetter
Colloq: Host Email: 
vetter@ornl.gov
Colloq: Abstract: 
InfiniBand (IBA) interconnect is becoming ubiquitous in the High Performance Computing Arena due to its superior performance capabilities. It also offers features such as H/W Multicast, Remote Direct Memory Acess (RDMA), etc. Message Passing Interface (MPI) has become an efficient programming model to scale parallel applications to thousands of nodes. In this context, scaling MPI primitives becomes very important. Particularly, this applies to MPI Collective operations which are widely used in many scientific applications. On the other hand, Multicore Systems are rapidly gaining ground. We already see 16-core Barcelona chips today and the core count is expected to increase rapidly in near future. Optimizing MPI Collective communication on these modern systems is a challenging task.<br><br>In my talk, I focus on three broad directions to effectively bridge the gap between MPI Collectives to Network/System Capabilities. In the first part of my talk, I propose new Communication Protocols for utilizing IBA H/W Multicast support for different collectives. Currently, H/W Multicast is not reliable and packets could be dropped in the network. I explain a distributed and scalable approach of providing Reliability for InfiniBand clusters. In the second part of the talk, I focus on Architecture driven Optimizations for Collectives. The current generation IBA systems come in two broad flavors: Off-loaded adapters and On-loaded adapters. Based on the characteristics of these systems, collectives have to be appropriately designed.<br><br>For the Off-loaded adapters, the key is to choose the correct message transport and also communication semantics such as RDMA for providing performance and resource scalability.For On-loaded adapters, the key is to use the correct number of cores to push out data into the network. I explain the different optimizations with respect to MPI_Alltoall which is a widely used collective. The new algorithms improve the CPMD application performance by more than 33% on a 512 core system. I propose a new primitive \RDMA over Unreliable Datagram\which can significantly enhance performance and resource scalability for Collectives and also One-sided operations.<br><br>In the last part of my talk, I explain new Collective algorithms that can be designed leveraging H/W Multicast to tolerate process skew.
Colloq: Speaker Bio: 
My research focus is the area of Communication Architecture in the High Peformance Computing domain. This includes diverse topics such as: Communication Algorithms and optimizations for multi-core architectures; User Level Protocols over Connection-Oriented and Connection-Less transports (e.g. InfiniBand); and High-Performance Cluster Interconnects.<br><br>I work on MVAPICH project which is an open source distribution of MPI over InfiniBand. My current focus is on designing a scalable and efficient collective communication subsystem for MVAPICH.