Automatic Empirical Techniques for Developing Efficient MPI Collective Communication Routines

Colloq: Speaker: 
Ahmad Faraj
Colloq: Speaker Institution: 
Computer Science Department Florida State University
Colloq: Date and Time: 
Tue, 2006-09-19 10:00
Colloq: Location: 
ORNL 5100-Auditorium
Colloq: Host: 
Jeff Vetter
Colloq: Host Email:
Colloq: Abstract: 
Due to the wide use of collective operations in Message Passing Interface (MPI) applications, developing efficient collective communication routines is essential. Despite numerous research efforts for optimizing MPI collective operations, it is still not clear how to obtain MPI collective routines that can achieve high performance across platforms and applications. In particular, while it may not be extremely difficult to develop an efficient communication algorithm for a given platform and a given application, including such an algorithm in an MPI library poses a significant challenge: the communication library is general-purpose and must provide efficient routines for different platforms and applications. In this research, a new library implementation paradigm called delayed finalization of MPI collective communication routines (DF) is proposed for realizing efficient MPI collective routines across platforms and applications. The idea is to postpone the decision of which algorithm to be used for a collective operation until the platform and/or application are known. Using the DF approach, the MPI library can maintain, for each communication operation, an extensive set of algorithms, and use an automatic algorithm selection mechanism to decide the appropriate algorithm for a given platform and a given application. Hence, a DF based library can adapt to platforms and applications. To verify that the DF approach is effective and practical, Ethernet switched clusters are selected as the experimental platform and two DF based MPI libraries, STAGE-MPI and STAR-MPI, are developed and evaluated. In the development of the DF based libraries, topology-specific algorithms for all-to-all, all-gather, and broadcast operations are designed for Ethernet switched clusters. The experimental results indicate that both STAGE-MPI and STAR-MPI significantly out-perform traditional MPI libraries including LAM/MPI and MPICH in many cases, which demonstrates that the performance of MPI collective library routines can be significantly improved by (1) incorporating platform/application specific communication algorithms in the MPI library, and (2) making the library adaptable to platforms and applications.
Colloq: Speaker Bio: 
Ahmad Faraj received his BS, MS, and Ph.D in Computer Science from Florida State University in 2000, 2002, and 2006, respectively. His Ph.D work with Prof. Xin Yuan focused on how to achieve efficient library implementation of the message passing interface (MPI), and in particular, how to realize collective communication routines that can deliver high performance across platforms and applications. His research interests include MPI implementation, communication optimizations and communication algorithms, performance analysis/optimization/tuning, empirical optimization techniques, parallel programming and computing, high performance computing, clustering, and compilers.