ABSTRACT:
Due to the wide use of collective operations in Message Passing Interface (MPI) applications, developing efficient
collective communication routines is essential. Despite numerous research efforts for optimizing MPI collective
operations, it is still not clear how to obtain MPI collective routines that can achieve high performance across
platforms and applications. In particular, while it may not be extremely difficult to develop an efficient communication
algorithm for a given platform and a given application, including such an algorithm in an MPI library poses a significant
challenge: the communication library is general-purpose and must provide efficient routines for different platforms
and applications.
In this research, a new library implementation paradigm called delayed finalization of MPI collective communication
routines (DF) is proposed for realizing efficient MPI collective routines across platforms and applications. The idea
is to postpone the decision of which algorithm to be used for a collective operation until the platform and/or application
are known. Using the DF approach, the MPI library can maintain, for each communication operation, an extensive set of
algorithms, and use an automatic algorithm selection mechanism to decide the appropriate algorithm for a given platform
and a given application. Hence, a DF based library can adapt to platforms and applications.
To verify that the DF approach is effective and practical, Ethernet switched clusters are selected as the experimental
platform and two DF based MPI libraries, STAGE-MPI and STAR-MPI, are developed and evaluated. In the development of the
DF based libraries, topology-specific algorithms for all-to-all, all-gather, and broadcast operations are designed for
Ethernet switched clusters. The experimental results indicate that both STAGE-MPI and STAR-MPI significantly out-perform
traditional MPI libraries including LAM/MPI and MPICH in many cases, which demonstrates that the performance of MPI
collective library routines can be significantly improved by (1) incorporating platform/application specific communication
algorithms in the MPI library, and (2) making the library adaptable to platforms and applications.
BIO:
Ahmad Faraj received his BS, MS, and Ph.D in Computer Science from Florida State University in 2000,
2002, and 2006,
respectively. His Ph.D work with Prof. Xin Yuan focused on how to achieve efficient library implementation of the message
passing interface (MPI), and in particular, how to realize collective communication routines that can deliver high performance
across platforms and applications. His research interests include MPI implementation, communication optimizations and
communication algorithms, performance analysis/optimization/tuning, empirical optimization techniques, parallel programming and
computing, high performance computing, clustering, and compilers.
# # #