Colloq: Speaker Institution:
Colloq: Date and Time:
Fri, 2011-01-07 10:00
5100, Room 128 JICS Lecture Hall
Colloq: Host Email:
Recent shift in computer technology has introduced a range of diverse computing resources, such as multicore architectures and hardware accelerators, resulting in a new blue ocean for general-purpose high-performance computing. With the increased complexity of these new computing environments, however, finding efficient and convenient ways of programming these resources and achieving a reasonable performance is one of very challenging issues. Specially, hardware accelerators, such as General-Purpose Graphics Processing Units (GPGPUs), provide inexpensive, highly parallel systems to application developers. However, their programming complexity poses a significant challenge for developers. Moreover, complex interactions among the limited hardware resources make it more difficult to achieve a good performance. This dissertation explores compile-time and runtime techniques to improve programmability and to enable adaptive execution of programs in such architectures. First, this dissertation examines the possibility of exploiting OpenMP shared memory programming model on stream architectures such as GPGPUs. This dissertation presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In the preliminary study, several key transformation techniques, which enable efficient memory access in general stream architectures, are identified to achieve high performance. Preliminary experiments on both real benchmarks and kernels show that the proposed framework increases the performance up to 50X (12X on average) over the unoptimized translations.Second, this dissertation studies runtime tuning systems to adapt applications dynamically. In preliminary work, an adaptive runtime tuning system with emphasis on parallel irregular applications has been proposed. Preliminary experiments on 26 real sparse matrices in various applications show that the tuning system reduces execution time up to 68.8% (30.9 % on average) over a base parallel SpMV algorithm on a 32-node platform.Current work focuses on creating an integrated framework where both the compiler framework and the tuning system are synergistically combined, such that compiler-translated GPGPU applications will be seamlessly adapted and tuned according to the characteristics of the underlying system. For this goal, we propose a new programming interface, called OpenMPC – OpenMP extended for CUDA. OpenMPC provides an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. The system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants.Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-tuned CUDA programs.
Colloq: Speaker Bio:
Seyong Lee is a Ph.D. student in the School of Electrical and Computer Engineering at Purdue University. His research interests include parallel programming and performance optimization on heterogeneous computing environment, and program analysis and optimizing compiler. He received a M.S. in Electrical and Computer Engineering in 2004 from Purdue University.