Scalable and Automated GPU Kernel Transformations in Production Stencil Applications

Colloq: Speaker: 
Mohamed Wahib
Colloq: Speaker Institution: 
RIKEN Advanced Institute for Computational Science, Japan
Colloq: Date and Time: 
Mon, 2015-06-22 10:00
Colloq: Location: 
Building 5700, Room MS-A106
Colloq: Host: 
Seyong Lee
Colloq: Host Email: 
lees2@ornl.gov
Colloq: Abstract: 
We present a scalable method for exposing and exploiting hidden localities in production GPU stencil applications. Exploiting inter-kernel localities is essentially the following: find the best permutation of kernel fusions that would minimize redundant memory accesses. To achieve this, we first expose the hidden localities by analyzing inter-kernel data dependencies and order-of-execution. Next, we use a scalable search heuristic that relies on a lightweight performance model to identify the best candidate kernel fusions. Experiments with two real-world applications prove the effectiveness of manual kernel fusion. To make kernel fusion a practical choice, we further introduce an end-to-end method for automated transformation. A CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. Moreover, the automated method allows us to improve the search process by enabling kernel fission and thread block tuning. We demonstrate the practicality and effectiveness of the proposed end-to-end automated method. With minimum intervention from the user, we improved the performance of six applications with speedups ranging between 1.12x to 1.76x.
Colloq: Speaker Bio: 
Mohamed Wahib is currently a postdoctoral researcher in the “HPC Programming Framework Research Team” at RIKEN Advanced Institute for Computational Science (RIKEN AICS). He joined RIKEN AICS in 2012 after years at Hokkaido University, Japan, where he received a Ph.D. in Computer Science in 2012. Prior to his graduate studies, he worked as a researcher at Texas Instruments (TI) R&D for four years.