- Directive-based GPU programming models are gaining momentum since they transparently relieve programmers from dealing with complex language syntax of low-level GPU programming, which often involves diverse architectural details.
- However, too much abstraction in directive models puts a significant burden on programmers in terms of debugging and performance optimizations.
- To improve debuggability/tunability for the directive-based GPU programming models, we must have a systematic approach that exposes more information to programmers while still maintaining high-level abstraction.
- To facilitate research on other important features, such as resilience and scalability, more intuitive and extensible compiler framework will be need.
Open Accelerator Research Compiler (OpenARC)
OpenARC is a new, open source compiler framework, which provides extensible environment, where various performance optimizations, traceability mechanisms, fault tolerance techniques, etc., can be built for better debuggability/performance/resilience on the complex accelerator computing. OpenARC has the following salient features:
- OpenARC is the first open source compiler supporting full features of OpenACC specification v1.0 and subset of v2.0, which allows full research contexts on directive-based GPU computing.
- OpenARC is designed with extensibility in mind. Built on top of the Cetus compiler infrastructure, OpenARC is equipped with various advanced compile-time analysis and transformation techniques, which provide basic building blocks to easily create more advanced compiler passes. Combined with its very high-level intermediate representation (IR) augmented with a rich set of directives, OpenARC provides a powerful research framework for various source-to-source translation and instrumentation experiments, even for porting various Domain-Specific Languages (DSLs).
- Within OpenARC, various traceability mechanisms can be implemented to keep the connection between input directive models and output codes/performance, because OpenARC has clear separation between analysis passes and transformation passes, and all compiler passes communicate with each other through annotations.
- As an OpenACC directive compiler, OpenARC has various additional directives/environment variables for internal tracing and GPU-specific optimizations. Combined with its built-in tuning tools, OpenARC allows users to control overall OpenACC-to-GPU translation and optimization in a fine-grained, but still abstract manner, offering very high tunability.
Extensible Program Representation in OpenARC
Built on top of the Cetus compiler infrastructure, OpenARC's program representation inherits several of its predecessor's salient features. OpenARC's IR is implemented in the form of a Java class hierarchy, and it provides an Abstract Syntax Tree (AST)-like syntactic view of the input program that makes it easy for compiler-pass writers to analyze and transform the input program. As shown in the above figure, the Program class type represents the entire program that may consist of multiple source files, and the TranslationUnit class type represents each of the source files. Each TranslationUnit object contains IR objects of two base types: Declaration and Statement. Each Declaration object contains Declarator IR objects, and each Statement object consists Expression IR objects. Other IR class types for specific language constructs are derived from these base classes (e.g., DeclarationStatement represents a Statement that contains a Declaration).
This hierarchical IR class structure provides complete data abstraction such that compiler-pass writers need only manipulate the objects through access functions, thus the AST-like view. OpenARC supports the following important features derived from Cetus.
- Traversable IR objects. All OpenARC IR objects extend a base class Traversable, which provides the basic functionality to iterate over IR objects. Combined with various built-in iterators (e.g., BreadthFirst, DepthFirst, Flat, etc.), these provide easy traversal and search over the AST-like, n-ary tree structures of the program representation.
- Rich Annotations. Annotation is a base class type to represent any type of annotations used in OpenARC. By deriving this base class, various types of information, such as comments, directives, raw codes to be inlined, etc., can be added to the OpenARC IRs. An annotation can be associated with any Annotatable IR objects (e.g., OpenMP/OpenACC directives attached to a ForLoop statement) or can be stand-alone like a comment statement.
- Flexible Printing. The printing functions in each IR class type allow flexible rendering of the program representation, depending on the target languages and translation goals.
- Symbol table. OpenARC's symbol table functionality provides information about identifiers and data types by directly accessing the information stored in declaration statements without using separate and redundant symbol table storage.
OpenARC Compilation Flow
The OpenARC compiler consists of the following major passes, each of which provides one or more checkpoints. Within checkpoints, intermediate results can be saved as output codes with annotations. This is useful for manual debugging or for implementing traceability mechanisms.
- Cetus parser calls C preprocessor to handle header files, macro expansion, etc., and converts the preprocessed OpenACC program into an internal representation (OpenARC IR).
- Input preprocessor parses OpenACC directives and performs initial code transformations for later passes, including selective procedure cloning to enable context-sensitive, interprocedural analyses/transformations.
- OpenACC loop-directive preprocessor interprets loop directives, extracts necessary implicit information from the loop constructs and stores them as internal/external annotations, and performs initial loop transformations according to explicit/implicit rules.
- OpenACC analysis checks the correctness of the overall OpenACC directives and derives sharing rules for the data not explicitly specified by programmers.
- User-directive handler interprets additional annotations provided as a separate file, and stores them into IR.
- Optimization pass performs various optimizations such as privatization, reduction recognition, locality analysis, etc. All the optimization results are stored as annotations to inform later transformation passes.
- Transformation pass conducts several pre-transformations according to the results passed from the optimization pass.
- OpenACC-to-Accelerator translation generates output accelerator codes with post-transformations that are possible only at output codes.
OpenARC Tuning Framework
|Overall Tuning Framework. In the figure, input OpenARC code is an output IR from OpenACC annotation parser in the compilation flow.|
The overall tuning process is as follows:
- The search space pruner analyzes an input OpenARC program plus optional user settings, which exist as annotations in the input program, and suggests application tuning parameters.
- The tuning configuration generator builds a search space, further prunes the space using the optimization space setup file if user-provided, and generates tuning configuration files for the given search space.
- For each tuning configuration, the A2G translator generates an output accelerator program.
- The tuning engine procudes executables from the generated accelerator programs and measures the performance of the output programs by running the executables.
- The tuning engine decides a direction to the next search and requests to generate new configurations.
- The last three steps are repeated, as needed.
In the the example tuning framework, a programmer can replace the tuning engine with any custom engine; all the other steps from finding tunable parameters to complex code changes for each tuning configuration are automatically handled by the compilation system in OpenARC.
The following figure shows the preliminary performance of various applications translated by OpenARC:
|Normalized execution time of benchmarks translated by OpenARC and PGI-OpenACC compilers relative to manual CUDA versions
(LUDOpenARC = 31, LUDPGI = 13). Lower is better.
|Performance portability of the OpenACC programming model as obtained by OpenARC|
To cite OpenARC, please use the following papers (you can download bibtex files from each link):
Seyong Lee and Jeffrey S. Vetter, OpenARC: Open Accelerator Research Compiler for Directive-Based, Efficient Heterogeneous Computing, HPDC: International ACM Symposium on High-Performance Parallel and Distributed Computing, Short Paper, June 2014
Seyong Lee and Jeffrey S. Vetter, OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study, WACCPD: Workshop on Accelerator Programming Using Directives in Conjunction with SC'14, November 2014
To cite OpenARC-related research, please use the following papers:
Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter. NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems. International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism. International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016.
Seyong Lee, Jungwon Kim, and Jeffrey S. Vetter. OpenACC to FPGA: A Framework for Directive-based High-Performance Reconfigurable Computing, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016
Joel E. Denny, Seyong Lee, and Jeffrey S. Vetter. FITL: Extending LLVM for the Translation of Fault-Injection Directives. LLVM-HPC2: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, 2015
Jungwon Kim, Seyong Lee, and Jeffrey S. Vetter, An OpenACC-based Unified Programming Model for Multi-accelerator Systems, Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Poster, 2015
Seyong Lee, Dong Li, and Jeffrey S. Vetter, Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing, IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2014
Seyong Lee and Jeffrey S. Vetter, Early Evaluation of Directive-Based GPU Programming Models for Productive Exascale Computing, SC12: ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2012
Seyong Lee and Jeffrey S. Vetter, Moving Heterogeneous GPU Computing into the Mainstream with Directive-Based, High-Level Programming Models, DOE Exascale Research Conference,Position Paper, April 2012
- OpenARC is currently in a closed beta; to test the beta version, please contact us.
- Seyong Lee (Email: lees2 AT ornl DOT gov)