Evaluation of Directive-Based Performance Portable Programming Models

TitleEvaluation of Directive-Based Performance Portable Programming Models
Publication TypeJournal Article
Year of Publication2019
AuthorsLopez, M. G., W. Joubert, V. Larrea, O. Hernandez, A. Haidar, S. Tomov, and J. Dongarra
JournalInternational Journal of High Performance Computing and Networking
Volume14
Issue2
Pagination165-182
Date Published2019–07
KeywordsOpenACC, OpenMP 4, performance portability, Programming models
Abstract

We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architecture with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, and we document how much tuning might be required and what lessons we can learn from these experiences. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. To better understand fundamental compute vs. bandwidth bound characteristics, we add the compute-bound Level 3 BLAS GEMM kernel to our linear algebra evaluation. We implement the kernels of interest using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both x86_64 and Power8 with attached NVIDIA GPUs, x86_64 multicores, self-hosted Intel Xeon Phi KNL, as well as an x86_64 host system with Intel Xeon Phi coprocessors. We update these evaluations with the newest version of the NVIDIA Pascal architecture (P100), Intel KNL 7230, Power8+, and the newest supporting compiler implementations. Furthermore, we present in detail what factors affected the performance portability, including how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimise and target multiple platforms.

DOI10.1504/IJHPCN.2017.10009064
External Publication Flag: