Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices

Submitted by scrawford on Thu, 12/07/2017 - 13:50

Title	Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices
Publication Type	Conference Paper
Year of Publication	2017
Authors	Gates, M., J. Kurzak, P. Luszczek, Y. Pei, and J. Dongarra
Conference Name	Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Date Published	2017-06
Publisher	IEEE
Conference Location	Orlando, FL
Keywords	batch computation, Cholesky Factorization, data layout, GPU computing, numerical linear algebra
Abstract	Batch matrix operations address the case of solving the same linear algebra problem for a very large number of very small matrices. In this paper, we focus on implementing the batch Cholesky factorization in CUDA, in single precision arithmetic, for NVIDIA GPUs. Specifically, we look into the benefits of using noncanonical data layouts, where consecutive memory locations store elements with the same row and column index in a set of consecutive matrices. We discuss a number of different implementation options and tuning parameters. We demonstrate superior performance to traditional implementations for the case of very small matrices.
DOI	10.1109/IPDPSW.2017.18

External Publication Flag: