Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions

Ahmad Abdelfattah; Stanimire Tomov; Jack Dongarra

Submitted by scrawford on Fri, 09/04/2020 - 14:04

Title	Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions
Publication Type	Journal Article
Year of Publication	2020
Authors	Abdelfattah, A., S. Tomov, and J. Dongarra
Journal	Journal of Parallel and Distributed Computing
Volume	145
Pagination	188-201
Date Published	2020-11
Abstract	Machine learning and artificial intelligence (AI) applications often rely on performing many small matrix operations—in particular general matrix–matrix multiplication (GEMM). These operations are usually performed in a reduced precision, such as the 16-bit floating-point format (i.e., half precision or FP16). The GEMM operation is also very important for dense linear algebra algorithms, and half-precision GEMM operations can be used in mixed-precision linear solvers. Therefore, high-performance batched GEMM operations in reduced precision are significantly important, not only for deep learning frameworks, but also for scientific applications that rely on batched linear algebra, such as tensor contractions and sparse direct solvers. This paper presents optimized batched GEMM kernels for graphics processing units (GPUs) in FP16 arithmetic. The paper addresses both real and complex half-precision computations on the GPU. The proposed design takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. With eight tuning parameters introduced in the design, the developed kernels have a high degree of flexibility that overcomes the limitations imposed by the hardware and software (in the form of discrete configurations for the Tensor Core APIs). For real FP16 arithmetic, performance speedups are observed against cuBLAS for sizes up to 128, and range between and . For the complex FP16 GEMM kernel, the speedups are between and thanks to a design that uses the standard interleaved matrix layout, in contrast with the planar layout required by the vendor’s solution. The paper also discusses special optimizations for extremely small matrices, where even higher performance gains are achievable.
DOI	10.1016/j.jpdc.2020.07.001

Project Tags:

bblas

magma

matedor

File:

icl-utk-1411-2020.pdf

External Publication Flag: