The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale

Jack Dongarra; Mark Gates; Azzam Haidar; Jakub Kurzak; Piotr Luszczek; Stanimire Tomov; Ichitaro Yamazaki

Submitted by claxton on Wed, 11/28/2018 - 14:17

Title	The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale
Publication Type	Journal Article
Year of Publication	2018
Authors	Dongarra, J., M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki
Journal	SIAM Review
Volume	60
Issue	4
Pagination	808–865
Date Published	2018-11
ISSN	0036-1445
Keywords	bidiagonal matrix, bisection, Divide and conquer, Hestenes method, Jacobi method, Kogbetliantz method, MRRR, QR iteration, Singular value decomposition, SVD
Abstract	The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EISPACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.
URL	https://epubs.siam.org/doi/10.1137/17M1117732
DOI	10.1137/17M1117732
Short Title	SIAM Rev.

Project Tags:

File:

icl-utk-1341-2018.pdf

External Publication Flag: