Sparse Matrix Matrix Multiplication
on modern multi-GPU systems

Markus Vieth

Outline

What
Why
How
Related Work
SWOT-Analysis and Action Plan
Sources

What

General sparse matrix matrix multiplication on multiple GPUs

Sparse matrix

Kolodziej2019

Combinatorial optimization as polynomial eqns

co-appearences of characters in Les Miserables

SpGEMM

Srivastava2020

Inner Product

Classic matrix multiplication
Small number of useful calculations
Low memory requirements
Input should be row major and column major

Outer Product

Produces partial sums
Only calculates non-zero entries
High memory requirements
Input should be row major and column major
Needs synchronization

Row-Wise Product

Produces complete rows
Only calculates non-zero entries
Low memory requirements
Input should be row major

Column-Wise Product

Produces complete rows
Only calculates non-zero entries
Low memory requirements
Input should be column major

Sparse storage formats

Coordinate list
Compressed Sparse Row
ELLPACK
Hybrid
Packet
Bitmap

King2016

1	0	3	4	0
0	9	5	0	0
0	0	0	7	8
7	2	0	8	6
2	0	1	0	0

ELL

Column Indices
0	1	3	0	0
2	2	4	1	2
3	*	*	3	*
*	*	*	4	*

Values
1	9	7	7	2
3	5	8	2	1
4	*	*	8	*
*	*	*	6	*

HYB

ELL

Column Indices
0	1	3	0	0
2	2	4	1	2

Values
1	9	7	7	2
3	5	8	2	1

COO

Row Idx
0	3	3

Col Idx
3	3	4

Values
4	8	6

GPU systems

NVIDIA2022

Multi GPU systems

Tsu2022

Why

Use cases

graph algorithms
Bioinformatics

protein complexes detection
disease module detection
gene function prediction

Graph algorithms

Shah_2011

Bioinformatics

Protein Clustering

Dongen2000

How

Expand, Sort and Compress
Hashing
Merging
Dense accumulation

Hashing with WarpCore

Juenger2020

Adaptive algorithm

Related work

State of the art single GPU

cuSPARSE
CUSP
RMerge2
nsparse
AC-SpGEMM
spECK
tSparse

spECK

Parger2020

CombBLAS 2.0

Azad2021

SWOT

Strengths

Focus on HPC and hardware
Experienced seniors
Prepared working environment

Weaknesses

No working naive GPU implementation
Missing knowledge on advanced memory management
Missing experience with working on multi-GPU systems

Opportunities

New CUDA & C++-Compiler
New Hopper Features

Threats

No significant scaling
No access to new hardware

Where am I now

Unit Tests
Profiling workflow
Prepared benchmark
Naive hash based CPU implementation
single GPU cuSPARSE implementation
Reading and converting COO and CSR

Action plan

Priority A

Implement SpGEMM based on WarpCore
Profile and improve implementation
Benchmark and compare with state of the art
Documentation
Compatible to typical workflow

Action plan

Priority B

Implementing a naive adaptive algorithm
Benchmarking energy consumption
Comparing scaling of different implementations
Improving performance with bloom-filter
Use advanced features of new architectures

Sources

article (Zachariadis2020): Zachariadis, O., Satpute, N., Gómez-Luna, J. and Olivares, J.; Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores; Comput. Electr. Eng. 88 (2020) 106848, Elsevier BV, 2020, Vol. 88, pp. 106848
inproceedings (Winter2019): Winter, M., Mlakar, D., Zayer, R., Seidel, H.-P. and Steinberger, M.; Adaptive sparse matrix-matrix multiplication on the GPU; Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming; ACM, 2019
article (Dalton2015): Dalton, S., Olson, L. and Bell, N.; Optimizing Sparse Matrix—Matrix Multiplication for the GPU; ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 2015, Vol. 41(4), pp. 1-20
article (Liu2015): Liu, W. and Vinter, B.; A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors; Journal of Parallel and Distributed Computing, Elsevier BV, 2015, Vol. 85, pp. 47-61

Sources

inproceedings (Anh2016): Anh, P. N. Q., Fan, R. and Wen, Y.; Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication; Proceedings of the 2016 International Conference on Supercomputing; ACM, 2016
inproceedings (Kunchum2017): Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I. and Sadayappan, P.; On improving performance of sparse matrix-matrix multiplication on GPUs; Proceedings of the International Conference on Supercomputing; ACM, 2017
article (Juenger2020): Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W. and Schmidt, B.; WarpCore: A Library for fast Hash Tables on GPUs; 2020
inproceedings (Koide2018): Koide, S., Kawano, K. and Kutsuna, T.; Neural Edit Operations for Biological Sequences; Advances in Neural Information Processing Systems; Curran Associates, Inc., 2018, Vol. 31

Sources

article (Guo2010): Guo, P. and Wang, L.; Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs; 2010 International Conference on Computational and Information Sciences; IEEE, 2010
misc (2020): Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt; NVIDIA Technical Blog, 2020
misc (2022): CUTLASS 2.8; 2022
article (Pearson2022): Pearson, C., Javeed, A. and Devine, K.; Machine Learning for CUDA+MPI Design Rules; 2022
techreport (Bell2008): Bell, N. and Garland, M.; Efficient Sparse Matrix-Vector Multiplication on CUDA; NVIDIA Corporation, 2008(NVR-2008-004)

Sources

inproceedings (Gubner2019): Gubner, T., Tomé, D., Lang, H. and Boncz, P.; Fluid Co-processing; Proceedings of the 15th International Workshop on Data Management on New Hardware - DaMoN'19; ACM Press, 2019
article (Chan2010): Chan, T. M.; More Algorithms for All-Pairs Shortest Paths in Weighted Graphs; SIAM Journal on Computing, Society for Industrial and Applied Mathematics, 2010, Vol. 39(5), pp. 2075-2089
inproceedings (Gilbert2007): Gilbert, J. R., Reinhardt, S. and Shah, V. B.; High-Performance Graph Algorithms from Parallel Sparse Matrices; Kågström, B., Elmroth, E., Dongarra, J. & Waśniewski, J. (ed.); Applied Parallel Computing. State of the Art in Scientific Computing; Springer, 2007, pp. 260-269

Sources

inproceedings (Niu2014): Niu, Q., Lai, P.-W., Faisal, S. M., Parthasarathy, S. and Sadayappan, P.; A fast implementation of MLR-MCL algorithm on multi-core processors; 2014 21st International Conference on High Performance Computing (HiPC); 2014, pp. 1-10
article (Buluc2012): Buluç, A. and Gilbert, J. R.; Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments; SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics, 2012, Vol. 34(4), pp. C170-C191
article (Lim2019): Lim, Y., Yu, I., Seo, D., Kang, U. and Sael, L.; PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks; BMC Bioinformatics, 2019, Vol. 20(Suppl 13), pp. 381
article (Vassilevska2006): Vassilevska, V., Williams, R. and Yuster, R.; Finding heaviest H-subgraphs in real weighted graphs, with applications; 2006

Sources

misc (Olson2015): Olson, L.; Algebraic Multigrid Methods; 2015
article (Xu2016): Xu, J. and Zikatanov, L. T.; Algebraic Multigrid Methods; 2016
inproceedings (Srivastava2020): Srivastava, N., Jin, H., Liu, J., Albonesi, D. and Zhang, Z.; MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product; 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO); 2020, pp. 766-780
incollection (Shah_2011): Shah, V. B., Gilbert, J. and Reinhardt, S.; Some Graph Algorithms in an Array-Based Language; Graph Algorithms in the Language of Linear Algebra; Society for Industrial and Applied Mathematics, 2011, pp. 29-43

Sources

article (Kolodziej2019): Kolodziej, S., Aznaveh, M., Bullock, M., David, J., Davis, T., Henderson, M., Hu, Y. and Sandstrom, R.; The SuiteSparse Matrix Collection Website Interface; Journal of Open Source Software, The Open Journal, 2019, Vol. 4(35), pp. 1244
online (Tsu2022): Tsu, W.; Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance Computing; 2022
incollection (King2016): King, J., Gilray, T., Kirby, R. M. and Might, M.; Dynamic Sparse-Matrix Allocation on GPUs; High Performance Computing; Springer, 2016
techreport (NVIDIA2022): NVIDIA; NVIDIA H100 Tensor Core GPU Architecture; NVIDIA Corporation, 2022

Sources

article (Parger2020): Parger, M., Winter, M., Mlakar, D. and Steinberger, M.; spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis; Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; Association for Computing Machinery, 2020, pp. 362–375
article (Azad2021): Azad, A., Selvitopi, O., Hussain, M. T., Gilbert, J. R. and Buluc, A.; Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems; 2021
report (Buluc2018): Buluç, A.; Large-scale parallel computing forcomputational genomics; Computational Research Division, LBNLEECS Department, UC Berkeley, 2018
thesis (Dongen2000): Dongen, S.M. van; Graph clustering by flow simulation; Utrecht University, 2000

Sparse Matrix Matrix Multiplication on modern multi-GPU systems

Markus Vieth

Outline

What

Sparse matrix

SpGEMM

Inner Product

Outer Product

Row-Wise Product

Column-Wise Product

Sparse storage formats

ELL

HYB

ELL

COO

GPU systems

Multi GPU systems

Why

Use cases

Graph algorithms

Bioinformatics

Protein Clustering

How

Hashing with WarpCore

Adaptive algorithm

Related work

State of the art single GPU

spECK

CombBLAS 2.0

SWOT

Strengths

Weaknesses

Opportunities

Threats

Where am I now

Action plan

Priority A

Action plan

Priority B

Sources

Sources

Sources

Sources

Sources

Sources

Sources

Sources

Sparse Matrix Matrix Multiplication
on modern multi-GPU systems