Sparse Matrix Matrix Multiplication
on modern multi-GPU systems

Markus Vieth

Outline

The problem
Related work
Methodology
Results
Further work
Discussion

The Problem

SpGEMM

Kolodziej2019

Combinatorial optimization as polynomial eqns

co-appearences of characters in Les Miserables

SpGEMM

Srivastava2020

Multi GPU systems

Tsu2022

Use cases

Stueben2001

Calculated airflow on a Mercedes-Benz Class E

Related Work

AC-SpGEMM

Winter2019

Methodology

Setup

Package manager: Spack
Benchmarking: JUBE
Toolchain: CMake
Debugging: cuda-gdb
Profiling: nsight-compute and nsight-system
Machines: Monster1-3 and Elwetritsch (GDX-2)

Workflow

Benchmark

Luehrs2018

Results

Benchmark

Code

Further Work

Benchmark

Algorithm

Winter2019

Discussion

Results

Sources

article (Zachariadis2020): Zachariadis, O., Satpute, N., Gómez-Luna, J. and Olivares, J.; Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores; Comput. Electr. Eng. 88 (2020) 106848, Elsevier BV, 2020, Vol. 88, pp. 106848
inproceedings (Winter2019): Winter, M., Mlakar, D., Zayer, R., Seidel, H.-P. and Steinberger, M.; Adaptive sparse matrix-matrix multiplication on the GPU; Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming; ACM, 2019
article (Dalton2015): Dalton, S., Olson, L. and Bell, N.; Optimizing Sparse Matrix—Matrix Multiplication for the GPU; ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 2015, Vol. 41(4), pp. 1-20
article (Liu2015): Liu, W. and Vinter, B.; A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors; Journal of Parallel and Distributed Computing, Elsevier BV, 2015, Vol. 85, pp. 47-61

Sources

inproceedings (Anh2016): Anh, P. N. Q., Fan, R. and Wen, Y.; Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication; Proceedings of the 2016 International Conference on Supercomputing; ACM, 2016
inproceedings (Kunchum2017): Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I. and Sadayappan, P.; On improving performance of sparse matrix-matrix multiplication on GPUs; Proceedings of the International Conference on Supercomputing; ACM, 2017
article (Juenger2020): Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W. and Schmidt, B.; WarpCore: A Library for fast Hash Tables on GPUs; 2020
inproceedings (Koide2018): Koide, S., Kawano, K. and Kutsuna, T.; Neural Edit Operations for Biological Sequences; Advances in Neural Information Processing Systems; Curran Associates, Inc., 2018, Vol. 31

Sources

article (Guo2010): Guo, P. and Wang, L.; Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs; 2010 International Conference on Computational and Information Sciences; IEEE, 2010
misc (2020): Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt; NVIDIA Technical Blog, 2020
misc (2022): CUTLASS 2.8; 2022
article (Pearson2022): Pearson, C., Javeed, A. and Devine, K.; Machine Learning for CUDA+MPI Design Rules; 2022
techreport (Bell2008): Bell, N. and Garland, M.; Efficient Sparse Matrix-Vector Multiplication on CUDA; NVIDIA Corporation, 2008(NVR-2008-004)

Sources

inproceedings (Gubner2019): Gubner, T., Tomé, D., Lang, H. and Boncz, P.; Fluid Co-processing; Proceedings of the 15th International Workshop on Data Management on New Hardware - DaMoN'19; ACM Press, 2019
article (Chan2010): Chan, T. M.; More Algorithms for All-Pairs Shortest Paths in Weighted Graphs; SIAM Journal on Computing, Society for Industrial and Applied Mathematics, 2010, Vol. 39(5), pp. 2075-2089
inproceedings (Gilbert2007): Gilbert, J. R., Reinhardt, S. and Shah, V. B.; High-Performance Graph Algorithms from Parallel Sparse Matrices; Kågström, B., Elmroth, E., Dongarra, J. & Waśniewski, J. (ed.); Applied Parallel Computing. State of the Art in Scientific Computing; Springer, 2007, pp. 260-269

Sources

inproceedings (Niu2014): Niu, Q., Lai, P.-W., Faisal, S. M., Parthasarathy, S. and Sadayappan, P.; A fast implementation of MLR-MCL algorithm on multi-core processors; 2014 21st International Conference on High Performance Computing (HiPC); 2014, pp. 1-10
article (Buluc2012): Buluç, A. and Gilbert, J. R.; Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments; SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics, 2012, Vol. 34(4), pp. C170-C191
article (Lim2019): Lim, Y., Yu, I., Seo, D., Kang, U. and Sael, L.; PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks; BMC Bioinformatics, 2019, Vol. 20(Suppl 13), pp. 381
article (Vassilevska2006): Vassilevska, V., Williams, R. and Yuster, R.; Finding heaviest H-subgraphs in real weighted graphs, with applications; 2006

Sources

misc (Olson2015): Olson, L.; Algebraic Multigrid Methods; 2015
article (Xu2016): Xu, J. and Zikatanov, L. T.; Algebraic Multigrid Methods; 2016
inproceedings (Srivastava2020): Srivastava, N., Jin, H., Liu, J., Albonesi, D. and Zhang, Z.; MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product; 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO); 2020, pp. 766-780
incollection (Shah_2011): Shah, V. B., Gilbert, J. and Reinhardt, S.; Some Graph Algorithms in an Array-Based Language; Graph Algorithms in the Language of Linear Algebra; Society for Industrial and Applied Mathematics, 2011, pp. 29-43

Sources

article (Kolodziej2019): Kolodziej, S., Aznaveh, M., Bullock, M., David, J., Davis, T., Henderson, M., Hu, Y. and Sandstrom, R.; The SuiteSparse Matrix Collection Website Interface; Journal of Open Source Software, The Open Journal, 2019, Vol. 4(35), pp. 1244
online (Tsu2022): Tsu, W.; Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance Computing; 2022
incollection (King2016): King, J., Gilray, T., Kirby, R. M. and Might, M.; Dynamic Sparse-Matrix Allocation on GPUs; High Performance Computing; Springer, 2016
techreport (NVIDIA2022): NVIDIA; NVIDIA H100 Tensor Core GPU Architecture; NVIDIA Corporation, 2022

Sources

article (Parger2020): Parger, M., Winter, M., Mlakar, D. and Steinberger, M.; spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis; Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; Association for Computing Machinery, 2020, pp. 362–375
article (Azad2021): Azad, A., Selvitopi, O., Hussain, M. T., Gilbert, J. R. and Buluc, A.; Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems; 2021
report (Buluc2018): Buluç, A.; Large-scale parallel computing forcomputational genomics; Computational Research Division, LBNLEECS Department, UC Berkeley, 2018

Sources

thesis (Dongen2000): Dongen, S.M. van; Graph clustering by flow simulation; Utrecht University, 2000
misc (Stueben2001): Stüben, K.; A review of algebraic multigrid; 2001, pp. 281-309
inproceedings (Luehrs2018): Lührs, S.; Benchmarking and parameter studies with JUBE; NEST Conference, 2018

Sparse Matrix Matrix Multiplication on modern multi-GPU systems

Markus Vieth

Outline

The Problem

SpGEMM

SpGEMM

Multi GPU systems

Use cases

Related Work

AC-SpGEMM

Methodology

Setup

Workflow

Benchmark

Results

Benchmark

Code

Further Work

Benchmark

Algorithm

Discussion

Results

Results

Sources

Sources

Sources

Sources

Sources

Sources

Sources

Sources

Sources

Sparse Matrix Matrix Multiplication
on modern multi-GPU systems