Sparse Matrix Matrix Multiplication
on modern multi-GPU systems

Markus Vieth

Outline

  1. The problem
  2. Related work
  3. Methodology
  4. Results
  5. Further work
  6. Discussion

The Problem

SpGEMM

Kolodziej2019
Combinatorial optimization as polynomial eqns
heat exchanger network
co-appearences of characters in Les Miserables
Asia OpenStreetMap

SpGEMM

Srivastava2020
Row-Wise Product

Multi GPU systems

Tsu2022
HGX H100, 4 GPUs
HGX H100, 8 GPUs
HGX H100 Network

Use cases

Stueben2001
Calculated airflow on a Mercedes-Benz Class E

Related Work

AC-SpGEMM

Winter2019

Methodology

Setup

  • Package manager: Spack
  • Benchmarking: JUBE
  • Toolchain: CMake
  • Debugging: cuda-gdb
  • Profiling: nsight-compute and nsight-system
  • Machines: Monster1-3 and Elwetritsch (GDX-2)

Workflow

Benchmark

Luehrs2018

Results

Benchmark

Code

Further Work

Benchmark

Algorithm

Winter2019

Discussion

Results

Results

Sources

article (Zachariadis2020)
Zachariadis, O., Satpute, N., Gómez-Luna, J. and Olivares, J.
Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Comput. Electr. Eng. 88 (2020) 106848, Elsevier BV, 2020, Vol. 88, pp. 106848
inproceedings (Winter2019)
Winter, M., Mlakar, D., Zayer, R., Seidel, H.-P. and Steinberger, M.
Adaptive sparse matrix-matrix multiplication on the GPU
Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
ACM, 2019
article (Dalton2015)
Dalton, S., Olson, L. and Bell, N.
Optimizing Sparse Matrix—Matrix Multiplication for the GPU
ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 2015, Vol. 41(4), pp. 1-20
article (Liu2015)
Liu, W. and Vinter, B.
A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
Journal of Parallel and Distributed Computing, Elsevier BV, 2015, Vol. 85, pp. 47-61

Sources

inproceedings (Anh2016)
Anh, P. N. Q., Fan, R. and Wen, Y.
Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
Proceedings of the 2016 International Conference on Supercomputing
ACM, 2016
inproceedings (Kunchum2017)
Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I. and Sadayappan, P.
On improving performance of sparse matrix-matrix multiplication on GPUs
Proceedings of the International Conference on Supercomputing
ACM, 2017
article (Juenger2020)
Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W. and Schmidt, B.
WarpCore: A Library for fast Hash Tables on GPUs
2020
inproceedings (Koide2018)
Koide, S., Kawano, K. and Kutsuna, T.
Neural Edit Operations for Biological Sequences
Advances in Neural Information Processing Systems
Curran Associates, Inc., 2018, Vol. 31

Sources

article (Guo2010)
Guo, P. and Wang, L.
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
2010 International Conference on Computational and Information Sciences
IEEE, 2010
misc (2020)
Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt
NVIDIA Technical Blog, 2020
misc (2022)
CUTLASS 2.8
2022
article (Pearson2022)
Pearson, C., Javeed, A. and Devine, K.
Machine Learning for CUDA+MPI Design Rules
2022
techreport (Bell2008)
Bell, N. and Garland, M.
Efficient Sparse Matrix-Vector Multiplication on CUDA
NVIDIA Corporation, 2008(NVR-2008-004)

Sources

inproceedings (Gubner2019)
Gubner, T., Tomé, D., Lang, H. and Boncz, P.
Fluid Co-processing
Proceedings of the 15th International Workshop on Data Management on New Hardware - DaMoN'19
ACM Press, 2019
article (Chan2010)
Chan, T. M.
More Algorithms for All-Pairs Shortest Paths in Weighted Graphs
SIAM Journal on Computing, Society for Industrial and Applied Mathematics, 2010, Vol. 39(5), pp. 2075-2089
inproceedings (Gilbert2007)
Gilbert, J. R., Reinhardt, S. and Shah, V. B.
High-Performance Graph Algorithms from Parallel Sparse Matrices
Kågström, B., Elmroth, E., Dongarra, J. & Waśniewski, J. (ed.)
Applied Parallel Computing. State of the Art in Scientific Computing
Springer, 2007, pp. 260-269

Sources

inproceedings (Niu2014)
Niu, Q., Lai, P.-W., Faisal, S. M., Parthasarathy, S. and Sadayappan, P.
A fast implementation of MLR-MCL algorithm on multi-core processors
2014 21st International Conference on High Performance Computing (HiPC)
2014, pp. 1-10
article (Buluc2012)
Buluç, A. and Gilbert, J. R.
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics, 2012, Vol. 34(4), pp. C170-C191
article (Lim2019)
Lim, Y., Yu, I., Seo, D., Kang, U. and Sael, L.
PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks
BMC Bioinformatics, 2019, Vol. 20(Suppl 13), pp. 381
article (Vassilevska2006)
Vassilevska, V., Williams, R. and Yuster, R.
Finding heaviest H-subgraphs in real weighted graphs, with applications
2006

Sources

misc (Olson2015)
Olson, L.
Algebraic Multigrid Methods
2015
article (Xu2016)
Xu, J. and Zikatanov, L. T.
Algebraic Multigrid Methods
2016
inproceedings (Srivastava2020)
Srivastava, N., Jin, H., Liu, J., Albonesi, D. and Zhang, Z.
MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
2020, pp. 766-780
incollection (Shah_2011)
Shah, V. B., Gilbert, J. and Reinhardt, S.
Some Graph Algorithms in an Array-Based Language
Graph Algorithms in the Language of Linear Algebra
Society for Industrial and Applied Mathematics, 2011, pp. 29-43

Sources

article (Kolodziej2019)
Kolodziej, S., Aznaveh, M., Bullock, M., David, J., Davis, T., Henderson, M., Hu, Y. and Sandstrom, R.
The SuiteSparse Matrix Collection Website Interface
Journal of Open Source Software, The Open Journal, 2019, Vol. 4(35), pp. 1244
online (Tsu2022)
Tsu, W.
Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance Computing
2022
incollection (King2016)
King, J., Gilray, T., Kirby, R. M. and Might, M.
Dynamic Sparse-Matrix Allocation on GPUs
High Performance Computing
Springer, 2016
techreport (NVIDIA2022)
NVIDIA
NVIDIA H100 Tensor Core GPU Architecture
NVIDIA Corporation, 2022

Sources

article (Parger2020)
Parger, M., Winter, M., Mlakar, D. and Steinberger, M.
spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis
Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Association for Computing Machinery, 2020, pp. 362–375
article (Azad2021)
Azad, A., Selvitopi, O., Hussain, M. T., Gilbert, J. R. and Buluc, A.
Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems
2021
report (Buluc2018)
Buluç, A.
Large-scale parallel computing forcomputational genomics
Computational Research Division, LBNLEECS Department, UC Berkeley, 2018

Sources

thesis (Dongen2000)
Dongen, S.M. van
Graph clustering by flow simulation
Utrecht University, 2000
misc (Stueben2001)
Stüben, K.
A review of algebraic multigrid
2001, pp. 281-309
inproceedings (Luehrs2018)
Lührs, S.
Benchmarking and parameter studies with JUBE
NEST Conference, 2018