Sparse Matrix Matrix Multiplication
on modern multi-GPU systems

Markus Vieth

Outline

  1. What
  2. Why
  3. How
  4. Related Work
  5. SWOT-Analysis and Action Plan
  6. Sources

What

  • General sparse matrix matrix multiplication on multiple GPUs

Sparse matrix

Kolodziej2019
Combinatorial optimization as polynomial eqns
heat exchanger network
co-appearences of characters in Les Miserables
biochemical network

SpGEMM

Srivastava2020

Inner Product

  • Classic matrix multiplication
  • Small number of useful calculations
  • Low memory requirements
  • Input should be row major and column major

Outer Product

  • Produces partial sums
  • Only calculates non-zero entries
  • High memory requirements
  • Input should be row major and column major
  • Needs synchronization

Row-Wise Product

  • Produces complete rows
  • Only calculates non-zero entries
  • Low memory requirements
  • Input should be row major

Column-Wise Product

  • Produces complete rows
  • Only calculates non-zero entries
  • Low memory requirements
  • Input should be column major

Sparse storage formats

  • Coordinate list
  • Compressed Sparse Row
  • ELLPACK
  • Hybrid
  • Packet
  • Bitmap
King2016
1 0 3 4 0
0 9 5 0 0
0 0 0 7 8
7 2 0 8 6
2 0 1 0 0

ELL

0 1 3 0 0
2 2 4 1 2
3 * * 3 *
* * * 4 *
Column Indices
1 9 7 7 2
3 5 8 2 1
4 * * 8 *
* * * 6 *
Values

HYB

ELL

0 1 3 0 0
2 2 4 1 2
Column Indices

1 9 7 7 2
3 5 8 2 1
Values

COO

0 3 3
Row Idx

3 3 4
Col Idx

4 8 6
Values

GPU systems

NVIDIA2022
NVIDIA2022

Multi GPU systems

Tsu2022
HGX H100, 4 GPUs
HGX H100, 8 GPUs
HGX H100 Network

Why

Use cases

  • graph algorithms
  • Bioinformatics
    • protein complexes detection
    • disease module detection
    • gene function prediction

Graph algorithms

Shah_2011

Bioinformatics

 

Protein Clustering

Dongen2000
Markov Cluster Algorithm
Markov Cluster Algorithm
Markov Cluster Algorithm
Markov Cluster Algorithm

How

  • Expand, Sort and Compress
  • Hashing
  • Merging
  • Dense accumulation

Hashing with WarpCore

Juenger2020

 

Adaptive algorithm

Related work

State of the art single GPU

  • cuSPARSE
  • CUSP
  • RMerge2
  • nsparse
  • AC-SpGEMM
  • spECK
  • tSparse

spECK

Parger2020

Procedure used in spECK

CombBLAS 2.0

Azad2021
Using 6 V100

SWOT

Strengths

  • Focus on HPC and hardware
  • Experienced seniors
  • Prepared working environment

Weaknesses

  • No working naive GPU implementation
  • Missing knowledge on advanced memory management
  • Missing experience with working on multi-GPU systems

Opportunities

  • New CUDA & C++-Compiler
  • New Hopper Features

Threats

  • No significant scaling
  • No access to new hardware

Where am I now

  • Unit Tests
  • Profiling workflow
  • Prepared benchmark
  • Naive hash based CPU implementation
  • single GPU cuSPARSE implementation
  • Reading and converting COO and CSR

Action plan

Priority A

  • Implement SpGEMM based on WarpCore
  • Profile and improve implementation
  • Benchmark and compare with state of the art
  • Documentation
  • Compatible to typical workflow

Action plan

Priority B

  • Implementing a naive adaptive algorithm
  • Benchmarking energy consumption
  • Comparing scaling of different implementations
  • Improving performance with bloom-filter
  • Use advanced features of new architectures

Sources

article (Zachariadis2020)
Zachariadis, O., Satpute, N., Gómez-Luna, J. and Olivares, J.
Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Comput. Electr. Eng. 88 (2020) 106848, Elsevier BV, 2020, Vol. 88, pp. 106848
inproceedings (Winter2019)
Winter, M., Mlakar, D., Zayer, R., Seidel, H.-P. and Steinberger, M.
Adaptive sparse matrix-matrix multiplication on the GPU
Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
ACM, 2019
article (Dalton2015)
Dalton, S., Olson, L. and Bell, N.
Optimizing Sparse Matrix—Matrix Multiplication for the GPU
ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM), 2015, Vol. 41(4), pp. 1-20
article (Liu2015)
Liu, W. and Vinter, B.
A framework for general sparse matrix–matrix multiplication on GPUs and heterogeneous processors
Journal of Parallel and Distributed Computing, Elsevier BV, 2015, Vol. 85, pp. 47-61

Sources

inproceedings (Anh2016)
Anh, P. N. Q., Fan, R. and Wen, Y.
Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
Proceedings of the 2016 International Conference on Supercomputing
ACM, 2016
inproceedings (Kunchum2017)
Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I. and Sadayappan, P.
On improving performance of sparse matrix-matrix multiplication on GPUs
Proceedings of the International Conference on Supercomputing
ACM, 2017
article (Juenger2020)
Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W. and Schmidt, B.
WarpCore: A Library for fast Hash Tables on GPUs
2020
inproceedings (Koide2018)
Koide, S., Kawano, K. and Kutsuna, T.
Neural Edit Operations for Biological Sequences
Advances in Neural Information Processing Systems
Curran Associates, Inc., 2018, Vol. 31

Sources

article (Guo2010)
Guo, P. and Wang, L.
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
2010 International Conference on Computational and Information Sciences
IEEE, 2010
misc (2020)
Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt
NVIDIA Technical Blog, 2020
misc (2022)
CUTLASS 2.8
2022
article (Pearson2022)
Pearson, C., Javeed, A. and Devine, K.
Machine Learning for CUDA+MPI Design Rules
2022
techreport (Bell2008)
Bell, N. and Garland, M.
Efficient Sparse Matrix-Vector Multiplication on CUDA
NVIDIA Corporation, 2008(NVR-2008-004)

Sources

inproceedings (Gubner2019)
Gubner, T., Tomé, D., Lang, H. and Boncz, P.
Fluid Co-processing
Proceedings of the 15th International Workshop on Data Management on New Hardware - DaMoN'19
ACM Press, 2019
article (Chan2010)
Chan, T. M.
More Algorithms for All-Pairs Shortest Paths in Weighted Graphs
SIAM Journal on Computing, Society for Industrial and Applied Mathematics, 2010, Vol. 39(5), pp. 2075-2089
inproceedings (Gilbert2007)
Gilbert, J. R., Reinhardt, S. and Shah, V. B.
High-Performance Graph Algorithms from Parallel Sparse Matrices
Kågström, B., Elmroth, E., Dongarra, J. & Waśniewski, J. (ed.)
Applied Parallel Computing. State of the Art in Scientific Computing
Springer, 2007, pp. 260-269

Sources

inproceedings (Niu2014)
Niu, Q., Lai, P.-W., Faisal, S. M., Parthasarathy, S. and Sadayappan, P.
A fast implementation of MLR-MCL algorithm on multi-core processors
2014 21st International Conference on High Performance Computing (HiPC)
2014, pp. 1-10
article (Buluc2012)
Buluç, A. and Gilbert, J. R.
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics, 2012, Vol. 34(4), pp. C170-C191
article (Lim2019)
Lim, Y., Yu, I., Seo, D., Kang, U. and Sael, L.
PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks
BMC Bioinformatics, 2019, Vol. 20(Suppl 13), pp. 381
article (Vassilevska2006)
Vassilevska, V., Williams, R. and Yuster, R.
Finding heaviest H-subgraphs in real weighted graphs, with applications
2006

Sources

misc (Olson2015)
Olson, L.
Algebraic Multigrid Methods
2015
article (Xu2016)
Xu, J. and Zikatanov, L. T.
Algebraic Multigrid Methods
2016
inproceedings (Srivastava2020)
Srivastava, N., Jin, H., Liu, J., Albonesi, D. and Zhang, Z.
MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
2020, pp. 766-780
incollection (Shah_2011)
Shah, V. B., Gilbert, J. and Reinhardt, S.
Some Graph Algorithms in an Array-Based Language
Graph Algorithms in the Language of Linear Algebra
Society for Industrial and Applied Mathematics, 2011, pp. 29-43

Sources

article (Kolodziej2019)
Kolodziej, S., Aznaveh, M., Bullock, M., David, J., Davis, T., Henderson, M., Hu, Y. and Sandstrom, R.
The SuiteSparse Matrix Collection Website Interface
Journal of Open Source Software, The Open Journal, 2019, Vol. 4(35), pp. 1244
online (Tsu2022)
Tsu, W.
Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance Computing
2022
incollection (King2016)
King, J., Gilray, T., Kirby, R. M. and Might, M.
Dynamic Sparse-Matrix Allocation on GPUs
High Performance Computing
Springer, 2016
techreport (NVIDIA2022)
NVIDIA
NVIDIA H100 Tensor Core GPU Architecture
NVIDIA Corporation, 2022

Sources

article (Parger2020)
Parger, M., Winter, M., Mlakar, D. and Steinberger, M.
spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis
Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Association for Computing Machinery, 2020, pp. 362–375
article (Azad2021)
Azad, A., Selvitopi, O., Hussain, M. T., Gilbert, J. R. and Buluc, A.
Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory systems
2021
report (Buluc2018)
Buluç, A.
Large-scale parallel computing forcomputational genomics
Computational Research Division, LBNLEECS Department, UC Berkeley, 2018
thesis (Dongen2000)
Dongen, S.M. van
Graph clustering by flow simulation
Utrecht University, 2000