Sparse Matrix Matrix
Multiplication
on
modern
multi-GPU systems
Markus Vieth
Outline
- The problem
- Related work
- Methodology
- Results
- Further work
- Discussion
SpGEMM
Kolodziej2019
SpGEMM
Srivastava2020
Multi GPU systems
Tsu2022
Use cases
Stueben2001
AC-SpGEMM
Winter2019
Setup
- Package manager: Spack
- Benchmarking: JUBE
- Toolchain: CMake
- Debugging: cuda-gdb
- Profiling: nsight-compute and nsight-system
- Machines: Monster1-3 and Elwetritsch (GDX-2)
Workflow
Benchmark
Luehrs2018
Benchmark
Algorithm
Winter2019
Sources
- article (Zachariadis2020)
- Zachariadis, O., Satpute, N., Gómez-Luna, J. and Olivares, J.
- Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
- Comput. Electr. Eng. 88 (2020) 106848, Elsevier BV, 2020, Vol. 88, pp. 106848
- inproceedings (Winter2019)
- Winter, M., Mlakar, D., Zayer, R., Seidel, H.-P. and Steinberger, M.
- Adaptive sparse matrix-matrix multiplication on the GPU
- Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
- ACM, 2019
- article (Dalton2015)
- Dalton, S., Olson, L. and Bell, N.
- Optimizing Sparse Matrix—Matrix Multiplication for the GPU
- ACM Transactions on Mathematical Software, Association for Computing Machinery (ACM),
2015, Vol. 41(4),
pp. 1-20
- article (Liu2015)
- Liu, W. and Vinter, B.
- A framework for general sparse matrix–matrix multiplication on GPUs and
heterogeneous processors
- Journal of Parallel and Distributed Computing, Elsevier BV, 2015, Vol. 85, pp. 47-61
Sources
- inproceedings (Anh2016)
- Anh, P. N. Q., Fan, R. and Wen, Y.
- Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication
- Proceedings of the 2016 International Conference on Supercomputing
- ACM, 2016
- inproceedings (Kunchum2017)
- Kunchum, R., Chaudhry, A., Sukumaran-Rajam, A., Niu, Q., Nisa, I. and Sadayappan, P.
- On improving performance of sparse matrix-matrix multiplication on GPUs
- Proceedings of the International Conference on Supercomputing
- ACM, 2017
- article (Juenger2020)
- Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W. and Schmidt, B.
- WarpCore: A Library for fast Hash Tables on GPUs
- 2020
- inproceedings (Koide2018)
- Koide, S., Kawano, K. and Kutsuna, T.
- Neural Edit Operations for Biological Sequences
- Advances in Neural Information Processing Systems
- Curran Associates, Inc., 2018, Vol. 31
Sources
- article (Guo2010)
- Guo, P. and Wang, L.
- Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs
- 2010 International Conference on Computational and Information Sciences
- IEEE, 2010
- misc (2020)
- Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt
- NVIDIA Technical Blog, 2020
- misc (2022)
- CUTLASS 2.8
- 2022
- article (Pearson2022)
- Pearson, C., Javeed, A. and Devine, K.
- Machine Learning for CUDA+MPI Design Rules
- 2022
- techreport (Bell2008)
- Bell, N. and Garland, M.
- Efficient Sparse Matrix-Vector Multiplication on CUDA
- NVIDIA Corporation, 2008(NVR-2008-004)
Sources
- inproceedings (Gubner2019)
- Gubner, T., Tomé, D., Lang, H. and Boncz, P.
- Fluid Co-processing
- Proceedings of the 15th International Workshop on Data Management on New Hardware -
DaMoN'19
- ACM Press, 2019
- article (Chan2010)
- Chan, T. M.
- More Algorithms for All-Pairs Shortest Paths in Weighted Graphs
- SIAM Journal on Computing, Society for Industrial and Applied Mathematics, 2010, Vol.
39(5), pp.
2075-2089
- inproceedings (Gilbert2007)
- Gilbert, J. R., Reinhardt, S. and Shah, V. B.
- High-Performance Graph Algorithms from Parallel Sparse Matrices
- Kågström, B., Elmroth, E., Dongarra, J. & Waśniewski, J. (ed.)
- Applied Parallel Computing. State of the Art in Scientific Computing
- Springer, 2007, pp. 260-269
Sources
- inproceedings (Niu2014)
- Niu, Q., Lai, P.-W., Faisal, S. M., Parthasarathy, S. and Sadayappan, P.
- A fast implementation of MLR-MCL algorithm on multi-core processors
- 2014 21st International Conference on High Performance Computing (HiPC)
- 2014, pp. 1-10
- article (Buluc2012)
- Buluç, A. and Gilbert, J. R.
- Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and
Experiments
- SIAM Journal on Scientific Computing, Society for Industrial and Applied Mathematics,
2012, Vol. 34(4),
pp. C170-C191
- article (Lim2019)
- Lim, Y., Yu, I., Seo, D., Kang, U. and Sael, L.
- PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks
- BMC Bioinformatics, 2019, Vol. 20(Suppl 13), pp. 381
- article (Vassilevska2006)
- Vassilevska, V., Williams, R. and Yuster, R.
- Finding heaviest H-subgraphs in real weighted graphs, with applications
- 2006
Sources
- misc (Olson2015)
- Olson, L.
- Algebraic Multigrid Methods
- 2015
- article (Xu2016)
- Xu, J. and Zikatanov, L. T.
- Algebraic Multigrid Methods
- 2016
- inproceedings (Srivastava2020)
- Srivastava, N., Jin, H., Liu, J., Albonesi, D. and Zhang, Z.
- MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise
Product
- 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
- 2020, pp. 766-780
- incollection (Shah_2011)
- Shah, V. B., Gilbert, J. and Reinhardt, S.
- Some Graph Algorithms in an Array-Based Language
- Graph Algorithms in the Language of Linear Algebra
- Society for Industrial and Applied Mathematics, 2011, pp. 29-43
Sources
- article (Kolodziej2019)
- Kolodziej, S., Aznaveh, M., Bullock, M., David, J., Davis, T., Henderson, M., Hu, Y. and
Sandstrom, R.
- The SuiteSparse Matrix Collection Website Interface
- Journal of Open Source Software, The Open Journal, 2019, Vol. 4(35), pp. 1244
- online (Tsu2022)
- Tsu, W.
- Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance
Computing
- 2022
- incollection (King2016)
- King, J., Gilray, T., Kirby, R. M. and Might, M.
- Dynamic Sparse-Matrix Allocation on GPUs
- High Performance Computing
- Springer, 2016
- techreport (NVIDIA2022)
- NVIDIA
- NVIDIA H100 Tensor Core GPU Architecture
- NVIDIA Corporation, 2022
Sources
- article (Parger2020)
- Parger, M., Winter, M., Mlakar, D. and Steinberger, M.
- spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight
analysis
- Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming
- Association for Computing Machinery, 2020, pp. 362–375
- article (Azad2021)
- Azad, A., Selvitopi, O., Hussain, M. T., Gilbert, J. R. and Buluc, A.
- Combinatorial BLAS 2.0: Scaling combinatorial algorithms on distributed-memory
systems
- 2021
- report (Buluc2018)
- Buluç, A.
- Large-scale parallel computing forcomputational genomics
- Computational Research Division, LBNLEECS Department, UC Berkeley, 2018
Sources
- thesis (Dongen2000)
- Dongen, S.M. van
- Graph clustering by flow simulation
- Utrecht University, 2000
- misc (Stueben2001)
- Stüben, K.
- A review of algebraic multigrid
- 2001, pp. 281-309
- inproceedings (Luehrs2018)
- Lührs, S.
- Benchmarking and parameter studies with JUBE
- NEST Conference, 2018