Sparse Matrix Multiplication Benchmark

A modular CUDA implementation comparing sparse matrix multiplication kernels against dense baselines.

Project Structure

SNNSparseInfer/
├── include/
│   ├── common.h           # Common definitions, macros, and inline functions
│   ├── kernels.h          # Kernel interface declarations
│   └── benchmark.h        # Benchmark utilities interface
├── src/
│   ├── kernels/
│   │   ├── kernel_utils.cu        # Common host functions (frand, etc.)
│   │   ├── sparse_mm_basic.cu     # Basic sparse implementation (mask + compute)
│   │   └── sparse_mm_pipeline.cu  # Optimized pipeline implementation
│   ├── benchmark/
│   │   ├── matrix_generator.cu    # Sparse matrix generation
│   │   ├── dense_baselines.cu     # cuBLAS/cuBLASLt baselines
│   │   └── benchmark_utils.cu     # Timing and error calculation
│   └── main.cu                    # Main benchmark program
├── scripts/
│   └── profile_sparse_mm.sh       # Nsight Compute profiling script
├── Makefile                       # Build system
├── sparse_mm.cu                   # Legacy monolithic version
└── new.cu                         # Original pipeline implementation

Building

Prerequisites

CUDA 12.6+ (for TF32 Tensor Core support)
GCC 11+
cuBLAS and cuBLASLt libraries

Quick Start

# Set CUDA paths (if not already done)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Build
make

# Test
make test

Build Targets

make or make all - Build main modular benchmark
make sparse_mm_legacy - Build legacy monolithic version
make pipeline - Build pipeline-only version
make test - Run basic test (1024×1024, 90% sparsity)
make test-variants - Test different sparsity levels
make debug - Build with debug symbols
make clean - Remove build artifacts

Usage

./sparse_mm_benchmark [M] [K] [N] [tile] [sparsity]

Parameters:

M, K, N: Matrix dimensions (W: M×K, A: K×N, result: M×N)
tile: Tile size for sparse computation (default: 32)
sparsity: Fraction of zero tiles (0.0-1.0, default: 0.9)

Example:

./sparse_mm_benchmark 2048 2048 2048 32 0.8

Kernel Implementations

1. Basic Sparse (`sparse_mm_basic.cu`)

Two-phase approach: mask generation + computation
Based on the original implementation
Simpler but less efficient

2. Pipeline Sparse (`sparse_mm_pipeline.cu`)

Triple-stream pipeline: mask generation, async loading, computation
Double buffering for overlapped execution
cp.async support for sm_80+
4-way vectorized output

3. Dense Baselines

cuBLAS SGEMM: Traditional FP32 GEMM
cuBLASLt TF32: TensorFloat-32 with Tensor Cores
cuBLASLt Optimal: Algorithm selection with workspace

Performance Analysis

The benchmark provides comprehensive performance comparison:

=== Dense Baseline Comparison ===
cuBLAS SGEMM      : 0.051 ms
cuBLASLt TF32     : 0.036 ms  ← Best dense baseline
cuBLASLt Optimal  : 0.036 ms

=== Sparse Performance ===
Sparse Basic      : 0.293 ms
Sparse Pipeline   : 0.899 ms

=== Speedup Analysis ===
Basic vs Best Dense     : 0.12x  (sparse slower than dense)
Pipeline vs Best Dense  : 0.04x  (pipeline needs optimization)
Pipeline vs Basic       : 0.33x  (basic faster than pipeline)

Key Insights

TF32 Tensor Cores are extremely efficient for dense GEMM on RTX 4090
High sparsity (90%) creates overhead that exceeds computation savings
Pipeline version needs tuning - currently slower than basic version
Matrix size matters - larger matrices may favor sparse approaches

Profiling

Use Nsight Compute for detailed analysis:

./scripts/profile_sparse_mm.sh [M] [K] [N] [tile] [sparsity]

Results saved in profile_results/:

sparse_mm_detailed.ncu-rep - Complete profiling data
sparse_mm_metrics.csv - Key performance metrics
analysis_summary.txt - Profiling guide

Python Benchmark Suite

For comprehensive automated benchmarking with visualization:

cd python_benchmark
pip install -r requirements.txt
./run_benchmark.py

Features:

🎯 Automated testing across multiple matrix sizes and sparsity levels
📊 Professional visualizations: Sparsity curves, workload performance bars, algorithm comparisons
🔧 YAML configuration for easy customization
📈 Statistical analysis with error bars and efficiency metrics
💾 Export results in CSV/JSON formats

Generated Plots:

sparsity_curves.png - Performance vs sparsity analysis with efficiency metrics
workload_size_performance.png - Bar charts comparing matrix sizes with GFLOPS/bandwidth
algorithm_comparison_summary.png - Overall algorithm comparison with recommendations

See python_benchmark/README.md for detailed usage.

Future Work

Optimize pipeline kernel - reduce overhead, improve memory access
Add more sparsity patterns - structured sparsity, block sparsity
Multi-GPU support - distributed sparse computation
FP16 variants - half-precision implementations
Adaptive tile sizing - dynamic tile selection based on sparsity
CI/CD integration - automated performance regression testing

Architecture Support

sm_89 (RTX 4090): Primary target, TF32 + cp.async
sm_80 (A100): TF32 + cp.async support
sm_75 (RTX 2080): TF32 fallback to manual load
sm_70 (V100): Manual loading only

Legacy Files

sparse_mm.cu - Original monolithic implementation with multiple baselines
new.cu - Pipeline implementation before modularization

These are kept for reference and comparison purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Matrix Multiplication Benchmark

Project Structure

Building

Prerequisites

Quick Start

Build Targets

Usage

Kernel Implementations

1. Basic Sparse (`sparse_mm_basic.cu`)

2. Pipeline Sparse (`sparse_mm_pipeline.cu`)

3. Dense Baselines

Performance Analysis

Key Insights

Profiling

Python Benchmark Suite

Future Work

Architecture Support

Legacy Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
CPU_test		CPU_test
include		include
python_benchmark		python_benchmark
scripts		scripts
src		src
test		test
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
new.cu		new.cu
sparse_mm.cu		sparse_mm.cu
sparse_mm_benchmark		sparse_mm_benchmark

Folders and files

Latest commit

History

Repository files navigation

Sparse Matrix Multiplication Benchmark

Project Structure

Building

Prerequisites

Quick Start

Build Targets

Usage

Kernel Implementations

1. Basic Sparse (sparse_mm_basic.cu)

2. Pipeline Sparse (sparse_mm_pipeline.cu)

3. Dense Baselines

Performance Analysis

Key Insights

Profiling

Python Benchmark Suite

Future Work

Architecture Support

Legacy Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Basic Sparse (`sparse_mm_basic.cu`)

2. Pipeline Sparse (`sparse_mm_pipeline.cu`)

Packages