CUDA-Accelerated Linear Algebra Toolkit

A high-performance GPU-accelerated linear algebra toolkit implementing dense matrix operations, sparse matrix operations, and iterative linear solvers for circuit simulation workloads.

Project Overview

This project provides a CUDA-based linear algebra library featuring:

Dense Matrix Operations: Optimized tiled matrix multiplication with multiple kernel implementations
Sparse Matrix Operations: CSR SpMV (Sparse matrix-vector multiplication) with custom kernels and cuSPARSE integration
Iterative Linear Solver: Conjugate Gradient method with GPU preconditioning
Performance Benchmarking: CPU vs GPU performance comparisons and custom vs library implementations

Features

Dense Matrix Multiplication (`denseMatMul`)

Multiple optimization strategies: Naive, Coalesced, Shared Memory, 2D Tiled, Vectorized, Warp Tiled
GEMM operations with alpha/beta scaling: C = α*A*B + β*C
Performance comparison against cuBLAS
Support for large matrices (2048×2048 and beyond)

Sparse Matrix Operations (`sparseMatMul`)

CSR (Compressed Sparse Row) format support
Multiple SpMV kernel optimizations: Basic, Warp-reduction, Cache-optimized, Vectorized, Shared-memory, Prefetch, Adaptive
Performance comparison against cuSPARSE
Configurable sparsity patterns for different problem sizes

Conjugate Gradient Solver (`cgSolver`)

GPU-accelerated iterative linear solver
Support for symmetric positive definite matrices
Custom CUDA kernels vs cuBLAS + cuSPARSE implementation comparison
Configurable convergence tolerance and maximum iterations

Building the Project

Prerequisites

CUDA Toolkit (11.0 or later)
CMake (3.18 or later)
C++17 compatible compiler
NVIDIA GPU with Compute Capability 7.0 or higher

Build Instructions

Clone and navigate to the project directory:
```
cd CUDA
```
Create build directory:
```
mkdir build
cd build
```
Configure with CMake:
```
cmake ..
```
Build the project:
```
cmake --build . --config Release
```
Run all benchmarks:
```
cmake --build . --target run_benchmarks
```

Alternative: Direct NVCC Compilation

# Dense matrix multiplication
nvcc -o denseMatMul.exe denseMatMulMain.cu denseMatMul.cu -lcublas

# Sparse matrix multiplication  
nvcc -o sparseMatMul.exe sparseMatMulMain.cu sparseMatMul.cu -lcusparse

# Conjugate gradient solver
nvcc -o cgSolver.exe cgSolverMain.cu cgSolver.cu sparseMatMul.cu denseMatMul.cu -lcusparse -lcublas

System Specifications

All benchmarks were performed on the following system configuration:

Hardware Configuration

GPU: NVIDIA GeForce RTX 4070 Laptop GPU
- Compute Capability: 8.9 (Ada Lovelace architecture)
- VRAM: 8,187 MB (~8GB GDDR6X)
- Shared Memory per Block: 48 KB
- Warp Size: 32 threads
- Max Threads per Block: 1,024
CPU: Intel Core i9-14900HX (24 cores, 32 threads)
RAM: 16GB DDR5
Platform: Windows with CUDA Toolkit

Software Environment

CUDA Toolkit: 12.x
Compiler: NVCC with Visual Studio 2022 Community
Build System: CMake 3.18+ / Direct NVCC compilation
Libraries: cuBLAS, cuSPARSE

Performance Results

Note: All benchmarks and this README were generated using Claude AI assistant

Dense Matrix Multiplication Results

Configuration: 2048×2048 matrices, RTX 4070 Laptop GPU

Kernel Implementation	Performance (GFLOPS)	Relative to cuBLAS
cuBLAS (Reference)	12,994.6	100.0%
Warp Tiled	11,502.1	88.5%
Vectorized	9,986.8	76.8%
2D Tiled	5,565.6	42.8%
Shared Memory	1,898.3	14.6%
Coalesced	1,461.1	11.2%
Naive	1,047.2	8.1%

Achievement: Custom Warp Tiled kernel achieves 88.5% of cuBLAS performance.

Sparse Matrix Multiplication Results

Small Matrix (500×500, 4,955 non-zeros):

Implementation	Time (ms)	Performance (GFLOPS)	Speedup vs cuSPARSE
Basic	0.013	0.8	3.0x
Warp-reduction	0.013	0.8	3.0x
Prefetch	0.013	0.8	3.0x
Adaptive	0.013	0.8	3.0x
cuSPARSE	0.040	0.251	1.0x (baseline)

Conjugate Gradient Solver Results

Problem: 1024×1024 symmetric positive definite matrix, tolerance: 1e-10

Implementation	Total Time (ms)	Iterations	Final Residual	Accuracy
Custom CUDA	4.38	20	2.862e-11	4.69e-07
cuBLAS + cuSPARSE	10.62	20	2.862e-11	4.69e-07

Results: Custom implementation is 2.4× faster while maintaining identical numerical accuracy.

Project Structure

CUDA/
├── CMakeLists.txt              # CMake build configuration
├── README.md                   # This file
├── cgSolver.cu                 # CG solver implementation
├── cgSolver.cuh                # CG solver header
├── cgSolverMain.cu             # CG solver main program
├── cgSolverRes.txt             # CG solver benchmark results
├── denseMatMul.cu              # Dense matrix multiplication kernels
├── denseMatMul.cuh             # Dense matrix multiplication header
├── denseMatMulMain.cu          # Dense matrix main program
├── denseMatMulRes.txt          # Dense matrix benchmark results
├── sparseMatMul.cu             # Sparse matrix multiplication kernels
├── sparseMatMul.cuh            # Sparse matrix multiplication header
├── sparseMatMulMain.cu         # Sparse matrix main program
└── sparseMatMulRes.txt         # Sparse matrix benchmark results

Usage Examples

Running Individual Components

Dense Matrix Multiplication:
```
./bin/denseMatMul
```
Sparse Matrix Operations:
```
./bin/sparseMatMul
```
Conjugate Gradient Solver:
```
./bin/cgSolver
```

Custom Configuration

Each component supports configuration through source code modification:

Matrix sizes: Modify N or problemSize constants
Convergence criteria: Adjust tolerance and maxIterations
Sparsity patterns: Customize matrix generation functions

Technical Details

Optimization Techniques

Dense Matrix Multiplication:

Memory coalescing for optimal global memory access
Shared memory tiling to reduce global memory traffic
Register blocking and vectorized loads
Warp-level optimizations for maximum throughput

Sparse Matrix Operations:

CSR format for efficient sparse storage
Warp-level reductions for irregular memory patterns
Cache optimization for repeated vector access
Adaptive algorithms based on sparsity patterns

Conjugate Gradient Solver:

GPU-parallel dot products and vector operations
Sparse matrix-vector multiplication integration (custom vs cuSPARSE)
Memory-efficient temporary storage management
Numerical stability optimizations

Hardware Requirements

Minimum: NVIDIA GPU with Compute Capability 7.0 (Volta architecture)
Recommended: RTX 30/40 series or A100/H100 for optimal performance
Memory: 4GB+ GPU memory for large problem sizes

License

This project is provided for educational and research purposes. Please ensure compliance with CUDA SDK license terms when using NVIDIA libraries.

Acknowledgments

Benchmarks and documentation generated using Claude AI assistant
NVIDIA CUDA Toolkit and libraries (cuBLAS, cuSPARSE)
Optimization techniques based on CUDA programming best practices
Comparative analysis includes both custom implementations and NVIDIA's optimized libraries

For questions or contributions, please refer to the source code comments for detailed implementation explanations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA-Accelerated Linear Algebra Toolkit

Project Overview

Features

Dense Matrix Multiplication (`denseMatMul`)

Sparse Matrix Operations (`sparseMatMul`)

Conjugate Gradient Solver (`cgSolver`)

Building the Project

Prerequisites

Build Instructions

Alternative: Direct NVCC Compilation

System Specifications

Hardware Configuration

Software Environment

Performance Results

Dense Matrix Multiplication Results

Sparse Matrix Multiplication Results

Conjugate Gradient Solver Results

Project Structure

Usage Examples

Running Individual Components

Custom Configuration

Technical Details

Optimization Techniques

Hardware Requirements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
cgSolver.cu		cgSolver.cu
cgSolver.cuh		cgSolver.cuh
cgSolver.exe		cgSolver.exe
cgSolverMain.cu		cgSolverMain.cu
cgSolverRes.txt		cgSolverRes.txt
denseMatMul.cu		denseMatMul.cu
denseMatMul.cuh		denseMatMul.cuh
denseMatMul.exe		denseMatMul.exe
denseMatMulMain.cu		denseMatMulMain.cu
denseMatMulRes.txt		denseMatMulRes.txt
sparseMatMul.cu		sparseMatMul.cu
sparseMatMul.cuh		sparseMatMul.cuh
sparseMatMul.exe		sparseMatMul.exe
sparseMatMulMain.cu		sparseMatMulMain.cu
sparseMatMulRes.txt		sparseMatMulRes.txt

License

aditya2819/CUDA-accelerated-linear-algebra-toolkit

Folders and files

Latest commit

History

Repository files navigation

CUDA-Accelerated Linear Algebra Toolkit

Project Overview

Features

Dense Matrix Multiplication (denseMatMul)

Sparse Matrix Operations (sparseMatMul)

Conjugate Gradient Solver (cgSolver)

Building the Project

Prerequisites

Build Instructions

Alternative: Direct NVCC Compilation

System Specifications

Hardware Configuration

Software Environment

Performance Results

Dense Matrix Multiplication Results

Sparse Matrix Multiplication Results

Conjugate Gradient Solver Results

Project Structure

Usage Examples

Running Individual Components

Custom Configuration

Technical Details

Optimization Techniques

Hardware Requirements

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Dense Matrix Multiplication (`denseMatMul`)

Sparse Matrix Operations (`sparseMatMul`)

Conjugate Gradient Solver (`cgSolver`)

Packages