Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Updated
Sep 8, 2024 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
Codes for DTC-SpMM (ASPLOS'24)
The lab assignments from CS4302 Parallel and Distributed Programming (2022 Fall) with my solutions
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
Add a description, image, and links to the tensor-core topic page so that developers can more easily learn about it.
To associate your repository with the tensor-core topic, visit your repo's landing page and select "manage topics."