Cublaslt Grouped: Gemm Documentation

NVIDIA reports speedups of up to 1.2x in MoE generation phases when using grouped APIs over standard batched alternatives.

To execute a grouped GEMM, the user typically provides arrays of pointers to the matrices: cublaslt grouped gemm documentation

Call cublasLtMatmul or the specialized cublasGemmGroupedBatchedEx . Ensure you provide a sufficient workspace buffer as requested by the library's heuristics. Advanced Optimization: The Scheduler NVIDIA reports speedups of up to 1

This report details the functionality, API structure, and usage of the operation within the NVIDIA cuBLASLt library. Grouped GEMM allows for the execution of multiple independent General Matrix Multiply (GEMM) operations in a single API call. This capability is critical for optimizing deep learning workloads—such as Multi-Head Attention or Mixture of Experts (MoE)—where many small matrix multiplications can be batched together to saturate GPU throughput and reduce kernel launch overhead. Advanced Optimization: The Scheduler This report details the

Would you like a shorter version for Twitter/X or a code snippet example to accompany this post?

For users requiring even more control, NVIDIA's (which often powers cuBLAS kernels) uses a grouped kernel scheduler. This scheduler assigns work to threadblocks in a round-robin fashion, ensuring that even if some GEMMs in your group are significantly larger than others, the GPU's Streaming Multiprocessors (SMs) remain balanced.

⚠️