cutlass/CHANGELOG.md

40 KiB

NVIDIA CUTLASS Changelog

3.5.0 (2024-04-09)

3.4.1 (2024-02-14)

3.4.0 (2024-01-12)

3.3 (2023-10-31)

  • Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input operand types.
  • Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8}, and upcast on operandA {s8, u8} x {fp16, bf16}.
  • Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors.
  • Kernel schedules and Builder support for mixed precision and Copy Async GEMMs with < 16B aligned input tensors.
  • Profiler support for lower-aligned Hopper GEMMs.
  • Performance Improvements to Scatter-Gather Hopper Example.
  • Sub-Byte type fixes and improvements.
  • EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
  • Fusion support for backprop fusions including drelu, dgelu, and dbias.
  • Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

3.2.2 (2023-10-25)

  • Minor patch for issue/1138

3.2.1 (2023-09-22)

  • Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
  • SM80 EVT support in C++ and Python.
  • Other SM90 epilogue improvements.
  • Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See 1105 for details.
  • Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
  • SM90 TF32 kernel improvements for all layouts.
  • SM90 rasterization direction support in the CUTLASS profiler.
  • Improvement for CUTLASS profiler build times.
  • Remove Python-C++ bindings.

3.2.0 (2023-08-03)

  • New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs. FP8 GEMMs come with a fast accumulation mode. When enabled, problem execution might be faster but at the cost of lower accuracy because intermediate results will not periodically be promoted to a higher precision.
  • New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
  • Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
  • Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
  • Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
  • New CUTLASS 2D Convolution Python interface. New example here.
  • Support for Windows (MSVC) builds. Tested with Visual Studio 2019 v16.11.27 on Windows 10.0.
  • Optimal performance using CUDA 12.2u1
  • Updates and bugfixes from the community (thanks!)

3.1.0 (2023-04-14)

  • New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
  • New efficient epilogues using TMA for Hopper.
  • Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
  • New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
  • An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
  • Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
  • Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
  • Performance optimizations for the warp-specialized persistent ping-pong kernel.
  • Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
  • FMHA Backward Pass from Meta xFormers.
  • Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
  • Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
  • Batched Strided GEMV support both row major and column major input matrix.
  • Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
  • Row Broadcast can be fused in the epilogue.
  • The GitHub branch is renamed from master to main in this release.
  • Optimal performance using CUDA 12.1
  • Updates and bugfixes from the community (thanks!)

3.0.0 (2023-01-23)

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.

  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.

  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.

  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.

  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.

  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.

  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.

  • Optimized DepthWise Conv. Two new modes are added

    • kOptimized - use direct conv to compute instead of implicit GEMM.
      • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
      • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.

  • FP8 data type definition and conversion routines.

  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2

2.10.0 (2022-08-23)

  • CUTLASS Python now supports GEMM, CONV, Group GEMM for different data types as well as different epilogue flavours.
  • Optimizations for CUTLASS's Grouped GEMM kernel. Threadblock scheduling part is improved. Some computation can be moved to the host side if applicable. Grouped Syr2k kernels are added, too.
  • Optimizations for GEMM+Softmax. All the reduction computation is fused into the previous GEMM. More template arguments are provided to fine tune the performance.
  • Grouped GEMM for Multihead Attention. This general group gemm based MHA does not require the sequence length of all GEMMs to be the same which makes it most useful for natural language processing.
  • GEMM + Layer norm fusion for Ampere splits the layernorm into two parts and both of them can be fused into the GEMMs before and after separately. In addition to use square sum to compute variance of layernorm, Shift-K is provided if square sum raise numerical issues.
  • GEMM Epilogue Permutation Fusion can apply user provided permutation layout mapping in the GEMM epilogue.
  • Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. The restrictions are: 1) input and output channel number should be multiple of group number. 2) split-K is not supported. The implementation has 2 modes:
    • kSingleGroup: output channel per group is multiple of Threadblock tile N.
    • kMultipleGroup: Threadblock tile N is multiple of output channel per group.
  • Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now. The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number.
  • Standalone Layernorm and Pooling kernels.
  • Back-to-back GEMM/CONV relaxes the requirement that the first GEMM K dimension needs to be the multiple of Threadblock Tile K dimension.
  • Optimal performance using CUDA 11.6u2
  • Updates and bugfixes from the community (thanks!)

2.9.0 (2022-04-21)

  • First layer Convolution kernels specialized for small channel counts and reduced alignment
  • BLAS3 operators accelerated by Tensor Cores
  • CUTLASS Python demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using CUDA Python
  • GEMM + Softmax example
  • Gather and Scatter Fusion with GEMM can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel.
    • It can select random rows in a row major matrix.
    • It can select random columns in a column major matrix.
  • Back-to-back GEMM/CONV fully supports buffering the first GEMM/CONV results in the shared memory for the latter one to use. It can eliminate register spill when the tile size is big. Additionally, bias vector add is supported in the first GEMM/CONV.
    • Supported kernels: GEMM and CONV.
    • Supported types: fp16 and int8.
    • Supported architectures: Turing and Ampere.
  • Transposed Convolution (a.k.a Deconvolution) support which reuses Dgrad implementation.
  • Utility functions that can pad NHWC and convert between NCHW and NHWC.
  • Small alignment implicit gemm support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores in these kernels.
  • Epilogue enhancement:
    • Eliminate bank conflicts in int8 tensor core kernels.
    • Half2 usage if epilogue compute type is fp16.
    • More activation functions: Silu, Hardswish, Leaky Relu.
    • New elementwise fusion pattern for residual block.
  • Group GEMM thread block number calculation fix which helps to launch the intended number of threadblocks to fully occupy the GPUs.
  • Parallel GEMM splitk support in the CUTLASS profiler.
  • Optimal performance using CUDA 11.6u2
  • Updates and bugfixes from the community (thanks!)

2.8.0 (2021-11-19)

2.7.0 (2021-09-24)

2.6.1 (2021-09-03)

  • Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
  • Tuning for GEMMs fused with partial reductions
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

2.6.0 (2021-07-22)

  • Optimal performance when compiled with the CUDA 11.4 Toolkit
  • Fused operators with GEMM and Convolution
  • 64b tensor strides and leading dimensions support for GEMMs
  • Affine rank=2 matrix layouts
  • Batched GEMV preview implementation
  • New strided Dgrad implementation
    • Accelerates over previous implementation by cutting down redundant math by 4x
    • Support using new Dy and w analytic iterators and existing cutlass::conv::device::ImplicitGemmConvolution interface
  • Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores)
  • Many improvements to the epilogue.
    • Provide an option to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations
    • Performance improvement for FP16 tensor core kernels
    • Bug fixes
  • Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
  • Updated minimum CUDA Toolkit requirement to 10.2
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

2.5.0 (2021-02-26)

  • Tensor reductions
    • m-to-n reductions of tensors with affine layout
    • Specializations for reductions including contiguous dimension
    • Specializations for reductions excluding contiguous dimension
    • Custom reduction functors such as cutlass::logical_and
    • Large tensor support, up to 2^63 elements (however, each dimension is limited to an extent of 2^31)
  • Optimizations for 3-D convolution
  • Fused Convolution+Convolution example
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!

2.4.0 (2020-11-19)

  • Implicit GEMM convolution kernels supporting CUDA and Tensor Cores on NVIDIA GPUs
    • Operators: forward (Fprop), backward data gradient (Dgrad), and backward weight gradient (Wgrad) convolution
    • Data type: FP32, complex, Tensor Float 32 (TF32), BFloat16 (BF16), Float16, Int4, Int8, Int32
    • Spatial dimensions: 1-D, 2-D, and 3-D
    • Layout: NHWC, NCxHWx
  • Implicit GEMM convolution components:
    • Global memory iterators supporting Fprop, Dgrad, and Wgrad
    • MmaMultistage for implicit GEMM convolution for NVIDIA Ampere architecture
    • MmaPipeline for implicit GEMM convolution for NVIDIA Volta and Turing architectures
    • Documentation describing Implicit GEMM Convolution algorithm and implementation

2.3.0 (2020-09-23)

2.2.0 (2020-06-08)

  • NVIDIA Ampere Architecture features
    • Fast Tensor Core operations:
    • Maximum performance via mma.sync
    • Tensor Float 32, BFloat16, and double-precision data types
    • Mixed integer data types (int8, int4, bin1)
    • Asynchronous copy for deep software pipelines via cp.async
    • Described in GTC 2020 Webinar (SR 21745) (free registration required)
  • Features:
    • SDK examples showing GEMM fused with bias+relu and fused GEMM+GEMM
    • Complex-valued GEMMs targeting NVIDIA Ampere Tensor Cores in double-precision and Tensor Float 32
    • Gaussian complex GEMMs using 3m complex multiply algorithm
    • Universal GEMM kernel supporting two batch modes and two algorithms for parallel reductions
  • Policy updates:
    • CUDA 11 Toolkit needed to enable NVIDIA Ampere Architecture features
    • Disabled F16C by default for compatibility - enable on cmake command line with -DCUTLASS_ENABLE_F16C=ON

2.1.0 (2020-04-06)

  • BLAS-style host-side API added to CUTLASS Library
    • API to launch compiled kernel instances for GEMM and planar complex GEMM
  • Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
  • Minor enhancements and bug fixes

2.0.0 (2019-11-19)

  • Substantially refactored for
    • Better performance, particularly for native Turing Tensor Cores
    • Robust and durable templates spanning the design space
    • Encapsulated functionality embodying modern C++11 programming techniques
    • Optimized containers and data types for efficient, generic, portable device code
  • Updates to:
  • Native Turing Tensor Cores
    • Efficient GEMM kernels targeting Turing Tensor Cores
    • Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
  • Coverage of existing CUTLASS functionality
    • GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
    • Volta Tensor Cores through native mma.sync and through WMMA API
    • Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
    • Batched GEMM operations
    • Complex-valued GEMMs
  • Note: a host compiler supporting C++11 or greater is required.

CUTLASS 1.x

1.3.2 (2019-07-09)

  • Performance improvement for Volta Tensor Cores TN and TT layouts.

1.3.1 (2019-04-09)

  • Corrected NVRTC unit tests.

1.3.0 (2019-03-20)

  • Efficient GEMM kernel targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.

1.2.0 (2018-10-26)

  • Parallelized reductions across threadblocks ("Split-K")
    • Improved IGEMM performance
  • Batched strided WMMA GEMMs

1.1.0 (2018-09-19)

  • Turing Features
    • WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit
  • Batched Strided GEMM
  • Threadblock rasterization strategies
    • Improved performance for adverse problem sizes and data layouts
  • Extended CUTLASS Core comonents
    • Tensor views support arbitrary matrix and tensor layouts
    • Zip iterators for structuring multiple data streams
  • Enhanced CUTLASS utilities
    • Reference code for tensor operations in host and device code
    • Added HostMatrix<> for simplified matrix creation
  • Examples
    • Basic GEMM, tensor views, CUTLASS utilities, batched GEMM, WMMA GEMM

1.0.1 (2018-06-11)

  • Intra-threadblock reduction added for small threadblock tile sizes
    • sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
    • igemm_32x32x128
  • GEMM K residue handled during prologue prior to mainloop
  • Replaced Google Test copy with submodule. Use git submodule init --recursive --update

1.0.0 (2018-05-16)

  • Substantial rewrite to accommodate new architecture
  • Kernels: SGEMM, DGEMM, IGEMM, HGEMM, WMMA GEMM
  • Unit and performance tests

0.0.1 (2017-12-04)

  • Initial release

Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: BSD-3-Clause

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.