cutlass/examples/59_ampere_gather_scatter_conv/README.md

# Example 59: Ampere gather/scatter convolution

CuTe and CUTLASS 3.x based Ampere convolution forward propagation kernel capable of operating on both affine and gather/scatter tensors.

Example executions:
```sh
./59_ampere_gather_scatter_conv
./59_ampere_gather_scatter_conv --n=108
./59_ampere_gather_scatter_conv --n=4096 --i=1
./59_ampere_gather_scatter_conv --n=1080 --i=1000
./59_ampere_gather_scatter_conv --n=131072 --i=1000 --no-check
```

This example demonstrates a few super cool features of CUTLASS and CuTe. It shows off
1. A dense conv 3D fprop kernel written as a single file ...
2. ... that leverages off-the-shelf CUTLASS collectives to show how custom kernels can use collectives ...
3. ... and uses the exact same templated kernel to also stamp out a gather/scatter 3D fprop conv ...
4. ... while getting near peak performance of the Ampere class tensor core on Ampere and Ada GPUs ...
5. ... by using static cute shapes and strides in case problem shapes are known at compile time.

## A dense conv 3D fprop kernel written in CUTLASS 3.x and CuTe

The most common strategy for implementing high performance convolution kernels on the GPU is to transform
the activation tensor in such a way that we can perform the computation as a GEMM. This is called the
image to column (im2col) transformation. [CUTLASS 2.x implementation of im2col based convolutions is
documented separately](../../media/docs/implicit_gemm_convolution.md), and here we consider a fresh approach for CuTe.

A 3D convolution has the following input tensors:
- Activation tensor (Act): `((N,(D,H,W)), (C,(1,1,1)))`
- Filter tensor     (Flt): `( K,          (C,(T,R,S)))`
- Output tensor     (Out): `((N,(Z,P,Q)),  K         )`

Where
- N   := number of images
- DHW := spatial dimensions of the activation tensor
- C   := channel dimension of the activation tensor
- K   := channel dimension of the filter and output tensor
- TRS := spoke dimensions of the filter tensor
- ZPQ := spatial dimensions of the output tensor

As is evident in the tensor shapes, these cannot be issued to a GEMM just yet, since there is no
logical M, N, and K modes we can group the tensor modes into.

Notice that every spoke of the filter tensor (TRS) will be applied to some (offset) view of the
activation tensor, thus expanding the logical size of the activation tensor.
Additionally, a similar logical transform of the spatial dimensions can be encoded as a function of the
padding, dilations, traversal strides, and filter spokes. This gets us to our im2col transform:

im2col transform affects the component shapes/strides of the activation tensor in the following way:
- ZPQ Shape   : changes DHW domain with formula `(1 + (DHW + pad - (((TRS-1) * dilation) + 1)) / traversal_stride)`
- TRS Shape   : TRS domain instead of `(1,1,1)`
- ZPQ Strides : Original DHW strides get `elem_scale()`-ed by traversal strides DHW
- TRS Strides : Original DHW strides get `elem_scale()`-ed by dilation DHW

With this transform applied, we end up with a set of input and output tensors that
are logically consistent in their MNK dimensions, thus allowing us to dispatch to a GEMM.
im2col activation layout: ((N,(Z,P,Q)), (C,(T,R,S))) // logical (M,K)
filter layout           : ( K,          (C,(T,R,S))) // logical (N,K)
output layout           : ((N,(Z,P,Q)),  K         ) // logical (M,N)

CuTe's layout representation and algebra make these folded tensors easy to represent and manipulate.
This is most evident in the reference check code used in this example:

```cpp
for (size_t logical_m = 0; logical_m < size<0>(mOutputRef); ++logical_m) {
  for (size_t logical_n = 0; logical_n < size<1>(mOutputRef); ++logical_n) {
    auto accumulator = float(0);
    for (size_t logical_k = 0; logical_k < size<1>(mStencil); ++logical_k) {
      accumulator += mStencil(logical_m, logical_k) * mActivation(logical_n, logical_k);
    }
    mOutputRef(logical_m, logical_n) = accumulator;
  }
}
```

Which succinctly demonstrates how im2col transform allows us to implement convolutions
as GEMMs with special layout transformations on the input tensor.

Note: in the example kernel's implementation we treat activations as the B tensor
and filter as the A tensor, thus making their logical dimensions NK and MK respectively.

## Leveraging CUTLASS collectives off the shelf in a custom kernel

Now that we have transformed our problem in such a way that allows us to dispatch to a GEMM,
we can reuse much of the machinery CUTLASS offers to implement this forward pass convolution
operator. CUTLASS decomposes these "moving parts" of GPU linear algebra into reusable,
modular software components abstracted by C++ template classes. This example
demonstrates how some of the lower layers of the hierarchy can be re-used for custom kernels
by writing a custom kernel for convolution that re-uses the Ampere/Ada GEMM collectives
from CUTLASS 3.

A kernel author is free to compose their custom components with any of the existing templates
in the CUTLASS hierarchy to leverage existing high performance implementations from the CUTLASS
team. In this example, we write a custom kernel layer and compose with an existing collective.
However, any of the CUTLASS kernels can be composed with bespoke collectives if the desired
customization is a mainloop or epilogue fusion without changes to the grid planning,
tile scheduling, load balancing, or thread marshalling.

## Implementing gather/scatter and dense convolution with the same kernel

Functionality and correctness of the implemented kernel, as a virtue of using
CuTe and off the shelf CUTLASS collectives, only relies on the logical consistency of
the layouts of input and output tensors. This means that we can freely change how
the logical coordinates of the tensors map into the index space, and even how these dereferences
happen. [CUTLASS example 52](../52_hopper_gather_scatter_fusion/) demonstrates this by implementing a custom stride that
supports indexed indirection for tensor data accesses. This allows for example 52
to implement a GEMM where inputs are gathered and output is scattered based on an index buffer.

We re-use the same custom stride utilities in this example to implement a convolution kernel
that gathers along the NDHW dimensions of the activation tensor and scatters the output along the
NZPQ dimensions of the output tensor, treating the channel dimensions as the dense vectors.

Our dense affine im2col transformed activation tensor:

```cpp
// im2col transformed activation layout: ((nzpq), (ctrs)) => idx
auto xformed_act_layout = make_layout(
  make_shape (make_shape (      N,     Z,   P, Q), make_shape (  C,      T,   R, S)),
  make_stride(make_stride(D*H*W*C, H*W*C, W*C, C), make_stride(_1{}, H*W*C, W*C, C)));
```

now becomes a composed layout that uses `IndexedGather`:

```cpp
// Inner layout of the composition:
// ((nzpq), (csrt)) => (idx_buffer_idx, dense_offset)
auto EG = E<0>{};  // Gather basis     (1,0) (idx_buffer_idx)
auto EC = E<1>{};  // Contiguous basis (0,1) (dense_offset)
auto xformed_act_logical_inner = make_layout(
  make_shape (make_shape (       N,      Z,    P,  Q), make_shape ( C,      T,    R,  S)),
  make_stride(make_stride(D*H*W*EG, H*W*EG, W*EG, EG), make_stride(EC, H*W*EG, W*EG, EG)));

// Outer layout of the composition:
// (idx_buffer_idx, dense_offset) => idx
// IndexedGather obtains idx by applying (gmem_base_ptr + gather_idx_buf[idx_buffer_idx] + dense_offset)
auto xformed_act_gather_outer = make_layout(
  make_shape(_1{},_1{}),
  make_stride(CustomStride{IndexedGather{gather_idx_buf}, C}, _1{}));

// Compose the inner and outer layouts
// ((nzpq), (ctrs)) => idx
auto xformed_act_composed_layout = composition(
  xformed_act_gather_outer,
  make_arithmetic_tuple(_0{}, _0{}),
  xformed_act_logical_inner);
```

Here, we create a composed layout whose inner layout has the same logical MK shape as earlier,
but with an outer layout that uses the custom strides with an index buffer to access memory with
indirections. A custom stride requires two inputs to compute the index that a certain coordinate maps to:
the index buffer offset and the dense offset into the vector. This entails that our inner layout
(the one with the logical MK shape) has a rank-2 codomain `(idx_buffer_idx, dense_offset)`.
We can set up such a layout with scaled basis strides, which allow us to map a domain onto a
codomain with multiple orthogonal bases. The two codomain basis are the
index buffer offsets (rank 0 basis), and the dense vector offsets (rank 1 basis).
A similar composed layout is set up for the output scatter tensor.

This tensor still has a logical MK shape and is backed by a CuTe layout, which means we can still
tile, partition, and otherwise manipulate it with CuTe's layout algebra in the same way we would any
other tensor. Substituting the activation tensor's affine layout for this gather layout requires
no changes to the implementation of the kernel whatsoever. Everything composes. This example
stamps out a dense 3D convolution as well as gather/scatter 3D convolution using the same kernel template,
with the only difference between them being the layouts of the input and output tensors.

Convolutions are just a special case of tensor contractions, and as [example 51](../51_hopper_gett)
demonstrates, the exact same collective used in this example can also be used to implement arbitrary GETTs.
Of course, this also means that the same kernel can implement gather/scatter GETTs as well!

This demonstrates the composition power of not just CuTe, but also CUTLASS 3's two level
micro kernel abstraction. A single highly tuned temporal micro-kernel (collective) can be implemented once
and applied to compute dense GETTs, gather/scatter GETTs, dense convolutions, and gather/scatter convolutions.

## Peak performance on Ampere and Ada GPUs by leveraging domain specific knowledge

Often, when implementing custom kernels, a user has more knowledge of the problem domain that can be
exploited to deliver higher performance than otherwise could be through general kernels. In this example
we presume that the shape of each of the images (DHWC dimensions) as well as the filter (TRS) are available
a-priori and that the tile shape evenly divides the problem. Number of images (N) is still left as a runtime
parameter.

Knowing the extents of our tensors at compile time allows us to encode them as static cute shapes rather than
a dynamic problem shape, resulting in the elimination of most of the index computation instructions such as
expensive div/mods. Knowing that the problem shape is divisible by the tile shape allows us to use the
Ampere collective that does not perform predication on global memory loads, further reducing overheads
and allowing us to achieve near peak performance on RTX Ampere and Ada GPUs.

Running this example on an RTX 3080Ti prints the following performance numbers (some output culled for brevity):

```
$> ./examples/59_ampere_gather_scatter_conv/59_ampere_gather_scatter_conv --n=131072 --i=128 --no-check
Ampere convolution forward propogation kernel supporting both affine and gather/scatter tensors.

Allocating tensors ... done.
Initializing data ... done.
Initializing gather/scatter index buffers ... done.

Running dense fprop kernel
Conv TFLOP count = 0.927713
Conv dense perf: 31.027376ms | TFLOP/s = 29.899819

Running gather/scatter fprop kernel
Conv TFLOP count = 0.927713
Conv gather/scatter perf: 28.973721ms | TFLOP/s = 32.019117
```

With this in mind, this example kernel has the following limitations:
- This example kernel only supports dynamic image count, all other conv problem shape must be defined as `cute::Constant<>`s
- Problem shapes (including dynamic image count `N`) must be evenly divisible by the tile shape
- It does not perform fp32->tf32 numeric conversion, gmem inputs must be rounded to tf32 already