Module weight quantization (#2000)

* Add q_into_data and q_reshape * Fix tch quantize f16 and q_into_data * Convert to actual dtype/kind in dequantize * Add module quantization and q_from_data * Fix clippy * Add documentation * Handle deserialize data conversion * Fix typo * Add calibration tests * Fix clippy precision * Add QTensorOps require_grad methods to avoid dequantizing * Add Dequantize mapper docs * Remove dead code
2024-07-15 08:20:37 -04:00 · 2024-07-15 08:20:37 -04:00 · 3afff434bd
parent a4123f6c2e
commit 3afff434bd
22 changed files with 618 additions and 31 deletions
--- a/burn-book/src/SUMMARY.md
+++ b/burn-book/src/SUMMARY.md
@ -25,6 +25,7 @@
  - [ONNX Model](./import/onnx-model.md)
  - [PyTorch Model](./import/pytorch-model.md)
 - [Models & Pre-Trained Weights](./models-and-pretrained-weights.md)
+- [Quantization (Beta)](./quantization.md)
 - [Advanced](./advanced/README.md)
  - [Backend Extension](./advanced/backend-extension/README.md)
    - [Custom WGPU Kernel](./advanced/backend-extension/custom-wgpu-kernel.md)
--- a/burn-book/src/quantization.md
+++ b/burn-book/src/quantization.md
@ -0,0 +1,122 @@
+# Quantization (Beta)
+
+Quantization techniques perform computations and store tensors in lower precision data types like
+8-bit integer instead of floating point precision. There are multiple approaches to quantize a deep
+learning model categorized as:
+
+- Post-training quantization (PTQ)
+- Quantization aware training (QAT)
+
+In post-training quantization, the model is trained in floating point precision and later converted
+to the lower precision data type.
+
+There are two types of post-training quantization:
+
+1. Static quantization: quantizes the weights and activations of the model. Quantizing the
+   activations statically requires data to be calibrated (i.e., recording the activation values to
+   compute the optimal quantization parameters with representative data).
+1. Dynamic quantization: quantized the weights ahead of time (like static quantization) but the
+   activations are dynamically at runtime.
+
+Sometimes post-training quantization is not able to achieve acceptable task accuracy. This is where
+quantization aware training comes into play, as it models the effects of quantization during
+training. Quantization errors are thus modeled in the forward and backward passes using fake
+quantization modules, which helps the model learn representations that are more robust to the
+reduction in precision.
+
+<div class="warning">
+
+Quantization support in Burn is currently in active development.
+
+It supports the following modes on some backends:
+
+- Static per-tensor quantization to signed 8-bit integer (`i8`)
+
+No integer operations are currently supported, which means tensors are dequantized to perform the
+operations in floating point precision.
+
+</div>
+
+## Module Quantization
+
+Quantizing the weights of your model after training is quite simple. We have access to the weight
+tensors and can collect their statistics, such as the min and max value when using
+`MinMaxCalibration`, to compute the quantization parameters.
+
+```rust , ignore
+# use burn::quantization::{MinMaxCalibration, QuantizationScheme, QuantizationType, Quantizer};
+#
+// Quantization config
+let mut quantizer = Quantizer {
+    calibration: MinMaxCalibration {
+        scheme: QuantizationScheme::PerTensorSymmetric(QuantizationType::QInt8),
+    },
+};
+
+// Quantize the weights
+let model = model.quantize_weights(&mut quantizer);
+```
+
+> Given that all operations are currently performed in floating point precision, it might be wise to
+> dequantize the module parameters before inference. This allows us to save disk space by storing
+> the model in reduced precision while preserving the inference speed.
+>
+> This can easily be implemented with a `ModuleMapper`.
+>
+> ```rust, ignore
+> # use burn::module::{ModuleMapper, ParamId};
+> # use burn::tensor::{backend::Backend, Tensor};
+> #
+> /// Module mapper used to dequantize the model params being loaded.
+> pub struct Dequantize {}
+>
+> impl<B: Backend> ModuleMapper<B> for Dequantize {
+>     fn map_float<const D: usize>(
+>         &mut self,
+>         _id: &ParamId,
+>         tensor: Tensor<B, D>,
+>     ) -> Tensor<B, D> {
+>         tensor.dequantize()
+>     }
+> }
+>
+> // Load saved quantized model in floating point precision
+> model = model
+>     .load_file(file_path, recorder, &device)
+>     .expect("Should be able to load the quantized model weights")
+>     .map(&mut Dequantize {});
+> ```
+
+### Calibration
+
+Calibration is the step during quantization where the range of all floating-point tensors is
+computed. This is pretty straightforward for weights since the actual range is known at
+_quantization-time_ (weights are static), but activations require more attention.
+
+To compute the quantization parameters, Burn supports the following `Calibration` methods.
+
+| Method              | Description                                                                      |
+| :------------------ | :------------------------------------------------------------------------------- |
+| `MinMaxCalibration` | Computes the quantization range mapping based on the running min and max values. |
+
+### Quantization Scheme
+
+A quantization scheme defines the quantized type, quantization granularity and range mapping
+technique.
+
+Burn currently supports the following `QuantizationType` variants.
+
+| Type    | Description                        |
+| :------ | :--------------------------------- |
+| `QInt8` | 8-bit signed integer quantization. |
+
+Quantization parameters are defined based on the range of values to represent and can typically be
+calculated for the layer's entire weight tensor with per-tensor quantization or separately for each
+channel with per-channel quantization (commonly used with CNNs).
+
+Burn currently supports the following `QuantizationScheme` variants.
+
+| Variant              | Description                                                                                                    |
+| :------------------- | :------------------------------------------------------------------------------------------------------------- |
+| `PerTensorAffine`    | Computes the quantization parameters for the whole tensor and applies an affine range mapping with zero point. |
+| `PerTensorSymmetric` | Computes the quantization parameters for the whole tensor and applies a scale range mapping centered around 0. |
--- a/crates/burn-autodiff/src/ops/qtensor.rs
+++ b/crates/burn-autodiff/src/ops/qtensor.rs
@ -1,12 +1,19 @@
 use burn_tensor::{
    backend::Backend,
    ops::{FloatTensor, QTensorOps, QuantizedTensor},
-    Device, QuantizationStrategy, Shape,
+    Device, QuantizationStrategy, Shape, TensorData,
 };

 use crate::{checkpoint::strategy::CheckpointStrategy, Autodiff};

 impl<B: Backend, C: CheckpointStrategy> QTensorOps<Self> for Autodiff<B, C> {
+    fn q_from_data<const D: usize>(
+        _data: TensorData,
+        _device: &Device<Self>,
+    ) -> QuantizedTensor<Self, D> {
+        todo!()
+    }
+
    fn quantize<const D: usize>(
        _tensor: FloatTensor<Self, D>,
        _strategy: &QuantizationStrategy,
@ -28,4 +35,18 @@ impl<B: Backend, C: CheckpointStrategy> QTensorOps<Self> for Autodiff<B, C> {
    fn q_device<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> Device<Self> {
        B::q_device(tensor)
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<Self, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        B::q_reshape(tensor, shape)
+    }
+
+    async fn q_into_data<const D: usize>(
+        tensor: QuantizedTensor<Self, D>,
+        strategy: QuantizationStrategy,
+    ) -> TensorData {
+        B::q_into_data(tensor, strategy).await
+    }
 }
--- a/crates/burn-candle/src/ops/qtensor.rs
+++ b/crates/burn-candle/src/ops/qtensor.rs
@ -1,7 +1,7 @@
 use burn_tensor::{
    backend::Backend,
    ops::{FloatTensor, QTensorOps, QuantizedTensor},
-    Device, QuantizationStrategy, Shape,
+    DType, Device, QuantizationStrategy, Shape, TensorData,
 };

 use crate::{
@ -10,6 +10,13 @@ use crate::{
 };

 impl<F: FloatCandleElement, I: IntCandleElement> QTensorOps<Self> for Candle<F, I> {
+    fn q_from_data<const D: usize>(
+        data: TensorData,
+        device: &Device<Self>,
+    ) -> QuantizedTensor<Self, D> {
+        unimplemented!() // no i8 support
+    }
+
    fn quantize<const D: usize>(
        _tensor: FloatTensor<Self, D>,
        _strategy: &QuantizationStrategy,
@ -31,4 +38,18 @@ impl<F: FloatCandleElement, I: IntCandleElement> QTensorOps<Self> for Candle<F,
    fn q_device<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> Device<Self> {
        super::base::device(tensor)
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<Self, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        super::base::reshape(tensor, shape)
+    }
+
+    async fn q_into_data<const D: usize>(
+        tensor: QuantizedTensor<Self, D>,
+        strategy: QuantizationStrategy,
+    ) -> TensorData {
+        super::base::into_data(tensor)
+    }
 }
--- a/crates/burn-core/src/lib.rs
+++ b/crates/burn-core/src/lib.rs
@ -33,6 +33,9 @@ pub mod module;
 /// Neural network module.
 pub mod nn;

+/// Quantization module.
+pub mod quantization;
+
 /// Module for the recorder.
 pub mod record;

--- a/crates/burn-core/src/module/base.rs
+++ b/crates/burn-core/src/module/base.rs
@ -1,5 +1,6 @@
 use super::ParamId;
 use crate::{
+    quantization::{Calibration, Quantizer},
    record::Record,
    tensor::backend::{AutodiffBackend, Backend},
 };
@ -202,6 +203,11 @@ pub trait Module<B: Backend>: Clone + Send + core::fmt::Debug {

        Ok(self.load_record(record))
    }
+
+    /// Quantize the weights of the module.
+    fn quantize_weights<C: Calibration>(self, quantizer: &mut Quantizer<C>) -> Self {
+        self.map(quantizer)
+    }
 }

 /// Module visitor trait.
--- a/crates/burn-core/src/quantization/calibration.rs
+++ b/crates/burn-core/src/quantization/calibration.rs
@ -0,0 +1,80 @@
+use burn_tensor::{
+    backend::Backend, AffineQuantization, ElementConversion, Quantization, QuantizationStrategy,
+    SymmetricQuantization, Tensor,
+};
+
+use super::{QuantizationScheme, QuantizationType};
+
+/// Calibration method used to compute the quantization range mapping.
+pub trait Calibration {
+    /// Configure the quantization strategy.
+    fn configure<B: Backend, const D: usize>(&self, tensor: &Tensor<B, D>) -> QuantizationStrategy;
+}
+
+/// Computes the quantization range mapping based on the running min and max values.
+pub struct MinMaxCalibration {
+    /// Quantization scheme to be used.
+    pub scheme: QuantizationScheme,
+}
+
+impl Calibration for MinMaxCalibration {
+    fn configure<B: Backend, const D: usize>(&self, tensor: &Tensor<B, D>) -> QuantizationStrategy {
+        let min = tensor.clone().min().into_scalar().elem::<f32>();
+        let max = tensor.clone().max().into_scalar().elem::<f32>();
+
+        match &self.scheme {
+            QuantizationScheme::PerTensorAffine(dtype) => match dtype {
+                QuantizationType::QInt8 => {
+                    QuantizationStrategy::PerTensorAffineInt8(AffineQuantization::new(min, max))
+                }
+            },
+            QuantizationScheme::PerTensorSymmetric(dtype) => match dtype {
+                QuantizationType::QInt8 => QuantizationStrategy::PerTensorSymmetricInt8(
+                    SymmetricQuantization::new(min, max),
+                ),
+            },
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+
+    use super::*;
+    use crate::TestBackend;
+
+    #[test]
+    fn min_max_calibration_per_tensor_affine_int8() {
+        let device = <TestBackend as Backend>::Device::default();
+        let tensor = Tensor::<TestBackend, 1>::from_floats([-1.8, -1.0, 0.0, 0.5], &device);
+        let calibration = MinMaxCalibration {
+            scheme: QuantizationScheme::PerTensorAffine(QuantizationType::QInt8),
+        };
+
+        let strategy = calibration.configure(&tensor);
+
+        if let QuantizationStrategy::PerTensorAffineInt8(q) = strategy {
+            assert_eq!(q.scale, 0.009_019_608);
+            assert_eq!(q.offset, 72);
+        } else {
+            panic!("Wrong quantization strategy");
+        }
+    }
+
+    #[test]
+    fn min_max_calibration_per_tensor_symmetric_int8() {
+        let device = <TestBackend as Backend>::Device::default();
+        let tensor = Tensor::<TestBackend, 1>::from_floats([-1.8, -1.0, 0.0, 0.5], &device);
+        let calibration = MinMaxCalibration {
+            scheme: QuantizationScheme::PerTensorSymmetric(QuantizationType::QInt8),
+        };
+
+        let strategy = calibration.configure(&tensor);
+
+        if let QuantizationStrategy::PerTensorSymmetricInt8(q) = strategy {
+            assert_eq!(q.scale, 0.014_173_228);
+        } else {
+            panic!("Wrong quantization strategy");
+        }
+    }
+}
--- a/crates/burn-core/src/quantization/mod.rs
+++ b/crates/burn-core/src/quantization/mod.rs
@ -0,0 +1,7 @@
+mod calibration;
+mod quantize;
+mod scheme;
+
+pub use calibration::*;
+pub use quantize::*;
+pub use scheme::*;
--- a/crates/burn-core/src/quantization/quantize.rs
+++ b/crates/burn-core/src/quantization/quantize.rs
@ -0,0 +1,18 @@
+use burn_tensor::{backend::Backend, Tensor};
+
+use crate::module::{ModuleMapper, ParamId};
+
+use super::Calibration;
+
+/// Describes how to quantize a module.
+pub struct Quantizer<C: Calibration> {
+    /// The calibration method used in quantization.
+    pub calibration: C,
+}
+
+impl<B: Backend, C: Calibration> ModuleMapper<B> for Quantizer<C> {
+    fn map_float<const D: usize>(&mut self, _id: &ParamId, tensor: Tensor<B, D>) -> Tensor<B, D> {
+        let strategy = self.calibration.configure(&tensor);
+        tensor.quantize(strategy)
+    }
+}
--- a/crates/burn-core/src/quantization/scheme.rs
+++ b/crates/burn-core/src/quantization/scheme.rs
@ -0,0 +1,17 @@
+/// Quantization data type.
+pub enum QuantizationType {
+    /// 8-bit signed integer.
+    QInt8,
+}
+
+/// Quantization scheme.
+pub enum QuantizationScheme {
+    /// Per-tensor affine/asymmetric quantization.
+    PerTensorAffine(QuantizationType),
+    /// Per-tensor symmetric quantization.
+    PerTensorSymmetric(QuantizationType),
+    // /// Per-channel affine/asymmetric quantization.
+    // PerChannelAffine,
+    // /// Per-channel symmetric quantization.
+    // PerChannelSymmetric,
+}
--- a/crates/burn-core/src/record/tensor.rs
+++ b/crates/burn-core/src/record/tensor.rs
@ -1,7 +1,7 @@
 use core::marker::PhantomData;

 use super::{PrecisionSettings, Record};
-use burn_tensor::{backend::Backend, Bool, Element, Int, Tensor, TensorData};
+use burn_tensor::{backend::Backend, Bool, DType, Element, Int, Tensor, TensorData};
 use serde::{Deserialize, Serialize};

 #[cfg(not(feature = "record-backward-compat"))]
@ -43,7 +43,12 @@ where
                e
            ))
        })?;
-        Ok(data.convert::<E>())
+        let data = if let DType::QFloat(_) = data.dtype {
+            data // do not convert quantized tensors
+        } else {
+            data.convert::<E>()
+        };
+        Ok(data)
    }
 }

@ -137,15 +142,25 @@ impl<B: Backend, const D: usize> Record<B> for Tensor<B, D> {
    type Item<S: PrecisionSettings> = FloatTensorSerde<S>;

    fn into_item<S: PrecisionSettings>(self) -> Self::Item<S> {
-        FloatTensorSerde::new(self.into_data().convert::<S::FloatElem>())
+        let data = self.into_data();
+        let data = if let DType::QFloat(_) = data.dtype {
+            data // do not convert quantized tensors
+        } else {
+            data.convert::<S::FloatElem>()
+        };
+        FloatTensorSerde::new(data)
    }

    fn from_item<S: PrecisionSettings>(item: Self::Item<S>, device: &B::Device) -> Self {
-        Tensor::from_data(item.data.convert::<B::FloatElem>(), device)
+        let data = if let DType::QFloat(_) = item.data.dtype {
+            item.data // do not convert quantized tensors
+        } else {
+            item.data.convert::<B::FloatElem>()
+        };
+        Tensor::from_data(data, device)
    }
 }

-#[allow(deprecated)]
 impl<B: Backend, const D: usize> Record<B> for Tensor<B, D, Int> {
    type Item<S: PrecisionSettings> = IntTensorSerde<S>;

@ -158,7 +173,6 @@ impl<B: Backend, const D: usize> Record<B> for Tensor<B, D, Int> {
    }
 }

-#[allow(deprecated)]
 impl<B: Backend, const D: usize> Record<B> for Tensor<B, D, Bool> {
    type Item<S: PrecisionSettings> = BoolTensorSerde;

--- a/crates/burn-fusion/src/ops/qtensor.rs
+++ b/crates/burn-fusion/src/ops/qtensor.rs
@ -1,12 +1,19 @@
 use burn_tensor::{
    backend::Backend,
    ops::{QTensorOps, QuantizedTensor},
-    Device, QuantizationStrategy, Shape,
+    Device, QuantizationStrategy, Shape, TensorData,
 };

 use crate::{client::FusionClient, Fusion, FusionBackend};

 impl<B: FusionBackend> QTensorOps<Self> for Fusion<B> {
+    fn q_from_data<const D: usize>(
+        _data: TensorData,
+        _device: &Device<Self>,
+    ) -> QuantizedTensor<Self, D> {
+        unimplemented!()
+    }
+
    fn quantize<const D: usize>(
        _tensor: <Self as Backend>::FloatTensorPrimitive<D>,
        _strategy: &QuantizationStrategy,
@ -28,4 +35,18 @@ impl<B: FusionBackend> QTensorOps<Self> for Fusion<B> {
    fn q_device<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> Device<Self> {
        tensor.client.device().clone()
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        _tensor: QuantizedTensor<Self, D1>,
+        _shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        unimplemented!()
+    }
+
+    async fn q_into_data<const D: usize>(
+        _tensor: QuantizedTensor<Self, D>,
+        _strategy: QuantizationStrategy,
+    ) -> TensorData {
+        unimplemented!()
+    }
 }
--- a/crates/burn-jit/src/ops/qtensor.rs
+++ b/crates/burn-jit/src/ops/qtensor.rs
@ -1,6 +1,6 @@
 use burn_tensor::{
    ops::{FloatTensor, QTensorOps, QuantizedTensor},
-    Device, QuantizationStrategy, Shape,
+    Device, QuantizationStrategy, Shape, TensorData,
 };

 use crate::{FloatElement, IntElement, JitBackend, JitRuntime};
@ -11,6 +11,13 @@ where
    F: FloatElement,
    I: IntElement,
 {
+    fn q_from_data<const D: usize>(
+        _data: TensorData,
+        _device: &Device<Self>,
+    ) -> QuantizedTensor<Self, D> {
+        todo!()
+    }
+
    fn quantize<const D: usize>(
        _tensor: FloatTensor<Self, D>,
        _strategy: &QuantizationStrategy,
@ -32,4 +39,18 @@ where
    fn q_device<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> Device<Self> {
        tensor.device.clone()
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<Self, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        super::reshape(tensor, shape)
+    }
+
+    async fn q_into_data<const D: usize>(
+        _tensor: QuantizedTensor<Self, D>,
+        _strategy: QuantizationStrategy,
+    ) -> TensorData {
+        unimplemented!()
+    }
 }
--- a/crates/burn-ndarray/src/ops/qtensor.rs
+++ b/crates/burn-ndarray/src/ops/qtensor.rs
@ -1,10 +1,12 @@
 use burn_tensor::{
    ops::{FloatTensor, QTensorOps, QuantizedTensor},
-    Quantization, QuantizationStrategy, Shape, TensorData,
+    DType, Quantization, QuantizationStrategy, Shape, TensorData,
 };

 use crate::{element::NdArrayElement, FloatNdArrayElement, NdArray, NdArrayDevice, NdArrayTensor};

+use super::NdArrayOps;
+
 fn into_data<E: NdArrayElement, const D: usize>(tensor: NdArrayTensor<E, D>) -> TensorData {
    let shape = tensor.shape();
    let values = tensor.array.into_iter().collect();
@ -12,6 +14,28 @@ fn into_data<E: NdArrayElement, const D: usize>(tensor: NdArrayTensor<E, D>) ->
 }

 impl<E: FloatNdArrayElement> QTensorOps<Self> for NdArray<E> {
+    fn q_from_data<const D: usize>(
+        data: TensorData,
+        _device: &NdArrayDevice,
+    ) -> QuantizedTensor<Self, D> {
+        match data.dtype {
+            DType::QFloat(strategy) => match strategy {
+                QuantizationStrategy::PerTensorAffineInt8(_) => {
+                    let data = data.convert::<i8>();
+                    NdArrayTensor::<i8, D>::from_data(data)
+                }
+                QuantizationStrategy::PerTensorSymmetricInt8(_) => {
+                    let data = data.convert::<i8>();
+                    NdArrayTensor::<i8, D>::from_data(data)
+                }
+            },
+            _ => panic!(
+                "Invalid dtype (expected DType::QFloat, got {:?})",
+                data.dtype
+            ),
+        }
+    }
+
    fn quantize<const D: usize>(
        tensor: FloatTensor<Self, D>,
        strategy: &QuantizationStrategy,
@ -41,4 +65,20 @@ impl<E: FloatNdArrayElement> QTensorOps<Self> for NdArray<E> {
    fn q_device<const D: usize>(_tensor: &QuantizedTensor<Self, D>) -> NdArrayDevice {
        NdArrayDevice::Cpu
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<Self, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        NdArrayOps::reshape(tensor, shape)
+    }
+
+    async fn q_into_data<const D: usize>(
+        tensor: QuantizedTensor<Self, D>,
+        strategy: QuantizationStrategy,
+    ) -> TensorData {
+        let shape = tensor.shape();
+        let values = tensor.array.into_iter().collect();
+        TensorData::quantized(values, shape, strategy)
+    }
 }
--- a/crates/burn-tch/src/ops/base.rs
+++ b/crates/burn-tch/src/ops/base.rs
@ -1,4 +1,4 @@
-use burn_tensor::Shape;
+use burn_tensor::{QuantizationStrategy, Shape};
 use tch::Scalar;

 use crate::{LibTorchDevice, TchShape, TchTensor};
@ -512,4 +512,30 @@ impl<E: tch::kind::Element + Copy + Default> TchOps<E> {
    ) -> TchTensor<i64, D> {
        TchTensor::new(tensor.tensor.argsort(dim as i64, descending))
    }
+
+    pub fn quantize<const D: usize, I: tch::kind::Element>(
+        tensor: TchTensor<E, D>,
+        strategy: &QuantizationStrategy,
+    ) -> TchTensor<I, D> {
+        let mut tensor = tensor;
+        // Quantize only works on Float Tensor
+        if tensor.tensor.kind() == tch::Kind::Half {
+            tensor.tensor = tensor.tensor.to_kind(tch::Kind::Float);
+        }
+
+        match strategy {
+            QuantizationStrategy::PerTensorAffineInt8(ref q) => {
+                TchTensor::new(tensor.tensor.quantize_per_tensor(
+                    q.scale.into(),
+                    q.offset.into(),
+                    tch::Kind::QInt8,
+                ))
+            }
+            QuantizationStrategy::PerTensorSymmetricInt8(ref q) => TchTensor::new(
+                tensor
+                    .tensor
+                    .quantize_per_tensor(q.scale.into(), 0, tch::Kind::QInt8),
+            ),
+        }
+    }
 }
--- a/crates/burn-tch/src/ops/qtensor.rs
+++ b/crates/burn-tch/src/ops/qtensor.rs
@ -1,15 +1,63 @@
 use burn_tensor::{
    ops::{FloatTensor, QTensorOps, QuantizedTensor},
-    QuantizationStrategy, Shape,
+    DType, Quantization, QuantizationStrategy, Shape, TensorData,
 };

-use crate::{LibTorch, LibTorchDevice, TchElement, TchTensor};
+use crate::{LibTorch, LibTorchDevice, TchElement, TchShape, TchTensor};
+
+use super::TchOps;

 impl<E: TchElement> QTensorOps<Self> for LibTorch<E> {
+    fn q_from_data<const D: usize>(
+        data: TensorData,
+        device: &LibTorchDevice,
+    ) -> QuantizedTensor<Self, D> {
+        let shape_tch = TchShape::<D>::from(data.shape.as_slice());
+        let device = (*device).into();
+
+        // NOTE: tch-rs doesn't have `from_blob_quantized_*` APIs
+        // https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/quantized/Quantizer.cpp#L322
+        // So for now we have to load the dequantized values to quantize them back since the dequantization
+        // methods take the values provided when quantizing.
+        let tensor = match data.dtype {
+            DType::QFloat(strategy) => match strategy {
+                QuantizationStrategy::PerTensorAffineInt8(q) => {
+                    let values = q.dequantize(&data.iter::<i8>().collect::<Vec<_>>());
+                    let tensor = tch::Tensor::from_slice(&values).to(device);
+                    TchOps::<E>::quantize::<D, i8>(
+                        TchTensor::new(tensor.reshape(shape_tch.dims)),
+                        &strategy,
+                    )
+                    .tensor
+                }
+                QuantizationStrategy::PerTensorSymmetricInt8(q) => {
+                    let values = q.dequantize(&data.iter::<i8>().collect::<Vec<_>>());
+                    let tensor = tch::Tensor::from_slice(&values).to(device);
+                    TchOps::<E>::quantize::<D, i8>(
+                        TchTensor::new(tensor.reshape(shape_tch.dims)),
+                        &strategy,
+                    )
+                    .tensor
+                }
+            },
+            _ => panic!(
+                "Invalid dtype (expected DType::QFloat, got {:?})",
+                data.dtype
+            ),
+        };
+        TchTensor::new(tensor)
+    }
+
    fn quantize<const D: usize>(
        tensor: FloatTensor<Self, D>,
        strategy: &QuantizationStrategy,
    ) -> QuantizedTensor<Self, D> {
+        let mut tensor = tensor;
+        // Quantize only works on Float Tensor
+        if E::dtype() == DType::F16 {
+            tensor.tensor = tensor.tensor.to_kind(tch::Kind::Float);
+        }
+
        match strategy {
            QuantizationStrategy::PerTensorAffineInt8(ref q) => {
                TchTensor::new(tensor.tensor.quantize_per_tensor(
@ -30,7 +78,7 @@ impl<E: TchElement> QTensorOps<Self> for LibTorch<E> {
        tensor: QuantizedTensor<Self, D>,
        _strategy: &QuantizationStrategy,
    ) -> FloatTensor<Self, D> {
-        TchTensor::new(tensor.tensor.dequantize())
+        TchTensor::new(tensor.tensor.dequantize().to_kind(E::KIND))
    }

    fn q_shape<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> Shape<D> {
@ -40,4 +88,23 @@ impl<E: TchElement> QTensorOps<Self> for LibTorch<E> {
    fn q_device<const D: usize>(tensor: &QuantizedTensor<Self, D>) -> LibTorchDevice {
        tensor.tensor.device().into()
    }
+
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<Self, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<Self, D2> {
+        TchOps::reshape(tensor, shape)
+    }
+
+    async fn q_into_data<const D: usize>(
+        tensor: QuantizedTensor<Self, D>,
+        strategy: QuantizationStrategy,
+    ) -> TensorData {
+        let shape = Self::q_shape(&tensor);
+        let tensor = Self::q_reshape(tensor.clone(), Shape::new([shape.num_elements()]));
+        // To get the integer values we have to call `int_repr()`
+        let values: Result<Vec<i8>, tch::TchError> = tensor.tensor.int_repr().try_into();
+
+        TensorData::quantized(values.unwrap(), shape, strategy)
+    }
 }
--- a/crates/burn-tensor/src/tensor/api/base.rs
+++ b/crates/burn-tensor/src/tensor/api/base.rs
@ -18,7 +18,7 @@ use crate::check::TensorCheck;
 use crate::tensor::api::chunk::chunk;
 use crate::tensor::api::narrow::narrow;
 use crate::{backend::Backend, check, Bool, Float, Int, Shape, TensorData, TensorKind};
-use crate::{Element, TensorPrimitive};
+use crate::{DType, Element, TensorPrimitive};

 /// A tensor with a given backend, shape and data type.
 #[derive(new, Clone, Debug)]
@ -1697,7 +1697,15 @@ impl<B: Backend> BasicOps<B> for Float {
        tensor: Self::Primitive<D1>,
        shape: Shape<D2>,
    ) -> Self::Primitive<D2> {
-        TensorPrimitive::Float(B::float_reshape(tensor.tensor(), shape))
+        match tensor {
+            TensorPrimitive::Float(tensor) => {
+                TensorPrimitive::Float(B::float_reshape(tensor, shape))
+            }
+            TensorPrimitive::QFloat { tensor, strategy } => TensorPrimitive::QFloat {
+                tensor: B::q_reshape(tensor, shape),
+                strategy,
+            },
+        }
    }

    fn transpose<const D: usize>(tensor: Self::Primitive<D>) -> Self::Primitive<D> {
@ -1750,11 +1758,20 @@ impl<B: Backend> BasicOps<B> for Float {
    }

    async fn into_data_async<const D: usize>(tensor: Self::Primitive<D>) -> TensorData {
-        B::float_into_data(tensor.tensor()).await
+        match tensor {
+            TensorPrimitive::Float(tensor) => B::float_into_data(tensor).await,
+            TensorPrimitive::QFloat { tensor, strategy } => B::q_into_data(tensor, strategy).await,
+        }
    }

    fn from_data<const D: usize>(data: TensorData, device: &B::Device) -> Self::Primitive<D> {
-        TensorPrimitive::Float(B::float_from_data(data, device))
+        match data.dtype {
+            DType::QFloat(strategy) => TensorPrimitive::QFloat {
+                tensor: B::q_from_data(data, device),
+                strategy,
+            },
+            _ => TensorPrimitive::Float(B::float_from_data(data, device)),
+        }
    }

    fn repeat<const D: usize>(
--- a/crates/burn-tensor/src/tensor/api/float.rs
+++ b/crates/burn-tensor/src/tensor/api/float.rs
@ -271,9 +271,9 @@ where
        match &self.primitive {
            TensorPrimitive::Float(tensor) => B::float_is_require_grad(tensor),
            TensorPrimitive::QFloat {
-                tensor: _,
+                tensor,
                strategy: _,
-            } => B::float_is_require_grad(&self.primitive.clone().tensor()),
+            } => B::q_is_require_grad(tensor),
        }
    }

@ -282,10 +282,16 @@ where
    ///
    /// This function does nothing when autodiff is not enabled.
    pub fn set_require_grad(self, require_grad: bool) -> Self {
-        Self::new(TensorPrimitive::Float(B::float_set_require_grad(
-            self.primitive.tensor(),
-            require_grad,
-        )))
+        let primitive = match self.primitive {
+            TensorPrimitive::Float(tensor) => {
+                TensorPrimitive::Float(B::float_set_require_grad(tensor, require_grad))
+            }
+            TensorPrimitive::QFloat { tensor, strategy } => TensorPrimitive::QFloat {
+                tensor: B::q_set_require_grad(tensor, require_grad),
+                strategy,
+            },
+        };
+        Self::new(primitive)
    }

    /// Applies the relu function to the tensor.
--- a/crates/burn-tensor/src/tensor/data.rs
+++ b/crates/burn-tensor/src/tensor/data.rs
@ -48,6 +48,15 @@ impl TensorData {
        Self::init(value, shape, E::dtype())
    }

+    /// Creates a new quantized tensor data structure.
+    pub fn quantized<E: Element, S: Into<Vec<usize>>>(
+        value: Vec<E>,
+        shape: S,
+        strategy: QuantizationStrategy,
+    ) -> Self {
+        Self::init(value, shape, DType::QFloat(strategy))
+    }
+
    /// Initializes a new tensor data structure from the provided values.
    fn init<E: Element, S: Into<Vec<usize>>>(value: Vec<E>, shape: S, dtype: DType) -> Self {
        Self {
@ -258,15 +267,15 @@ impl TensorData {
            "Only f32 data type can be quantized"
        );
        match &quantization {
-            QuantizationStrategy::PerTensorAffineInt8(strategy) => TensorData::init(
+            QuantizationStrategy::PerTensorAffineInt8(strategy) => TensorData::quantized(
                strategy.quantize(self.as_slice().unwrap()),
                self.shape,
-                DType::QFloat(quantization),
+                quantization,
            ),
-            QuantizationStrategy::PerTensorSymmetricInt8(strategy) => TensorData::init(
+            QuantizationStrategy::PerTensorSymmetricInt8(strategy) => TensorData::quantized(
                strategy.quantize(self.as_slice().unwrap()),
                self.shape,
-                DType::QFloat(quantization),
+                quantization,
            ),
        }
    }
--- a/crates/burn-tensor/src/tensor/ops/qtensor.rs
+++ b/crates/burn-tensor/src/tensor/ops/qtensor.rs
@ -1,10 +1,24 @@
-use crate::{backend::Backend, Device, QuantizationStrategy, Shape};
+use core::future::Future;
+
+use crate::{backend::Backend, Device, QuantizationStrategy, Shape, TensorData};

 use super::{FloatTensor, QuantizedTensor};

 /// Quantized Tensor API for basic operations, see [tensor](crate::Tensor)
 /// for documentation on each function.
 pub trait QTensorOps<B: Backend> {
+    /// Creates a new tensor from the data structure.
+    ///
+    /// # Arguments
+    ///
+    /// * `data` - The data structure.
+    /// * `device` - The device to create the tensor on.
+    ///
+    /// # Returns
+    ///
+    /// The tensor with the given data.
+    fn q_from_data<const D: usize>(data: TensorData, device: &Device<B>) -> QuantizedTensor<B, D>;
+
    /// Convert the tensor to a lower precision data type based on the quantization strategy.
    fn quantize<const D: usize>(
        tensor: FloatTensor<B, D>,
@ -38,4 +52,48 @@ pub trait QTensorOps<B: Backend> {
    ///
    /// The device of the tensor.
    fn q_device<const D: usize>(tensor: &QuantizedTensor<B, D>) -> Device<B>;
+
+    /// Reshapes a tensor.
+    ///
+    /// # Arguments
+    ///
+    /// * `tensor` - The tensor to reshape.
+    /// * `shape` - The new shape of the tensor.
+    ///
+    /// # Returns
+    ///
+    /// The tensor with the new shape.
+    fn q_reshape<const D1: usize, const D2: usize>(
+        tensor: QuantizedTensor<B, D1>,
+        shape: Shape<D2>,
+    ) -> QuantizedTensor<B, D2>;
+
+    /// Converts the tensor to a data structure.
+    ///
+    /// # Arguments
+    ///
+    /// * `tensor` - The tensor.
+    ///
+    /// # Returns
+    ///
+    /// The data structure with the tensor's data.
+    fn q_into_data<const D: usize>(
+        tensor: QuantizedTensor<B, D>,
+        strategy: QuantizationStrategy,
+    ) -> impl Future<Output = TensorData> + Send;
+
+    /// Sets the `require_grad` flag of a tensor.
+    fn q_set_require_grad<const D: usize>(
+        tensor: QuantizedTensor<B, D>,
+        _require_grad: bool,
+    ) -> QuantizedTensor<B, D> {
+        // Should only be overridden by autodiff backends.
+        tensor
+    }
+
+    /// Returns the `require_grad` flag of a tensor.
+    fn q_is_require_grad<const D: usize>(_tensor: &QuantizedTensor<B, D>) -> bool {
+        // Should only be overridden by autodiff backends.
+        false
+    }
 }
--- a/crates/burn-tensor/src/tensor/quantization_strategy.rs
+++ b/crates/burn-tensor/src/tensor/quantization_strategy.rs
@ -8,7 +8,7 @@ use burn_common::{iter_par, run_par};
 use num_traits::{Float, PrimInt};
 use serde::{Deserialize, Serialize};

-/// Quantization scheme/strategy.
+/// Quantization strategy.
 #[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Serialize, Deserialize)]
 pub enum QuantizationStrategy {
    /// Per-tensor `int8` affine/asymmetric quantization.
--- a/crates/burn/src/lib.rs
+++ b/crates/burn/src/lib.rs
@ -47,6 +47,18 @@
 //! - Autodiff: Backend decorator that brings backpropagation to any backend
 //! - Fusion: Backend decorator that brings kernel fusion to backends that support it
 //!
+//! # Quantization (Beta)
+//!
+//! Quantization techniques perform computations and store tensors in lower precision data types like 8-bit integer
+//! instead of floating point precision. There are multiple approaches to quantize a deep learning model. In most cases,
+//! the model is trained in floating point precision and later converted to the lower precision data type. This is called
+//! post-training quantization (PTQ). On the other hand, quantization aware training (QAT) models the effects of quantization
+//! during training. Quantization errors are thus modeled in the forward and backward passes, which helps the model learn
+//! representations that are more robust to the reduction in precision.
+//!
+//! Quantization support in Burn is currently in active development. It supports the following modes on some backends:
+//! - Static per-tensor quantization to signed 8-bit integer (`i8`)
+//!
 //! ## Feature Flags
 //!
 //! The following feature flags are available.