Release: v4.41.2

Fix faulty rstrip in module loading (#31108 )
fix from_pretrained in offline mode when model is preloaded in cache (#31010 )
2024-05-30 13:28:00 -04:00 · 2024-05-30 13:25:10 -04:00 · 2024-05-30 13:25:10 -04:00 · 2024-05-30 13:25:10 -04:00 · 2024-05-30 13:25:09 -04:00 · 2024-05-24 05:02:39 -04:00
1512 changed files with 16187 additions and 14868 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -31,7 +31,6 @@ jobs:
        steps:
            - checkout
            - run: uv pip install -U -e .
-            - run: echo 'export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)"' >> "$BASH_ENV" && source "$BASH_ENV"
            - run: mkdir -p test_preparation
            - run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt
            - store_artifacts:
@ -81,7 +80,7 @@ jobs:
                  path: ~/transformers/test_preparation/filtered_test_list.txt
            - store_artifacts:
                  path: test_preparation/examples_test_list.txt
-            - run: export "GIT_COMMIT_MESSAGE=$(git show -s --format=%s)" && echo $GIT_COMMIT_MESSAGE && python .circleci/create_circleci_config.py --fetcher_folder test_preparation
+            - run: python .circleci/create_circleci_config.py --fetcher_folder test_preparation
            - run: |
                  if [ ! -s test_preparation/generated_config.yml ]; then
                      echo "No tests to run, exiting early!"
@ -98,7 +97,7 @@ jobs:
    fetch_all_tests:
        working_directory: ~/transformers
        docker:
-            - image: huggingface/transformers-quality
+            - image: huggingface/transformers-consistency
        parallelism: 1
        steps:
            - checkout
--- a/.circleci/create_circleci_config.py
+++ b/.circleci/create_circleci_config.py
@ -72,12 +72,6 @@ class CircleCIJob:
        if self.docker_image is None:
            # Let's avoid changing the default list and make a copy.
            self.docker_image = copy.deepcopy(DEFAULT_DOCKER_IMAGE)
-        else:
-            # BIG HACK WILL REMOVE ONCE FETCHER IS UPDATED
-            print(os.environ.get("GIT_COMMIT_MESSAGE"))
-            if "[build-ci-image]" in os.environ.get("GIT_COMMIT_MESSAGE", "") or os.environ.get("GIT_COMMIT_MESSAGE", "") == "dev-ci":
-                self.docker_image[0]["image"] = f"{self.docker_image[0]['image']}:dev"
-            print(f"Using {self.docker_image} docker image")
        if self.install_steps is None:
            self.install_steps = []
        if self.pytest_options is None:
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@ -17,50 +17,50 @@ body:
      description: |
        Your issue will be replied to more quickly if you can figure out the right person to tag with @
        If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
-
+        
        All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and
        a core maintainer will ping the right person.
-
+        
        Please tag fewer than 3 people.
-
+        
        Models:

          - text models: @ArthurZucker and @younesbelkada
          - vision models: @amyeroberts
          - speech models: @sanchit-gandhi
          - graph models: @clefourrier
-
+        
        Library:
-
+        
          - flax: @sanchit-gandhi
-          - generate: @zucchini-nlp (visual-language models) or @gante (all others)
+          - generate: @gante
          - pipelines: @Narsil
          - tensorflow: @gante and @Rocketknight1
          - tokenizers: @ArthurZucker
-          - trainer: @muellerzr @SunMarc
+          - trainer: @muellerzr and @pacman100
        
        Integrations:
        
-          - deepspeed: HF Trainer/Accelerate: @muellerzr
+          - deepspeed: HF Trainer/Accelerate: @pacman100
          - ray/raytune: @richardliaw, @amogkam
          - Big Model Inference: @SunMarc
          - quantization (bitsandbytes, autogpt): @SunMarc and @younesbelkada
-
+        
        Documentation: @stevhliu
-
+        
        Model hub:

          - for issues with a model, report at https://discuss.huggingface.co/ and tag the model's creator.
-
+        
        HF projects:
-
+        
          - accelerate: [different repo](https://github.com/huggingface/accelerate)
          - datasets: [different repo](https://github.com/huggingface/datasets)
          - diffusers: [different repo](https://github.com/huggingface/diffusers)
          - rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
-
+        
        Maintained examples (not research project or legacy):
-
+        
          - Flax: @sanchit-gandhi
          - PyTorch: See Models above and tag the person corresponding to the modality of the example.
          - TensorFlow: @Rocketknight1
@ -101,11 +101,11 @@ body:

      placeholder: |
        Steps to reproduce the behavior:
-
+          
          1.
          2.
          3.
-
+          

  - type: textarea
    id: expected-behavior
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@ -1,6 +1,6 @@
 name: "\U0001F680 Feature request"
 description: Submit a proposal/request for a new transformers feature
-labels: [ "Feature request" ]
+labels: [ "feature" ]
 body:
  - type: textarea
    id: feature-request
@ -19,7 +19,7 @@ body:
      label: Motivation
      description: |
        Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too.
-
+        

  - type: textarea
    id: contribution
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -47,15 +47,15 @@ Models:
 Library:

 - flax: @sanchit-gandhi
- generate: @zucchini-nlp (visual-language models) or @gante (all others)
+- generate: @gante
 - pipelines: @Narsil
 - tensorflow: @gante and @Rocketknight1
 - tokenizers: @ArthurZucker
- trainer: @muellerzr and @SunMarc
+- trainer: @muellerzr and @pacman100

 Integrations:

- deepspeed: HF Trainer/Accelerate: @muellerzr
+- deepspeed: HF Trainer/Accelerate: @pacman100
 - ray/raytune: @richardliaw, @amogkam
 - Big Model Inference: @SunMarc
 - quantization (bitsandbytes, autogpt): @SunMarc and @younesbelkada
--- a/.github/workflows/build-ci-docker-images.yml
+++ b/.github/workflows/build-ci-docker-images.yml
@ -3,7 +3,7 @@ name: Build pr ci-docker
 on:
  push:
    branches:
-      - push-ci-image # for now let's only build on this branch
+      - change-ci # for now let's only build on this branch
  repository_dispatch:
  workflow_call:
    inputs:
@ -22,7 +22,7 @@ jobs:
  build:
    runs-on: ubuntu-22.04

-    if: ${{ contains(github.event.head_commit.message, '[build-ci-image]') || contains(github.event.head_commit.message, '[push-ci-image]') && '!cancelled()' || github.event_name == 'schedule' }}
+    if: ${{ contains(github.event.head_commit.message, '[push-ci-image]') && '!cancelled()' }}

    strategy:
      matrix:
@ -30,16 +30,6 @@ jobs:
    continue-on-error: true 

    steps:
-      - 
-        name: Set tag
-        run: |
-              if ${{contains(github.event.head_commit.message, '[build-ci-image]')}}; then
-                  echo "TAG=huggingface/transformers-${{ matrix.file }}:dev" >> "$GITHUB_ENV" 
-                  echo "setting it to DEV!"
-              else
-                  echo "TAG=huggingface/transformers-${{ matrix.file }}" >> "$GITHUB_ENV"
-                  
-              fi
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
@ -60,5 +50,5 @@ jobs:
          build-args: |
            REF=${{ github.sha }}
          file: "./docker/${{ matrix.file }}.dockerfile"
-          push: ${{ contains(github.event.head_commit.message, 'ci-image]') ||  github.event_name == 'schedule' }}
-          tags: ${{ env.TAG }}
+          push: true
+          tags: huggingface/transformers-${{ matrix.file }}
--- a/.github/workflows/build-docker-images.yml
+++ b/.github/workflows/build-docker-images.yml
@ -57,19 +57,20 @@ jobs:
          push: true
          tags: huggingface/transformers-all-latest-gpu-push-ci

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-all-latest-gpu-push-ci docker build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-torch-deepspeed-docker:
    name: "Latest PyTorch + DeepSpeed"
    runs-on: [intel-cpu, 8-cpu, ci]
    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
@ -92,20 +93,21 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-deepspeed-latest-gpu${{ inputs.image_postfix }}

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER}}
-          title: 🤗 Results of the transformers-pytorch-deepspeed-latest-gpu docker build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  # Can't build 2 images in a single job `latest-torch-deepspeed-docker` (for `nvcr.io/nvidia`)
  latest-torch-deepspeed-docker-for-push-ci-daily-build:
    name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)"
    runs-on: [intel-cpu, 8-cpu, ci]
    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
@ -132,15 +134,6 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-pytorch-deepspeed-latest-gpu-push-ci docker build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  doc-builder:
    name: "Doc builder"
    # Push CI doesn't need this image
@ -167,21 +160,22 @@ jobs:
          push: true
          tags: huggingface/transformers-doc-builder

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the huggingface/transformers-doc-builder docker build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-pytorch:
    name: "Latest PyTorch [dev]"
    # Push CI doesn't need this image
    if: inputs.image_postfix != '-push-ci'
    runs-on: [intel-cpu, 8-cpu, ci]
    steps:
+      - name: Cleanup disk
+        run: |
+          sudo ls -l /usr/local/lib/
+          sudo ls -l /usr/share/
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
+          sudo rm -rf /usr/local/lib/android
+          sudo rm -rf /usr/share/dotnet
+          sudo du -sh /usr/local/lib/
+          sudo du -sh /usr/share/
      -
        name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
@ -204,15 +198,6 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-gpu

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the huggingface/transformers-pytorch-gpudocker build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-pytorch-amd:
    name: "Latest PyTorch (AMD) [dev]"
    runs-on: [intel-cpu, 8-cpu, ci]
@ -252,15 +237,6 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-amd-gpu-push-ci

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the huggingface/transformers-pytorch-amd-gpu-push-ci build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-tensorflow:
    name: "Latest TensorFlow [dev]"
    # Push CI doesn't need this image
@ -289,15 +265,6 @@ jobs:
          push: true
          tags: huggingface/transformers-tensorflow-gpu

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the huggingface/transformers-tensorflow-gpu build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-pytorch-deepspeed-amd:
    name: "PyTorch + DeepSpeed (AMD) [dev]"
    runs-on: [intel-cpu, 8-cpu, ci]
@ -337,15 +304,6 @@ jobs:
          push: true
          tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci

-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-pytorch-deepspeed-amd-gpu build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
-
  latest-quantization-torch-docker:
    name: "Latest Pytorch + Quantization [dev]"
     # Push CI doesn't need this image
@ -372,13 +330,4 @@ jobs:
          build-args: |
            REF=main
          push: true
-          tags: huggingface/transformers-quantization-latest-gpu${{ inputs.image_postfix }}
-
-      - name: Post to Slack
-        if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
-        with:
-          slack_channel: ${{ secrets.CI_SLACK_CHANNEL_DOCKER }}
-          title: 🤗 Results of the transformers-quantization-latest-gpu build 
-          status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
+          tags: huggingface/transformers-quantization-latest-gpu${{ inputs.image_postfix }}
--- a/.github/workflows/build-nightly-ci-docker-images.yml
+++ b/.github/workflows/build-nightly-ci-docker-images.yml
@ -13,7 +13,7 @@ concurrency:
 jobs:
  latest-with-torch-nightly-docker:
    name: "Nightly PyTorch + Stable TensorFlow"
-    runs-on: [intel-cpu, 8-cpu, ci]
+    runs-on: ubuntu-22.04
    steps:
      - name: Cleanup disk
        run: |
@ -50,7 +50,7 @@ jobs:

  nightly-torch-deepspeed-docker:
    name: "Nightly PyTorch + DeepSpeed"
-    runs-on: [intel-cpu, 8-cpu, ci]
+    runs-on: ubuntu-22.04
    steps:
      - name: Cleanup disk
        run: |
--- a/.github/workflows/build-past-ci-docker-images.yml
+++ b/.github/workflows/build-past-ci-docker-images.yml
@ -16,7 +16,7 @@ jobs:
      fail-fast: false
      matrix:
        version: ["1.13", "1.12", "1.11"]
-    runs-on: [intel-cpu, 8-cpu, ci]
+    runs-on: ubuntu-22.04
    steps:
      -
        name: Set up Docker Buildx
@ -60,7 +60,7 @@ jobs:
      fail-fast: false
      matrix:
        version: ["2.11", "2.10", "2.9", "2.8", "2.7", "2.6", "2.5"]
-    runs-on: [intel-cpu, 8-cpu, ci]
+    runs-on: ubuntu-22.04
    steps:
      -
        name: Set up Docker Buildx
--- a/.github/workflows/model_jobs.yml
+++ b/.github/workflows/model_jobs.yml
@ -80,7 +80,7 @@ jobs:

      - name: Run all tests on GPU
        working-directory: /transformers
-        run: python3 -m pytest -rsfE -v --make-reports=${{ inputs.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}
+        run: python3 -m pytest -rs -v --make-reports=${{ inputs.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}

      - name: Failure short reports
        if: ${{ failure() }}
--- a/.github/workflows/push-important-models.yml
+++ b/.github/workflows/push-important-models.yml
@ -5,6 +5,7 @@ on:
    branches: [ main ]

 env:
+  IS_GITHUB_CI: "1"
  OUTPUT_SLACK_CHANNEL_ID: "C06L2SGMEEA"
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
  HF_HOME: /mnt/cache 
@ -85,7 +86,7 @@ jobs:
      - name: Run FA2 tests
        id: run_fa2_tests
        run:
-          pytest -rsfE -m "flash_attn_test" --make-reports=${{ matrix.model-name }}_fa2_tests/ tests/${{ matrix.model-name }}/test_modeling_*
+          pytest -rs -m "flash_attn_test" --make-reports=${{ matrix.model-name }}_fa2_tests/ tests/${{ matrix.model-name }}/test_modeling_*
      
      - name: "Test suite reports artifacts: ${{ matrix.model-name }}_fa2_tests"
        if: ${{ always() }}
@ -107,7 +108,7 @@ jobs:
        id: run_integration_tests
        if: always()
        run:
-          pytest -rsfE -k "IntegrationTest"  --make-reports=tests_integration_${{ matrix.model-name }} tests/${{ matrix.model-name }}/test_modeling_*
+          pytest -rs -k "IntegrationTest"  --make-reports=tests_integration_${{ matrix.model-name }} tests/${{ matrix.model-name }}/test_modeling_*
      
      - name: "Test suite reports artifacts: tests_integration_${{ matrix.model-name }}"
        if: ${{ always() }}
--- a/.github/workflows/self-pr-slow-ci.yml
+++ b/.github/workflows/self-pr-slow-ci.yml
@ -110,7 +110,7 @@ jobs:

      - name: Run all tests on GPU
        working-directory: /transformers
-        run: python3 -m pytest -v -rsfE --make-reports=${{ matrix.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}
+        run: python3 -m pytest -v -rs --make-reports=${{ matrix.machine_type }}_run_models_gpu_${{ matrix.folders }}_test_reports tests/${{ matrix.folders }}

      - name: Failure short reports
        if: ${{ failure() }}
--- a/.github/workflows/ssh-runner.yml
+++ b/.github/workflows/ssh-runner.yml
@ -9,11 +9,9 @@ on:
      docker_image:
        description: 'Name of the Docker image'
        required: true
-      num_gpus:
-        description: 'Type of the number of gpus to use (`single` or `multi`)'
-        required: true

 env:
+  IS_GITHUB_CI: "1"
  HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
  HF_HOME: /mnt/cache 
  TRANSFORMERS_IS_CI: yes 
@ -22,13 +20,12 @@ env:
  RUN_SLOW: yes # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. # This token is created under the bot `hf-transformers-bot`. 
  SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} 
  TF_FORCE_GPU_ALLOW_GROWTH: true 
-  CUDA_VISIBLE_DEVICES: 0,1
  RUN_PT_TF_CROSS_TESTS: 1

 jobs:
  ssh_runner:
    name: "SSH"
-    runs-on: ["${{ github.event.inputs.num_gpus }}-gpu", nvidia-gpu, "${{ github.event.inputs.runner_type }}", ci]
+    runs-on: [single-gpu, nvidia-gpu, "${{ github.event.inputs.runner_type }}", ci]
    container:
      image: ${{ github.event.inputs.docker_image }}
      options: --gpus all --privileged --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
@ -55,7 +52,7 @@ jobs:
          nvidia-smi
      
      - name: Tailscale # In order to be able to SSH when a test fails
-        uses: huggingface/tailscale-action@main
+        uses: huggingface/tailscale-action@v1
        with:
          authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
          slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
--- a/9
+++ b/9
@ -1,11 +1,11 @@
-.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples benchmark
+.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples

 # make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
 export PYTHONPATH = src

 check_dirs := examples tests src utils

-exclude_folders :=  ""
+exclude_folders := examples/research_projects

 modified_only_fixup:
 	$(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs)))
@ -96,11 +96,6 @@ test:
 test-examples:
 	python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/

-# Run benchmark
-
-benchmark:
-	python3 benchmark/benchmark.py --config-dir benchmark/config --config-name generation --commit=diff backend.model=google/gemma-2b backend.cache_implementation=null,static backend.torch_compile=false,true --multirun
-
 # Run tests for SageMaker DLC release

 test-sagemaker: # install sagemaker dependencies in advance with pip install .[sagemaker]
--- a/benchmark/benchmark.py
+++ b/benchmark/benchmark.py
@ -1,310 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""
-Run benchmark using the `optimum-benchmark` library with some customization in `transformers`.
-
-Assume we are under `transformers` root directory: (make sure the commits are valid commits)
-```bash
-python benchmark/benchmark.py --config-dir benchmark/config --config-name generation --commit=9b9c7f03da625b13643e99205c691fe046461724 --metrics=decode.latency.mean,per_token.latency.mean,per_token.throughput.value backend.model=google/gemma-2b benchmark.input_shapes.sequence_length=5,7 benchmark.input_shapes.batch_size=1,2 --multirun
-```
-"""
-
-import argparse
-import glob
-import json
-import os.path
-import re
-import tempfile
-from contextlib import contextmanager
-from pathlib import Path
-
-from git import Repo
-
-from optimum_benchmark import Benchmark
-from optimum_benchmark_wrapper import main
-
-
-PATH_TO_REPO = Path(__file__).parent.parent.resolve()
-
-
-@contextmanager
-def checkout_commit(repo: Repo, commit_id: str):
-    """
-    Context manager that checks out a given commit when entered, but gets back to the reference it was at on exit.
-    Args:
-        repo (`git.Repo`): A git repository (for instance the Transformers repo).
-        commit_id (`str`): The commit reference to checkout inside the context manager.
-    """
-    current_head = repo.head.commit if repo.head.is_detached else repo.head.ref
-
-    try:
-        repo.git.checkout(commit_id)
-        yield
-
-    finally:
-        repo.git.checkout(current_head)
-
-
-def summarize(run_dir, metrics, expand_metrics=False):
-    """Produce a summary for each optimum-benchmark launched job's output directory found in `run_dir`.
-
-    Each summary's format is as follows (for `expand_metrics=False`):
-    ```
-    {
-        "model": "google/gemma-2b",
-        "commit": "3cd6ed22e4d49219f300f5055e71e3929aba20d7",
-        "config": "benchmark.input_shapes.batch_size=1,benchmark.input_shapes.sequence_length=5",
-        "metrics": {
-            "decode.latency.mean": 1.624666809082031,
-            "per_token.latency.mean": 0.012843788806628804,
-            "per_token.throughput.value": 77.85864553330948
-        }
-    }
-    ```
-    """
-    reports = glob.glob(os.path.join(run_dir, "**/benchmark_report.json"), recursive=True)
-    report_dirs = [str(Path(report).parent) for report in reports]
-
-    summaries = []
-    for report_dir in report_dirs:
-        commit = re.search(r"/commit=([^/]+)", report_dir).groups()[0]
-
-        if not os.path.isfile(os.path.join(report_dir, "benchmark.json")):
-            continue
-        benchmark = Benchmark.from_json(os.path.join(report_dir, "benchmark.json"))
-        report = benchmark.report
-
-        model = benchmark.config.backend["model"]
-
-        # Ths looks like `benchmark.input_shapes.batch_size=1,benchmark.input_shapes.sequence_length=5`.
-        # (we rely on the usage of hydra's `${hydra.job.override_dirname}`.)
-        benchmark_name = re.sub(f"backend.model={model},*", "", report_dir)
-        benchmark_name = str(Path(benchmark_name).parts[-1])
-        if benchmark_name.startswith("commit="):
-            benchmark_name = benchmark.config.name
-
-        metrics_values = {}
-        # post-processing of report: show a few selected/important metric
-        for metric in metrics:
-            keys = metric.split(".")
-            value = report
-            current = metrics_values
-            for key in keys:
-                # Avoid KeyError when a user's specified metric has typo.
-                # TODO: Give warnings.
-                if key not in value:
-                    continue
-                value = value[key]
-
-                if expand_metrics:
-                    if isinstance(value, dict):
-                        if key not in current:
-                            current[key] = {}
-                            current = current[key]
-                    else:
-                        current[key] = value
-
-            if not expand_metrics:
-                metrics_values[metric] = value
-
-        # show some config information
-        print(f"model: {model}")
-        print(f"commit: {commit}")
-        print(f"config: {benchmark_name}")
-        if len(metrics_values) > 0:
-            print("metrics:")
-            if expand_metrics:
-                print(metrics_values)
-            else:
-                for metric, value in metrics_values.items():
-                    print(f"  - {metric}: {value}")
-        print("-" * 80)
-
-        summary = {
-            "model": model,
-            "commit": commit,
-            "config": benchmark_name,
-            "metrics": metrics_values,
-        }
-        summaries.append(summary)
-
-        with open(os.path.join(report_dir, "summary.json"), "w") as fp:
-            json.dump(summary, fp, indent=4)
-
-    # TODO: upload to Hub
-    return summaries
-
-
-def combine_summaries(summaries):
-    """Combine a list of summary obtained from the function `summarize`.
-
-    The combined summary's format is as follows:
-    ```
-    "google/gemma-2b": {
-        "benchmark.input_shapes.batch_size=1,benchmark.input_shapes.sequence_length=5": {
-            "3cd6ed22e4d49219f300f5055e71e3929aba20d7": {
-                "metrics": {"decode.latency.mean": 1.624666809082031}
-            },
-            "c97ee28b117c0abe8e08891f402065e4df6d72aa": {
-                "metrics": {"decode.latency.mean": 1.6278163452148438}
-            }
-        },
-        "benchmark.input_shapes.batch_size=2,benchmark.input_shapes.sequence_length=5": {
-            "3cd6ed22e4d49219f300f5055e71e3929aba20d7": {
-                "metrics": {"decode.latency.mean": 1.6947791748046876}
-            },
-            "c97ee28b117c0abe8e08891f402065e4df6d72aa": {
-                "metrics": {
-                    "decode.latency.mean": 1.6980519409179688}
-            }
-        }
-    }
-    ```
-    """
-    combined = {}
-    for summary in summaries:
-        model = summary["model"]
-        config = summary["config"]
-        commit = summary["commit"]
-
-        if model not in combined:
-            combined[model] = {}
-
-        if config not in combined[model]:
-            combined[model][config] = {}
-
-        if commit not in combined[model][config]:
-            combined[model][config][commit] = {"metrics": summary["metrics"]}
-
-    with open(os.path.join(exp_run_dir, "summary.json"), "w") as fp:
-        json.dump(combined, fp, indent=4)
-
-    # TODO: upload to Hub
-    print(json.dumps(combined, indent=4))
-
-    return combined
-
-
-if __name__ == "__main__":
-
-    def list_str(values):
-        return values.split(",")
-
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument("--config-dir", type=str, required=True, help="The path to the config directory.")
-    parser.add_argument("--config-name", type=str, required=True, help="The config name.")
-
-    # arguments specific to this wrapper for our own customization
-    parser.add_argument("--ensure_empty", type=bool, default=True, help="If to create a temporary directory.")
-    parser.add_argument(
-        "--commit",
-        type=list_str,
-        default="",
-        help="Comma-separated list of branch names and/or commit sha values on which the benchmark will run. If `diff` is specified, it will run on both the current head and the `main` branch.",
-    )
-    parser.add_argument("--metrics", type=str, help="The metrics to be included in the summary.")
-    args, optimum_benchmark_args = parser.parse_known_args()
-
-    repo = Repo(PATH_TO_REPO)
-
-    metrics = [
-        "prefill.latency.mean",
-        "prefill.throughput.value",
-        "decode.latency.mean",
-        "decode.throughput.value",
-        "per_token.latency.mean",
-        "per_token.throughput.value",
-    ]
-    if args.metrics is not None:
-        metrics = args.metrics.split(",")
-
-    # Get `backend.model` in a hacky way: We want to control the experiment flow manually.
-    models = [""]
-    for idx, arg in enumerate(optimum_benchmark_args):
-        if arg.startswith("backend.model="):
-            models = arg[len("backend.model=") :]
-            models = models.split(",")
-            break
-    optimum_benchmark_args = [arg for arg in optimum_benchmark_args if not arg.startswith("backend.model=")]
-
-    # Get the commit(s)
-    current_head = str(repo.head.commit) if repo.head.is_detached else str(repo.head.ref)
-    commits = [x for x in args.commit if x != ""]
-    if len(commits) == 0:
-        commits = [current_head]
-    elif len(commits) == 1 and commits[0] == "diff":
-        # compare to `main`
-        commits = ["main", current_head]
-
-    # Get the specified run directory
-    run_dir_arg_idx, run_dir = -1, None
-    sweep_dir_arg_idx, sweep_dir = -1, None
-    for idx, arg in enumerate(optimum_benchmark_args):
-        if arg.startswith("hydra.run.dir="):
-            run_dir = arg[len("hydra.run.dir=") :]
-            run_dir_arg_idx = idx
-        elif arg.startswith("hydra.sweep.dir="):
-            sweep_dir = arg[len("hydra.sweep.dir=") :]
-            sweep_dir_arg_idx = idx
-    exp_run_dir, arg_dix, arg_name = (
-        (sweep_dir, sweep_dir_arg_idx, "hydra.sweep.dir")
-        if "--multirun" in optimum_benchmark_args
-        else (run_dir, run_dir_arg_idx, "hydra.run.dir")
-    )
-
-    # TODO: not hardcoded
-    if exp_run_dir is None and args.ensure_empty:
-        exp_run_dir = "_benchmark"
-
-    if args.ensure_empty:
-        os.makedirs(exp_run_dir, exist_ok=True)
-        exp_run_dir = tempfile.mkdtemp(dir=exp_run_dir)
-
-    run_summaries = []
-    for commit in commits:
-        with checkout_commit(repo, commit):
-            commit = str(repo.head.commit)
-
-            commit_run_dir = exp_run_dir
-            if exp_run_dir is not None:
-                commit_run_dir = os.path.join(exp_run_dir, rf"commit\={commit}")
-
-            print(f"Run benchmark on commit: {commit}")
-
-            for model in models:
-                model_arg = [f"backend.model={model}"] if model != "" else []
-                dir_args = []
-                if commit_run_dir is not None:
-                    if arg_dix > -1:
-                        optimum_benchmark_args[arg_dix] = f"{arg_name}={commit_run_dir}"
-                    else:
-                        dir_args = [
-                            f"hydra.sweep.dir={commit_run_dir}",
-                            f"hydra.run.dir={commit_run_dir}/" + "${hydra.job.override_dirname}",
-                        ]
-                main(args.config_dir, args.config_name, model_arg + dir_args + optimum_benchmark_args)
-
-            if commit_run_dir is not None:
-                # Need to remove the `\` character
-                summaries = summarize(commit_run_dir.replace("\\", ""), metrics)
-                run_summaries.extend(summaries)
-
-    # aggregate the information across the commits
-    if exp_run_dir is not None:
-        with open(os.path.join(exp_run_dir, "summaries.json"), "w") as fp:
-            json.dump(run_summaries, fp, indent=4)
-
-        combined_summary = combine_summaries(run_summaries)
--- a/benchmark/config/generation.yaml
+++ b/benchmark/config/generation.yaml
@ -1,57 +0,0 @@
-defaults:
-  - benchmark # inheriting benchmark schema
-  - scenario: inference
-  - launcher: process
-  - backend: pytorch
-  - _self_ # for hydra 1.1 compatibility
-
-name: pytorch_generate
-
-launcher:
-  start_method: spawn
-  device_isolation: true
-  device_isolation_action: warn
-
-backend:
-  device: cuda
-  device_ids: 0
-  no_weights: true
-  model: meta-llama/Llama-2-7b-hf
-  cache_implementation: static
-  torch_compile: true
-  torch_dtype: float16
-  torch_compile_config:
-    backend: inductor
-    mode: reduce-overhead
-    fullgraph: true
-
-scenario:
-  input_shapes:
-    batch_size: 1
-    sequence_length: 7
-  generate_kwargs:
-    max_new_tokens: 128
-    min_new_tokens: 128
-    do_sample: false
-  memory: true
-  latency: true
-  iterations: 2
-  duration: 0
-
-
-# hydra/cli specific settings
-hydra:
-  run:
-    # where to store run results
-    dir: runs/${name}
-  job:
-    # change working directory to the run directory
-    chdir: true
-    env_set:
-      # set environment variable OVERRIDE_BENCHMARKS to 1
-      # to not skip benchmarks that have been run before
-      OVERRIDE_BENCHMARKS: 1
-      LOG_LEVEL: WARN
-  sweep:
-    dir: multirun
-    subdir: ${hydra.job.override_dirname}
--- a/benchmark/optimum_benchmark_wrapper.py
+++ b/benchmark/optimum_benchmark_wrapper.py
@ -1,16 +0,0 @@
-import argparse
-import subprocess
-
-
-def main(config_dir, config_name, args):
-    subprocess.run(["optimum-benchmark", "--config-dir", f"{config_dir}", "--config-name", f"{config_name}"] + ["hydra/job_logging=disabled", "hydra/hydra_logging=disabled"] + args)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument("--config-dir", type=str, required=True, help="The path to the config directory.")
-    parser.add_argument("--config-name", type=str, required=True, help="The config name.")
-    args, unknown = parser.parse_known_args()
-
-    main(args.config_dir, args.config_name, unknown)
--- a/docker/consistency.dockerfile
+++ b/docker/consistency.dockerfile
@ -1,13 +1,12 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 USER root
-ARG REF=main
 RUN apt-get update && apt-get install -y time git pkg-config make git-lfs
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools GitPython
 RUN uv pip install --no-cache-dir --upgrade 'torch' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir tensorflow-cpu tf-keras
-RUN uv pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,quality,vision,testing]" 
+RUN uv pip install --no-cache-dir "transformers[flax,quality,vision,testing]"
 RUN git lfs install

 RUN pip uninstall -y transformers
--- a/docker/custom-tokenizers.dockerfile
+++ b/docker/custom-tokenizers.dockerfile
@ -2,7 +2,7 @@ FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git cmake wget xz-utils build-essential g++5 libprotobuf-dev protobuf-compiler
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools

 RUN wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz
--- a/docker/examples-tf.dockerfile
+++ b/docker/examples-tf.dockerfile
@ -3,7 +3,7 @@ ENV PYTHONDONTWRITEBYTECODE=1
 USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git
 RUN apt-get install -y g++ cmake
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv
 RUN uv pip install --no-cache-dir -U pip setuptools albumentations seqeval
 RUN pip install  --upgrade --no-cache-dir "transformers[tf-cpu,sklearn,testing,sentencepiece,tf-speech,vision]"
--- a/docker/examples-torch.dockerfile
+++ b/docker/examples-torch.dockerfile
@ -2,7 +2,7 @@ FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
 USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu 
--- a/docker/exotic-models.dockerfile
+++ b/docker/exotic-models.dockerfile
@ -3,7 +3,7 @@ ENV PYTHONDONTWRITEBYTECODE=1
 ARG REF=main
 USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git libgl1-mesa-glx libgl1 g++ tesseract-ocr
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv &&  uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-cache-dir  --no-deps timm accelerate
--- a/docker/jax-light.dockerfile
+++ b/docker/jax-light.dockerfile
@ -1,10 +1,9 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git g++ cmake
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv &&  uv venv && uv pip install --no-cache-dir -U pip setuptools
-RUN pip install --no-cache-dir "scipy<1.13" "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,testing,sentencepiece,flax-speech,vision]"
+RUN pip install --no-cache-dir "scipy<1.13" "transformers[flax,testing,sentencepiece,flax-speech,vision]"
 RUN pip uninstall -y transformers
 RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean
--- a/docker/pipeline-tf.dockerfile
+++ b/docker/pipeline-tf.dockerfile
@ -1,10 +1,9 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update && apt-get install -y libsndfile1-dev espeak-ng time git cmake g++
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
-RUN pip install --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,tf-cpu,testing,sentencepiece,tf-speech,vision]"
+RUN pip install --no-cache-dir "transformers[sklearn,tf-cpu,testing,sentencepiece,tf-speech,vision]"
 RUN uv pip install --no-cache-dir  "protobuf==3.20.3" tensorflow_probability
 RUN apt-get clean && rm -rf /var/lib/apt/lists/*
--- a/docker/pipeline-torch.dockerfile
+++ b/docker/pipeline-torch.dockerfile
@ -1,11 +1,10 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git pkg-config openssh-client git
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu 
-RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"
+RUN uv pip install --no-cache-dir librosa "transformers[sklearn,sentencepiece,vision,testing]"
 RUN pip uninstall -y transformers
--- a/docker/quality.dockerfile
+++ b/docker/quality.dockerfile
@ -1,9 +1,8 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update && apt-get install -y time git 
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip install uv &&  uv venv
-RUN uv pip install --no-cache-dir -U pip setuptools GitPython "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[ruff]" urllib3
+RUN uv pip install --no-cache-dir -U pip setuptools GitPython transformers "ruff==0.1.5" urllib3
 RUN apt-get install -y jq curl && apt-get clean && rm -rf /var/lib/apt/lists/*
--- a/docker/tf-light.dockerfile
+++ b/docker/tf-light.dockerfile
@ -1,12 +1,11 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ pkg-config openssh-client git
 RUN apt-get install -y  cmake
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
-RUN pip install  --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[tf-cpu,sklearn,testing,sentencepiece,tf-speech,vision]"
+RUN pip install  --upgrade --no-cache-dir "transformers[tf-cpu,sklearn,testing,sentencepiece,tf-speech,vision]"
 RUN uv pip install --no-cache-dir  "protobuf==3.20.3" 
 RUN pip uninstall -y transformers
 RUN apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get autoremove && apt-get autoclean
--- a/docker/torch-jax-light.dockerfile
+++ b/docker/torch-jax-light.dockerfile
@ -1,13 +1,12 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update &&  apt-get install -y libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN uv pip install --no-deps accelerate
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
-RUN pip install --no-cache-dir "scipy<1.13" "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[flax,audio,sklearn,sentencepiece,vision,testing]"
+RUN pip install --no-cache-dir "scipy<1.13" "transformers[flax, audio, sklearn,sentencepiece,vision,testing]"


 # RUN pip install --no-cache-dir "scipy<1.13" "transformers[flax,testing,sentencepiece,flax-speech,vision]"
--- a/docker/torch-light.dockerfile
+++ b/docker/torch-light.dockerfile
@ -1,11 +1,10 @@
 FROM python:3.10-slim
 ENV PYTHONDONTWRITEBYTECODE=1
-ARG REF=main
 USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git git-lfs
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
 RUN uv pip install --no-deps timm accelerate --extra-index-url https://download.pytorch.org/whl/cpu 
-RUN uv pip install --no-cache-dir librosa "git+https://github.com/huggingface/transformers.git@${REF}#egg=transformers[sklearn,sentencepiece,vision,testing]"
+RUN uv pip install --no-cache-dir librosa "transformers[sklearn,sentencepiece,vision,testing]"
 RUN pip uninstall -y transformers
--- a/docker/torch-tf-light.dockerfile
+++ b/docker/torch-tf-light.dockerfile
@ -4,7 +4,7 @@ ARG REF=main
 RUN echo ${REF}
 USER root
 RUN apt-get update &&  apt-get install -y --no-install-recommends libsndfile1-dev espeak-ng time git g++ cmake pkg-config openssh-client git git-lfs
-ENV UV_PYTHON=/usr/local/bin/python
+ENV VIRTUAL_ENV=/usr/local
 RUN pip --no-cache-dir install uv && uv venv && uv pip install --no-cache-dir -U pip setuptools
 RUN uv pip install --no-cache-dir  --no-deps accelerate --extra-index-url https://download.pytorch.org/whl/cpu 
 RUN pip install --no-cache-dir 'torch' 'torchvision' 'torchaudio' --index-url https://download.pytorch.org/whl/cpu
--- a/docker/transformers-all-latest-gpu/Dockerfile
+++ b/docker/transformers-all-latest-gpu/Dockerfile
@ -45,16 +45,12 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/opt
 # For video model testing
 RUN python3 -m pip install --no-cache-dir decord av==9.2.0

+# For GGUF tests
+RUN python3 -m pip install --no-cache-dir gguf
+
 # Some slow tests require bnb
 RUN python3 -m pip install --no-cache-dir bitsandbytes

-# Some tests require quanto
-RUN python3 -m pip install --no-cache-dir quanto
-
-# `quanto` will install `ninja` which leads to many `CUDA error: an illegal memory access ...` in some model tests
-# (`deformable_detr`, `rwkv`, `mra`)
-RUN python3 -m pip uninstall -y ninja
-
 # For `dinat` model
 # The `XXX` part in `torchXXX` needs to match `PYTORCH` (to some extent)
 RUN python3 -m pip install --no-cache-dir natten==0.15.1+torch220$CUDA -f https://shi-labs.com/natten/wheels
--- a/docker/transformers-quantization-latest-gpu/Dockerfile
+++ b/docker/transformers-quantization-latest-gpu/Dockerfile
@ -1,4 +1,4 @@
-FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
+FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
 LABEL maintainer="Hugging Face"

 ARG DEBIAN_FRONTEND=noninteractive
@ -48,9 +48,6 @@ RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.2
 # Add hqq for quantization testing
 RUN python3 -m pip install --no-cache-dir hqq

-# For GGUF tests
-RUN python3 -m pip install --no-cache-dir gguf
-
 # Add autoawq for quantization testing
 # >=v0.2.3 needed for compatibility with torch 2.2.1
 RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.3/autoawq-0.2.3+cu118-cp38-cp38-linux_x86_64.whl
--- a/docs/source/de/installation.md
+++ b/docs/source/de/installation.md
@ -162,7 +162,7 @@ Transformers verwendet die Shell-Umgebungsvariablen `PYTORCH_TRANSFORMERS_CACHE`

 ## Offline Modus

-Transformers ist in der Lage, in einer Firewall- oder Offline-Umgebung zu laufen, indem es nur lokale Dateien verwendet. Setzen Sie die Umgebungsvariable `HF_HUB_OFFLINE=1`, um dieses Verhalten zu aktivieren.
+Transformers ist in der Lage, in einer Firewall- oder Offline-Umgebung zu laufen, indem es nur lokale Dateien verwendet. Setzen Sie die Umgebungsvariable `TRANSFORMERS_OFFLINE=1`, um dieses Verhalten zu aktivieren.

 <Tip>

@ -179,7 +179,7 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 Führen Sie das gleiche Programm in einer Offline-Instanz mit aus:

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/docs/source/de/peft.md
+++ b/docs/source/de/peft.md
@ -86,10 +86,10 @@ model.load_adapter(peft_model_id)
 Die `bitsandbytes`-Integration unterstützt Datentypen mit 8bit und 4bit Genauigkeit, was für das Laden großer Modelle nützlich ist, weil es Speicher spart (lesen Sie den `bitsandbytes`-Integrations [guide](./quantization#bitsandbytes-integration), um mehr zu erfahren). Fügen Sie die Parameter `load_in_8bit` oder `load_in_4bit` zu [`~PreTrainedModel.from_pretrained`] hinzu und setzen Sie `device_map="auto"`, um das Modell effektiv auf Ihre Hardware zu verteilen:

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer

 peft_model_id = "ybelkada/opt-350m-lora"
-model = AutoModelForCausalLM.from_pretrained(peft_model_id, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True)
 ```

 ## Einen neuen Adapter hinzufügen
--- a/docs/source/en/_redirects.yml
+++ b/docs/source/en/_redirects.yml
@ -2,4 +2,3 @@

 perf_infer_gpu_many: perf_infer_gpu_one
 transformers_agents: agents
-quantization: quantization/overview
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@ -135,36 +135,18 @@
    title: Community resources
  - local: troubleshooting
    title: Troubleshoot
+  - local: hf_quantizer
+    title: Contribute new quantization method
  - local: gguf
    title: Interoperability with GGUF files
  title: Developer guides
- sections:
-  - local: quantization/overview
-    title: Getting started
-  - local: quantization/bitsandbytes
-    title: bitsandbytes
-  - local: quantization/gptq
-    title: GPTQ
-  - local: quantization/awq
-    title: AWQ
-  - local: quantization/aqlm
-    title: AQLM
-  - local: quantization/quanto
-    title: Quanto
-  - local: quantization/eetq
-    title: EETQ
-  - local: quantization/hqq
-    title: HQQ
-  - local: quantization/optimum
-    title: Optimum
-  - local: quantization/contribute
-    title: Contribute new quantization method
-  title: Quantization Methods
 - sections:
  - local: performance
    title: Overview
  - local: llm_optims
    title: LLM inference optimization
+  - local: quantization
+    title: Quantization
  - sections:
    - local: perf_train_gpu_one
      title: Methods and tools for efficient training on a single GPU
--- a/docs/source/en/agents.md
+++ b/docs/source/en/agents.md
@ -28,8 +28,8 @@ An agent is a system that uses an LLM as its engine, and it has access to functi
 These *tools* are functions for performing a task, and they contain all necessary description for the agent to properly use them.

 The agent can be programmed to:
- devise a series of actions/tools and run them all at once like the [`CodeAgent`] for example
- plan and execute actions/tools one by one and wait for the outcome of each action before launching the next one like the [`ReactJsonAgent`] for example
+- devise a series of actions/tools and run them all at once like the `CodeAgent` for example
+- plan and execute actions/tools one by one and wait for the outcome of each action before launching the next one like the `ReactJsonAgent` for example

 ### Types of agents

@ -42,8 +42,8 @@ This agent has a planning step, then generates python code to execute all its ac
 This is the go-to agent to solve reasoning tasks, since the ReAct framework ([Yao et al., 2022](https://huggingface.co/papers/2210.03629)) makes it really efficient to think on the basis of its previous observations.

 We implement two versions of ReactJsonAgent: 
- [`ReactJsonAgent`] generates tool calls as a JSON in its output.
- [`ReactCodeAgent`] is a new type of ReactJsonAgent that generates its tool calls as blobs of code, which works really well for LLMs that have strong coding performance.
+- [`~ReactJsonAgent`] generates tool calls as a JSON in its output.
+- [`~ReactCodeAgent`] is a new type of ReactJsonAgent that generates its tool calls as blobs of code, which works really well for LLMs that have strong coding performance.

 > [!TIP]
 > Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more the ReAct agent.
@ -124,7 +124,7 @@ You could use any `llm_engine` method as long as:

 You also need a `tools` argument which accepts a list of `Tools`. You can provide an empty list for `tools`, but use the default toolbox with the optional argument `add_base_tools=True`.

-Now you can create an agent, like [`CodeAgent`], and run it. For convenience, we also provide the [`HfEngine`] class that uses `huggingface_hub.InferenceClient` under the hood.
+Now you can create an agent, like `CodeAgent`, and run it. For convenience, we also provide the `HfEngine` class that uses `huggingface_hub.InferenceClient` under the hood.

 ```python
 from transformers import CodeAgent, HfEngine
@ -139,7 +139,7 @@ agent.run(
 ```

 This will be handy in case of emergency baguette need!
-You can even leave the argument `llm_engine` undefined, and an [`HfEngine`] will be created by default.
+You can even leave the argument `llm_engine` undefined, and an [~HfEngine] will be created by default.

 ```python
 from transformers import CodeAgent
@ -181,27 +181,13 @@ You can also run an agent consecutively for different tasks: each time the attri
 A Python interpreter executes the code on a set of inputs passed along with your tools.
 This should be safe because the only functions that can be called are the tools you provided (especially if it's only tools by Hugging Face) and the print function, so you're already limited in what can be executed.

-The Python interpreter also doesn't allow imports by default outside of a safe list, so all the most obvious attacks shouldn't be an issue.
-You can still authorize additional imports by passing the authorized modules as a list of strings in argument `additional_authorized_imports` upon initialization of your [`ReactCodeAgent`] or [`CodeAgent`]:
-
-```py
->>> from transformers import ReactCodeAgent
-
->>> agent = ReactCodeAgent(tools=[], additional_authorized_imports=['requests', 'bs4'])
->>>agent.run("Could you get me the title of the page at url 'https://huggingface.co/blog'?")
-
-(...)
-'Hugging Face – Blog'
-```
+The Python interpreter also doesn't allow any attribute lookup or imports (which shouldn't be needed for passing inputs/outputs to a small set of functions) so all the most obvious attacks shouldn't be an issue.

 The execution will stop at any code trying to perform an illegal operation or if there is a regular Python error with the code generated by the agent.

-> [!WARNING]
-> The LLM can generate arbitrary code that will then be executed: do not add any unsafe imports!
-
 ### The system prompt

-An agent, or rather the LLM that drives the agent, generates an output based on the system prompt. The system prompt can be customized and tailored to the intended task. For example, check the system prompt for the [`ReactCodeAgent`] (below version is slightly simplified).
+An agent, or rather the LLM that drives the agent, generates an output based on the system prompt. The system prompt can be customized and tailored to the intended task. For example, check the system prompt for the `ReactCodeAgent` (below version is slightly simplified).

 ```text
 You will be given a task to solve as best you can.
@ -260,7 +246,7 @@ of the available tools.

 A tool is an atomic function to be used by an agent.

-You can for instance check the [`PythonInterpreterTool`]: it has a name, a description, input descriptions, an output type, and a `__call__` method to perform the action.
+You can for instance check the [~PythonInterpreterTool]: it has a name, a description, input descriptions, an output type, and a `__call__` method to perform the action.

 When the agent is initialized, the tool attributes are used to generate a tool description which is baked into the agent's system prompt. This lets the agent know which tools it can use and why.

@ -273,7 +259,7 @@ Transformers comes with a default toolbox for empowering agents, that you can ad
 - **Speech to text**: given an audio recording of a person talking, transcribe the speech into text ([Whisper](./model_doc/whisper))
 - **Text to speech**: convert text to speech ([SpeechT5](./model_doc/speecht5))
 - **Translation**: translates a given sentence from source language to target language.
- **Python code interpreter**: runs your the LLM generated Python code in a secure environment. This tool will only be added to [`ReactJsonAgent`] if you use `add_base_tools=True`, since code-based tools can already execute Python code
+- **Python code interpreter**: runs your the LLM generated Python code in a secure environment. This tool will only be added to [~ReactJsonAgent] if you use `add_base_tools=True`, since code-based tools can already execute Python code


 You can manually use a tool by calling the [`load_tool`] function and a task to perform.
--- a/docs/source/en/generation_strategies.md
+++ b/docs/source/en/generation_strategies.md
@ -174,43 +174,6 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te
 ```


-## KV Cache Quantization
-
-The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
-cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
-Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed. 
-
-KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
-(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper.
-
-To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`.
-Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class.
-One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`.
-
-<Tip warning={true}>
-
-Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.
-
-</Tip>
-
-
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
-
->>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
->>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
->>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. It's a great way to express myself and rel
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
-```
-
 ## Watermarking

 The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green". 
--- a/docs/source/en/quantization/contribute.md
+++ b/docs/source/en/quantization/contribute.md
@ -64,6 +64,6 @@ For some quantization methods, they may require "pre-quantizing" the models thro

 6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.

-7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization` and adding a new row in the table in `docs/source/en/quantization/overview.md`.
+7. Document everything! Make sure your quantization method is documented in the [`docs/source/en/quantization.md`](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/docs/source/en/quantization.md) file.

 8. Add tests! You should add tests by first adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out how it is implemented for other quantization methods.
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@ -200,7 +200,7 @@ Flax), PyTorch, and/or TensorFlow.
 |                 [Megatron-BERT](model_doc/megatron-bert)                 |       ✅        |         ❌         |      ❌      |
 |                 [Megatron-GPT2](model_doc/megatron_gpt2)                 |       ✅        |         ✅         |      ✅      |
 |                       [MGP-STR](model_doc/mgp-str)                       |       ✅        |         ❌         |      ❌      |
-|                       [Mistral](model_doc/mistral)                       |       ✅        |         ✅         |      ✅      |
+|                       [Mistral](model_doc/mistral)                       |       ✅        |         ❌         |      ✅      |
 |                       [Mixtral](model_doc/mixtral)                       |       ✅        |         ❌         |      ❌      |
 |                         [mLUKE](model_doc/mluke)                         |       ✅        |         ❌         |      ❌      |
 |                           [MMS](model_doc/mms)                           |       ✅        |         ✅         |      ✅      |
--- a/docs/source/en/installation.md
+++ b/docs/source/en/installation.md
@ -169,7 +169,7 @@ Pretrained models are downloaded and locally cached at: `~/.cache/huggingface/hu

 ## Offline mode

-Run 🤗 Transformers in a firewalled or offline environment with locally cached files by setting the environment variable `HF_HUB_OFFLINE=1`.
+Run 🤗 Transformers in a firewalled or offline environment with locally cached files by setting the environment variable `TRANSFORMERS_OFFLINE=1`.

 <Tip>

@ -178,7 +178,7 @@ Add [🤗 Datasets](https://huggingface.co/docs/datasets/) to your offline train
 </Tip>

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/docs/source/en/internal/generation_utils.md
+++ b/docs/source/en/internal/generation_utils.md
@ -360,12 +360,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
 [[autodoc]] Cache
    - update

-[[autodoc]] CacheConfig
-	- update
-
-[[autodoc]] QuantizedCacheConfig
-	- validate
-
 [[autodoc]] DynamicCache
    - update
    - get_seq_length
@ -373,14 +367,6 @@ A [`Constraint`] can be used to force the generation to include specific tokens
    - to_legacy_cache
    - from_legacy_cache

-[[autodoc]] QuantizedCache
-    - update
-    - get_seq_length
-
-[[autodoc]] QuantoQuantizedCache
-
-[[autodoc]] HQQQuantizedCache
-
 [[autodoc]] SinkCache
    - update
    - get_seq_length
@ -389,7 +375,7 @@ A [`Constraint`] can be used to force the generation to include specific tokens
 [[autodoc]] StaticCache
    - update
    - get_seq_length
-    - reset
+    - reorder_cache


 ## Watermark Utils
--- a/docs/source/en/llm_optims.md
+++ b/docs/source/en/llm_optims.md
@ -29,7 +29,7 @@ To optimize this, you can use a kv-cache to store the past keys and values inste
 The *static kv-cache* solves this issue by pre-allocating the kv-cache size to a maximum value which allows you to combine it with torch.compile for up to a 4x speed up.

 > [!WARNING]
-> Currently, only [Llama](./model_doc/llama2) and a few other models support static kv-cache and torch.compile. Check [this issue](https://github.com/huggingface/transformers/issues/28981) for a live model compatibility list.
+> Currently, only [Command R](./model_doc/cohere), [Gemma](./model_doc/gemma) and [Llama](./model_doc/llama2) models support static kv-cache and torch.compile.

 For this example, let's load the [Gemma](https://hf.co/google/gemma-2b) model.

--- a/docs/source/en/model_doc/dbrx.md
+++ b/docs/source/en/model_doc/dbrx.md
@ -31,7 +31,8 @@ We used curriculum learning for pretraining, changing the data mix during traini

 More detailed information about DBRX Instruct and DBRX Base can be found in our [technical blog post](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm).

-This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx-instruct), though this may not be up to date.
+
+This model was contributed by [eitan-turok](https://huggingface.co/eitanturok) and [abhi-db](https://huggingface.co/abhi-db). The original code can be found [here](https://github.com/databricks/dbrx), though this may not be up to date.

 ## Usage Examples

--- a/docs/source/en/model_doc/deta.md
+++ b/docs/source/en/model_doc/deta.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # DETA

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The DETA model was proposed in [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.
--- a/docs/source/en/model_doc/efficientformer.md
+++ b/docs/source/en/model_doc/efficientformer.md
@ -16,36 +16,28 @@ rendered properly in your Markdown viewer.

 # EfficientFormer

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

-The EfficientFormer model was proposed in [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191)
+The EfficientFormer model was proposed in [EfficientFormer: Vision Transformers at MobileNet Speed](https://arxiv.org/abs/2206.01191) 
 by Yanyu Li, Geng Yuan, Yang Wen, Eric Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren.  EfficientFormer proposes a
 dimension-consistent pure transformer that can be run on mobile devices for dense prediction tasks like image classification, object
 detection and semantic segmentation.

 The abstract from the paper is the following:

-*Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
-However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally
-times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly
-challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation
-complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still
-unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance?
-To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs.
-Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm.
-Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer.
-Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices.
-Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on
-iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model,
-EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can
+*Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. 
+However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally 
+times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly 
+challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation 
+complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still 
+unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? 
+To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. 
+Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. 
+Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. 
+Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. 
+Our fastest model, EfficientFormer-L1, achieves 79.2% top-1 accuracy on ImageNet-1K with only 1.6 ms inference latency on 
+iPhone 12 (compiled with CoreML), which { runs as fast as MobileNetV2×1.4 (1.6 ms, 74.7% top-1),} and our largest model, 
+EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work proves that properly designed transformers can 
 reach extremely low latency on mobile devices while maintaining high performance.*

 This model was contributed by [novice03](https://huggingface.co/novice03) and [Bearnardd](https://huggingface.co/Bearnardd).
@ -101,4 +93,4 @@ The original code can be found [here](https://github.com/snap-research/Efficient
    - call

 </tf>
-</frameworkcontent>
+</frameworkcontent>
--- a/docs/source/en/model_doc/ernie_m.md
+++ b/docs/source/en/model_doc/ernie_m.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # ErnieM

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The ErnieM model was proposed in [ERNIE-M: Enhanced Multilingual Representation by Aligning
--- a/docs/source/en/model_doc/gemma.md
+++ b/docs/source/en/model_doc/gemma.md
@ -60,11 +60,6 @@ This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [
 [[autodoc]] GemmaForSequenceClassification
    - forward

-## GemmaForTokenClassification
-
-[[autodoc]] GemmaForTokenClassification
-    - forward
-
 ## FlaxGemmaModel

 [[autodoc]] FlaxGemmaModel
--- a/docs/source/en/model_doc/gptsan-japanese.md
+++ b/docs/source/en/model_doc/gptsan-japanese.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # GPTSAN-japanese

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama).
--- a/docs/source/en/model_doc/graphormer.md
+++ b/docs/source/en/model_doc/graphormer.md
@ -1,7 +1,7 @@
 <!--Copyright 2022 The HuggingFace Team and Microsoft. All rights reserved.

 Licensed under the MIT License; you may not use this file except in compliance with
-the License.
+the License. 

 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
@ -14,17 +14,9 @@ rendered properly in your Markdown viewer.

 # Graphormer

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

-The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by
+The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234)  by 
 Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessing and collation, then using a modified attention.

 The abstract from the paper is the following:
--- a/docs/source/en/model_doc/idefics2.md
+++ b/docs/source/en/model_doc/idefics2.md
@ -87,58 +87,6 @@ generated_text = processor.batch_decode(generated_text, skip_special_tokens=True
 print("Generated text:", generated_text)
 ```

- During training, it's important to determine which tokens the model should not learn. For Idefics2, this typically comes down to the image and padding tokens. This means that one can create the labels as follows:
-
-```python
-import requests
-from PIL import Image
-from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
-import torch
-
-url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
-url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
-
-image_1 = Image.open(requests.get(url_1, stream=True).raw)
-image_2 = Image.open(requests.get(url_2, stream=True).raw)
-images = [image_1, image_2]
-
-messages = [{
-    "role": "user",
-    "content": [
-        {"type": "text", "text": "What’s the difference between these two images?"},
-        {"type": "image"},
-        {"type": "image"},
-    ],
-},
-{
-    "role": "assistant",
-    "content": [
-        {"type": "text", "text": "The difference is that one image is about dogs and the other one about cats."},
-    ],
-}]
-
-device = "cuda" if torch.cuda.is_available() else "cpu"
-
-processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
-model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
-model.to(device)
-
-text = processor.apply_chat_template(messages, add_generation_prompt=False)
-inputs = processor(images=images, text=text, return_tensors="pt").to(device)
-
-labels = inputs.input_ids.clone()
-labels[labels == processor.tokenizer.pad_token_id] = -100
-labels[labels == model.config.image_token_id] = -100
-
-inputs["labels"] = labels
-
-outputs = model(**inputs)
-loss = outputs.loss
-loss.backward()
-```
-
-Do note that when training Idefics2 on multi-turn conversations between a user and an assistant, one typically also sets all the tokens corresponding to the user messages to -100.
-
 ## Model optimizations: Flash Attention

 The code snippets above showcase inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
--- a/docs/source/en/model_doc/jukebox.md
+++ b/docs/source/en/model_doc/jukebox.md
@ -15,14 +15,6 @@ rendered properly in your Markdown viewer.
 -->
 # Jukebox

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The Jukebox model was proposed in [Jukebox: A generative model for music](https://arxiv.org/pdf/2005.00341.pdf)
@ -35,7 +27,7 @@ The abstract from the paper is the following:
 *We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples, along with model weights and code.*

 As shown on the following figure, Jukebox is made of 3 `priors` which are decoder only models. They follow the architecture described in [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509), modified to support longer context length.
-First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution.
+First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution. 
 The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positional embedding for the timing data.  The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio.

 ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg)
--- a/docs/source/en/model_doc/llama.md
+++ b/docs/source/en/model_doc/llama.md
@ -121,11 +121,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] LlamaForQuestionAnswering
    - forward

-## LlamaForTokenClassification
-
-[[autodoc]] LlamaForTokenClassification
-    - forward
-
 ## FlaxLlamaModel

 [[autodoc]] FlaxLlamaModel
--- a/docs/source/en/model_doc/llava_next.md
+++ b/docs/source/en/model_doc/llava_next.md
@ -68,8 +68,6 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/

 ## Usage example

-### Single image inference
-
 Here's how to load the model and perform inference in half-precision (`torch.float16`):

 ```python
@ -96,45 +94,6 @@ output = model.generate(**inputs, max_new_tokens=100)
 print(processor.decode(output[0], skip_special_tokens=True))
 ```

-### Multi image inference
-
-LLaVa-Next can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:
-
-```python
-import requests
-from PIL import Image
-import torch
-from transformers import AutoProcessor, LlavaNextForConditionalGeneration
-
-# Load the model in half-precision
-model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
-processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
-
-# Get three different images
-url = "https://www.ilankelman.org/stopsigns/australia.jpg"
-image_stop = Image.open(requests.get(url, stream=True).raw)
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image_cats = Image.open(requests.get(url, stream=True).raw)
-
-url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
-image_snowman = Image.open(requests.get(url, stream=True).raw)
-
-# Prepare a batched prompt, where the first one is a multi-turn conversation and the second is not
-prompt = [
-    "[INST] <image>\nWhat is shown in this image? [/INST] There is a red stop sign in the image. [INST] <image>\nWhat about this image? How many cats do you see [/INST]",
-    "[INST] <image>\nWhat is shown in this image? [/INST]"
-]
-
-# We can simply feed images in the order they have to be used in the text prompt
-# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
-inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)
-
-# Generate
-generate_ids = model.generate(**inputs, max_new_tokens=30)
-processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
-```
-
 ## Model optimization

 ### Quantization using Bitsandbytes
--- a/docs/source/en/model_doc/mask2former.md
+++ b/docs/source/en/model_doc/mask2former.md
@ -41,7 +41,6 @@ This model was contributed by [Shivalika Singh](https://huggingface.co/shivi) an
 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mask2Former.

 - Demo notebooks regarding inference + fine-tuning Mask2Former on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Mask2Former).
- Scripts for finetuning [`Mask2Former`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/instance-segmentation).

 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it.
 The resource should ideally demonstrate something new instead of duplicating an existing resource.
--- a/docs/source/en/model_doc/maskformer.md
+++ b/docs/source/en/model_doc/maskformer.md
@ -51,7 +51,6 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The
 <PipelineTag pipeline="image-segmentation"/>

 - All notebooks that illustrate inference as well as fine-tuning on custom data with MaskFormer can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MaskFormer).
- Scripts for finetuning [`MaskFormer`] with [`Trainer`] or [Accelerate](https://huggingface.co/docs/accelerate/index) can be found [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/instance-segmentation).

 ## MaskFormer specific outputs

--- a/docs/source/en/model_doc/mega.md
+++ b/docs/source/en/model_doc/mega.md
@ -16,20 +16,12 @@ rendered properly in your Markdown viewer.

 # MEGA

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The MEGA model was proposed in [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer.
-MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism
-stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA
-while also having significantly fewer parameters. MEGA's compute efficiency allows it to scale to very long sequences, making it an
+MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism 
+stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA 
+while also having significantly fewer parameters. MEGA's compute efficiency allows it to scale to very long sequences, making it an 
 attractive option for long-document NLP tasks.

 The abstract from the paper is the following:
@ -42,8 +34,8 @@ The original code can be found [here](https://github.com/facebookresearch/mega).

 ## Usage tips

- MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional.
- Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size
+- MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional. 
+- Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size 


 ## Implementation Notes
--- a/docs/source/en/model_doc/mistral.md
+++ b/docs/source/en/model_doc/mistral.md
@ -203,11 +203,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 [[autodoc]] MistralForSequenceClassification
    - forward

-## MistralForTokenClassification
-
-[[autodoc]] MistralForTokenClassification
-    - forward
-
 ## FlaxMistralModel

 [[autodoc]] FlaxMistralModel
@ -216,19 +211,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 ## FlaxMistralForCausalLM

 [[autodoc]] FlaxMistralForCausalLM
-    - __call__
-
-## TFMistralModel
-
-[[autodoc]] TFMistralModel
-    - call
-
-## TFMistralForCausalLM
-
-[[autodoc]] TFMistralForCausalLM
-    - call
-
-## TFMistralForSequenceClassification
-
-[[autodoc]] TFMistralForSequenceClassification
-    - call
+    - __call__
--- a/docs/source/en/model_doc/mixtral.md
+++ b/docs/source/en/model_doc/mixtral.md
@ -204,8 +204,3 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

 [[autodoc]] MixtralForSequenceClassification
    - forward
-
-## MixtralForTokenClassification
-
-[[autodoc]] MixtralForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/nat.md
+++ b/docs/source/en/model_doc/nat.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # Neighborhood Attention Transformer

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 NAT was proposed in [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143)
--- a/docs/source/en/model_doc/nezha.md
+++ b/docs/source/en/model_doc/nezha.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # Nezha

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The Nezha model was proposed in [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei et al.
@ -33,8 +25,8 @@ The abstract from the paper is the following:
 *The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks
 due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora.
 In this technical report, we present our practice of pre-training language models named NEZHA (NEural contextualiZed
-representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks.
-The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional
+representation for CHinese lAnguage understanding) on Chinese corpora and finetuning for the Chinese NLU tasks. 
+The current version of NEZHA is based on BERT with a collection of proven improvements, which include Functional 
 Relative Positional Encoding as an effective positional encoding scheme, Whole Word Masking strategy,
 Mixed Precision Training and the LAMB Optimizer in training the models. The experimental results show that NEZHA
 achieves the state-of-the-art performances when finetuned on several representative Chinese tasks, including
@ -93,4 +85,4 @@ This model was contributed by [sijunhe](https://huggingface.co/sijunhe). The ori
 ## NezhaForQuestionAnswering

 [[autodoc]] NezhaForQuestionAnswering
-    - forward
+    - forward
--- a/docs/source/en/model_doc/paligemma.md
+++ b/docs/source/en/model_doc/paligemma.md
@ -18,51 +18,11 @@ rendered properly in your Markdown viewer.

 ## Overview

-The PaliGemma model was proposed in [PaliGemma – Google's Cutting-Edge Open Vision Language Model](https://huggingface.co/blog/paligemma) by Google. It is a 3B vision-language model composed by a [SigLIP](siglip) vision encoder and a [Gemma](gemma) language decoder linked by a multimodal linear projection. It cuts an image into a fixed number of VIT tokens and prepends it to an optional prompt. One particularity is that the model uses full block attention on all the image tokens plus the input text tokens. It comes in 3 resolutions, 224x224, 448x448 and 896x896 with 3 base models, with 55 fine-tuned versions for different tasks, and 2 mix models.
+The PaliGemma model was proposed by Google. It is a 3B VLM composed by a Siglip-400m vision encoder and a Gemma-2B decoder linked by a multimodal linear projection. It is not a chat model with images. It cuts an image into a fixed number of VIT tokens and prepends it to an optional prompt. One particularity is that the model uses full block attention on all the image tokens plus the input text tokens. It comes in 3 resolutions, 224x224, 448x448 and 896x896 with 3 base models, with 55 fine-tuned versions for different tasks, and 2 mix models.

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/paligemma/paligemma_arch.png"
-alt="drawing" width="600"/>
-
-<small> PaliGemma architecture. Taken from the <a href="https://huggingface.co/blog/paligemma">blog post.</a> </small>

 This model was contributed by [Molbap](https://huggingface.co/Molbap).

-## Usage tips
-
-Inference with PaliGemma can be performed as follows:
-
-```python
-from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
-
-model_id = "google/paligemma-3b-mix-224"
-model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
-processor = AutoProcessor.from_pretrained(model_id)
-
-prompt = "What is on the flower?"
-image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
-raw_image = Image.open(requests.get(image_file, stream=True).raw)
-inputs = processor(prompt, raw_image, return_tensors="pt")
-output = model.generate(**inputs, max_new_tokens=20)
-
-print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):])
-```
-
- PaliGemma is not meant for conversational use, and it works best when fine-tuning to a specific use case. Some downstream tasks on which PaliGemma can be fine-tuned include image captioning, visual question answering (VQA), object detection, referring expression segmentation and document understanding.
- One can use `PaliGemmaProcessor` to prepare images, text and optional labels for the model. When fine-tuning a PaliGemma model, the `suffix` argument can be passed to the processor which creates the `labels` for the model:
-
-```python
-prompt = "What is on the flower?"
-answer = "a bee"
-inputs = processor(text=prompt, images=raw_image, suffix=answer, return_tensors="pt")
-```
-
-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PaliGemma. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
- A blog post introducing all the features of PaliGemma can be found [here](https://huggingface.co/blog/paligemma).
- Demo notebooks on how to fine-tune PaliGemma for VQA with the Trainer API along with inference can be found [here](https://github.com/huggingface/notebooks/tree/main/examples/paligemma).
- Demo notebooks on how to fine-tune PaliGemma on a custom dataset (receipt image -> JSON) along with inference can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/PaliGemma). 🌎

 ## PaliGemmaConfig

--- a/docs/source/en/model_doc/persimmon.md
+++ b/docs/source/en/model_doc/persimmon.md
@ -96,8 +96,3 @@ The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. T

 [[autodoc]] PersimmonForSequenceClassification
    - forward
-
-## PersimmonForTokenClassification
-
-[[autodoc]] PersimmonForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/qdqbert.md
+++ b/docs/source/en/model_doc/qdqbert.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # QDQBERT

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The QDQBERT model can be referenced in [Integer Quantization for Deep Learning Inference: Principles and Empirical
--- a/docs/source/en/model_doc/qwen2.md
+++ b/docs/source/en/model_doc/qwen2.md
@ -80,8 +80,3 @@ In the following, we demonstrate how to use `Qwen2-7B-Chat-beta` for the inferen

 [[autodoc]] Qwen2ForSequenceClassification
    - forward
-
-## Qwen2ForTokenClassification
-
-[[autodoc]] Qwen2ForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/qwen2_moe.md
+++ b/docs/source/en/model_doc/qwen2_moe.md
@ -75,8 +75,3 @@ In the following, we demonstrate how to use `Qwen1.5-MoE-A2.7B-Chat` for the inf

 [[autodoc]] Qwen2MoeForSequenceClassification
    - forward
-
-## Qwen2MoeForTokenClassification
-
-[[autodoc]] Qwen2MoeForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/realm.md
+++ b/docs/source/en/model_doc/realm.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # REALM

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The REALM model was proposed in [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. It's a
@ -94,4 +86,4 @@ This model was contributed by [qqaatw](https://huggingface.co/qqaatw). The origi

 [[autodoc]] RealmForOpenQA
    - block_embedding_to
-    - forward
+    - forward
--- a/docs/source/en/model_doc/sam.md
+++ b/docs/source/en/model_doc/sam.md
@ -81,10 +81,10 @@ processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
 img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
 mask_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
-segmentation_map = Image.open(requests.get(mask_url, stream=True).raw).convert("1")
+segmentation_map = Image.open(requests.get(mask_url, stream=True).raw).convert("RGB")
 input_points = [[[450, 600]]]  # 2D location of a window in the image

-inputs = processor(raw_image, input_points=input_points, segmentation_maps=segmentation_map, return_tensors="pt").to(device)
+inputs = processor(raw_image, input_points=input_points, segmentation_maps=mask, return_tensors="pt").to(device)
 with torch.no_grad():
    outputs = model(**inputs)

--- a/docs/source/en/model_doc/speech_to_text_2.md
+++ b/docs/source/en/model_doc/speech_to_text_2.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # Speech2Text2

-  <Tip warning={true}>
-
-  This model is in maintenance mode only, we don't accept any new PRs changing its code.
-  If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-  You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-  </Tip>
-
 ## Overview

 The Speech2Text2 model is used together with [Wav2Vec2](wav2vec2) for Speech Translation models proposed in
--- a/docs/source/en/model_doc/stablelm.md
+++ b/docs/source/en/model_doc/stablelm.md
@ -104,8 +104,3 @@ Now, to run the model with Flash Attention 2, refer to the snippet below:

 [[autodoc]] StableLmForSequenceClassification
    - forward
-
-## StableLmForTokenClassification
-
-[[autodoc]] StableLmForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/starcoder2.md
+++ b/docs/source/en/model_doc/starcoder2.md
@ -66,8 +66,3 @@ These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hu

 [[autodoc]] Starcoder2ForSequenceClassification
    - forward
-
-## Starcoder2ForTokenClassification
-
-[[autodoc]] Starcoder2ForTokenClassification
-    - forward
--- a/docs/source/en/model_doc/superpoint.md
+++ b/docs/source/en/model_doc/superpoint.md
@ -38,17 +38,12 @@ to repeatedly detect a much richer set of interest points than the initial pre-a
 traditional corner detector. The final system gives rise to state-of-the-art homography estimation results on HPatches
 when compared to LIFT, SIFT and ORB.*

-<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/superpoint_architecture.png"
-alt="drawing" width="500"/>
-
-<small> SuperPoint overview. Taken from the <a href="https://arxiv.org/abs/1712.07629v4">original paper.</a> </small>
-
-## Usage tips
+## How to use

 Here is a quick example of using the model to detect interest points in an image:

 ```python
-from transformers import AutoImageProcessor, SuperPointForKeypointDetection
+from transformers import AutoImageProcessor, AutoModel
 import torch
 from PIL import Image
 import requests
@ -57,7 +52,7 @@ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 image = Image.open(requests.get(url, stream=True).raw)

 processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
-model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")
+model = AutoModel.from_pretrained("magic-leap-community/superpoint")

 inputs = processor(image, return_tensors="pt")
 outputs = model(**inputs)
@ -69,7 +64,7 @@ You can also feed multiple images to the model. Due to the nature of SuperPoint,
 you will need to use the mask attribute to retrieve the respective information :

 ```python
-from transformers import AutoImageProcessor, SuperPointForKeypointDetection
+from transformers import AutoImageProcessor, AutoModel
 import torch
 from PIL import Image
 import requests
@ -82,7 +77,7 @@ image_2 = Image.open(requests.get(url_image_2, stream=True).raw)
 images = [image_1, image_2]

 processor = AutoImageProcessor.from_pretrained("magic-leap-community/superpoint")
-model = SuperPointForKeypointDetection.from_pretrained("magic-leap-community/superpoint")
+model = AutoModel.from_pretrained("magic-leap-community/superpoint")

 inputs = processor(images, return_tensors="pt")
 outputs = model(**inputs)
@ -108,12 +103,6 @@ cv2.imwrite("output_image.png", image)
 This model was contributed by [stevenbucaille](https://huggingface.co/stevenbucaille).
 The original code can be found [here](https://github.com/magicleap/SuperPointPretrainedNetwork).

-## Resources
-
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SuperPoint. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
-
- A notebook showcasing inference and visualization with SuperPoint can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SuperPoint/Inference_with_SuperPoint_to_detect_interest_points_in_an_image.ipynb). 🌎
-
 ## SuperPointConfig

 [[autodoc]] SuperPointConfig
--- a/docs/source/en/model_doc/tvlt.md
+++ b/docs/source/en/model_doc/tvlt.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # TVLT

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The TVLT model was proposed in [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156)
@ -68,7 +60,7 @@ The original code can be found [here](https://github.com/zinengtang/TVLT). This

 [[autodoc]] TvltFeatureExtractor
    - __call__
-
+    
 ## TvltModel

 [[autodoc]] TvltModel
--- a/docs/source/en/model_doc/video_llava.md
+++ b/docs/source/en/model_doc/video_llava.md
@ -42,28 +42,21 @@ a unified visual representation, outperforming models designed specifically for
 work to provide modest insights into the multi-modal inputs
 for the LLM*

-## Usage tips:
+Tips:

 - We advise users to use padding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = "left" before generating.

 - Note the model has not been explicitly trained to process multiple images/videos in the same prompt, although this is technically possible, you may experience inaccurate results.

- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting. 
+- For better results, we recommend users prompt the model with the correct prompt format:

-This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
-The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA).
-
-
-## Usage example
-
-### Single Media Mode
-
-The model can accept both images and videos as input. Here's an example code for inference in half-precision (`torch.float16`):

 ```python
 import av
 import torch
 import numpy as np
+import requests
+from PIL import Image
 from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor

 def read_video_pyav(container, indices):
@ -86,99 +79,36 @@ def read_video_pyav(container, indices):
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

-# Load the model in half-precision
-model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", torch_dtype=torch.float16, device_map="auto")
+
+model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", device_map="auto")
 processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

-# Load the video as an np.arrau, sampling uniformly 8 frames
 video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
+
 container = av.open(video_path)
 total_frames = container.streams.video[0].frames
 indices = np.arange(0, total_frames, total_frames / 8).astype(int)
 video = read_video_pyav(container, indices)

-# For better results, we recommend to prompt the model in the following format
 prompt = "USER: <video>Why is this funny? ASSISTANT:"
 inputs = processor(text=prompt, videos=video, return_tensors="pt")

-out = model.generate(**inputs, max_new_tokens=60)
-processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+out = model.generate(**inputs, max_new_tokens=40)
+print(processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True))
 ```

-For multiple turns conversation change the prompt format to:
+For multiple turns conversation change the prompt to:

 ```bash
 "USER: <video>What do you see in this video? ASSISTANT: A baby reading a book. USER: Why is the it funny? ASSISTANT:"
 ```

-### Mixed Media Mode
-
-The model can also generate from an interleaved image-video inputs. However note, that it was not trained in interleaved image-video setting which might affect the performance. Below is an example usage for mixed media input, add the following lines to the above code snippet: 
-
-```python
-from PIL import Image
-import requests
-
-# Generate from image and video mixed inputs
-# Load and image and write a new prompt
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-prompt = "USER: <image> How many cats are there in the image? ASSISTANT: There are two cats. USER: <video>Why is this video funny? ASSISTANT:"
-
-inputs = processor(text=prompt, images=image, videos=clip, padding=True, return_tensors="pt")
-
-# Generate
-generate_ids = model.generate(**inputs, max_length=50)
-processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
-
-```
-
-## Model optimization
-
-### Quantization using Bitsandbytes for memory efficiency
-
-The model can be loaded in lower bits, significantly reducing memory burden while maintaining the performance of the original model. his allows for efficient deployment on resource-constrained cases. 
-
-First make sure to install bitsandbytes by running `pip install bitsandbytes` and to have access to a CUDA compatible GPU device. Load the quantized model by simply adding [`BitsAndBytesConfig`](../main_classes/quantization#transformers.BitsAndBytesConfig) as shown below:
+- Note that the video inputs should have exactly 8 frames at the input, since the models were trained in that setting.


-```python
-from transformers import VideoLlavaForConditionalGeneration, BitsAndBytesConfig

-# specify how to quantize the model
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.float16,
-)
-
-model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", quantization_config=quantization_config, device_map="auto")
-```
-
-
-### Flash-Attention 2 to speed-up generation
-
-Additionally, we can greatly speed-up model inference by using [Flash Attention](../perf_train_gpu_one.md#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
-
-First, make sure to install the latest version of Flash Attention 2:
-
-```bash
-pip install -U flash-attn --no-build-isolation
-```
-
-Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
-
-To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
-
-```python
-from transformers import VideoLlavaForConditionalGeneration
-
-model = VideoLlavaForConditionalGeneration.from_pretrained(
-    "LanguageBind/Video-LLaVA-7B-hf", 
-    torch_dtype=torch.float16, 
-    attn_implementation="flash_attention_2",
-).to(0)
-```
+This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
+The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA).


 ## VideoLlavaConfig
--- a/docs/source/en/model_doc/vit_hybrid.md
+++ b/docs/source/en/model_doc/vit_hybrid.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # Hybrid Vision Transformer (ViT Hybrid)

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 ## Overview

 The hybrid Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition
--- a/docs/source/en/model_doc/whisper.md
+++ b/docs/source/en/model_doc/whisper.md
@ -78,8 +78,6 @@ Here is a step-by-step guide to transcribing an audio sample using a pre-trained

 A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Whisper. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

- [Fine-tune Whisper](https://huggingface.co/blog/fine-tune-whisper) on your own dataset for better downstream performance.
- [Distil-Whisper](https://huggingface.co/distil-whisper): Upto 6x faster, 2x smaller distilled Whisper models for English. We release the [model checkpoints](https://huggingface.co/distil-whisper), and [distillation code](https://github.com/huggingface/distil-whisper).
 - A fork with a script to [convert a Whisper model in Hugging Face format to OpenAI format](https://github.com/zuazo-forks/transformers/blob/convert_hf_to_openai/src/transformers/models/whisper/convert_hf_to_openai.py). 🌎
 Usage example:
 ```bash
--- a/docs/source/en/model_doc/xclip.md
+++ b/docs/source/en/model_doc/xclip.md
@ -30,7 +30,7 @@ Tips:
 - Usage of X-CLIP is identical to [CLIP](clip).

 <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
-alt="drawing" width="600"/>
+alt="drawing" width="600"/> 

 <small> X-CLIP architecture. Taken from the <a href="https://arxiv.org/abs/2208.02816">original paper.</a> </small>

--- a/docs/source/en/model_doc/xlm-prophetnet.md
+++ b/docs/source/en/model_doc/xlm-prophetnet.md
@ -16,14 +16,6 @@ rendered properly in your Markdown viewer.

 # XLM-ProphetNet

-<Tip warning={true}>
-
-This model is in maintenance mode only, we don't accept any new PRs changing its code.
-If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
-You can do so by running the following command: `pip install -U transformers==4.40.2`.
-
-</Tip>
-
 <div class="flex flex-wrap space-x-1">
 <a href="https://huggingface.co/models?filter=xprophetnet">
 <img alt="Models" src="https://img.shields.io/badge/All_model_pages-xprophetnet-blueviolet">
--- a/docs/source/en/peft.md
+++ b/docs/source/en/peft.md
@ -81,17 +81,15 @@ model = AutoModelForCausalLM.from_pretrained(model_id)
 model.load_adapter(peft_model_id)
 ```

-Check out the [API documentation](#transformers.integrations.PeftAdapterMixin) section below for more details.
-
 ## Load in 8bit or 4bit

 The `bitsandbytes` integration supports 8bit and 4bit precision data types, which are useful for loading large models because it saves memory (see the `bitsandbytes` integration [guide](./quantization#bitsandbytes-integration) to learn more). Add the `load_in_8bit` or `load_in_4bit` parameters to [`~PreTrainedModel.from_pretrained`] and set `device_map="auto"` to effectively distribute the model to your hardware:

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer

 peft_model_id = "ybelkada/opt-350m-lora"
-model = AutoModelForCausalLM.from_pretrained(peft_model_id, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True)
 ```

 ## Add a new adapter
@ -229,19 +227,6 @@ lora_config = LoraConfig(
 model.add_adapter(lora_config)
 ```

-## API docs
-
-[[autodoc]] integrations.PeftAdapterMixin
-    - load_adapter
-    - add_adapter
-    - set_adapter
-    - disable_adapters
-    - enable_adapters
-    - active_adapters
-    - get_adapter_state_dict
-
-
-

 <!--
 TODO: (@younesbelkada @stevhliu)
--- a/docs/source/en/perf_infer_gpu_one.md
+++ b/docs/source/en/perf_infer_gpu_one.md
@ -354,20 +354,20 @@ If you're curious and interested in learning more about the concepts underlying
 To load a model in 8-bit for inference, use the `load_in_8bit` parameter. The `device_map` parameter is optional, but we recommend setting it to `"auto"` to allow 🤗 Accelerate to automatically and efficiently allocate the model given the available resources in the environment:

 ```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM

 model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
 ```

 If you're loading a model in 8-bit for text generation, you should use the [`~transformers.GenerationMixin.generate`] method instead of the [`Pipeline`] function which is not optimized for 8-bit models and will be slower. Some sampling strategies, like nucleus sampling, are also not supported by the [`Pipeline`] for 8-bit models. You should also place all inputs on the same device as the model:

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer

 model_name = "bigscience/bloom-2b5"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

 prompt = "Hello, my llama is cute"
 inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
--- a/docs/source/en/quantization.md
+++ b/docs/source/en/quantization.md
@ -0,0 +1,818 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Quantization
+
+Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
+
+Transformers supports several quantization schemes to help you run inference with large language models (LLMs) and finetune adapters on quantized models. This guide will show you how to use Activation-aware Weight Quantization (AWQ), AutoGPTQ, and bitsandbytes.
+
+<Tip>
+
+Interested in adding a new quantization method to Transformers? Read the [HfQuantizer](./hf_quantizer) guide to learn how!
+
+</Tip>
+
+## Quanto
+
+<Tip>
+
+Try Quanto + transformers with this [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)!
+
+</Tip>
+
+
+[🤗 Quanto](https://github.com/huggingface/quanto) library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features such as:
+
+- weights quantization (`float8`,`int8`,`int4`,`int2`)
+- activation quantization (`float8`,`int8`)
+- modality agnostic (e.g CV,LLM)
+- device agnostic (e.g CUDA,MPS,CPU)
+- compatibility with `torch.compile`
+- easy to add custom kernel for specific device
+- supports quantization aware training
+<!-- Add link to the blogpost -->
+
+Before you begin, make sure the following libraries are installed:
+
+```bash
+pip install quanto
+pip install git+https://github.com/huggingface/accelerate.git
+pip install git+https://github.com/huggingface/transformers.git
+```
+
+Now you can quantize a model by passing [`QuantoConfig`] object in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers. 
+
+The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [quanto](https://github.com/huggingface/quanto) library instead. 
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
+
+model_id = "facebook/opt-125m"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+quantization_config = QuantoConfig(weights="int8")
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
+```
+
+Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
+
+Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following becnhmark (llama-2-7b on perplexity metric). You can find more benchamarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/NousResearch-Llama-2-7b-hf_Perplexity.png" alt="llama-2-7b-quanto-perplexity" />
+  </div>
+</div>
+
+The library is versatible enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
+
+## AQLM
+
+
+
+Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
+
+Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
+
+Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
+```bash
+pip install aqlm[gpu,cpu]
+```
+
+The library provides efficient kernels for both GPU and CPU inference and training.
+
+The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM).
+
+### PEFT
+
+Starting with version `aqlm 1.0.2`, AQLM supports Parameter-Efficient Fine-Tuning in a form of [LoRA](https://huggingface.co/docs/peft/package_reference/lora) integrated into the [PEFT](https://huggingface.co/blog/peft) library.
+
+### AQLM configurations
+
+AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
+ 
+| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup     | Fast GPU inference | Fast CPU inference |
+|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
+| Triton | K                   | N                  | KxN     | -        | Up to ~0.7x | ✅                  | ❌                  |
+| CUDA | 1                   | 16                  | 1x16     | Best        | Up to ~1.3x | ✅                  | ❌                  |
+| CUDA | 2                   | 8                   | 2x8      | OK          | Up to ~3.0x | ✅                  | ❌                  |
+| Numba | K                   | 8                   | Kx8      | Good        | Up to ~4.0x | ❌                  | ✅                  |
+
+## AWQ
+
+<Tip>
+
+Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)!
+
+</Tip>
+
+[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation.
+
+There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models.
+
+Make sure you have autoawq installed:
+
+```bash
+pip install autoawq
+```
+
+AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file:
+
+```json
+{
+  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
+  "architectures": [
+    "MistralForCausalLM"
+  ],
+  ...
+  ...
+  ...
+  "quantization_config": {
+    "quant_method": "awq",
+    "zero_point": true,
+    "group_size": 128,
+    "bits": 4,
+    "version": "gemm"
+  }
+}
+```
+
+A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model:
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "TheBloke/zephyr-7B-alpha-AWQ"
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")
+```
+
+Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter:
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "TheBloke/zephyr-7B-alpha-AWQ"
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
+```
+
+AWQ quantization can also be combined with [FlashAttention-2](perf_infer_gpu_one#flashattention-2) to further accelerate inference:
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")
+```
+
+### Fused modules
+
+Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures.
+
+<Tip warning={true}>
+
+Fused modules cannot be combined with other optimization techniques such as FlashAttention-2.
+
+</Tip>
+
+<hfoptions id="fuse">
+<hfoption id="supported architectures">
+
+To enable fused modules for supported architectures, create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. You can set it to a larger value to be safe.
+
+For example, to fuse the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model.
+
+```python
+import torch
+from transformers import AwqConfig, AutoModelForCausalLM
+
+model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
+
+quantization_config = AwqConfig(
+    bits=4,
+    fuse_max_seq_len=512,
+    do_fuse=True,
+)
+
+model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
+```
+
+</hfoption>
+<hfoption id="unsupported architectures">
+
+For architectures that don't support fused modules yet, you need to create a custom fusing mapping to define which modules need to be fused with the `modules_to_fuse` parameter. For example, to fuse the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model.
+
+```python
+import torch
+from transformers import AwqConfig, AutoModelForCausalLM
+
+model_id = "TheBloke/Yi-34B-AWQ"
+
+quantization_config = AwqConfig(
+    bits=4,
+    fuse_max_seq_len=512,
+    modules_to_fuse={
+        "attention": ["q_proj", "k_proj", "v_proj", "o_proj"],
+        "layernorm": ["ln1", "ln2", "norm"],
+        "mlp": ["gate_proj", "up_proj", "down_proj"],
+        "use_alibi": False,
+        "num_attention_heads": 56,
+        "num_key_value_heads": 8,
+        "hidden_size": 7168
+    }
+)
+
+model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
+```
+
+The parameter `modules_to_fuse` should include:
+
+- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list.
+- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list.
+- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers).
+- `"use_alibi"`: If your model uses ALiBi positional embedding.
+- `"num_attention_heads"`: The number of attention heads.
+- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA), otherwise GQA is used.
+- `"hidden_size"`: The dimension of the hidden representations.
+
+</hfoption>
+</hfoptions>
+
+### Exllama-v2 support
+
+Recent versions of `autoawq` supports exllama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
+
+```bash
+pip install git+https://github.com/casper-hansen/AutoAWQ.git
+```
+
+Get started by passing an `AwqConfig()` with `version="exllama"`.
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
+
+quantization_config = AwqConfig(version="exllama")
+
+model = AutoModelForCausalLM.from_pretrained(
+    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
+    quantization_config=quantization_config,
+    device_map="auto",
+)
+
+input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
+output = model(input_ids)
+print(output.logits)
+
+tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
+input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
+output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+<Tip warning={true}>
+
+Note this feature is supported on AMD GPUs.
+
+</Tip>
+
+
+## AutoGPTQ
+
+<Tip>
+
+Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)!
+
+</Tip>
+
+The [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.
+
+Before you begin, make sure the following libraries are installed:
+
+```bash
+pip install auto-gptq
+pip install git+https://github.com/huggingface/optimum.git
+pip install git+https://github.com/huggingface/transformers.git
+pip install --upgrade accelerate
+```
+
+To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset.
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
+
+model_id = "facebook/opt-125m"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
+```
+
+You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.
+
+```py
+dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
+gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
+```
+
+Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.
+
+```py
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
+```
+
+If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU):
+
+```py
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)
+```
+
+<Tip warning={true}>
+
+Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
+
+</Tip>
+
+Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]:
+
+```py
+quantized_model.push_to_hub("opt-125m-gptq")
+tokenizer.push_to_hub("opt-125m-gptq")
+```
+
+You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU:
+
+```py
+quantized_model.save_pretrained("opt-125m-gptq")
+tokenizer.save_pretrained("opt-125m-gptq")
+
+# if quantized with device_map set
+quantized_model.to("cpu")
+quantized_model.save_pretrained("opt-125m-gptq")
+```
+
+Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.
+
+```py
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
+```
+
+### ExLlama
+
+[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter:
+
+```py
+import torch
+from transformers import AutoModelForCausalLM, GPTQConfig
+
+gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
+model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)
+```
+
+<Tip warning={true}>
+
+Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
+
+</Tip>
+
+The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM, GPTQConfig
+gptq_config = GPTQConfig(bits=4, use_exllama=False)
+model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config)
+```
+
+## bitsandbytes
+
+[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
+
+To use bitsandbytes, make sure you have the following libraries installed:
+
+<hfoptions id="bnb">
+<hfoption id="8-bit">
+
+```bash
+pip install transformers accelerate bitsandbytes>0.37.0
+```
+
+</hfoption>
+<hfoption id="4-bit">
+
+```bash
+pip install bitsandbytes>=0.39.0
+pip install --upgrade accelerate
+pip install --upgrade transformers
+```
+
+</hfoption>
+</hfoptions>
+
+Now you can quantize a model with the `load_in_8bit` or `load_in_4bit` parameters in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.
+
+<hfoptions id="bnb">
+<hfoption id="8-bit">
+
+Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
+
+```py
+from transformers import AutoModelForCausalLM
+
+model_8bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_8bit=True)
+```
+
+By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
+
+model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32)
+model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
+```
+
+Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights.
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True)
+tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
+
+model.push_to_hub("bloom-560m-8bit")
+```
+
+</hfoption>
+<hfoption id="4-bit">
+
+Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
+
+```py
+from transformers import AutoModelForCausalLM
+
+model_4bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_4bit=True)
+```
+
+By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
+
+model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True, torch_dtype=torch.float32)
+model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
+```
+
+If you have `bitsandbytes>=0.41.3`, you can serialize 4-bit models and push them on Hugging Face Hub. Simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command.  
+
+</hfoption>
+</hfoptions>
+
+<Tip warning={true}>
+
+Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.
+
+</Tip>
+
+You can check your memory footprint with the `get_memory_footprint` method:
+
+```py
+print(model.get_memory_footprint())
+```
+
+Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters:
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")
+```
+
+### 8-bit
+
+<Tip>
+
+Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!
+
+</Tip>
+
+This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning.
+
+#### Offloading
+
+8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]:
+
+```py
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+
+quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
+```
+
+Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU:
+
+```py
+device_map = {
+    "transformer.word_embeddings": 0,
+    "transformer.word_embeddings_layernorm": 0,
+    "lm_head": "cpu",
+    "transformer.h": 0,
+    "transformer.ln_f": 0,
+}
+```
+
+Now load your model with the custom `device_map` and `quantization_config`:
+
+```py
+model_8bit = AutoModelForCausalLM.from_pretrained(
+    "bigscience/bloom-1b7",
+    device_map=device_map,
+    quantization_config=quantization_config,
+)
+```
+
+#### Outlier threshold
+
+An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
+
+To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
+
+```py
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+
+model_id = "bigscience/bloom-1b7"
+
+quantization_config = BitsAndBytesConfig(
+    llm_int8_threshold=10,
+)
+
+model_8bit = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map=device_map,
+    quantization_config=quantization_config,
+)
+```
+
+#### Skip module conversion
+
+For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
+
+```py
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+
+model_id = "bigscience/bloom-1b7"
+
+quantization_config = BitsAndBytesConfig(
+    llm_int8_skip_modules=["lm_head"],
+)
+
+model_8bit = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    quantization_config=quantization_config,
+)
+```
+
+#### Finetuning
+
+With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference).
+
+### 4-bit
+
+<Tip>
+
+Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
+
+</Tip>
+
+This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
+
+
+#### Compute data type
+
+To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
+
+```py
+import torch
+from transformers import BitsAndBytesConfig
+
+quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
+```
+
+#### Normal Float 4 (NF4)
+
+NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
+
+```py
+from transformers import BitsAndBytesConfig
+
+nf4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+)
+
+model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
+```
+
+For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
+
+#### Nested quantization
+
+Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
+
+```py
+from transformers import BitsAndBytesConfig
+
+double_quant_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+)
+
+model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
+```
+
+### Dequantizing `bitsandbytes` models
+
+Once quantized, you can dequantize the model to the original precision. Note this might result in a small quality loss of the model. Make also sure to have enough GPU RAM to fit the dequantized model. 
+Below is how to perform dequantization on a 4-bit model using `bitsandbytes`.
+
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
+
+model_id = "facebook/opt-125m"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+model.dequantize()
+
+text = tokenizer("Hello my name is", return_tensors="pt").to(0)
+
+out = model.generate(**text)
+print(tokenizer.decode(out[0]))
+```
+
+## EETQ
+The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization. 
+
+Make sure you have eetq installed from the [relase page](https://github.com/NetEase-FuXi/EETQ/releases)
+```
+pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
+```
+or via the source code https://github.com/NetEase-FuXi/EETQ. EETQ requires CUDA capability <= 8.9 and >= 7.0
+```
+git clone https://github.com/NetEase-FuXi/EETQ.git
+cd EETQ/
+git submodule update --init --recursive
+pip install .
+```
+
+An unquantized model can be quantized via "from_pretrained".
+```py
+from transformers import AutoModelForCausalLM, EetqConfig
+path = "/path/to/model"
+quantization_config = EetqConfig("int8")
+model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)
+```
+
+A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
+
+```py
+quant_path = "/path/to/save/quantized/model"
+model.save_pretrained(quant_path)
+model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
+```
+
+## Optimum
+
+The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime.
+
+## Benchmarks
+
+To compare the speed, throughput, and latency of each quantization scheme, check the following benchmarks obtained from the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library. The benchmark was run on a NVIDIA A1000 for the [TheBloke/Mistral-7B-v0.1-AWQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ) and [TheBloke/Mistral-7B-v0.1-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ) models. These were also tested against the bitsandbytes quantization methods as well as a native fp16 model.
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/forward_memory_plot.png" alt="forward peak memory per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">forward peak memory/batch size</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/generate_memory_plot.png" alt="generate peak memory per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generate peak memory/batch size</figcaption>
+  </div>
+</div>
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/generate_throughput_plot.png" alt="generate throughput per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generate throughput/batch size</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/forward_latency_plot.png" alt="forward latency per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">forward latency/batch size</figcaption>
+  </div>
+</div>
+
+The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size. For a more detailed discussion about the pros and cons of each quantization method, read the [Overview of natively supported quantization schemes in 🤗 Transformers](https://huggingface.co/blog/overview-quantization-transformers) blog post.
+
+### Fused AWQ modules
+
+The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.
+
+<figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption>
+
+|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
+|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
+|            1 |               32 |              32 |            60.0984 |           38.4537 | 4.50 GB (5.68%) |
+|            1 |               64 |              64 |          1333.67   |           31.6604 | 4.50 GB (5.68%) |
+|            1 |              128 |             128 |          2434.06   |           31.6272 | 4.50 GB (5.68%) |
+|            1 |              256 |             256 |          3072.26   |           38.1731 | 4.50 GB (5.68%) |
+|            1 |              512 |             512 |          3184.74   |           31.6819 | 4.59 GB (5.80%) |
+|            1 |             1024 |            1024 |          3148.18   |           36.8031 | 4.81 GB (6.07%) |
+|            1 |             2048 |            2048 |          2927.33   |           35.2676 | 5.73 GB (7.23%) |
+
+<figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>
+
+|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
+|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
+|            1 |               32 |              32 |            81.4899 |           80.2569 | 4.00 GB (5.05%) |
+|            1 |               64 |              64 |          1756.1    |          106.26   | 4.00 GB (5.05%) |
+|            1 |              128 |             128 |          2479.32   |          105.631  | 4.00 GB (5.06%) |
+|            1 |              256 |             256 |          1813.6    |           85.7485 | 4.01 GB (5.06%) |
+|            1 |              512 |             512 |          2848.9    |           97.701  | 4.11 GB (5.19%) |
+|            1 |             1024 |            1024 |          3044.35   |           87.7323 | 4.41 GB (5.57%) |
+|            1 |             2048 |            2048 |          2715.11   |           89.4709 | 5.57 GB (7.04%) |
+
+The speed and throughput of fused and unfused modules were also tested with the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library.
+
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">forward peak memory/batch size</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" />
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generate throughput/batch size</figcaption>
+  </div>
+</div>
+
+## HQQ 
+Half-Quadratic Quantization (HQQ) implements on-the-fly quantization via fast robust optimization. It doesn't require calibration data and can be used to quantize any model.  
+Please refer to the <a href="https://github.com/mobiusml/hqq/">official package</a> for more details.
+
+For installation, we recommend you use the following approach to get the latest version and build its corresponding CUDA kernels:
+```
+pip install hqq
+```
+
+To quantize a model, you need to create an [`HqqConfig`]. There are two ways of doing it:
+``` Python
+from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
+
+# Method 1: all linear layers will use the same quantization config
+quant_config  = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default
+```
+
+``` Python
+# Method 2: each linear layer with the same tag will use a dedicated quantization config
+q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
+q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}
+quant_config  = HqqConfig(dynamic_config={
+  'self_attn.q_proj':q4_config,
+  'self_attn.k_proj':q4_config,
+  'self_attn.v_proj':q4_config,
+  'self_attn.o_proj':q4_config,
+
+  'mlp.gate_proj':q3_config,
+  'mlp.up_proj'  :q3_config,
+  'mlp.down_proj':q3_config,
+})
+```
+
+The second approach is especially interesting for quantizing Mixture-of-Experts (MoEs) because the experts are less affected by lower quantization settings.
+
+
+Then you simply quantize the model as follows
+``` Python
+model = transformers.AutoModelForCausalLM.from_pretrained(
+    model_id, 
+    torch_dtype=torch.float16, 
+    device_map="cuda", 
+    quantization_config=quant_config
+)
+```
+### Optimized Runtime
+HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
+For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
+For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
--- a/docs/source/en/quantization/aqlm.md
+++ b/docs/source/en/quantization/aqlm.md
@ -1,57 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# AQLM
-
-> [!TIP]
-> Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)!
-
-Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
-
-Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10):
-```bash
-pip install aqlm[gpu,cpu]
-```
-
-The library provides efficient kernels for both GPU and CPU inference and training.
-
-The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM). To run AQLM models simply load a model that has been quantized with AQLM:
-
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-
-quantized_model = AutoModelForCausalLM.from_pretrained(
-    "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
-    torch_dtype="auto", 
-    device_map="auto"
-)
-tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
-```
-
-## PEFT
-
-Starting with version `aqlm 1.0.2`, AQLM supports Parameter-Efficient Fine-Tuning in a form of [LoRA](https://huggingface.co/docs/peft/package_reference/lora) integrated into the [PEFT](https://huggingface.co/blog/peft) library.
-
-## AQLM configurations
-
-AQLM quantization setups vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are:
- 
-| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup     | Fast GPU inference | Fast CPU inference |
-|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
-| Triton | K                   | N                  | KxN     | -        | Up to ~0.7x | ✅                  | ❌                  |
-| CUDA | 1                   | 16                  | 1x16     | Best        | Up to ~1.3x | ✅                  | ❌                  |
-| CUDA | 2                   | 8                   | 2x8      | OK          | Up to ~3.0x | ✅                  | ❌                  |
-| Numba | K                   | 8                   | Kx8      | Good        | Up to ~4.0x | ❌                  | ✅                  |
--- a/docs/source/en/quantization/awq.md
+++ b/docs/source/en/quantization/awq.md
@ -1,232 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# AWQ
-
-<Tip>
-
-Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)!
-
-</Tip>
-
-[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation.
-
-There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models.
-
-Make sure you have autoawq installed:
-
-```bash
-pip install autoawq
-```
-
-AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file:
-
-```json
-{
-  "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
-  "architectures": [
-    "MistralForCausalLM"
-  ],
-  ...
-  ...
-  ...
-  "quantization_config": {
-    "quant_method": "awq",
-    "zero_point": true,
-    "group_size": 128,
-    "bits": 4,
-    "version": "gemm"
-  }
-}
-```
-
-A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model_id = "TheBloke/zephyr-7B-alpha-AWQ"
-model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")
-```
-
-Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model_id = "TheBloke/zephyr-7B-alpha-AWQ"
-model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
-```
-
-AWQ quantization can also be combined with [FlashAttention-2](perf_infer_gpu_one#flashattention-2) to further accelerate inference:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0")
-```
-
-## Fused modules
-
-Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures.
-
-<Tip warning={true}>
-
-Fused modules cannot be combined with other optimization techniques such as FlashAttention-2.
-
-</Tip>
-
-<hfoptions id="fuse">
-<hfoption id="supported architectures">
-
-To enable fused modules for supported architectures, create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. You can set it to a larger value to be safe.
-
-For example, to fuse the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model.
-
-```python
-import torch
-from transformers import AwqConfig, AutoModelForCausalLM
-
-model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
-
-quantization_config = AwqConfig(
-    bits=4,
-    fuse_max_seq_len=512,
-    do_fuse=True,
-)
-
-model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
-```
-
-The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.
-
-<figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption>
-
-|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
-|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
-|            1 |               32 |              32 |            60.0984 |           38.4537 | 4.50 GB (5.68%) |
-|            1 |               64 |              64 |          1333.67   |           31.6604 | 4.50 GB (5.68%) |
-|            1 |              128 |             128 |          2434.06   |           31.6272 | 4.50 GB (5.68%) |
-|            1 |              256 |             256 |          3072.26   |           38.1731 | 4.50 GB (5.68%) |
-|            1 |              512 |             512 |          3184.74   |           31.6819 | 4.59 GB (5.80%) |
-|            1 |             1024 |            1024 |          3148.18   |           36.8031 | 4.81 GB (6.07%) |
-|            1 |             2048 |            2048 |          2927.33   |           35.2676 | 5.73 GB (7.23%) |
-
-<figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>
-
-|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
-|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
-|            1 |               32 |              32 |            81.4899 |           80.2569 | 4.00 GB (5.05%) |
-|            1 |               64 |              64 |          1756.1    |          106.26   | 4.00 GB (5.05%) |
-|            1 |              128 |             128 |          2479.32   |          105.631  | 4.00 GB (5.06%) |
-|            1 |              256 |             256 |          1813.6    |           85.7485 | 4.01 GB (5.06%) |
-|            1 |              512 |             512 |          2848.9    |           97.701  | 4.11 GB (5.19%) |
-|            1 |             1024 |            1024 |          3044.35   |           87.7323 | 4.41 GB (5.57%) |
-|            1 |             2048 |            2048 |          2715.11   |           89.4709 | 5.57 GB (7.04%) |
-
-The speed and throughput of fused and unfused modules were also tested with the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library.
-
-<div class="flex gap-4">
-  <div>
-    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" />
-    <figcaption class="mt-2 text-center text-sm text-gray-500">forward peak memory/batch size</figcaption>
-  </div>
-  <div>
-    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" />
-    <figcaption class="mt-2 text-center text-sm text-gray-500">generate throughput/batch size</figcaption>
-  </div>
-</div>
-
-</hfoption>
-<hfoption id="unsupported architectures">
-
-For architectures that don't support fused modules yet, you need to create a custom fusing mapping to define which modules need to be fused with the `modules_to_fuse` parameter. For example, to fuse the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model.
-
-```python
-import torch
-from transformers import AwqConfig, AutoModelForCausalLM
-
-model_id = "TheBloke/Yi-34B-AWQ"
-
-quantization_config = AwqConfig(
-    bits=4,
-    fuse_max_seq_len=512,
-    modules_to_fuse={
-        "attention": ["q_proj", "k_proj", "v_proj", "o_proj"],
-        "layernorm": ["ln1", "ln2", "norm"],
-        "mlp": ["gate_proj", "up_proj", "down_proj"],
-        "use_alibi": False,
-        "num_attention_heads": 56,
-        "num_key_value_heads": 8,
-        "hidden_size": 7168
-    }
-)
-
-model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
-```
-
-The parameter `modules_to_fuse` should include:
-
- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list.
- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list.
- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers).
- `"use_alibi"`: If your model uses ALiBi positional embedding.
- `"num_attention_heads"`: The number of attention heads.
- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA), otherwise GQA is used.
- `"hidden_size"`: The dimension of the hidden representations.
-
-</hfoption>
-</hfoptions>
-
-
-
-## ExLlama-v2 support
-
-Recent versions of `autoawq` supports ExLlama-v2 kernels for faster prefill and decoding. To get started, first install the latest version of `autoawq` by running:
-
-```bash
-pip install git+https://github.com/casper-hansen/AutoAWQ.git
-```
-
-Get started by passing an `AwqConfig()` with `version="exllama"`.
-
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
-
-quantization_config = AwqConfig(version="exllama")
-
-model = AutoModelForCausalLM.from_pretrained(
-    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
-    quantization_config=quantization_config,
-    device_map="auto",
-)
-
-input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
-output = model(input_ids)
-print(output.logits)
-
-tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
-input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
-output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
-
-<Tip warning={true}>
-
-Note this feature is supported on AMD GPUs.
-
-</Tip>
--- a/docs/source/en/quantization/bitsandbytes.md
+++ b/docs/source/en/quantization/bitsandbytes.md
@ -1,308 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# bitsandbytes
-
-[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.
-
-To use bitsandbytes, make sure you have the following libraries installed:
-
-<hfoptions id="bnb">
-<hfoption id="8-bit">
-
-```bash
-pip install transformers accelerate bitsandbytes>0.37.0
-```
-
-</hfoption>
-<hfoption id="4-bit">
-
-```bash
-pip install bitsandbytes>=0.39.0
-pip install --upgrade accelerate transformers
-```
-
-</hfoption>
-</hfoptions>
-
-Now you can quantize a model by passing a `BitsAndBytesConfig` to [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers.
-
-<hfoptions id="bnb">
-<hfoption id="8-bit">
-
-Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
-
-```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    "bigscience/bloom-1b7", 
-    quantization_config=quantization_config
-)
-```
-
-By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
-
-```py
-import torch
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    "facebook/opt-350m", 
-    quantization_config=quantization_config, 
-    torch_dtype=torch.float32
-)
-model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
-```
-
-Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights.
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-
-model = AutoModelForCausalLM.from_pretrained(
-    "bigscience/bloom-560m", 
-    quantization_config=quantization_config
-)
-tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
-
-model.push_to_hub("bloom-560m-8bit")
-```
-
-</hfoption>
-<hfoption id="4-bit">
-
-Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available:
-
-```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_4bit=True)
-
-model_4bit = AutoModelForCausalLM.from_pretrained(
-    "bigscience/bloom-1b7",
-    quantization_config=quantization_config
-)
-```
-
-By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:
-
-```py
-import torch
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_4bit=True)
-
-model_4bit = AutoModelForCausalLM.from_pretrained(
-    "facebook/opt-350m",
-    quantization_config=quantization_config, 
-    torch_dtype=torch.float32
-)
-model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
-```
-
-If you have `bitsandbytes>=0.41.3`, you can serialize 4-bit models and push them on Hugging Face Hub. Simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command.  
-
-</hfoption>
-</hfoptions>
-
-<Tip warning={true}>
-
-Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.
-
-</Tip>
-
-You can check your memory footprint with the `get_memory_footprint` method:
-
-```py
-print(model.get_memory_footprint())
-```
-
-Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")
-```
-
-## 8-bit (LLM.int8() algorithm)
-
-<Tip>
-
-Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!
-
-</Tip>
-
-This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning.
-
-### Offloading
-
-8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]:
-
-```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
-```
-
-Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU:
-
-```py
-device_map = {
-    "transformer.word_embeddings": 0,
-    "transformer.word_embeddings_layernorm": 0,
-    "lm_head": "cpu",
-    "transformer.h": 0,
-    "transformer.ln_f": 0,
-}
-```
-
-Now load your model with the custom `device_map` and `quantization_config`:
-
-```py
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    "bigscience/bloom-1b7",
-    device_map=device_map,
-    quantization_config=quantization_config,
-)
-```
-
-### Outlier threshold
-
-An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
-
-To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:
-
-```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
-
-model_id = "bigscience/bloom-1b7"
-
-quantization_config = BitsAndBytesConfig(
-    llm_int8_threshold=10,
-)
-
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    device_map=device_map,
-    quantization_config=quantization_config,
-)
-```
-
-### Skip module conversion
-
-For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-
-model_id = "bigscience/bloom-1b7"
-
-quantization_config = BitsAndBytesConfig(
-    llm_int8_skip_modules=["lm_head"],
-)
-
-model_8bit = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    device_map="auto",
-    quantization_config=quantization_config,
-)
-```
-
-### Finetuning
-
-With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference).
-
-## 4-bit (QLoRA algorithm)
-
-<Tip>
-
-Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
-
-</Tip>
-
-This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.
-
-
-### Compute data type
-
-To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:
-
-```py
-import torch
-from transformers import BitsAndBytesConfig
-
-quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
-```
-
-### Normal Float 4 (NF4)
-
-NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:
-
-```py
-from transformers import BitsAndBytesConfig
-
-nf4_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-)
-
-model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
-```
-
-For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.
-
-### Nested quantization
-
-Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps.
-
-```py
-from transformers import BitsAndBytesConfig
-
-double_quant_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-)
-
-model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config)
-```
-
-## Dequantizing `bitsandbytes` models
-
-Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model. 
-
-```python
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
-
-model_id = "facebook/opt-125m"
-
-model = AutoModelForCausalLM.from_pretrained(model_id, BitsAndBytesConfig(load_in_4bit=True))
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-
-model.dequantize()
-
-text = tokenizer("Hello my name is", return_tensors="pt").to(0)
-
-out = model.generate(**text)
-print(tokenizer.decode(out[0]))
-```
--- a/docs/source/en/quantization/eetq.md
+++ b/docs/source/en/quantization/eetq.md
@ -1,47 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# EETQ
-
-The [EETQ](https://github.com/NetEase-FuXi/EETQ) library supports int8 per-channel weight-only quantization for NVIDIA GPUS. The high-performance GEMM and GEMV kernels are from FasterTransformer and TensorRT-LLM. It requires no calibration dataset and does not need to pre-quantize your model. Moreover, the accuracy degradation is negligible owing to the per-channel quantization. 
-
-Make sure you have eetq installed from the [relase page](https://github.com/NetEase-FuXi/EETQ/releases)
-```
-pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
-```
-or via the source code https://github.com/NetEase-FuXi/EETQ. EETQ requires CUDA capability <= 8.9 and >= 7.0
-```
-git clone https://github.com/NetEase-FuXi/EETQ.git
-cd EETQ/
-git submodule update --init --recursive
-pip install .
-```
-
-An unquantized model can be quantized via "from_pretrained".
-```py
-from transformers import AutoModelForCausalLM, EetqConfig
-path = "/path/to/model"
-quantization_config = EetqConfig("int8")
-model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)
-```
-
-A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".
-
-```py
-quant_path = "/path/to/save/quantized/model"
-model.save_pretrained(quant_path)
-model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
-```
--- a/docs/source/en/quantization/gptq.md
+++ b/docs/source/en/quantization/gptq.md
@ -1,120 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# GPTQ
-
-<Tip>
-
-Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)!
-
-</Tip>
-
-The [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.
-
-Before you begin, make sure the following libraries are installed:
-
-```bash
-pip install auto-gptq
-pip install --upgrade accelerate optimum transformers
-```
-
-To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset.
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
-
-model_id = "facebook/opt-125m"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
-```
-
-You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.
-
-```py
-dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
-gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
-```
-
-Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.
-
-```py
-quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
-```
-
-If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU):
-
-```py
-quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config)
-```
-
-<Tip warning={true}>
-
-Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
-
-</Tip>
-
-Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]:
-
-```py
-quantized_model.push_to_hub("opt-125m-gptq")
-tokenizer.push_to_hub("opt-125m-gptq")
-```
-
-You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU:
-
-```py
-quantized_model.save_pretrained("opt-125m-gptq")
-tokenizer.save_pretrained("opt-125m-gptq")
-
-# if quantized with device_map set
-quantized_model.to("cpu")
-quantized_model.save_pretrained("opt-125m-gptq")
-```
-
-Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.
-
-```py
-from transformers import AutoModelForCausalLM
-
-model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
-```
-
-## ExLlama
-
-[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter:
-
-```py
-import torch
-from transformers import AutoModelForCausalLM, GPTQConfig
-
-gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
-model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config)
-```
-
-<Tip warning={true}>
-
-Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
-
-</Tip>
-
-The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
-
-```py
-import torch
-from transformers import AutoModelForCausalLM, GPTQConfig
-gptq_config = GPTQConfig(bits=4, use_exllama=False)
-model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config)
-```
--- a/docs/source/en/quantization/hqq.md
+++ b/docs/source/en/quantization/hqq.md
@ -1,69 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-
-# HQQ 
-
-Half-Quadratic Quantization (HQQ) implements on-the-fly quantization via fast robust optimization. It doesn't require calibration data and can be used to quantize any model.  
-Please refer to the <a href="https://github.com/mobiusml/hqq/">official package</a> for more details.
-
-For installation, we recommend you use the following approach to get the latest version and build its corresponding CUDA kernels:
-```
-pip install hqq
-```
-
-To quantize a model, you need to create an [`HqqConfig`]. There are two ways of doing it:
-``` Python
-from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
-
-# Method 1: all linear layers will use the same quantization config
-quant_config  = HqqConfig(nbits=8, group_size=64, quant_zero=False, quant_scale=False, axis=0) #axis=0 is used by default
-```
-
-``` Python
-# Method 2: each linear layer with the same tag will use a dedicated quantization config
-q4_config = {'nbits':4, 'group_size':64, 'quant_zero':False, 'quant_scale':False}
-q3_config = {'nbits':3, 'group_size':32, 'quant_zero':False, 'quant_scale':False}
-quant_config  = HqqConfig(dynamic_config={
-  'self_attn.q_proj':q4_config,
-  'self_attn.k_proj':q4_config,
-  'self_attn.v_proj':q4_config,
-  'self_attn.o_proj':q4_config,
-
-  'mlp.gate_proj':q3_config,
-  'mlp.up_proj'  :q3_config,
-  'mlp.down_proj':q3_config,
-})
-```
-
-The second approach is especially interesting for quantizing Mixture-of-Experts (MoEs) because the experts are less affected by lower quantization settings.
-
-
-Then you simply quantize the model as follows
-``` Python
-model = transformers.AutoModelForCausalLM.from_pretrained(
-    model_id, 
-    torch_dtype=torch.float16, 
-    device_map="cuda", 
-    quantization_config=quant_config
-)
-```
-
-## Optimized Runtime
-
-HQQ supports various backends, including pure Pytorch and custom dequantization CUDA kernels. These backends are suitable for older gpus and peft/QLoRA training.
-For faster inference, HQQ supports 4-bit fused kernels (TorchAO and Marlin), reaching up to 200 tokens/sec on a single 4090.
-For more details on how to use the backends, please refer to https://github.com/mobiusml/hqq/?tab=readme-ov-file#backend
--- a/docs/source/en/quantization/optimum.md
+++ b/docs/source/en/quantization/optimum.md
@ -1,19 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Optimum
-
-The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime.
--- a/docs/source/en/quantization/overview.md
+++ b/docs/source/en/quantization/overview.md
@ -1,58 +0,0 @@
-<!--Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Quantization
-
-Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.
-
-<Tip>
-
-Interested in adding a new quantization method to Transformers? Read the [HfQuantizer](./contribute) guide to learn how!
-
-</Tip>
-
-<Tip>
-
-If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:
-
-* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
-* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/)
-
-</Tip>
-
-## When to use what?
-
-The community has developed many quantization methods for various use cases. With Transformers, you can run any of these integrated methods depending on your use case because each method has their own pros and cons.
-
-For example, some quantization methods require calibrating the model with a dataset for more accurate and "extreme" compression (up to 1-2 bits quantization), while other methods work out of the box with on-the-fly quantization.
-
-Another parameter to consider is compatibility with your target device. Do you want to quantize on a CPU, GPU, or Apple silicon?
-
-In short, supporting a wide range of quantization methods allows you to pick the best quantization method for your specific use case.
-
-Use the table below to help you decide which quantization method to use.
-
-| Quantization method                 | On the fly quantization | CPU | CUDA GPU | RoCm GPU (AMD) | Metal (Apple Silicon) | torch.compile() support | Number of bits | Supports fine-tuning (through PEFT) | Serializable with 🤗 transformers | 🤗 transformers support | Link to library                             |
-|-------------------------------------|-------------------------|-----|----------|----------------|-----------------------|-------------------------|----------------|-------------------------------------|--------------|------------------------|---------------------------------------------|
-| [AQLM](./aqlm)                                | 🔴                       |  🟢   |     🟢     | 🔴              | 🔴                     | ?                       | 1 / 2          | 🟢                                   | 🟢            | 🟢                      | https://github.com/Vahe1994/AQLM            |
-| [AWQ](./awq) | 🔴                       | 🔴   | 🟢        | 🟢              | 🔴                     | ?                       | 4              | 🟢                                   | 🟢            | 🟢                      | https://github.com/casper-hansen/AutoAWQ    |
-| [bitsandbytes](./bitsandbytes)                        | 🟢                       | 🔴   |     🟢     | 🔴              | 🔴                     | 🔴                       | 4 / 8          | 🟢                                   | 🟢            | 🟢                      | https://github.com/TimDettmers/bitsandbytes |
-| [EETQ](./eetq)                                | 🟢                       | 🔴   | 🟢        | 🔴              | 🔴                     | ?                       | 8              | 🟢                                   | 🟢            | 🟢                      | https://github.com/NetEase-FuXi/EETQ        |
-| GGUF / GGML (llama.cpp)             | 🟢                       | 🟢   | 🟢        | 🔴              | 🟢                     | 🔴                       | 1 - 8          | 🔴                                   | [See GGUF section](../gguf)                | [See GGUF section](../gguf)                      | https://github.com/ggerganov/llama.cpp      |
-| [GPTQ](./gptq)                                | 🔴                       | 🔴   | 🟢        | 🟢              | 🔴                     | 🔴                       | 2 - 3 - 4 - 8          | 🟢                                   | 🟢            | 🟢                      | https://github.com/AutoGPTQ/AutoGPTQ        |
-| [HQQ](./hqq)                                 | 🟢                       | 🟢    | 🟢        | 🔴              | 🔴                     | 🟢                       | 1 - 8          | 🟢                                   | 🔴            | 🟢                      | https://github.com/mobiusml/hqq/            |
-| [Quanto](./quanto)                              | 🟢                       | 🟢   | 🟢        | 🔴              | 🟢                     | 🟢                       | 2 / 4 / 8      | 🔴                                   | 🔴            | 🟢                      | https://github.com/huggingface/quanto       |
-
--- a/docs/source/en/quantization/quanto.md
+++ b/docs/source/en/quantization/quanto.md
@ -1,66 +0,0 @@
-<!--Copyright 2024 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
-
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
-
-->
-
-# Quanto
-
-<Tip>
-
-Try Quanto + transformers with this [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing)!
-
-</Tip>
-
-
-[🤗 Quanto](https://github.com/huggingface/quanto) library is a versatile pytorch quantization toolkit. The quantization method used is the linear quantization. Quanto provides several unique features such as:
-
- weights quantization (`float8`,`int8`,`int4`,`int2`)
- activation quantization (`float8`,`int8`)
- modality agnostic (e.g CV,LLM)
- device agnostic (e.g CUDA,MPS,CPU)
- compatibility with `torch.compile`
- easy to add custom kernel for specific device
- supports quantization aware training
-<!-- Add link to the blogpost -->
-
-Before you begin, make sure the following libraries are installed:
-
-```bash
-pip install quanto accelerate transformers
-```
-
-Now you can quantize a model by passing [`QuantoConfig`] object in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it contains `torch.nn.Linear` layers. 
-
-The integration with transformers only supports weights quantization. For the more complex use case such as activation quantization, calibration and quantization aware training, you should use [quanto](https://github.com/huggingface/quanto) library instead. 
-
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
-
-model_id = "facebook/opt-125m"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-quantization_config = QuantoConfig(weights="int8")
-quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
-```
-
-Note that serialization is not supported yet with transformers but it is coming soon! If you want to save the model, you can use quanto library instead.
-
-Quanto library uses linear quantization algorithm for quantization. Even though this is a basic quantization technique, we get very good results! Have a look at the following becnhmark (llama-2-7b on perplexity metric). You can find more benchamarks [here](https://github.com/huggingface/quanto/tree/main/bench/generation)
-
-<div class="flex gap-4">
-  <div>
-    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/NousResearch-Llama-2-7b-hf_Perplexity.png" alt="llama-2-7b-quanto-perplexity" />
-  </div>
-</div>
-
-The library is versatible enough to be compatible with most PTQ optimization algorithms. The plan in the future is to integrate the most popular algorithms in the most seamless possible way (AWQ, Smoothquant).
--- a/docs/source/en/quicktour.md
+++ b/docs/source/en/quicktour.md
@ -204,7 +204,7 @@ Pass your text to the tokenizer:
 The tokenizer returns a dictionary containing:

 * [input_ids](./glossary#input-ids): numerical representations of your tokens.
-* [attention_mask](./glossary#attention-mask): indicates which tokens should be attended to.
+* [attention_mask](.glossary#attention-mask): indicates which tokens should be attended to.

 A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:

--- a/docs/source/en/tasks/object_detection.md
+++ b/docs/source/en/tasks/object_detection.md
@ -206,10 +206,10 @@ Instantiate the image processor from the same checkpoint as the model you want t

 >>> image_processor = AutoImageProcessor.from_pretrained(
 ...     MODEL_NAME,
-...     do_resize=True,
-...     size={"max_height": MAX_SIZE, "max_width": MAX_SIZE},
-...     do_pad=True,
-...     pad_size={"height": MAX_SIZE, "width": MAX_SIZE},
+...     # At this moment we recommend using external transform to pad and resize images.
+...     # It`s faster and yields better results for object-detection models.
+...     do_pad=False,
+...     do_resize=False,
 ... )
 ```

@ -217,28 +217,38 @@ Before passing the images to the `image_processor`, apply two preprocessing tran
 - Augmenting images
 - Reformatting annotations to meet DETR expectations

-First, to make sure the model does not overfit on the training data, you can apply image augmentation with any data augmentation library. Here we use [Albumentations](https://albumentations.ai/docs/).
+First, to make sure the model does not overfit on the training data, you can apply image augmentation with any data augmentation library. Here we use [Albumentations](https://albumentations.ai/docs/) ...
 This library ensures that transformations affect the image and update the bounding boxes accordingly.
 The 🤗 Datasets library documentation has a detailed [guide on how to augment images for object detection](https://huggingface.co/docs/datasets/object_detection),
-and it uses the exact same dataset as an example. Apply some geometric and color transformations to the image. For additional augmentation options, explore the [Albumentations Demo Space](https://huggingface.co/spaces/qubvel-hf/albumentations-demo).
+and it uses the exact same dataset as an example. Apply the same approach here, resize each image to (480, 480),
+flip it horizontally, and brighten it. For additional augmentation options, explore the [Albumentations Demo Space](https://huggingface.co/spaces/qubvel-hf/albumentations-demo).

 ```py
 >>> import albumentations as A

 >>> max_size = IMAGE_SIZE

+>>> # Resize image longest edge to 480 and then pad image to square 480x480.
+>>> # This padding and resizing strategy give better results, see
+>>> # https://github.com/huggingface/transformers/pull/30422#discussion_r1584647408
+>>> basic_transforms = [
+...     A.LongestMaxSize(max_size=max_size),
+...     A.PadIfNeeded(max_size, max_size, border_mode=0, value=(128, 128, 128), position="top_left"),
+... ]
+
 >>> train_augment_and_transform = A.Compose(
 ...     [
 ...         A.Perspective(p=0.1),
 ...         A.HorizontalFlip(p=0.5),
 ...         A.RandomBrightnessContrast(p=0.5),
 ...         A.HueSaturationValue(p=0.1),
+...         *basic_transforms,
 ...     ],
 ...     bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True, min_area=25),
 ... )

 >>> validation_transform = A.Compose(
-...     [A.NoOp()],
+...     basic_transforms,
 ...     bbox_params=A.BboxParams(format="coco", label_fields=["category"], clip=True),
 ... )
 ```
@ -284,7 +294,7 @@ The `image_processor` expects the annotations to be in the following format: `{'
 Now you can combine the image and annotation transformations to use on a batch of examples:

 ```py
->>> def augment_and_transform_batch(examples, transform, image_processor, return_pixel_mask=False):
+>>> def augment_and_transform_batch(examples, transform, image_processor):
 ...     """Apply augmentations and format annotations in COCO format for object detection task"""

 ...     images = []
@ -305,9 +315,6 @@ Now you can combine the image and annotation transformations to use on a batch o
 ...     # Apply the image processor transformations: resizing, rescaling, normalization
 ...     result = image_processor(images=images, annotations=annotations, return_tensors="pt")

-...     if not return_pixel_mask:
-...         result.pop("pixel_mask", None)
-
 ...     return result
 ```

@ -1478,12 +1485,25 @@ Now that you have finetuned a model, evaluated it, and uploaded it to the Huggin
 ```py
 >>> import torch
 >>> import requests
+>>> import numpy as np
+>>> import albumentations as A

->>> from PIL import Image, ImageDraw
+>>> from PIL import Image
 >>> from transformers import AutoImageProcessor, AutoModelForObjectDetection

 >>> url = "https://images.pexels.com/photos/8413299/pexels-photo-8413299.jpeg?auto=compress&cs=tinysrgb&w=630&h=375&dpr=2"
 >>> image = Image.open(requests.get(url, stream=True).raw)
+
+>>> # Define transformations for inference
+>>> resize_and_pad = A.Compose([
+...     A.LongestMaxSize(max_size=max_size),
+...     A.PadIfNeeded(max_size, max_size, border_mode=0, value=(128, 128, 128), position="top_left"),
+... ])
+
+>>> # This one is for visualization with no padding
+>>> resize_only = A.Compose([
+...     A.LongestMaxSize(max_size=max_size),
+... ])
 ```

 Load model and image processor from the Hugging Face Hub (skip to use already trained in this session):
@ -1499,11 +1519,12 @@ Load model and image processor from the Hugging Face Hub (skip to use already tr
 And detect bounding boxes:

 ```py
+>>> np_preprocessed_image = resize_and_pad(image=np.array(image))["image"]

 >>> with torch.no_grad():
-...     inputs = image_processor(images=[image], return_tensors="pt")
-...     outputs = model(**inputs.to(device))
-...     target_sizes = torch.tensor([[image.size[1], image.size[0]]])
+...     inputs = image_processor(images=[np_preprocessed_image], return_tensors="pt")
+...     outputs = model(inputs["pixel_values"].to(device))
+...     target_sizes = torch.tensor([np_preprocessed_image.shape[:2]])
 ...     results = image_processor.post_process_object_detection(outputs, threshold=0.3, target_sizes=target_sizes)[0]

 >>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
@ -1522,7 +1543,9 @@ Detected Coverall with confidence 0.391 at location [68.61, 126.66, 309.03, 318.
 Let's plot the result:

 ```py
->>> draw = ImageDraw.Draw(image)
+>>> resized_image = resize_only(image=np.array(image))["image"]
+>>> resized_image = Image.fromarray(resized_image)
+>>> draw = ImageDraw.Draw(resized_image)

 >>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
 ...     box = [round(i, 2) for i in box.tolist()]
@ -1530,7 +1553,7 @@ Let's plot the result:
 ...     draw.rectangle((x, y, x2, y2), outline="red", width=1)
 ...     draw.text((x, y), model.config.id2label[label.item()], fill="white")

->>> image
+>>> resized_image
 ```

 <div class="flex justify-center">
--- a/docs/source/en/trainer.md
+++ b/docs/source/en/trainer.md
@ -382,56 +382,6 @@ trainer.train()

 Note layerwise optimization is a bit experimental and does not support DDP (Distributed Data Parallel), thus you can run the training script only on a single GPU. Please see [this appropriate section](https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#train-7b-model-with-a-single-gpu-with-24gb-memory) for more details. Other features such as gradient clipping, DeepSpeed, etc might not be supported out of the box. Please [raise an issue on GitHub](https://github.com/huggingface/transformers/issues) if you encounter such issue.

-## LOMO optimizer
-
-The LOMO optimizers have been introduced in [Full Parameter Fine-Tuning for Large Language Models with Limited Resources](https://hf.co/papers/2306.09782) and [AdaLomo: Low-memory Optimization with Adaptive Learning Rate](https://hf.co/papers/2310.10195). 
-They both consist of an efficient full-parameter fine-tuning method. These optimizers fuse the gradient computation and the parameter update in one step to reduce memory usage. Supported optimizers for LOMO are `"lomo"` and `"adalomo"`. First either install LOMO from pypi `pip install lomo-optim` or install it from source with `pip install git+https://github.com/OpenLMLab/LOMO.git`. 
-
-<Tip>
-
-According to the authors, it is recommended to use `AdaLomo` without `grad_norm` to get better performance and higher throughput.
-
-</Tip>
-
-Below is a simple script to demonstrate how to fine-tune [google/gemma-2b](https://huggingface.co/google/gemma-2b) on IMDB dataset in full precision:
-
-```python
-import torch
-import datasets
-from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
-import trl
-
-train_dataset = datasets.load_dataset('imdb', split='train')
-
-args = TrainingArguments(
-    output_dir="./test-lomo",
-    max_steps=1000,
-    per_device_train_batch_size=4,
-    optim="adalomo",
-    gradient_checkpointing=True,
-    logging_strategy="steps",
-    logging_steps=1,
-    learning_rate=2e-6,
-    save_strategy="no",
-    run_name="lomo-imdb",
-)
-
-model_id = "google/gemma-2b"
-
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(0)
-
-trainer = trl.SFTTrainer(
-    model=model, 
-    args=args,
-    train_dataset=train_dataset,
-    dataset_text_field='text',
-    max_seq_length=1024,
-)
-
-trainer.train()
-```
-
 ## Accelerate and Trainer

 The [`Trainer`] class is powered by [Accelerate](https://hf.co/docs/accelerate), a library for easily training PyTorch models in distributed environments with support for integrations such as [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/).
--- a/docs/source/es/_toctree.yml
+++ b/docs/source/es/_toctree.yml
@ -102,6 +102,4 @@
    title: Perplejidad de los modelos de longitud fija
  - local: pipeline_webserver
    title: Flujo de trabajo para la inferencia de los servidores web
-  - local: model_memory_anatomy
-    title: Anatomía del entrenamiento de los modelos
  title: Guías conceptuales
--- a/docs/source/es/installation.md
+++ b/docs/source/es/installation.md
@ -154,7 +154,7 @@ Los modelos preentrenados se descargan y almacenan en caché localmente en: `~/.

 ## Modo Offline

-🤗 Transformers puede ejecutarse en un entorno con firewall o fuera de línea (offline) usando solo archivos locales. Configura la variable de entorno `HF_HUB_OFFLINE=1` para habilitar este comportamiento.
+🤗 Transformers puede ejecutarse en un entorno con firewall o fuera de línea (offline) usando solo archivos locales. Configura la variable de entorno `TRANSFORMERS_OFFLINE=1` para habilitar este comportamiento.

 <Tip>

@ -171,7 +171,7 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 Ejecuta este mismo programa en una instancia offline con el siguiente comando:

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/docs/source/es/model_memory_anatomy.md
+++ b/docs/source/es/model_memory_anatomy.md
@ -1,239 +0,0 @@
-<!---
-Copyright 2023 The HuggingFace Team. All rights reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Anatomía del entrenamiento de los modelos
-
-Para entender las técnicas de optimización del rendimiento que se pueden aplicar para mejorar la eficiencia en la velocidad del entrenamiento de los modelos y la utilización de la memoria, es útil familiarizarse con cómo se utiliza la GPU durante el entrenamiento y cómo varía la intensidad de cálculo según la operación realizada.
-
-Empecemos explorando un ejemplo enfocado en la utilización de la GPU y la ejecución del entrenamiento de un modelo. Para la demostración, necesitaremos instalar algunas bibliotecas:
-
-```bash
-pip install transformers datasets accelerate nvidia-ml-py3
-```
-
-La biblioteca `nvidia-ml-py3` nos permite monitorear la utilización de memoria de los modelos desde Python. Es posible que estés familiarizado con el comando `nvidia-smi` en la terminal, esta biblioteca nos permite acceder a la misma información en Python directamente.
-
-Luego, creamos algunos datos ficticios: IDs de tokens aleatorios entre 100 y 30000 y etiquetas binarias para un clasificador. En total, obtenemos 512 secuencias cada una con longitud 512 y las almacenamos en un [`~datasets.Dataset`] con formato PyTorch.
-
-
-```py
->>> import numpy as np
->>> from datasets import Dataset
-
-
->>> seq_len, dataset_size = 512, 512
->>> dummy_data = {
-...     "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)),
-...     "labels": np.random.randint(0, 1, (dataset_size)),
-... }
->>> ds = Dataset.from_dict(dummy_data)
->>> ds.set_format("pt")
-```
-
-Para imprimir estadísticas resumidas para la utilización de la GPU y la ejecución del entrenamiento con [`Trainer`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer), definimos dos funciones auxiliares:
-
-```py
->>> from pynvml import *
-
-
->>> def print_gpu_utilization():
-...     nvmlInit()
-...     handle = nvmlDeviceGetHandleByIndex(0)
-...     info = nvmlDeviceGetMemoryInfo(handle)
-...     print(f"GPU memory occupied: {info.used//1024**2} MB.")
-
-
->>> def print_summary(result):
-...     print(f"Time: {result.metrics['train_runtime']:.2f}")
-...     print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
-...     print_gpu_utilization()
-```
-
-Comencemos comprobando que la memoria GPU este libre:
-
-```py
->>> print_gpu_utilization()
-GPU memory occupied: 0 MB.
-```
-
-Parece estar bien: la memoria de la GPU no está ocupada como esperaríamos antes de cargar cualquier modelo. Si no es el caso en tu máquina, asegúrate de detener todos los procesos que estén utilizando la memoria de la GPU. Sin embargo, no toda la memoria libre de la GPU puede ser utilizada por el usuario. Cuando se carga un modelo en la GPU, también se cargan los kernels, lo que puede ocupar 1-2GB de memoria. Para ver cuánta memoria será ocupada por defecto, cargemos un tensor diminuto en la GPU, lo que también desencadena la carga de los kernels.
-
-```py
->>> import torch
-
-
->>> torch.ones((1, 1)).to("cuda")
->>> print_gpu_utilization()
-GPU memory occupied: 1343 MB.
-```
-
-Vemos que los kernels solos ocupan 1,3GB de memoria de la GPU. Ahora, veamos cuánto espacio ocupa el modelo.
-
-## Cargar el Modelo
-
-Primero, cargamos el modelo `google-bert/bert-large-uncased`. Los pesos del modelo son cargados directamente en la GPU para que podamos verificar cuánto espacio ocupan solo los pesos.
-
-```py
->>> from transformers import AutoModelForSequenceClassification
-
-
->>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda")
->>> print_gpu_utilization()
-GPU memory occupied: 2631 MB.
-```
-
-Podemos ver que los pesos del modelo solos ocupan 1,3 GB de memoria de la GPU. El número exacto depende de la GPU específica que estés utilizando. Ten en cuenta que en GPUs más modernas, un modelo puede ocupar más espacio ya que los pesos se cargan de manera optimizada lo cual acelera el uso del modelo. Ahora también podemos verificar rápidamente si obtenemos el mismo resultado que con la CLI de `nvidia-smi`:
-
-```bash
-nvidia-smi
-```
-
-```bash
-Tue Jan 11 08:58:05 2022
-+-----------------------------------------------------------------------------+
-| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
-|-------------------------------+----------------------+----------------------+
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
-|                               |                      |               MIG M. |
-|===============================+======================+======================|
-|   0  Tesla V100-SXM2...  On   | 00000000:00:04.0 Off |                    0 |
-| N/A   37C    P0    39W / 300W |   2631MiB / 16160MiB |      0%      Default |
-|                               |                      |                  N/A |
-+-------------------------------+----------------------+----------------------+
-
-+-----------------------------------------------------------------------------+
-| Processes:                                                                  |
-|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
-|        ID   ID                                                   Usage      |
-|=============================================================================|
-|    0   N/A  N/A      3721      C   ...nvs/codeparrot/bin/python     2629MiB |
-+-----------------------------------------------------------------------------+
-```
-
-Obtenemos el mismo número que antes y también puedes ver que estamos utilizando una GPU V100 con 16GB de memoria. Ahora podemos empezar a entrenar el modelo y ver cómo cambia el consumo de memoria de la GPU. Primero, configuramos algunos argumentos de entrenamiento estándar:
-
-```py
-default_args = {
-    "output_dir": "tmp",
-    "eval_strategy": "steps",
-    "num_train_epochs": 1,
-    "log_level": "error",
-    "report_to": "none",
-}
-```
-
-<Tip>
-
-Si planeas ejecutar varias pruebas, reinicie el kernel de Python entre cada prueba para borrar correctamente la memoria.
-
-</Tip>
-
-## Utilización de la memoria en el entrenamiento
-
-Vamos a utilizar el [`Trainer`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer) y entrenar el modelo sin utilizar ninguna técnica de optimización del rendimiento de la GPU y un tamaño de lote de 4:
-
-```py
->>> from transformers import TrainingArguments, Trainer, logging
-
->>> logging.set_verbosity_error()
-
-
->>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
->>> trainer = Trainer(model=model, args=training_args, train_dataset=ds)
->>> result = trainer.train()
->>> print_summary(result)
-```
-
-```
-Time: 57.82
-Samples/second: 8.86
-GPU memory occupied: 14949 MB.
-```
-
-Vemos que incluso un tamaño de lote relativamente pequeño casi llena toda la memoria de nuestra GPU. Sin embargo, un tamaño de lote más grande a menudo puede resultar en una convergencia del modelo más rápida o un mejor rendimiento final. Así que idealmente queremos ajustar el tamaño del lote a las necesidades del modelo y no a las limitaciones de la GPU. Lo interesante es que utilizamos mucha más memoria que el tamaño del modelo. 
-Para entender un poco mejor por qué es el caso, echemos un vistazo a las operaciones y necesidades de memoria de un modelo.
-
-## Anatomía de las Operaciones del Modelo
-
-La arquitectura de los transformers incluye 3 grupos principales de operaciones agrupadas a continuación por intensidad de cálculo.
-
-1. **Contracciones de Tensores**
-
-    Las capas lineales y componentes de la Atención Multi-Head realizan **multiplicaciones matriciales por lotes**. Estas operaciones son la parte más intensiva en cálculo del entrenamiento de los transformers.
-
-2. **Normalizaciones Estadísticas**
-
-    Softmax y normalización de capas son menos intensivas en cálculo que las contracciones de tensores, e implican una o más **operaciones de reducción**, cuyo resultado se aplica luego mediante un mapa.
-
-3. **Operadores por Elemento**
-
-    Estos son los operadores restantes: **sesgos, dropout, activaciones y conexiones residuales**. Estas son las operaciones menos intensivas en cálculo.
-
-Este conocimiento puede ser útil al analizar cuellos de botella de rendimiento.
-
-Este resumen se deriva de [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072)
-
-
-## Anatomía de la Memoria del Modelo
-
-Hemos visto que al entrenar un modelo se utiliza mucha más memoria que solo poner el modelo en la GPU. Esto se debe a que hay muchos componentes durante el entrenamiento que utilizan memoria de la GPU. Los componentes en memoria de la GPU son los siguientes:
-
-1. pesos del modelo
-2. estados del optimizador
-3. gradientes
-4. activaciones hacia adelante guardadas para el cálculo del gradiente
-5. buffers temporales
-6. memoria específica de funcionalidad
-
-Un modelo típico entrenado en precisión mixta con AdamW requiere 18 bytes por parámetro del modelo más memoria de activación. Para la inferencia no hay estados del optimizador ni gradientes, por lo que podemos restarlos. Y así terminamos con 6 bytes por parámetro del modelo para la inferencia en precisión mixta, más la memoria de activación.
-
-Veámoslo a detalle:
-
-**Pesos del Modelo:**
-
- 4 bytes por número de parámetros para entrenamiento en fp32
- 6 bytes por número de parámetros para entrenamiento en precisión mixta (mantiene un modelo en fp32 y uno en fp16 en memoria)
-
-**Estados del Optimizador:**
-
- 8 bytes por número de parámetros para un AdamW normal (mantiene 2 estados)
- 2 bytes por número de parámetros para optimizadores de 8 bits como [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- 4 bytes por número de parámetros para optimizadores como SGD con momentum (mantiene solo 1 estado)
-
-**Gradientes**
-
- 4 bytes por número de parámetros para entrenamiento en fp32 o precisión mixta (los gradientes siempre se mantienen en fp32)
-
-**Activaciones hacia Adelante**
-
- El tamaño depende de muchos factores, los principales siendo la longitud de la secuencia, el tamaño oculto y el tamaño de lote.
-
-Hay entradas y salidas que se pasan y se devuelven por las funciones hacia adelante y hacia atrás, y las activaciones hacia adelante (*forward activations*) guardadas para el cálculo del gradiente.
-
-**Memoria Temporal**
-
-Además, hay todas clases de variables temporales que se liberan una vez que se completa el cálculo, pero en el momento podrían requerir memoria adicional y podrían provocar un error de memoria insuficiente. Por lo tanto, al codificar es crucial pensar estratégicamente sobre tales variables temporales y a veces liberarlas explícitamente tan pronto como ya no se necesitan.
-
-**Memoria Específica de Funcionalidad**
-
-Entonces, su software podría tener necesidades especiales de memoria. Por ejemplo, al generar texto mediante la búsqueda por haz, el software necesita mantener múltiples copias de las entradas y salidas.
-
-**Velocidad de Ejecución `forward` vs `backward`**
-
-Para convoluciones y capas lineales, hay 2x flops en la ejecución hacia atrás (`backward`) en comparación con la ejecución hacia adelante (`forward`), lo que generalmente se traduce en ~2x más lento (a veces más, porque los tamaños en la ejecución hacia atrás tienden a ser más complejos). Las activaciones suelen ser limitadas por ancho de banda, y es típico que una activación tenga que leer más datos en la ejecución hacia atrás que en la ejecución hacia adelante (por ejemplo, la activación hacia adelante lee una vez, escribe una vez, la activación hacia atrás lee dos veces, gradOutput y salida de la ejecución hacia adelante, y escribe una vez, gradInput).
-
-Como puedes ver, hay potencialmente unos pocos lugares donde podríamos ahorrar memoria de la GPU o acelerar operaciones. Ahora que entiendes qué afecta la utilización de la GPU y la velocidad de cálculo, consulta la página de documentación [Métodos y herramientas para entrenamiento eficiente en una sola GPU](https://huggingface.co/docs/transformers/perf_train_gpu_one) para aprender sobre técnicas de optimización del rendimiento.
--- a/docs/source/fr/installation.md
+++ b/docs/source/fr/installation.md
@ -171,7 +171,7 @@ Les modèles pré-entraînés sont téléchargés et mis en cache localement dan

 ## Mode hors ligne

-🤗 Transformers peut fonctionner dans un environnement cloisonné ou hors ligne en n'utilisant que des fichiers locaux. Définissez la variable d'environnement `HF_HUB_OFFLINE=1` pour activer ce mode.
+🤗 Transformers peut fonctionner dans un environnement cloisonné ou hors ligne en n'utilisant que des fichiers locaux. Définissez la variable d'environnement `TRANSFORMERS_OFFLINE=1` pour activer ce mode.

 <Tip>

@ -180,7 +180,7 @@ Ajoutez [🤗 Datasets](https://huggingface.co/docs/datasets/) à votre processu
 </Tip>

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/docs/source/it/installation.md
+++ b/docs/source/it/installation.md
@ -152,7 +152,7 @@ I modelli pre-allenati sono scaricati e memorizzati localmente nella cache in: `

 ## Modalità Offline

-🤗 Transformers può essere eseguita in un ambiente firewalled o offline utilizzando solo file locali. Imposta la variabile d'ambiente `HF_HUB_OFFLINE=1` per abilitare questo comportamento.
+🤗 Transformers può essere eseguita in un ambiente firewalled o offline utilizzando solo file locali. Imposta la variabile d'ambiente `TRANSFORMERS_OFFLINE=1` per abilitare questo comportamento.

 <Tip>

@ -169,7 +169,7 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 Esegui lo stesso programma in un'istanza offline con:

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/docs/source/it/perf_infer_gpu_one.md
+++ b/docs/source/it/perf_infer_gpu_one.md
@ -55,10 +55,10 @@ Di seguito sono riportate alcune note per aiutarvi a utilizzare questo modulo, o
 Dopo aver installato le librerie necessarie, per caricare il tuo modello mixed 8-bit è il seguente:

 ```py
-from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM

 model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
 ```

 Per la generazione di testo, si consiglia di:
@ -69,11 +69,11 @@ Per la generazione di testo, si consiglia di:
 Ecco un semplice esempio:

 ```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer

 model_name = "bigscience/bloom-2b5"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

 text = "Hello, my llama is cute"
 inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
@ -87,7 +87,7 @@ outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
 Usare il seguente modo caricare il modello mixed-8bit su più GPU (stesso comando della configurazione a GPU singola):
 ```py
 model_name = "bigscience/bloom-2b5"
-model_8bit = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=BitsAndBytesConfig(load_in_8bit=True))
+model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
 ```
 Puoi controllare la RAM della GPU che si vuole allocare su ogni GPU usando `accelerate`. Utilizzare l'argomento `max_memory` come segue:

--- a/docs/source/ja/installation.md
+++ b/docs/source/ja/installation.md
@ -157,7 +157,7 @@ conda install conda-forge::transformers

 ## オフラインモード

-🤗 Transformersはローカルファイルのみを使用することでファイアウォールやオフラインの環境でも動作させることができます。この動作を有効にするためには、環境変数`HF_HUB_OFFLINE=1`を設定します。
+🤗 Transformersはローカルファイルのみを使用することでファイアウォールやオフラインの環境でも動作させることができます。この動作を有効にするためには、環境変数`TRANSFORMERS_OFFLINE=1`を設定します。

 <Tip>

@ -174,7 +174,7 @@ python examples/pytorch/translation/run_translation.py --model_name_or_path goog
 オフラインインスタンスでこの同じプログラムを実行します:

 ```bash
-HF_DATASETS_OFFLINE=1 HF_HUB_OFFLINE=1 \
+HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \
 python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ...
 ```

--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
ArthurZucker	ab0f050b42	Release: v4.41.2	2024-05-30 13:28:00 -04:00
Matt	57f5553d2e	Fix faulty rstrip in module loading (#31108 )	2024-05-30 13:25:10 -04:00
oOraph	73b180c2be	fix from_pretrained in offline mode when model is preloaded in cache (#31010 ) * Unit test to verify fix Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> * fix from_pretrained in offline mode when model is preloaded in cache Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> * minor: fmt Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2024-05-30 13:25:10 -04:00
Aymeric Roucher	a6325a77b2	Redirect transformers_agents doc to agents (#31054 )	2024-05-30 13:25:10 -04:00
Pablo Montalvo	9ccdc84cb2	Paligemma- fix devices and dtype assignments (#31008 ) * fix devices and dtype assignments * [run-slow]paligemma	2024-05-30 13:25:09 -04:00
Lucain	12aa3167b4	Do not trigger autoconversion if local_files_only (#31004 )	2024-05-24 05:02:39 -04:00
ArthurZucker	75f15f39a0	Release: v4.41.1	2024-05-22 13:40:40 -04:00
Pablo Montalvo	8282db5cc9	Paligemma causal attention mask (#30967 ) * PaliGemma working causal attention * Formatting * Style * Docstrings + remove commented code * Update docstring for PaliGemma Config * PaliGemma - add separator ind to model/labels * Refactor + docstring paligemma processor method * Style * return token type ids when tokenizing labels * use token type ids when building causal mask * add token type ids to tester * remove separator from config * fix style * don't ignore separator * add processor documentation * simplify tokenization * fix causal mask * style * fix label propagation, revert suffix naming * fix style * fix labels tokenization * [run-slow]paligemma * add eos if suffixes are present * [run-slow]paligemma * [run-slow]paligemma * add misssing tokens to fast version * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix style * [run-slow]paligemma --------- Co-authored-by: Peter Robicheaux <peter@roboflow.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>	2024-05-22 13:39:52 -04:00
ArthurZucker	e5b788ade3	Revert "feat: Upgrade Weights & Biases callback (#30135 )" This reverts commit `4ab7a28216`.	2024-05-22 12:39:27 -04:00
Raushan Turganbay	9d054596e7	Generation: get special tokens from model config (#30899 ) * fix * let's do this way? * codestyle * update * add tests	2024-05-22 12:37:27 -04:00
hoshi-hiyouga	e5d174f12a	PaliGemma - fix processor with no input text (#30916 ) Update processing_paligemma.py	2024-05-22 12:37:15 -04:00
Arthur	04141855bd	legacy to init the slow tokenizer when converting from slow was wrong (#30972 )	2024-05-22 12:37:07 -04:00
Arthur	6d2439a126	`tokenizer_class = "AutoTokenizer"` Llava Family (#30912 ) propagate changes to more models	2024-05-22 12:36:58 -04:00
ArthurZucker	4c6c45ba13	Release: v4.41.0	2024-05-17 11:11:44 -04:00