84 lines
3.4 KiB
Markdown
84 lines
3.4 KiB
Markdown
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# I-BERT
|
|
|
|
## Overview
|
|
|
|
The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by
|
|
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running
|
|
inference up to four times faster.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language
|
|
Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for
|
|
efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this,
|
|
previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot
|
|
efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM
|
|
processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes
|
|
the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for
|
|
nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT
|
|
inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using
|
|
RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to
|
|
the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for
|
|
INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
|
|
been open-sourced.*
|
|
|
|
This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).
|
|
|
|
## Resources
|
|
|
|
- [Text classification task guide](../tasks/sequence_classification)
|
|
- [Token classification task guide](../tasks/token_classification)
|
|
- [Question answering task guide](../tasks/question_answering)
|
|
- [Masked language modeling task guide](../tasks/masked_language_modeling)
|
|
- [Multiple choice task guide](../tasks/masked_language_modeling)
|
|
|
|
## IBertConfig
|
|
|
|
[[autodoc]] IBertConfig
|
|
|
|
## IBertModel
|
|
|
|
[[autodoc]] IBertModel
|
|
- forward
|
|
|
|
## IBertForMaskedLM
|
|
|
|
[[autodoc]] IBertForMaskedLM
|
|
- forward
|
|
|
|
## IBertForSequenceClassification
|
|
|
|
[[autodoc]] IBertForSequenceClassification
|
|
- forward
|
|
|
|
## IBertForMultipleChoice
|
|
|
|
[[autodoc]] IBertForMultipleChoice
|
|
- forward
|
|
|
|
## IBertForTokenClassification
|
|
|
|
[[autodoc]] IBertForTokenClassification
|
|
- forward
|
|
|
|
## IBertForQuestionAnswering
|
|
|
|
[[autodoc]] IBertForQuestionAnswering
|
|
- forward
|