359 lines
18 KiB
Markdown
359 lines
18 KiB
Markdown
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# BERT
|
|
|
|
<div class="flex flex-wrap space-x-1">
|
|
<a href="https://huggingface.co/models?filter=bert">
|
|
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-bert-blueviolet">
|
|
</a>
|
|
<a href="https://huggingface.co/spaces/docs-demos/bert-base-uncased">
|
|
<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue">
|
|
</a>
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
The BERT model was proposed in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. It's a
|
|
bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence
|
|
prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations
|
|
from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional
|
|
representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result,
|
|
the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models
|
|
for a wide range of tasks, such as question answering and language inference, without substantial task-specific
|
|
architecture modifications.*
|
|
|
|
*BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural
|
|
language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI
|
|
accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute
|
|
improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).*
|
|
|
|
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert).
|
|
|
|
## Usage tips
|
|
|
|
- BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
|
the left.
|
|
- BERT was trained with the masked language modeling (MLM) and next sentence prediction (NSP) objectives. It is
|
|
efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation.
|
|
- Corrupts the inputs by using random masking, more precisely, during pretraining, a given percentage of tokens (usually 15%) is masked by:
|
|
|
|
* a special mask token with probability 0.8
|
|
* a random token different from the one masked with probability 0.1
|
|
* the same token with probability 0.1
|
|
|
|
- The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% they are not related. The model has to predict if the sentences are consecutive or not.
|
|
|
|
### Using Scaled Dot Product Attention (SDPA)
|
|
|
|
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of `torch.nn.functional`. This function
|
|
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
|
|
[official documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html)
|
|
or the [GPU Inference](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention)
|
|
page for more information.
|
|
|
|
SDPA is used by default for `torch>=2.1.1` when an implementation is available, but you may also set
|
|
`attn_implementation="sdpa"` in `from_pretrained()` to explicitly request SDPA to be used.
|
|
|
|
```
|
|
from transformers import BertModel
|
|
|
|
model = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")
|
|
...
|
|
```
|
|
|
|
For the best speedups, we recommend loading the model in half-precision (e.g. `torch.float16` or `torch.bfloat16`).
|
|
|
|
On a local benchmark (A100-80GB, CPUx12, RAM 96.6GB, PyTorch 2.2.0, OS Ubuntu 22.04) with `float16`, we saw the
|
|
following speedups during training and inference.
|
|
|
|
#### Training
|
|
|
|
|batch_size|seq_len|Time per batch (eager - s)|Time per batch (sdpa - s)|Speedup (%)|Eager peak mem (MB)|sdpa peak mem (MB)|Mem saving (%)|
|
|
|----------|-------|--------------------------|-------------------------|-----------|-------------------|------------------|--------------|
|
|
|4 |256 |0.023 |0.017 |35.472 |939.213 |764.834 |22.800 |
|
|
|4 |512 |0.023 |0.018 |23.687 |1970.447 |1227.162 |60.569 |
|
|
|8 |256 |0.023 |0.018 |23.491 |1594.295 |1226.114 |30.028 |
|
|
|8 |512 |0.035 |0.025 |43.058 |3629.401 |2134.262 |70.054 |
|
|
|16 |256 |0.030 |0.024 |25.583 |2874.426 |2134.262 |34.680 |
|
|
|16 |512 |0.064 |0.044 |46.223 |6964.659 |3961.013 |75.830 |
|
|
|
|
#### Inference
|
|
|
|
|batch_size|seq_len|Per token latency eager (ms)|Per token latency SDPA (ms)|Speedup (%)|Mem eager (MB)|Mem BT (MB)|Mem saved (%)|
|
|
|----------|-------|----------------------------|---------------------------|-----------|--------------|-----------|-------------|
|
|
|1 |128 |5.736 |4.987 |15.022 |282.661 |282.924 |-0.093 |
|
|
|1 |256 |5.689 |4.945 |15.055 |298.686 |298.948 |-0.088 |
|
|
|2 |128 |6.154 |4.982 |23.521 |314.523 |314.785 |-0.083 |
|
|
|2 |256 |6.201 |4.949 |25.303 |347.546 |347.033 |0.148 |
|
|
|4 |128 |6.049 |4.987 |21.305 |378.895 |379.301 |-0.107 |
|
|
|4 |256 |6.285 |5.364 |17.166 |443.209 |444.382 |-0.264 |
|
|
|
|
|
|
|
|
## Resources
|
|
|
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
|
|
|
<PipelineTag pipeline="text-classification"/>
|
|
|
|
- A blog post on [BERT Text Classification in a different language](https://www.philschmid.de/bert-text-classification-in-a-different-language).
|
|
- A notebook for [Finetuning BERT (and friends) for multi-label text classification](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb).
|
|
- A notebook on how to [Finetune BERT for multi-label classification using PyTorch](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb). 🌎
|
|
- A notebook on how to [warm-start an EncoderDecoder model with BERT for summarization](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb).
|
|
- [`BertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb).
|
|
- [`TFBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
|
|
- [`FlaxBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb).
|
|
- [Text classification task guide](../tasks/sequence_classification)
|
|
|
|
<PipelineTag pipeline="token-classification"/>
|
|
|
|
- A blog post on how to use [Hugging Face Transformers with Keras: Fine-tune a non-English BERT for Named Entity Recognition](https://www.philschmid.de/huggingface-transformers-keras-tf).
|
|
- A notebook for [Finetuning BERT for named-entity recognition](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb) using only the first wordpiece of each word in the word label during tokenization. To propagate the label of the word to all wordpieces, see this [version](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT.ipynb) of the notebook instead.
|
|
- [`BertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb).
|
|
- [`TFBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
|
|
- [`FlaxBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification).
|
|
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course.
|
|
- [Token classification task guide](../tasks/token_classification)
|
|
|
|
<PipelineTag pipeline="fill-mask"/>
|
|
|
|
- [`BertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).
|
|
- [`TFBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
|
|
- [`FlaxBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb).
|
|
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course.
|
|
- [Masked language modeling task guide](../tasks/masked_language_modeling)
|
|
|
|
<PipelineTag pipeline="question-answering"/>
|
|
|
|
- [`BertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
|
|
- [`TFBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
|
|
- [`FlaxBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering).
|
|
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course.
|
|
- [Question answering task guide](../tasks/question_answering)
|
|
|
|
**Multiple choice**
|
|
- [`BertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb).
|
|
- [`TFBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb).
|
|
- [Multiple choice task guide](../tasks/multiple_choice)
|
|
|
|
⚡️ **Inference**
|
|
- A blog post on how to [Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia](https://huggingface.co/blog/bert-inferentia-sagemaker).
|
|
- A blog post on how to [Accelerate BERT inference with DeepSpeed-Inference on GPUs](https://www.philschmid.de/bert-deepspeed-inference).
|
|
|
|
⚙️ **Pretraining**
|
|
- A blog post on [Pre-Training BERT with Hugging Face Transformers and Habana Gaudi](https://www.philschmid.de/pre-training-bert-habana).
|
|
|
|
🚀 **Deploy**
|
|
- A blog post on how to [Convert Transformers to ONNX with Hugging Face Optimum](https://www.philschmid.de/convert-transformers-to-onnx).
|
|
- A blog post on how to [Setup Deep Learning environment for Hugging Face Transformers with Habana Gaudi on AWS](https://www.philschmid.de/getting-started-habana-gaudi#conclusion).
|
|
- A blog post on [Autoscaling BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module](https://www.philschmid.de/terraform-huggingface-amazon-sagemaker-advanced).
|
|
- A blog post on [Serverless BERT with HuggingFace, AWS Lambda, and Docker](https://www.philschmid.de/serverless-bert-with-huggingface-aws-lambda-docker).
|
|
- A blog post on [Hugging Face Transformers BERT fine-tuning using Amazon SageMaker and Training Compiler](https://www.philschmid.de/huggingface-amazon-sagemaker-training-compiler).
|
|
- A blog post on [Task-specific knowledge distillation for BERT using Transformers & Amazon SageMaker](https://www.philschmid.de/knowledge-distillation-bert-transformers).
|
|
|
|
## BertConfig
|
|
|
|
[[autodoc]] BertConfig
|
|
- all
|
|
|
|
## BertTokenizer
|
|
|
|
[[autodoc]] BertTokenizer
|
|
- build_inputs_with_special_tokens
|
|
- get_special_tokens_mask
|
|
- create_token_type_ids_from_sequences
|
|
- save_vocabulary
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
|
|
## BertTokenizerFast
|
|
|
|
[[autodoc]] BertTokenizerFast
|
|
|
|
</pt>
|
|
<tf>
|
|
|
|
## TFBertTokenizer
|
|
|
|
[[autodoc]] TFBertTokenizer
|
|
|
|
</tf>
|
|
</frameworkcontent>
|
|
|
|
## Bert specific outputs
|
|
|
|
[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput
|
|
|
|
[[autodoc]] models.bert.modeling_tf_bert.TFBertForPreTrainingOutput
|
|
|
|
[[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput
|
|
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
|
|
## BertModel
|
|
|
|
[[autodoc]] BertModel
|
|
- forward
|
|
|
|
## BertForPreTraining
|
|
|
|
[[autodoc]] BertForPreTraining
|
|
- forward
|
|
|
|
## BertLMHeadModel
|
|
|
|
[[autodoc]] BertLMHeadModel
|
|
- forward
|
|
|
|
## BertForMaskedLM
|
|
|
|
[[autodoc]] BertForMaskedLM
|
|
- forward
|
|
|
|
## BertForNextSentencePrediction
|
|
|
|
[[autodoc]] BertForNextSentencePrediction
|
|
- forward
|
|
|
|
## BertForSequenceClassification
|
|
|
|
[[autodoc]] BertForSequenceClassification
|
|
- forward
|
|
|
|
## BertForMultipleChoice
|
|
|
|
[[autodoc]] BertForMultipleChoice
|
|
- forward
|
|
|
|
## BertForTokenClassification
|
|
|
|
[[autodoc]] BertForTokenClassification
|
|
- forward
|
|
|
|
## BertForQuestionAnswering
|
|
|
|
[[autodoc]] BertForQuestionAnswering
|
|
- forward
|
|
|
|
</pt>
|
|
<tf>
|
|
|
|
## TFBertModel
|
|
|
|
[[autodoc]] TFBertModel
|
|
- call
|
|
|
|
## TFBertForPreTraining
|
|
|
|
[[autodoc]] TFBertForPreTraining
|
|
- call
|
|
|
|
## TFBertModelLMHeadModel
|
|
|
|
[[autodoc]] TFBertLMHeadModel
|
|
- call
|
|
|
|
## TFBertForMaskedLM
|
|
|
|
[[autodoc]] TFBertForMaskedLM
|
|
- call
|
|
|
|
## TFBertForNextSentencePrediction
|
|
|
|
[[autodoc]] TFBertForNextSentencePrediction
|
|
- call
|
|
|
|
## TFBertForSequenceClassification
|
|
|
|
[[autodoc]] TFBertForSequenceClassification
|
|
- call
|
|
|
|
## TFBertForMultipleChoice
|
|
|
|
[[autodoc]] TFBertForMultipleChoice
|
|
- call
|
|
|
|
## TFBertForTokenClassification
|
|
|
|
[[autodoc]] TFBertForTokenClassification
|
|
- call
|
|
|
|
## TFBertForQuestionAnswering
|
|
|
|
[[autodoc]] TFBertForQuestionAnswering
|
|
- call
|
|
|
|
</tf>
|
|
<jax>
|
|
|
|
## FlaxBertModel
|
|
|
|
[[autodoc]] FlaxBertModel
|
|
- __call__
|
|
|
|
## FlaxBertForPreTraining
|
|
|
|
[[autodoc]] FlaxBertForPreTraining
|
|
- __call__
|
|
|
|
## FlaxBertForCausalLM
|
|
|
|
[[autodoc]] FlaxBertForCausalLM
|
|
- __call__
|
|
|
|
## FlaxBertForMaskedLM
|
|
|
|
[[autodoc]] FlaxBertForMaskedLM
|
|
- __call__
|
|
|
|
## FlaxBertForNextSentencePrediction
|
|
|
|
[[autodoc]] FlaxBertForNextSentencePrediction
|
|
- __call__
|
|
|
|
## FlaxBertForSequenceClassification
|
|
|
|
[[autodoc]] FlaxBertForSequenceClassification
|
|
- __call__
|
|
|
|
## FlaxBertForMultipleChoice
|
|
|
|
[[autodoc]] FlaxBertForMultipleChoice
|
|
- __call__
|
|
|
|
## FlaxBertForTokenClassification
|
|
|
|
[[autodoc]] FlaxBertForTokenClassification
|
|
- __call__
|
|
|
|
## FlaxBertForQuestionAnswering
|
|
|
|
[[autodoc]] FlaxBertForQuestionAnswering
|
|
- __call__
|
|
|
|
</jax>
|
|
</frameworkcontent>
|
|
|
|
|