155 lines
7.9 KiB
Markdown
155 lines
7.9 KiB
Markdown
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# LayoutLMv3
|
|
|
|
## Overview
|
|
|
|
The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
|
|
LayoutLMv3 simplifies [LayoutLMv2](layoutlmv2) by using patch embeddings (as in [ViT](vit)) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM)
|
|
and word-patch alignment (WPA).
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png"
|
|
alt="drawing" width="600"/>
|
|
|
|
<small> LayoutLMv3 architecture. Taken from the <a href="https://arxiv.org/abs/2204.08387">original paper</a>. </small>
|
|
|
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3).
|
|
|
|
## Usage tips
|
|
|
|
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
|
|
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
|
|
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
|
|
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
|
|
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
|
|
|
|
## Resources
|
|
|
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
|
|
|
<Tip>
|
|
|
|
LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 resources you can adapt for LayoutLMv3 tasks. For these notebooks, take care to use [`LayoutLMv2Processor`] instead when preparing data for the model!
|
|
|
|
</Tip>
|
|
|
|
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
|
|
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
|
|
|
|
<PipelineTag pipeline="text-classification"/>
|
|
|
|
- [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb).
|
|
- [Text classification task guide](../tasks/sequence_classification)
|
|
|
|
<PipelineTag pipeline="token-classification"/>
|
|
|
|
- [`LayoutLMv3ForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3) and [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb).
|
|
- A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Inference_with_LayoutLMv2ForTokenClassification.ipynb) for how to perform inference with [`LayoutLMv2ForTokenClassification`] and a [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/True_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb) for how to perform inference when no labels are available with [`LayoutLMv2ForTokenClassification`].
|
|
- A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb) for how to finetune [`LayoutLMv2ForTokenClassification`] with the 🤗 Trainer.
|
|
- [Token classification task guide](../tasks/token_classification)
|
|
|
|
<PipelineTag pipeline="question-answering"/>
|
|
|
|
- [`LayoutLMv2ForQuestionAnswering`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb).
|
|
- [Question answering task guide](../tasks/question_answering)
|
|
|
|
**Document question answering**
|
|
- [Document question answering task guide](../tasks/document_question_answering)
|
|
|
|
## LayoutLMv3Config
|
|
|
|
[[autodoc]] LayoutLMv3Config
|
|
|
|
## LayoutLMv3FeatureExtractor
|
|
|
|
[[autodoc]] LayoutLMv3FeatureExtractor
|
|
- __call__
|
|
|
|
## LayoutLMv3ImageProcessor
|
|
|
|
[[autodoc]] LayoutLMv3ImageProcessor
|
|
- preprocess
|
|
|
|
## LayoutLMv3Tokenizer
|
|
|
|
[[autodoc]] LayoutLMv3Tokenizer
|
|
- __call__
|
|
- save_vocabulary
|
|
|
|
## LayoutLMv3TokenizerFast
|
|
|
|
[[autodoc]] LayoutLMv3TokenizerFast
|
|
- __call__
|
|
|
|
## LayoutLMv3Processor
|
|
|
|
[[autodoc]] LayoutLMv3Processor
|
|
- __call__
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
|
|
## LayoutLMv3Model
|
|
|
|
[[autodoc]] LayoutLMv3Model
|
|
- forward
|
|
|
|
## LayoutLMv3ForSequenceClassification
|
|
|
|
[[autodoc]] LayoutLMv3ForSequenceClassification
|
|
- forward
|
|
|
|
## LayoutLMv3ForTokenClassification
|
|
|
|
[[autodoc]] LayoutLMv3ForTokenClassification
|
|
- forward
|
|
|
|
## LayoutLMv3ForQuestionAnswering
|
|
|
|
[[autodoc]] LayoutLMv3ForQuestionAnswering
|
|
- forward
|
|
|
|
</pt>
|
|
<tf>
|
|
|
|
## TFLayoutLMv3Model
|
|
|
|
[[autodoc]] TFLayoutLMv3Model
|
|
- call
|
|
|
|
## TFLayoutLMv3ForSequenceClassification
|
|
|
|
[[autodoc]] TFLayoutLMv3ForSequenceClassification
|
|
- call
|
|
|
|
## TFLayoutLMv3ForTokenClassification
|
|
|
|
[[autodoc]] TFLayoutLMv3ForTokenClassification
|
|
- call
|
|
|
|
## TFLayoutLMv3ForQuestionAnswering
|
|
|
|
[[autodoc]] TFLayoutLMv3ForQuestionAnswering
|
|
- call
|
|
|
|
</tf>
|
|
</frameworkcontent>
|