6.8 KiB
TrOCR
Overview
The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform optical character recognition (OCR).
The abstract from the paper is the following:
Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
TrOCR architecture. Taken from the original paper.
Please refer to the [VisionEncoderDecoder
] class on how to use this model.
This model was contributed by nielsr. The original code can be found here.
Usage tips
- The quickest way to get started with TrOCR is by checking the tutorial notebooks, which show how to use the model at inference time as well as fine-tuning on custom data.
- TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results on both printed (e.g. the SROIE dataset and handwritten (e.g. the IAM Handwriting dataset text recognition tasks. For more information, see the official models.
- TrOCR is always used within the VisionEncoderDecoder framework.
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with TrOCR. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
- A blog post on Accelerating Document AI with TrOCR.
- A blog post on how to Document AI with TrOCR.
- A notebook on how to finetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer.
- A notebook on inference with TrOCR and Gradio demo.
- A notebook on finetune TrOCR on the IAM Handwriting Database using native PyTorch.
- A notebook on evaluating TrOCR on the IAM test set.
- Casual language modeling task guide.
⚡️ Inference
- An interactive-demo on TrOCR handwritten character recognition.
Inference
TrOCR's [VisionEncoderDecoder
] model accepts images as input and makes use of
[~generation.GenerationMixin.generate
] to autoregressively generate text given the input image.
The [ViTImageProcessor
/DeiTImageProcessor
] class is responsible for preprocessing the input image and
[RobertaTokenizer
/XLMRobertaTokenizer
] decodes the generated target tokens to the target string. The
[TrOCRProcessor
] wraps [ViTImageProcessor
/DeiTImageProcessor
] and [RobertaTokenizer
/XLMRobertaTokenizer
]
into a single instance to both extract the input features and decode the predicted token ids.
- Step-by-step Optical Character Recognition (OCR)
>>> from transformers import TrOCRProcessor, VisionEncoderDecoderModel
>>> import requests
>>> from PIL import Image
>>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
>>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten")
>>> # load image from the IAM dataset
>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
>>> pixel_values = processor(image, return_tensors="pt").pixel_values
>>> generated_ids = model.generate(pixel_values)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
See the model hub to look for TrOCR checkpoints.
TrOCRConfig
autodoc TrOCRConfig
TrOCRProcessor
autodoc TrOCRProcessor - call - from_pretrained - save_pretrained - batch_decode - decode
TrOCRForCausalLM
autodoc TrOCRForCausalLM - forward