Add ViLT (#14895)
* First commit * Add conversion script * Make conversion script work for base model * More improvements * Update conversion script, works for vqa * Add indexing argument to meshgrid * Make conversion script work for ViltForPreTraining * Add ViltForPreTraining to docs * Fix device issue * Add processor * Add MinMaxResize to feature extractor * Implement call method of ViltProcessor * Fix tests * Add integration test * Add loss calculation for VQA * Improve tests * Improve some more tests * Debug tests * Small improvements * Add support for attention_mask * Remove mask_it * Add pixel_mask * Add tests for ViltFeatureExtractor * Improve tests * Add ViltForNaturalLanguageVisualReasoning * Add ViltForNaturalLanguageVisualReasoning to conversion script * Minor fixes * Add support for image_embeds, update docstrings to markdown * Update docs to markdown * Improve conversion script * Rename ViltForPreTraining to ViltForMaskedLM * Improve conversion script * Convert docstrings to markdown * Fix code example of retrieval model * Properly convert masked language model * Add integration test for nlvr * Fix code quality * Apply suggestions from code review * Add copied from statements * Fix pretrained_config_archive_map * Fix docs * Add model to README * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply more suggestions from code review * Make code more readable * Add ViltForNaturalLanguageVisualReasoning to the tests * Rename ViltForVisualQuestionAnswering to ViltForQuestionAnswering * Replace pixel_values_2 by single tensor * Add hidden_states and attentions * Fix one more test * Fix all tests * Update year * Fix rebase issues * Fix another rebase issue * Remove ViltForPreTraining from auto mapping * Rename ViltForImageRetrievalTextRetrieval to ViltForImageAndTextRetrieval * Make it possible to use BertTokenizerFast in the processor * Use BertTokenizerFast by default * Rename ViltForNaturalLanguageVisualReasoning, define custom model output Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
parent
691878ee2f
commit
ac227093e4
|
@ -311,6 +311,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
|||
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER
|
||||
AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
|
||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
|
|
|
@ -289,6 +289,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
|
||||
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
|
||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
|
|
|
@ -313,6 +313,7 @@ conda install -c huggingface transformers
|
|||
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (来自 Microsoft) 伴随论文 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 由 Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 发布。
|
||||
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。
|
||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。
|
||||
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (来自 NAVER AI Lab/Kakao Enterprise/Kakao Brain) 伴随论文 [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 由 Wonjae Kim, Bokyung Son, Ildoo Kim 发布。
|
||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。
|
||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。
|
||||
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
|
||||
|
|
|
@ -325,6 +325,7 @@ conda install -c huggingface transformers
|
|||
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
|
||||
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||
1. **[ViLT)](https://huggingface.co/docs/transformers/master/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
|
||||
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
1. **[ViTMAE)](https://huggingface.co/docs/transformers/master/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
|
|
|
@ -282,6 +282,8 @@
|
|||
title: UniSpeech
|
||||
- local: model_doc/unispeech-sat
|
||||
title: UniSpeech-SAT
|
||||
- local: model_doc/vilt
|
||||
title: ViLT
|
||||
- local: model_doc/vision-encoder-decoder
|
||||
title: Vision Encoder Decoder Models
|
||||
- local: model_doc/vision-text-dual-encoder
|
||||
|
|
|
@ -170,6 +170,7 @@ conversion utilities for the following models.
|
|||
1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
|
||||
1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
|
||||
1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
|
||||
1. **[ViLT)](model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
|
||||
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[ViTMAE)](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
|
@ -266,6 +267,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| ViLT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Vision Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ |
|
||||
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
|
|
@ -0,0 +1,87 @@
|
|||
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# ViLT
|
||||
|
||||
## Overview
|
||||
|
||||
The ViLT model was proposed in [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334)
|
||||
by Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design
|
||||
for Vision-and-Language Pre-training (VLP).
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks.
|
||||
Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision
|
||||
(e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we
|
||||
find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more
|
||||
computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive
|
||||
power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model,
|
||||
Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically
|
||||
simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of
|
||||
times faster than previous VLP models, yet with competitive or better downstream task performance.*
|
||||
|
||||
Tips:
|
||||
|
||||
- ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model.
|
||||
This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one.
|
||||
- ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to
|
||||
under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates
|
||||
which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you.
|
||||
- The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes
|
||||
additional embedding layers for the language modality.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vilt_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> ViLT architecture. Taken from the <a href="https://arxiv.org/abs/2102.03334">original paper</a>. </small>
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/dandelin/ViLT).
|
||||
|
||||
## ViltConfig
|
||||
|
||||
[[autodoc]] ViltConfig
|
||||
|
||||
## ViltFeatureExtractor
|
||||
|
||||
[[autodoc]] ViltFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## ViltProcessor
|
||||
|
||||
[[autodoc]] ViltProcessor
|
||||
- __call__
|
||||
|
||||
## ViltModel
|
||||
|
||||
[[autodoc]] ViltModel
|
||||
- forward
|
||||
|
||||
## ViltForMaskedLM
|
||||
|
||||
[[autodoc]] ViltForMaskedLM
|
||||
- forward
|
||||
|
||||
## ViltForQuestionAnswering
|
||||
|
||||
[[autodoc]] ViltForQuestionAnswering
|
||||
- forward
|
||||
|
||||
## ViltForImagesAndTextClassification
|
||||
|
||||
[[autodoc]] ViltForImagesAndTextClassification
|
||||
- forward
|
||||
|
||||
## ViltForImageAndTextRetrieval
|
||||
|
||||
[[autodoc]] ViltForImageAndTextRetrieval
|
||||
- forward
|
|
@ -308,6 +308,7 @@ _import_structure = {
|
|||
"UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"UniSpeechSatConfig",
|
||||
],
|
||||
"models.vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig", "ViltFeatureExtractor", "ViltProcessor"],
|
||||
"models.vision_encoder_decoder": ["VisionEncoderDecoderConfig"],
|
||||
"models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"],
|
||||
"models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"],
|
||||
|
@ -514,6 +515,8 @@ if is_vision_available():
|
|||
_import_structure["models.layoutxlm"].append("LayoutXLMProcessor")
|
||||
_import_structure["models.perceiver"].append("PerceiverFeatureExtractor")
|
||||
_import_structure["models.segformer"].append("SegformerFeatureExtractor")
|
||||
_import_structure["models.vilt"].append("ViltFeatureExtractor")
|
||||
_import_structure["models.vilt"].append("ViltProcessor")
|
||||
_import_structure["models.vit"].append("ViTFeatureExtractor")
|
||||
else:
|
||||
from .utils import dummy_vision_objects
|
||||
|
@ -629,7 +632,6 @@ if is_torch_available():
|
|||
_import_structure["modeling_utils"] = ["Conv1D", "PreTrainedModel", "apply_chunking_to_forward", "prune_layer"]
|
||||
|
||||
# PyTorch models structure
|
||||
|
||||
_import_structure["models.albert"].extend(
|
||||
[
|
||||
"ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -1382,6 +1384,18 @@ if is_torch_available():
|
|||
"UniSpeechSatPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.vilt"].extend(
|
||||
[
|
||||
"VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"ViltForImageAndTextRetrieval",
|
||||
"ViltForImagesAndTextClassification",
|
||||
"ViltForMaskedLM",
|
||||
"ViltForQuestionAnswering",
|
||||
"ViltLayer",
|
||||
"ViltModel",
|
||||
"ViltPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.vision_encoder_decoder"].extend(["VisionEncoderDecoderModel"])
|
||||
_import_structure["models.vision_text_dual_encoder"].extend(["VisionTextDualEncoderModel"])
|
||||
_import_structure["models.visual_bert"].extend(
|
||||
|
@ -2409,6 +2423,7 @@ if TYPE_CHECKING:
|
|||
from .models.trocr import TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP, TrOCRConfig, TrOCRProcessor
|
||||
from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig
|
||||
from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig
|
||||
from .models.vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig, ViltFeatureExtractor, ViltProcessor
|
||||
from .models.vision_encoder_decoder import VisionEncoderDecoderConfig
|
||||
from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor
|
||||
from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig
|
||||
|
@ -2585,6 +2600,7 @@ if TYPE_CHECKING:
|
|||
from .models.layoutxlm import LayoutXLMProcessor
|
||||
from .models.perceiver import PerceiverFeatureExtractor
|
||||
from .models.segformer import SegformerFeatureExtractor
|
||||
from .models.vilt import ViltFeatureExtractor, ViltProcessor
|
||||
from .models.vit import ViTFeatureExtractor
|
||||
else:
|
||||
from .utils.dummy_vision_objects import *
|
||||
|
@ -3302,6 +3318,16 @@ if TYPE_CHECKING:
|
|||
UniSpeechSatModel,
|
||||
UniSpeechSatPreTrainedModel,
|
||||
)
|
||||
from .models.vilt import (
|
||||
VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
ViltForImageAndTextRetrieval,
|
||||
ViltForImagesAndTextClassification,
|
||||
ViltForMaskedLM,
|
||||
ViltForQuestionAnswering,
|
||||
ViltLayer,
|
||||
ViltModel,
|
||||
ViltPreTrainedModel,
|
||||
)
|
||||
from .models.vision_encoder_decoder import VisionEncoderDecoderModel
|
||||
from .models.vision_text_dual_encoder import VisionTextDualEncoderModel
|
||||
from .models.visual_bert import (
|
||||
|
|
|
@ -104,6 +104,7 @@ from . import (
|
|||
trocr,
|
||||
unispeech,
|
||||
unispeech_sat,
|
||||
vilt,
|
||||
vision_encoder_decoder,
|
||||
vision_text_dual_encoder,
|
||||
visual_bert,
|
||||
|
|
|
@ -30,6 +30,7 @@ logger = logging.get_logger(__name__)
|
|||
CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Add configs here
|
||||
("vilt", "ViltConfig"),
|
||||
("vit_mae", "ViTMAEConfig"),
|
||||
("realm", "RealmConfig"),
|
||||
("nystromformer", "NystromformerConfig"),
|
||||
|
@ -119,6 +120,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||
CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Add archive maps here
|
||||
("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
|
@ -196,6 +198,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||
MODEL_NAMES_MAPPING = OrderedDict(
|
||||
[
|
||||
# Add full (and cased) model names here
|
||||
("vilt", "ViLT"),
|
||||
("vit_mae", "ViTMAE"),
|
||||
("realm", "Realm"),
|
||||
("nystromformer", "Nystromformer"),
|
||||
|
|
|
@ -28,6 +28,7 @@ logger = logging.get_logger(__name__)
|
|||
MODEL_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Base model mapping
|
||||
("vilt", "ViltModel"),
|
||||
("vit_mae", "ViTMAEModel"),
|
||||
("nystromformer", "NystromformerModel"),
|
||||
("imagegpt", "ImageGPTModel"),
|
||||
|
|
|
@ -297,12 +297,6 @@ class DeiTLayer(nn.Module):
|
|||
|
||||
# in DeiT, layernorm is also applied after self-attention
|
||||
layer_output = self.layernorm_after(hidden_states)
|
||||
|
||||
# TODO feedforward chunking not working for now
|
||||
# layer_output = apply_chunking_to_forward(
|
||||
# self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
|
||||
# )
|
||||
|
||||
layer_output = self.intermediate(layer_output)
|
||||
|
||||
# second residual connection is done here
|
||||
|
@ -312,11 +306,6 @@ class DeiTLayer(nn.Module):
|
|||
|
||||
return outputs
|
||||
|
||||
def feed_forward_chunk(self, attention_output):
|
||||
intermediate_output = self.intermediate(attention_output)
|
||||
layer_output = self.output(intermediate_output)
|
||||
return layer_output
|
||||
|
||||
|
||||
# Copied from transformers.models.vit.modeling_vit.ViTEncoder with ViT->DeiT
|
||||
class DeiTEncoder(nn.Module):
|
||||
|
|
|
@ -0,0 +1,68 @@
|
|||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
# rely on isort to merge the imports
|
||||
from ...file_utils import _LazyModule, is_torch_available, is_vision_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_vilt": ["VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViltConfig"],
|
||||
}
|
||||
|
||||
if is_vision_available():
|
||||
_import_structure["feature_extraction_vilt"] = ["ViltFeatureExtractor"]
|
||||
_import_structure["processing_vilt"] = ["ViltProcessor"]
|
||||
|
||||
if is_torch_available():
|
||||
_import_structure["modeling_vilt"] = [
|
||||
"VILT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"ViltForImageAndTextRetrieval",
|
||||
"ViltForImagesAndTextClassification",
|
||||
"ViltForMaskedLM",
|
||||
"ViltForQuestionAnswering",
|
||||
"ViltLayer",
|
||||
"ViltModel",
|
||||
"ViltPreTrainedModel",
|
||||
]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_vilt import VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViltConfig
|
||||
|
||||
if is_vision_available():
|
||||
from .feature_extraction_vilt import ViltFeatureExtractor
|
||||
from .processing_vilt import ViltProcessor
|
||||
|
||||
if is_torch_available():
|
||||
from .modeling_vilt import (
|
||||
VILT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
ViltForImageAndTextRetrieval,
|
||||
ViltForImagesAndTextClassification,
|
||||
ViltForMaskedLM,
|
||||
ViltForQuestionAnswering,
|
||||
ViltLayer,
|
||||
ViltModel,
|
||||
ViltPreTrainedModel,
|
||||
)
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
|
@ -0,0 +1,148 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" VilT model configuration"""
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
VILT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
# TODO
|
||||
}
|
||||
|
||||
|
||||
class ViltConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`ViLTModel`]. It is used to instantiate an ViLT
|
||||
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the ViLT
|
||||
[google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 30522):
|
||||
Vocabulary size of the text part of the model. Defines the number of different tokens that can be
|
||||
represented by the `inputs_ids` passed when calling [`ViltModel`].
|
||||
type_vocab_size (`int`, *optional*, defaults to 2):
|
||||
The vocabulary size of the `token_type_ids` passed when calling [`ViltModel`]. This is used when encoding
|
||||
text.
|
||||
modality_type_vocab_size (`int`, *optional*, defaults to 2):
|
||||
The vocabulary size of the modalities passed when calling [`ViltModel`]. This is used after concatening the
|
||||
embeddings of the text and image modalities.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 40):
|
||||
The maximum sequence length that this model might ever be used with.
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||
The epsilon used by the layer normalization layers.
|
||||
image_size (`int`, *optional*, defaults to 384):
|
||||
The size (resolution) of each image.
|
||||
patch_size (`int`, *optional*, defaults to 32):
|
||||
The size (resolution) of each patch.
|
||||
num_channels (`int`, *optional*, defaults to 3):
|
||||
The number of input channels.
|
||||
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether to add a bias to the queries, keys and values.
|
||||
max_image_length (`int`, *optional*, defaults to -1):
|
||||
The maximum number of patches to take as input for the Transformer encoder. If set to a positive integer,
|
||||
the encoder will sample `max_image_length` patches at maximum. If set to -1, will not be taken into
|
||||
account.
|
||||
num_images (`int`, *optional*, defaults to -1):
|
||||
The number of images to use for natural language visual reasoning. If set to a positive integer, will be
|
||||
used by [`ViltForImagesAndTextClassification`] for defining the classifier head.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import ViLTModel, ViLTConfig
|
||||
|
||||
>>> # Initializing a ViLT dandelin/vilt-b32-mlm style configuration
|
||||
>>> configuration = ViLTConfig()
|
||||
|
||||
>>> # Initializing a model from the dandelin/vilt-b32-mlm style configuration
|
||||
>>> model = ViLTModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "vilt"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=30522,
|
||||
type_vocab_size=2,
|
||||
modality_type_vocab_size=2,
|
||||
max_position_embeddings=40,
|
||||
hidden_size=768,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
intermediate_size=3072,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.0,
|
||||
attention_probs_dropout_prob=0.0,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-12,
|
||||
is_encoder_decoder=False,
|
||||
image_size=384,
|
||||
patch_size=32,
|
||||
num_channels=3,
|
||||
qkv_bias=True,
|
||||
max_image_length=-1,
|
||||
tie_word_embeddings=False,
|
||||
num_images=-1,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
|
||||
|
||||
self.vocab_size = vocab_size
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.modality_type_vocab_size = modality_type_vocab_size
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.qkv_bias = qkv_bias
|
||||
self.max_image_length = max_image_length
|
||||
self.num_images = num_images
|
|
@ -0,0 +1,297 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert ViLT checkpoints from the original Github repository."""
|
||||
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
|
||||
import requests
|
||||
from huggingface_hub import cached_download, hf_hub_url
|
||||
from transformers import (
|
||||
BertTokenizer,
|
||||
ViltConfig,
|
||||
ViltFeatureExtractor,
|
||||
ViltForImageAndTextRetrieval,
|
||||
ViltForImagesAndTextClassification,
|
||||
ViltForMaskedLM,
|
||||
ViltForQuestionAnswering,
|
||||
ViltProcessor,
|
||||
)
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
# here we list all keys to be renamed (original name on the left, our name on the right)
|
||||
def create_rename_keys(config, vqa_model=False, nlvr_model=False, irtr_model=False):
|
||||
rename_keys = []
|
||||
for i in range(config.num_hidden_layers):
|
||||
# encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
|
||||
rename_keys.append((f"transformer.blocks.{i}.norm1.weight", f"vilt.encoder.layer.{i}.layernorm_before.weight"))
|
||||
rename_keys.append((f"transformer.blocks.{i}.norm1.bias", f"vilt.encoder.layer.{i}.layernorm_before.bias"))
|
||||
rename_keys.append(
|
||||
(f"transformer.blocks.{i}.attn.proj.weight", f"vilt.encoder.layer.{i}.attention.output.dense.weight")
|
||||
)
|
||||
rename_keys.append(
|
||||
(f"transformer.blocks.{i}.attn.proj.bias", f"vilt.encoder.layer.{i}.attention.output.dense.bias")
|
||||
)
|
||||
rename_keys.append((f"transformer.blocks.{i}.norm2.weight", f"vilt.encoder.layer.{i}.layernorm_after.weight"))
|
||||
rename_keys.append((f"transformer.blocks.{i}.norm2.bias", f"vilt.encoder.layer.{i}.layernorm_after.bias"))
|
||||
rename_keys.append(
|
||||
(f"transformer.blocks.{i}.mlp.fc1.weight", f"vilt.encoder.layer.{i}.intermediate.dense.weight")
|
||||
)
|
||||
rename_keys.append((f"transformer.blocks.{i}.mlp.fc1.bias", f"vilt.encoder.layer.{i}.intermediate.dense.bias"))
|
||||
rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.weight", f"vilt.encoder.layer.{i}.output.dense.weight"))
|
||||
rename_keys.append((f"transformer.blocks.{i}.mlp.fc2.bias", f"vilt.encoder.layer.{i}.output.dense.bias"))
|
||||
|
||||
# embeddings
|
||||
rename_keys.extend(
|
||||
[
|
||||
# text embeddings
|
||||
("text_embeddings.word_embeddings.weight", "vilt.embeddings.text_embeddings.word_embeddings.weight"),
|
||||
(
|
||||
"text_embeddings.position_embeddings.weight",
|
||||
"vilt.embeddings.text_embeddings.position_embeddings.weight",
|
||||
),
|
||||
("text_embeddings.position_ids", "vilt.embeddings.text_embeddings.position_ids"),
|
||||
(
|
||||
"text_embeddings.token_type_embeddings.weight",
|
||||
"vilt.embeddings.text_embeddings.token_type_embeddings.weight",
|
||||
),
|
||||
("text_embeddings.LayerNorm.weight", "vilt.embeddings.text_embeddings.LayerNorm.weight"),
|
||||
("text_embeddings.LayerNorm.bias", "vilt.embeddings.text_embeddings.LayerNorm.bias"),
|
||||
# patch embeddings
|
||||
("transformer.cls_token", "vilt.embeddings.cls_token"),
|
||||
("transformer.patch_embed.proj.weight", "vilt.embeddings.patch_embeddings.projection.weight"),
|
||||
("transformer.patch_embed.proj.bias", "vilt.embeddings.patch_embeddings.projection.bias"),
|
||||
("transformer.pos_embed", "vilt.embeddings.position_embeddings"),
|
||||
# token type embeddings
|
||||
("token_type_embeddings.weight", "vilt.embeddings.token_type_embeddings.weight"),
|
||||
]
|
||||
)
|
||||
|
||||
# final layernorm + pooler
|
||||
rename_keys.extend(
|
||||
[
|
||||
("transformer.norm.weight", "vilt.layernorm.weight"),
|
||||
("transformer.norm.bias", "vilt.layernorm.bias"),
|
||||
("pooler.dense.weight", "vilt.pooler.dense.weight"),
|
||||
("pooler.dense.bias", "vilt.pooler.dense.bias"),
|
||||
]
|
||||
)
|
||||
|
||||
# classifier head(s)
|
||||
if vqa_model:
|
||||
# classification head
|
||||
rename_keys.extend(
|
||||
[
|
||||
("vqa_classifier.0.weight", "classifier.0.weight"),
|
||||
("vqa_classifier.0.bias", "classifier.0.bias"),
|
||||
("vqa_classifier.1.weight", "classifier.1.weight"),
|
||||
("vqa_classifier.1.bias", "classifier.1.bias"),
|
||||
("vqa_classifier.3.weight", "classifier.3.weight"),
|
||||
("vqa_classifier.3.bias", "classifier.3.bias"),
|
||||
]
|
||||
)
|
||||
elif nlvr_model:
|
||||
# classification head
|
||||
rename_keys.extend(
|
||||
[
|
||||
("nlvr2_classifier.0.weight", "classifier.0.weight"),
|
||||
("nlvr2_classifier.0.bias", "classifier.0.bias"),
|
||||
("nlvr2_classifier.1.weight", "classifier.1.weight"),
|
||||
("nlvr2_classifier.1.bias", "classifier.1.bias"),
|
||||
("nlvr2_classifier.3.weight", "classifier.3.weight"),
|
||||
("nlvr2_classifier.3.bias", "classifier.3.bias"),
|
||||
]
|
||||
)
|
||||
else:
|
||||
pass
|
||||
|
||||
return rename_keys
|
||||
|
||||
|
||||
# we split up the matrix of each encoder layer into queries, keys and values
|
||||
def read_in_q_k_v(state_dict, config):
|
||||
for i in range(config.num_hidden_layers):
|
||||
prefix = "vilt."
|
||||
# read in weights + bias of input projection layer (in timm, this is a single matrix + bias)
|
||||
in_proj_weight = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.weight")
|
||||
in_proj_bias = state_dict.pop(f"transformer.blocks.{i}.attn.qkv.bias")
|
||||
# next, add query, keys and values (in that order) to the state dict
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[
|
||||
: config.hidden_size, :
|
||||
]
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[: config.hidden_size]
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
|
||||
config.hidden_size : config.hidden_size * 2, :
|
||||
]
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
|
||||
config.hidden_size : config.hidden_size * 2
|
||||
]
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[
|
||||
-config.hidden_size :, :
|
||||
]
|
||||
state_dict[f"{prefix}encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-config.hidden_size :]
|
||||
|
||||
|
||||
def remove_classification_head_(state_dict):
|
||||
ignore_keys = ["head.weight", "head.bias"]
|
||||
for k in ignore_keys:
|
||||
state_dict.pop(k, None)
|
||||
|
||||
|
||||
def rename_key(dct, old, new):
|
||||
val = dct.pop(old)
|
||||
dct[new] = val
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_vilt_checkpoint(checkpoint_url, pytorch_dump_folder_path):
|
||||
"""
|
||||
Copy/paste/tweak model's weights to our ViLT structure.
|
||||
"""
|
||||
|
||||
# define configuration and initialize HuggingFace model
|
||||
config = ViltConfig(image_size=384, patch_size=32, tie_word_embeddings=False)
|
||||
mlm_model = False
|
||||
vqa_model = False
|
||||
nlvr_model = False
|
||||
irtr_model = False
|
||||
if "vqa" in checkpoint_url:
|
||||
vqa_model = True
|
||||
config.num_labels = 3129
|
||||
repo_id = "datasets/huggingface/label-files"
|
||||
filename = "vqa2-id2label.json"
|
||||
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename)), "r"))
|
||||
id2label = {int(k): v for k, v in id2label.items()}
|
||||
config.id2label = id2label
|
||||
config.label2id = {v: k for k, v in id2label.items()}
|
||||
model = ViltForQuestionAnswering(config)
|
||||
elif "nlvr" in checkpoint_url:
|
||||
nlvr_model = True
|
||||
config.num_labels = 2
|
||||
config.id2label = {0: "False", 1: "True"}
|
||||
config.label2id = {v: k for k, v in config.id2label.items()}
|
||||
config.modality_type_vocab_size = 3
|
||||
model = ViltForImagesAndTextClassification(config)
|
||||
elif "irtr" in checkpoint_url:
|
||||
irtr_model = True
|
||||
model = ViltForImageAndTextRetrieval(config)
|
||||
elif "mlm_itm" in checkpoint_url:
|
||||
mlm_model = True
|
||||
model = ViltForMaskedLM(config)
|
||||
else:
|
||||
raise ValueError("Unknown model type")
|
||||
|
||||
# load state_dict of original model, remove and rename some keys
|
||||
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["state_dict"]
|
||||
rename_keys = create_rename_keys(config, vqa_model, nlvr_model, irtr_model)
|
||||
for src, dest in rename_keys:
|
||||
rename_key(state_dict, src, dest)
|
||||
read_in_q_k_v(state_dict, config)
|
||||
if mlm_model or irtr_model:
|
||||
ignore_keys = ["itm_score.fc.weight", "itm_score.fc.bias"]
|
||||
for k in ignore_keys:
|
||||
state_dict.pop(k, None)
|
||||
|
||||
# load state dict into HuggingFace model
|
||||
model.eval()
|
||||
if mlm_model:
|
||||
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
|
||||
assert missing_keys == ["mlm_score.decoder.bias"]
|
||||
else:
|
||||
model.load_state_dict(state_dict)
|
||||
|
||||
# Define processor
|
||||
feature_extractor = ViltFeatureExtractor(size=384)
|
||||
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||
processor = ViltProcessor(feature_extractor, tokenizer)
|
||||
|
||||
# Forward pass on example inputs (image + text)
|
||||
if nlvr_model:
|
||||
image1 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
|
||||
image2 = Image.open(requests.get("https://lil.nlp.cornell.edu/nlvr/exs/ex0_0.jpg", stream=True).raw)
|
||||
text = "The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing."
|
||||
encoding_1 = processor(image1, text, return_tensors="pt")
|
||||
encoding_2 = processor(image2, text, return_tensors="pt")
|
||||
outputs = model(
|
||||
input_ids=encoding_1.input_ids,
|
||||
pixel_values=encoding_1.pixel_values,
|
||||
pixel_values_2=encoding_2.pixel_values,
|
||||
)
|
||||
else:
|
||||
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
|
||||
if mlm_model:
|
||||
text = "a bunch of [MASK] laying on a [MASK]."
|
||||
else:
|
||||
text = "How many cats are there?"
|
||||
encoding = processor(image, text, return_tensors="pt")
|
||||
outputs = model(**encoding)
|
||||
|
||||
# Verify outputs
|
||||
if mlm_model:
|
||||
expected_shape = torch.Size([1, 11, 30522])
|
||||
expected_slice = torch.tensor([-12.5061, -12.5123, -12.5174])
|
||||
assert outputs.logits.shape == expected_shape
|
||||
assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
|
||||
|
||||
# verify masked token prediction equals "cats"
|
||||
predicted_id = outputs.logits[0, 4, :].argmax(-1).item()
|
||||
assert tokenizer.decode([predicted_id]) == "cats"
|
||||
elif vqa_model:
|
||||
expected_shape = torch.Size([1, 3129])
|
||||
expected_slice = torch.tensor([-15.9495, -18.1472, -10.3041])
|
||||
assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
|
||||
assert outputs.logits.shape == expected_shape
|
||||
assert torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4)
|
||||
|
||||
# verify vqa prediction equals "2"
|
||||
predicted_idx = outputs.logits.argmax(-1).item()
|
||||
assert model.config.id2label[predicted_idx] == "2"
|
||||
elif nlvr_model:
|
||||
expected_shape = torch.Size([1, 2])
|
||||
expected_slice = torch.tensor([-2.8721, 2.1291])
|
||||
assert torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4)
|
||||
assert outputs.logits.shape == expected_shape
|
||||
|
||||
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||
print(f"Saving model and processor to {pytorch_dump_folder_path}")
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
processor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--checkpoint_url",
|
||||
default="https://github.com/dandelin/ViLT/releases/download/200k/vilt_200k_mlm_itm.ckpt",
|
||||
type=str,
|
||||
help="URL of the checkpoint you'd like to convert.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
convert_vilt_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path)
|
|
@ -0,0 +1,292 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Feature extractor class for ViLT."""
|
||||
|
||||
from typing import List, Optional, Union
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
|
||||
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
|
||||
from ...file_utils import TensorType, is_torch_available
|
||||
from ...image_utils import (
|
||||
IMAGENET_STANDARD_MEAN,
|
||||
IMAGENET_STANDARD_STD,
|
||||
ImageFeatureExtractionMixin,
|
||||
ImageInput,
|
||||
is_torch_tensor,
|
||||
)
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class ViltFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
|
||||
r"""
|
||||
Constructs a ViLT feature extractor.
|
||||
|
||||
This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
|
||||
should refer to this superclass for more information regarding those methods.
|
||||
|
||||
Args:
|
||||
do_resize (`bool`, *optional*, defaults to `True`):
|
||||
Whether to resize the input based on `size`.
|
||||
size (`int`, *optional*, defaults to 384):
|
||||
Resize the shorter side of the input to the given size. Should be an integer. The longer side will be
|
||||
limited to under int((1333 / 800) * size) while preserving the aspect ratio. Only has an effect if
|
||||
`do_resize` is set to `True`.
|
||||
size_divisor (`int`, *optional*, defaults to 32):
|
||||
The size by which to make sure both the height and width can be divided.
|
||||
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
|
||||
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
|
||||
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
|
||||
if `do_resize` is set to `True`.
|
||||
do_normalize (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to normalize the input with mean and standard deviation.
|
||||
image_mean (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
|
||||
The sequence of means for each channel, to be used when normalizing images.
|
||||
image_std (`List[int]`, defaults to `[0.5, 0.5, 0.5]`):
|
||||
The sequence of standard deviations for each channel, to be used when normalizing images.
|
||||
"""
|
||||
|
||||
model_input_names = ["pixel_values", "pixel_mask"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
do_resize=True,
|
||||
size=384,
|
||||
size_divisor=32,
|
||||
resample=Image.BICUBIC,
|
||||
do_normalize=True,
|
||||
image_mean=None,
|
||||
image_std=None,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
self.do_resize = do_resize
|
||||
self.size = size
|
||||
self.size_divisor = size_divisor
|
||||
self.resample = resample
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
|
||||
self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
|
||||
|
||||
def _resize(self, image, shorter=800, longer=1333, size_divisor=32, resample=Image.BICUBIC):
|
||||
"""
|
||||
Resizes the shorter edge of `image` to `shorter` and limits the longer edge to under `longer`, while preserving
|
||||
the aspect ratio. Also makes sure that both the height and width can be divided by `size_divisor`.
|
||||
|
||||
Based on original implementation:
|
||||
https://github.com/dandelin/ViLT/blob/3db8b5035464afee84d951bf6322e1b27f1d072d/vilt/transforms/utils.py#L5
|
||||
|
||||
Args:
|
||||
image (`PIL.Image`):
|
||||
The image to resize.
|
||||
shorter (`int`, *optional*, defaults to `800`):
|
||||
The size to which to resize the shorter side of the image.
|
||||
longer (`int`, *optional*, defaults to `1333`):
|
||||
The size by which to limit the longer side of the image, while preserving the aspect ratio.
|
||||
size_divisor (`int`, *optional*, defaults to `32`):
|
||||
The size by which both the height and the width must be divisible.
|
||||
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
|
||||
An optional resampling filter.
|
||||
"""
|
||||
if not isinstance(image, Image.Image):
|
||||
image = self.to_pil_image(image)
|
||||
|
||||
w, h = image.size
|
||||
min_size = shorter
|
||||
max_size = longer
|
||||
scale = min_size / min(w, h)
|
||||
if h < w:
|
||||
newh, neww = min_size, scale * w
|
||||
else:
|
||||
newh, neww = scale * h, min_size
|
||||
|
||||
if max(newh, neww) > max_size:
|
||||
scale = max_size / max(newh, neww)
|
||||
newh = newh * scale
|
||||
neww = neww * scale
|
||||
|
||||
newh, neww = int(newh + 0.5), int(neww + 0.5)
|
||||
newh, neww = newh // size_divisor * size_divisor, neww // size_divisor * size_divisor
|
||||
|
||||
return self.resize(image, size=(neww, newh), resample=resample)
|
||||
|
||||
def _max_by_axis(self, the_list):
|
||||
# type: (List[List[int]]) -> List[int]
|
||||
maxes = the_list[0]
|
||||
for sublist in the_list[1:]:
|
||||
for index, item in enumerate(sublist):
|
||||
maxes[index] = max(maxes[index], item)
|
||||
return maxes
|
||||
|
||||
def pad_and_create_pixel_mask(
|
||||
self, pixel_values_list: List["torch.Tensor"], return_tensors: Optional[Union[str, TensorType]] = None
|
||||
):
|
||||
"""
|
||||
Pad images up to the largest image in a batch and create a corresponding `pixel_mask`.
|
||||
|
||||
Args:
|
||||
pixel_values_list (`List[torch.Tensor]`):
|
||||
List of images (pixel values) to be padded. Each image should be a tensor of shape (C, H, W).
|
||||
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of NumPy arrays. If set to `'pt'`, return PyTorch `torch.Tensor`
|
||||
objects.
|
||||
|
||||
Returns:
|
||||
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||
|
||||
- **pixel_values** -- Pixel values to be fed to a model.
|
||||
- **pixel_mask** -- Pixel mask to be fed to a model (when `pad_and_return_pixel_mask=True` or if
|
||||
*"pixel_mask"* is in `self.model_input_names`).
|
||||
"""
|
||||
|
||||
max_size = self._max_by_axis([list(image.shape) for image in pixel_values_list])
|
||||
c, h, w = max_size
|
||||
padded_images = []
|
||||
pixel_mask = []
|
||||
for image in pixel_values_list:
|
||||
# create padded image
|
||||
padded_image = np.zeros((c, h, w), dtype=np.float32)
|
||||
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
|
||||
padded_images.append(padded_image)
|
||||
# create pixel mask
|
||||
mask = np.zeros((h, w), dtype=np.int64)
|
||||
mask[: image.shape[1], : image.shape[2]] = True
|
||||
pixel_mask.append(mask)
|
||||
|
||||
# return as BatchFeature
|
||||
data = {"pixel_values": padded_images, "pixel_mask": pixel_mask}
|
||||
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||
|
||||
return encoded_inputs
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
images: ImageInput,
|
||||
pad_and_return_pixel_mask: Optional[bool] = True,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
**kwargs
|
||||
) -> BatchFeature:
|
||||
"""
|
||||
Main method to prepare for the model one or several image(s).
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
|
||||
PIL images.
|
||||
|
||||
</Tip>
|
||||
|
||||
Args:
|
||||
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
pad_and_return_pixel_mask (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
|
||||
|
||||
If left to the default, will return a pixel mask that is:
|
||||
|
||||
- 1 for pixels that are real (i.e. **not masked**),
|
||||
- 0 for pixels that are padding (i.e. **masked**).
|
||||
|
||||
return_tensors (`str` or [`~file_utils.TensorType`], *optional*, defaults to `'np'`):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||
|
||||
- **pixel_values** -- Pixel values to be fed to a model, of shape (batch_size, num_channels, height,
|
||||
width).
|
||||
- **pixel_mask** -- Pixel mask to be fed to a model (when `return_pixel_mask=True` or if *"pixel_mask"* is
|
||||
in `self.model_input_names`).
|
||||
"""
|
||||
# Input type checking for clearer error
|
||||
valid_images = False
|
||||
|
||||
# Check that images has a valid type
|
||||
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
|
||||
valid_images = True
|
||||
elif isinstance(images, (list, tuple)):
|
||||
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
|
||||
valid_images = True
|
||||
|
||||
if not valid_images:
|
||||
raise ValueError(
|
||||
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example), "
|
||||
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
|
||||
)
|
||||
|
||||
is_batched = bool(
|
||||
isinstance(images, (list, tuple))
|
||||
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
|
||||
)
|
||||
|
||||
if not is_batched:
|
||||
images = [images]
|
||||
|
||||
# transformations (resizing + normalization)
|
||||
if self.do_resize and self.size is not None:
|
||||
longer = int((1333 / 800) * self.size)
|
||||
images = [
|
||||
self._resize(
|
||||
image=image,
|
||||
shorter=self.size,
|
||||
longer=longer,
|
||||
size_divisor=self.size_divisor,
|
||||
resample=self.resample,
|
||||
)
|
||||
for image in images
|
||||
]
|
||||
if self.do_normalize:
|
||||
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
|
||||
|
||||
if pad_and_return_pixel_mask:
|
||||
# pad images up to largest image in batch and create pixel_mask
|
||||
max_size = self._max_by_axis([list(image.shape) for image in images])
|
||||
c, h, w = max_size
|
||||
padded_images = []
|
||||
pixel_mask = []
|
||||
for image in images:
|
||||
# create padded image
|
||||
padded_image = np.zeros((c, h, w), dtype=np.float32)
|
||||
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
|
||||
padded_images.append(padded_image)
|
||||
# create pixel mask
|
||||
mask = np.zeros((h, w), dtype=np.int64)
|
||||
mask[: image.shape[1], : image.shape[2]] = True
|
||||
pixel_mask.append(mask)
|
||||
images = padded_images
|
||||
|
||||
# return as BatchFeature
|
||||
data = {}
|
||||
data["pixel_values"] = images
|
||||
if pad_and_return_pixel_mask:
|
||||
data["pixel_mask"] = pixel_mask
|
||||
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||
|
||||
return encoded_inputs
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,172 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Processor class for ViLT.
|
||||
"""
|
||||
|
||||
from typing import List, Optional, Union
|
||||
|
||||
from transformers import BertTokenizerFast
|
||||
|
||||
from ...file_utils import TensorType
|
||||
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
|
||||
from .feature_extraction_vilt import ViltFeatureExtractor
|
||||
|
||||
|
||||
class ViltProcessor:
|
||||
r"""
|
||||
Constructs a ViLT processor which wraps a BERT tokenizer and ViLT feature extractor into a single processor.
|
||||
|
||||
[`ViltProcessor`] offers all the functionalities of [`ViltFeatureExtractor`] and [`BertTokenizerFast`]. See the
|
||||
docstring of [`~ViltProcessor.__call__`] and [`~ViltProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
feature_extractor (`ViltFeatureExtractor`):
|
||||
An instance of [`ViltFeatureExtractor`]. The feature extractor is a required input.
|
||||
tokenizer (`BertTokenizerFast`):
|
||||
An instance of ['BertTokenizerFast`]. The tokenizer is a required input.
|
||||
"""
|
||||
|
||||
def __init__(self, feature_extractor, tokenizer):
|
||||
if not isinstance(feature_extractor, ViltFeatureExtractor):
|
||||
raise ValueError(
|
||||
f"`feature_extractor` has to be of type {ViltFeatureExtractor.__class__}, but is {type(feature_extractor)}"
|
||||
)
|
||||
if not isinstance(tokenizer, BertTokenizerFast):
|
||||
raise ValueError(f"`tokenizer` has to be of type {BertTokenizerFast.__class__}, but is {type(tokenizer)}")
|
||||
|
||||
self.feature_extractor = feature_extractor
|
||||
self.tokenizer = tokenizer
|
||||
self.current_processor = self.feature_extractor
|
||||
|
||||
def save_pretrained(self, save_directory):
|
||||
"""
|
||||
Save a ViLT feature_extractor object and BERT tokenizer object to the directory `save_directory`, so that it
|
||||
can be re-loaded using the [`~ViltProcessor.from_pretrained`] class method.
|
||||
|
||||
<Tip>
|
||||
|
||||
This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
|
||||
[`~tokenization_utils_base.PreTrainedTokenizer.save_pretrained`]. Please refer to the docstrings of the methods
|
||||
above for more information.
|
||||
|
||||
</Tip>
|
||||
|
||||
Args:
|
||||
save_directory (`str` or `os.PathLike`):
|
||||
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
|
||||
be created if it does not exist).
|
||||
"""
|
||||
|
||||
self.feature_extractor.save_pretrained(save_directory)
|
||||
self.tokenizer.save_pretrained(save_directory)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
|
||||
r"""
|
||||
Instantiate a [`ViltProcessor`] from a pretrained ViLT processor.
|
||||
|
||||
<Tip>
|
||||
|
||||
This class method is simply calling ViltFeatureExtractor's
|
||||
[`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`] and BertTokenizerFast's
|
||||
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. Please refer to the docstrings of the methods
|
||||
above for more information.
|
||||
|
||||
</Tip>
|
||||
|
||||
Args:
|
||||
pretrained_model_name_or_path (`str` or `os.PathLike`):
|
||||
This can be either:
|
||||
|
||||
- a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
|
||||
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
|
||||
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
|
||||
- a path to a *directory* containing a feature extractor file saved using the
|
||||
[`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
|
||||
- a path or url to a saved feature extractor JSON *file*, e.g.,
|
||||
`./my_model_directory/preprocessor_config.json`.
|
||||
**kwargs
|
||||
Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
|
||||
[`PreTrainedTokenizer`]
|
||||
"""
|
||||
feature_extractor = ViltFeatureExtractor.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
return cls(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
images,
|
||||
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
|
||||
add_special_tokens: bool = True,
|
||||
padding: Union[bool, str, PaddingStrategy] = False,
|
||||
truncation: Union[bool, str, TruncationStrategy] = False,
|
||||
max_length: Optional[int] = None,
|
||||
stride: int = 0,
|
||||
pad_to_multiple_of: Optional[int] = None,
|
||||
return_token_type_ids: Optional[bool] = None,
|
||||
return_attention_mask: Optional[bool] = None,
|
||||
return_overflowing_tokens: bool = False,
|
||||
return_special_tokens_mask: bool = False,
|
||||
return_offsets_mapping: bool = False,
|
||||
return_length: bool = False,
|
||||
verbose: bool = True,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
**kwargs
|
||||
) -> BatchEncoding:
|
||||
"""
|
||||
This method uses [`ViltFeatureExtractor.__call__`] method to prepare image(s) for the model, and
|
||||
[`BertTokenizerFast.__call__`] to prepare text for the model.
|
||||
|
||||
Please refer to the docstring of the above two methods for more information.
|
||||
"""
|
||||
encoding = self.tokenizer(
|
||||
text=text,
|
||||
add_special_tokens=add_special_tokens,
|
||||
padding=padding,
|
||||
truncation=truncation,
|
||||
max_length=max_length,
|
||||
stride=stride,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
return_token_type_ids=return_token_type_ids,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_overflowing_tokens=return_overflowing_tokens,
|
||||
return_special_tokens_mask=return_special_tokens_mask,
|
||||
return_offsets_mapping=return_offsets_mapping,
|
||||
return_length=return_length,
|
||||
verbose=verbose,
|
||||
return_tensors=return_tensors,
|
||||
**kwargs,
|
||||
)
|
||||
# add pixel_values + pixel_mask
|
||||
encoding_feature_extractor = self.feature_extractor(images, return_tensors=return_tensors)
|
||||
encoding.update(encoding_feature_extractor)
|
||||
|
||||
return encoding
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
||||
the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
|
@ -326,12 +326,6 @@ class ViTLayer(nn.Module):
|
|||
|
||||
# in ViT, layernorm is also applied after self-attention
|
||||
layer_output = self.layernorm_after(hidden_states)
|
||||
|
||||
# TODO feedforward chunking not working for now
|
||||
# layer_output = apply_chunking_to_forward(
|
||||
# self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, layer_output
|
||||
# )
|
||||
|
||||
layer_output = self.intermediate(layer_output)
|
||||
|
||||
# second residual connection is done here
|
||||
|
@ -341,11 +335,6 @@ class ViTLayer(nn.Module):
|
|||
|
||||
return outputs
|
||||
|
||||
def feed_forward_chunk(self, attention_output):
|
||||
intermediate_output = self.intermediate(attention_output)
|
||||
layer_output = self.output(intermediate_output)
|
||||
return layer_output
|
||||
|
||||
|
||||
class ViTEncoder(nn.Module):
|
||||
def __init__(self, config):
|
||||
|
|
|
@ -3540,6 +3540,58 @@ class UniSpeechSatPreTrainedModel(metaclass=DummyObject):
|
|||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
VILT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class ViltForImageAndTextRetrieval(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltForImagesAndTextClassification(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltForMaskedLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltForQuestionAnswering(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltLayer(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class ViltPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class VisionEncoderDecoderModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
|
|
|
@ -87,6 +87,20 @@ class SegformerFeatureExtractor(metaclass=DummyObject):
|
|||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class ViltFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["vision"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class ViltProcessor(metaclass=DummyObject):
|
||||
_backends = ["vision"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class ViTFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["vision"]
|
||||
|
||||
|
|
|
@ -0,0 +1,251 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2021 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers.file_utils import is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_vision
|
||||
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import ViltFeatureExtractor
|
||||
|
||||
|
||||
class ViltFeatureExtractionTester(unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
num_channels=3,
|
||||
image_size=18,
|
||||
min_resolution=30,
|
||||
max_resolution=400,
|
||||
do_resize=True,
|
||||
size=30,
|
||||
size_divisor=2,
|
||||
do_normalize=True,
|
||||
image_mean=[0.5, 0.5, 0.5],
|
||||
image_std=[0.5, 0.5, 0.5],
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.num_channels = num_channels
|
||||
self.image_size = image_size
|
||||
self.min_resolution = min_resolution
|
||||
self.max_resolution = max_resolution
|
||||
self.do_resize = do_resize
|
||||
self.size = size
|
||||
self.size_divisor = size_divisor
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean
|
||||
self.image_std = image_std
|
||||
|
||||
def prepare_feat_extract_dict(self):
|
||||
return {
|
||||
"image_mean": self.image_mean,
|
||||
"image_std": self.image_std,
|
||||
"do_normalize": self.do_normalize,
|
||||
"do_resize": self.do_resize,
|
||||
"size": self.size,
|
||||
"size_divisor": self.size_divisor,
|
||||
}
|
||||
|
||||
def get_expected_values(self, image_inputs, batched=False):
|
||||
"""
|
||||
This function computes the expected height and width when providing images to ViltFeatureExtractor,
|
||||
assuming do_resize is set to True with a scalar size and size_divisor.
|
||||
"""
|
||||
if not batched:
|
||||
image = image_inputs[0]
|
||||
if isinstance(image, Image.Image):
|
||||
w, h = image.size
|
||||
else:
|
||||
h, w = image.shape[1], image.shape[2]
|
||||
scale = self.size / min(w, h)
|
||||
if h < w:
|
||||
newh, neww = self.size, scale * w
|
||||
else:
|
||||
newh, neww = scale * h, self.size
|
||||
|
||||
max_size = int((1333 / 800) * self.size)
|
||||
if max(newh, neww) > max_size:
|
||||
scale = max_size / max(newh, neww)
|
||||
newh = newh * scale
|
||||
neww = neww * scale
|
||||
|
||||
newh, neww = int(newh + 0.5), int(neww + 0.5)
|
||||
expected_height, expected_width = (
|
||||
newh // self.size_divisor * self.size_divisor,
|
||||
neww // self.size_divisor * self.size_divisor,
|
||||
)
|
||||
|
||||
else:
|
||||
expected_values = []
|
||||
for image in image_inputs:
|
||||
expected_height, expected_width = self.get_expected_values([image])
|
||||
expected_values.append((expected_height, expected_width))
|
||||
expected_height = max(expected_values, key=lambda item: item[0])[0]
|
||||
expected_width = max(expected_values, key=lambda item: item[1])[1]
|
||||
|
||||
return expected_height, expected_width
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class ViltFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCase):
|
||||
|
||||
feature_extraction_class = ViltFeatureExtractor if is_vision_available() else None
|
||||
|
||||
def setUp(self):
|
||||
self.feature_extract_tester = ViltFeatureExtractionTester(self)
|
||||
|
||||
@property
|
||||
def feat_extract_dict(self):
|
||||
return self.feature_extract_tester.prepare_feat_extract_dict()
|
||||
|
||||
def test_feat_extract_properties(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
self.assertTrue(hasattr(feature_extractor, "image_mean"))
|
||||
self.assertTrue(hasattr(feature_extractor, "image_std"))
|
||||
self.assertTrue(hasattr(feature_extractor, "do_normalize"))
|
||||
self.assertTrue(hasattr(feature_extractor, "do_resize"))
|
||||
self.assertTrue(hasattr(feature_extractor, "size"))
|
||||
self.assertTrue(hasattr(feature_extractor, "size_divisor"))
|
||||
|
||||
def test_batch_feature(self):
|
||||
pass
|
||||
|
||||
def test_call_pil(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PIL images
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, Image.Image)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_call_numpy(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random numpy tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, numpify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, np.ndarray)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_call_pytorch(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_equivalence_pad_and_create_pixel_mask(self):
|
||||
# Initialize feature_extractors
|
||||
feature_extractor_1 = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
feature_extractor_2 = self.feature_extraction_class(do_resize=False, do_normalize=False)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
# Test whether the method "pad_and_return_pixel_mask" and calling the feature extractor return the same tensors
|
||||
encoded_images_with_method = feature_extractor_1.pad_and_create_pixel_mask(image_inputs, return_tensors="pt")
|
||||
encoded_images = feature_extractor_2(image_inputs, return_tensors="pt")
|
||||
|
||||
self.assertTrue(
|
||||
torch.allclose(encoded_images_with_method["pixel_values"], encoded_images["pixel_values"], atol=1e-4)
|
||||
)
|
||||
self.assertTrue(
|
||||
torch.allclose(encoded_images_with_method["pixel_mask"], encoded_images["pixel_mask"], atol=1e-4)
|
||||
)
|
|
@ -0,0 +1,607 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch ViLT model. """
|
||||
|
||||
import unittest
|
||||
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import ViltConfig, is_torch_available, is_vision_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.models.auto import get_values
|
||||
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
MODEL_MAPPING,
|
||||
ViltForImageAndTextRetrieval,
|
||||
ViltForImagesAndTextClassification,
|
||||
ViltForMaskedLM,
|
||||
ViltForQuestionAnswering,
|
||||
ViltModel,
|
||||
)
|
||||
from transformers.models.vilt.modeling_vilt import VILT_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import ViltProcessor
|
||||
|
||||
|
||||
class ViltModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
image_size=30,
|
||||
patch_size=2,
|
||||
num_channels=3,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_token_type_ids=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
type_sequence_label_size=2,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
scope=None,
|
||||
modality_type_vocab_size=2,
|
||||
add_multiple_images=False,
|
||||
num_images=-1,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.scope = scope
|
||||
self.modality_type_vocab_size = modality_type_vocab_size
|
||||
self.add_multiple_images = add_multiple_images
|
||||
self.num_images = num_images
|
||||
# we set the expected sequence length (which is used in several tests)
|
||||
# this is equal to the seq length of the text tokens + number of image patches + 1 for the CLS token
|
||||
self.expected_seq_len = self.seq_length + (self.image_size // self.patch_size) ** 2 + 1
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
if self.add_multiple_images:
|
||||
pixel_values = floats_tensor([self.batch_size, 2, self.num_channels, self.image_size, self.image_size])
|
||||
else:
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
if self.use_labels:
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return (config, input_ids, token_type_ids, input_mask, pixel_values, token_labels)
|
||||
|
||||
def get_config(self):
|
||||
return ViltConfig(
|
||||
image_size=self.image_size,
|
||||
patch_size=self.patch_size,
|
||||
num_channels=self.num_channels,
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
is_decoder=False,
|
||||
initializer_range=self.initializer_range,
|
||||
num_labels=self.num_labels,
|
||||
modality_type_vocab_size=self.modality_type_vocab_size,
|
||||
num_images=self.num_images,
|
||||
)
|
||||
|
||||
def create_and_check_model(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
pixel_values,
|
||||
token_labels,
|
||||
):
|
||||
model = ViltModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, pixel_values=pixel_values)
|
||||
result = model(input_ids, token_type_ids=token_type_ids, pixel_values=pixel_values)
|
||||
result = model(input_ids, pixel_values=pixel_values)
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape, (self.batch_size, self.expected_seq_len, self.hidden_size)
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
pixel_values,
|
||||
token_labels,
|
||||
) = config_and_inputs
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"token_type_ids": token_type_ids,
|
||||
"attention_mask": input_mask,
|
||||
"pixel_values": pixel_values,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
def prepare_pixel_values(self):
|
||||
return floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
|
||||
|
||||
@require_torch
|
||||
class ViltModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (
|
||||
(
|
||||
ViltModel,
|
||||
ViltForQuestionAnswering,
|
||||
ViltForImageAndTextRetrieval,
|
||||
ViltForMaskedLM,
|
||||
)
|
||||
if is_torch_available()
|
||||
else ()
|
||||
)
|
||||
test_pruning = False
|
||||
test_headmasking = False
|
||||
test_torchscript = False
|
||||
|
||||
# ViltForMaskedLM, ViltForQuestionAnswering and ViltForImagesAndTextClassification require special treatment
|
||||
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
|
||||
|
||||
# if model_class.__name__ == "ViltForNaturalLanguageVisualReasonining":
|
||||
# inputs_dict["pixel_values"] = floats_tensor([self.model_tester.batch_size, self.model_tester.num_images, self.model_tester.num_channels, self.model_tester.image_size, self.model_tester.image_size])
|
||||
|
||||
if return_labels:
|
||||
if model_class.__name__ == "ViltForQuestionAnswering":
|
||||
inputs_dict["labels"] = torch.zeros(
|
||||
self.model_tester.batch_size, self.model_tester.num_labels, device=torch_device
|
||||
)
|
||||
elif model_class.__name__ == "ViltForMaskedLM":
|
||||
inputs_dict["labels"] = torch.zeros(
|
||||
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
|
||||
)
|
||||
elif model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
inputs_dict["labels"] = torch.zeros(
|
||||
self.model_tester.batch_size, dtype=torch.long, device=torch_device
|
||||
)
|
||||
|
||||
return inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = ViltModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=ViltConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
if not self.model_tester.is_training:
|
||||
return
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.return_dict = True
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
config.modality_type_vocab_size = 3
|
||||
|
||||
# ViltForImageAndTextRetrieval doesn't support training for now
|
||||
if model_class in [*get_values(MODEL_MAPPING), ViltForImageAndTextRetrieval]:
|
||||
continue
|
||||
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||
for k, v in inputs.items():
|
||||
print(k, v.shape)
|
||||
loss = model(**inputs).loss
|
||||
loss.backward()
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
if not self.model_tester.is_training:
|
||||
return
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.use_cache = False
|
||||
config.return_dict = True
|
||||
|
||||
# ViltForImageAndTextRetrieval doesn't support training for now
|
||||
if (
|
||||
model_class in [*get_values(MODEL_MAPPING), ViltForImageAndTextRetrieval]
|
||||
or not model_class.supports_gradient_checkpointing
|
||||
):
|
||||
continue
|
||||
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.gradient_checkpointing_enable()
|
||||
model.train()
|
||||
inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||
loss = model(**inputs).loss
|
||||
loss.backward()
|
||||
|
||||
@unittest.skip(
|
||||
reason="""VilT samples image tokens from a multinomial distribution, resulting in not deterministic
|
||||
hidden states"""
|
||||
)
|
||||
def test_save_load(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
reason="""VilT samples image tokens from a multinomial distribution, resulting in not deterministic
|
||||
hidden states"""
|
||||
)
|
||||
def test_determinism(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
reason="""VilT samples image tokens from a multinomial distribution, resulting in not deterministic
|
||||
hidden states"""
|
||||
)
|
||||
def test_model_outputs_equivalence(self):
|
||||
pass
|
||||
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.return_dict = True
|
||||
|
||||
seq_len = getattr(self.model_tester, "expected_seq_len", None)
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = False
|
||||
config.return_dict = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
attentions = outputs.attentions
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
# attentions are a list of length num_images
|
||||
# each element contains the attentions of a particular image index
|
||||
self.assertEqual(len(attentions), self.model_tester.num_images)
|
||||
self.assertEqual(len(attentions[0]), self.model_tester.num_hidden_layers)
|
||||
else:
|
||||
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||
|
||||
# check that output_attentions also work using config
|
||||
del inputs_dict["output_attentions"]
|
||||
config.output_attentions = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
attentions = outputs.attentions
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
# attentions are a list of length num_images
|
||||
# each element contains the attentions of a particular image index
|
||||
self.assertEqual(len(attentions), self.model_tester.num_images)
|
||||
self.assertEqual(len(attentions[0]), self.model_tester.num_hidden_layers)
|
||||
else:
|
||||
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
self.assertListEqual(
|
||||
list(attentions[0][0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, seq_len, seq_len],
|
||||
)
|
||||
else:
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, seq_len, seq_len],
|
||||
)
|
||||
out_len = len(outputs)
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
self.assertEqual(out_len + 1, len(outputs))
|
||||
|
||||
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
self.assertEqual(len(self_attentions), self.model_tester.num_images)
|
||||
self.assertEqual(len(self_attentions[0]), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(self_attentions[0][0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, seq_len, seq_len],
|
||||
)
|
||||
else:
|
||||
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(self_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, seq_len, seq_len],
|
||||
)
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
hidden_states = outputs.encoder_hidden_states if config.is_encoder_decoder else outputs.hidden_states
|
||||
|
||||
expected_num_layers = getattr(
|
||||
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
|
||||
)
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
# hidden_states are a list of length num_images
|
||||
# each element contains the hidden states of a particular image index
|
||||
self.assertEqual(len(hidden_states), self.model_tester.num_images)
|
||||
self.assertEqual(len(hidden_states[0]), expected_num_layers)
|
||||
else:
|
||||
self.assertEqual(len(hidden_states), expected_num_layers)
|
||||
|
||||
seq_length = self.model_tester.expected_seq_len
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
self.assertListEqual(
|
||||
list(hidden_states[0][0].shape[-2:]),
|
||||
[seq_length, self.model_tester.hidden_size],
|
||||
)
|
||||
else:
|
||||
self.assertListEqual(
|
||||
list(hidden_states[0].shape[-2:]),
|
||||
[seq_length, self.model_tester.hidden_size],
|
||||
)
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
print("Model class:", model_class)
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
|
||||
# check that output_hidden_states also work using config
|
||||
del inputs_dict["output_hidden_states"]
|
||||
config.output_hidden_states = True
|
||||
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.output_hidden_states = True
|
||||
config.output_attentions = True
|
||||
|
||||
# no need to test all models as different heads yield the same functionality
|
||||
model_class = self.all_model_classes[0]
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
|
||||
inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||
|
||||
outputs = model(**inputs)
|
||||
|
||||
output = outputs[0]
|
||||
|
||||
# Encoder-/Decoder-only models
|
||||
hidden_states = outputs.hidden_states[0]
|
||||
attentions = outputs.attentions[0]
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
# hidden_states are a list of length num_images
|
||||
# each element contains the hidden states of a particular image index
|
||||
hidden_states[0].retain_grad()
|
||||
attentions[0].retain_grad()
|
||||
else:
|
||||
hidden_states.retain_grad()
|
||||
attentions.retain_grad()
|
||||
|
||||
output.flatten()[0].backward(retain_graph=True)
|
||||
|
||||
if model_class.__name__ == "ViltForImagesAndTextClassification":
|
||||
# hidden_states are a list of length num_images
|
||||
# each element contains the hidden states of a particular image index
|
||||
self.assertIsNotNone(hidden_states[0].grad)
|
||||
self.assertIsNotNone(attentions[0].grad)
|
||||
else:
|
||||
self.assertIsNotNone(hidden_states.grad)
|
||||
self.assertIsNotNone(attentions.grad)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in VILT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = ViltModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_torch
|
||||
class ViltForImagesAndTextClassificationModelTest(ViltModelTest, unittest.TestCase):
|
||||
|
||||
all_model_classes = (ViltForImagesAndTextClassification,) if is_torch_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = ViltModelTester(self, modality_type_vocab_size=3, add_multiple_images=True, num_images=2)
|
||||
self.config_tester = ConfigTester(self, config_class=ViltConfig, hidden_size=37)
|
||||
|
||||
@unittest.skip("We only test the model that takes in multiple images")
|
||||
def test_model(self):
|
||||
pass
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class ViltModelIntegrationTest(unittest.TestCase):
|
||||
@cached_property
|
||||
def default_processor(self):
|
||||
return ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa") if is_vision_available() else None
|
||||
|
||||
@slow
|
||||
def test_inference_masked_lm(self):
|
||||
model = ViltForMaskedLM.from_pretrained("dandelin/vilt-b32-mlm").to(torch_device)
|
||||
|
||||
processor = self.default_processor
|
||||
image = prepare_img()
|
||||
text = "a bunch of [MASK] laying on a [MASK]."
|
||||
inputs = processor(image, text, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the logits
|
||||
expected_shape = torch.Size([1, 11, 30522])
|
||||
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||
|
||||
expected_slice = torch.tensor([-12.5061, -12.5123, -12.5174]).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, 0, :3], expected_slice, atol=1e-4))
|
||||
|
||||
# verify masked token prediction equals "cats"
|
||||
predicted_id = outputs.logits[0, 4, :].argmax(-1).item()
|
||||
assert processor.decode([predicted_id]) == "cats"
|
||||
|
||||
@slow
|
||||
def test_inference_visual_question_answering(self):
|
||||
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa").to(torch_device)
|
||||
|
||||
processor = self.default_processor
|
||||
image = prepare_img()
|
||||
text = "How many cats are there?"
|
||||
inputs = processor(image, text, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the logits
|
||||
expected_shape = torch.Size((1, 3129))
|
||||
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||
|
||||
expected_slice = torch.tensor([-15.9495, -18.1472, -10.3041]).to(torch_device)
|
||||
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4))
|
||||
|
||||
# compute loss
|
||||
vqa_labels = [[2, 3, 155, 800]]
|
||||
vqa_scores = [[1.0, 0.3, 0.3, 0.3]]
|
||||
labels = torch.zeros(1, model.config.num_labels).to(torch_device)
|
||||
|
||||
for i, (labels_example, scores_example) in enumerate(zip(vqa_labels, vqa_scores)):
|
||||
for l, s in zip(labels_example, scores_example):
|
||||
labels[i, l] = s
|
||||
|
||||
# forward pass
|
||||
outputs = model(**inputs, labels=labels)
|
||||
|
||||
# verify we have a positive loss
|
||||
self.assertTrue(outputs.loss > 0)
|
||||
|
||||
@slow
|
||||
def test_inference_natural_language_visual_reasoning(self):
|
||||
model = ViltForImagesAndTextClassification.from_pretrained("dandelin/vilt-b32-finetuned-nlvr2").to(
|
||||
torch_device
|
||||
)
|
||||
|
||||
processor = self.default_processor
|
||||
|
||||
dataset = load_dataset("hf-internal-testing/fixtures_nlvr2", split="test")
|
||||
image1 = Image.open(dataset[0]["file"]).convert("RGB")
|
||||
image2 = Image.open(dataset[1]["file"]).convert("RGB")
|
||||
|
||||
text = "The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing."
|
||||
encoding_1 = processor(image1, text, return_tensors="pt")
|
||||
encoding_2 = processor(image2, text, return_tensors="pt")
|
||||
|
||||
pixel_values = torch.stack([encoding_1.pixel_values, encoding_2.pixel_values], dim=1)
|
||||
|
||||
# forward pass
|
||||
outputs = model(
|
||||
input_ids=encoding_1.input_ids,
|
||||
pixel_values=pixel_values,
|
||||
)
|
||||
|
||||
# verify the logits
|
||||
expected_shape = torch.Size([1, 2])
|
||||
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||
|
||||
expected_slice = torch.tensor([-2.4013, 2.9342]).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3], expected_slice, atol=1e-4))
|
|
@ -108,6 +108,10 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
|
|||
# should **not** be the rule.
|
||||
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
# models to ignore for model xxx mapping
|
||||
"ViltForQuestionAnswering",
|
||||
"ViltForImagesAndTextClassification",
|
||||
"ViltForImageAndTextRetrieval",
|
||||
"ViltForMaskedLM",
|
||||
"PerceiverForMultimodalAutoencoding",
|
||||
"PerceiverForOpticalFlow",
|
||||
"SegformerDecodeHead",
|
||||
|
|
Loading…
Reference in New Issue