Add DETR (#11653)
* Squash all commits of modeling_detr_v7 branch into one * Improve docs * Fix tests * Style * Improve docs some more and fix most tests * Fix slow tests of ViT, DeiT and DETR * Improve replacement of batch norm * Restructure timm backbone forward * Make DetrForSegmentation support any timm backbone * Fix name of output * Address most comments by @LysandreJik * Give better names for variables * Conditional imports + timm in setup.py * Address additional comments by @sgugger * Make style, add require_timm and require_vision to testsé * Remove train_backbone attribute of DetrConfig, add methods to freeze/unfreeze backbone * Add png files to fixtures * Fix type hint * Add timm to workflows * Add `BatchNorm2d` to the weight initialization * Fix retain_grad test * Replace model checkpoints by Facebook namespace * Fix name of checkpoint in test * Add user-friendly message when scipy is not available * Address most comments by @patrickvonplaten * Remove return_intermediate_layers attribute of DetrConfig and simplify Joiner * Better initialization * Scipy is necessary to get sklearn metrics * Rename TimmBackbone to DetrTimmConvEncoder and rename DetrJoiner to DetrConvModel * Make style * Improve docs and add 2 community notebooks Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
This commit is contained in:
parent
d14e0af274
commit
d3eacbb829
|
@ -139,7 +139,7 @@ jobs:
|
|||
- v0.4-{{ checksum "setup.py" }}
|
||||
- run: sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev
|
||||
- run: pip install --upgrade pip
|
||||
- run: pip install .[sklearn,torch,testing,sentencepiece,speech,vision]
|
||||
- run: pip install .[sklearn,torch,testing,sentencepiece,speech,vision,timm]
|
||||
- run: pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cpu.html
|
||||
- save_cache:
|
||||
key: v0.4-torch-{{ checksum "setup.py" }}
|
||||
|
|
|
@ -37,7 +37,7 @@ jobs:
|
|||
run: |
|
||||
apt -y update && apt install -y libsndfile1-dev
|
||||
pip install --upgrade pip
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech]
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,vision,timm]
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
|
@ -121,7 +121,7 @@ jobs:
|
|||
run: |
|
||||
apt -y update && apt install -y libsndfile1-dev
|
||||
pip install --upgrade pip
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech]
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,vision,timm]
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
|
|
|
@ -33,7 +33,7 @@ jobs:
|
|||
run: |
|
||||
apt -y update && apt install -y libsndfile1-dev
|
||||
pip install --upgrade pip
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,integrations]
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,vision,timm]
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
|
@ -155,7 +155,7 @@ jobs:
|
|||
run: |
|
||||
apt -y update && apt install -y libsndfile1-dev
|
||||
pip install --upgrade pip
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,integrations]
|
||||
pip install .[sklearn,testing,onnxruntime,sentencepiece,speech,vision,timm]
|
||||
|
||||
- name: Are GPUs recognized by our DL frameworks
|
||||
run: |
|
||||
|
|
|
@ -215,6 +215,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
|||
1. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
|
||||
1. **[DeBERTa-v2](https://huggingface.co/transformers/model_doc/deberta_v2.html)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
|
||||
1. **[DeiT](https://huggingface.co/transformers/model_doc/deit.html)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
|
||||
1. **[DETR](https://huggingface.co/transformers/model_doc/detr.html)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.
|
||||
1. **[DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||
1. **[DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/master/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
|
||||
1. **[DPR](https://huggingface.co/transformers/model_doc/dpr.html)** (from Facebook) released with the paper [Dense Passage Retrieval
|
||||
|
|
|
@ -59,3 +59,5 @@ This page regroups resources around 🤗 Transformers developed by the community
|
|||
| [Evaluate LUKE on CoNLL-2003, an important NER benchmark](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | How to evaluate *LukeForEntitySpanClassification* on the CoNLL-2003 dataset | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) |
|
||||
| [Evaluate BigBird-Pegasus on PubMed dataset](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | How to evaluate *BigBirdPegasusForConditionalGeneration* on PubMed dataset | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) |
|
||||
| [Speech Emotion Classification with Wav2Vec2](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | How to leverage a pretrained Wav2Vec2 model for Emotion Classification on the MEGA dataset | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) |
|
||||
| [Detect objects in an image with DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | How to use a trained *DetrForObjectDetection* model to detect objects in an image and visualize attention | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) |
|
||||
| [Fine-tune DETR on a custom object detection dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | How to fine-tune *DetrForObjectDetection* on a custom object detection dataset | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) |
|
||||
|
|
|
@ -153,128 +153,131 @@ Supported models
|
|||
19. :doc:`DeiT <model_doc/deit>` (from Facebook) released with the paper `Training data-efficient image transformers &
|
||||
distillation through attention <https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs
|
||||
Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
|
||||
20. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
|
||||
20. :doc:`DETR <model_doc/detr>` (from Facebook) released with the paper `End-to-End Object Detection with Transformers
|
||||
<https://arxiv.org/abs/2005.12872>`__ by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
|
||||
Alexander Kirillov, Sergey Zagoruyko.
|
||||
21. :doc:`DialoGPT <model_doc/dialogpt>` (from Microsoft Research) released with the paper `DialoGPT: Large-Scale
|
||||
Generative Pre-training for Conversational Response Generation <https://arxiv.org/abs/1911.00536>`__ by Yizhe
|
||||
Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan.
|
||||
21. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
|
||||
22. :doc:`DistilBERT <model_doc/distilbert>` (from HuggingFace), released together with the paper `DistilBERT, a
|
||||
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__ by Victor
|
||||
Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into `DistilGPT2
|
||||
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, RoBERTa into `DistilRoBERTa
|
||||
<https://github.com/huggingface/transformers/tree/master/examples/distillation>`__, Multilingual BERT into
|
||||
`DistilmBERT <https://github.com/huggingface/transformers/tree/master/examples/distillation>`__ and a German
|
||||
version of DistilBERT.
|
||||
22. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
|
||||
23. :doc:`DPR <model_doc/dpr>` (from Facebook) released with the paper `Dense Passage Retrieval for Open-Domain
|
||||
Question Answering <https://arxiv.org/abs/2004.04906>`__ by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick
|
||||
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
||||
23. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
|
||||
24. :doc:`ELECTRA <model_doc/electra>` (from Google Research/Stanford University) released with the paper `ELECTRA:
|
||||
Pre-training text encoders as discriminators rather than generators <https://arxiv.org/abs/2003.10555>`__ by Kevin
|
||||
Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||
24. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
|
||||
25. :doc:`FlauBERT <model_doc/flaubert>` (from CNRS) released with the paper `FlauBERT: Unsupervised Language Model
|
||||
Pre-training for French <https://arxiv.org/abs/1912.05372>`__ by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne,
|
||||
Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||
25. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
|
||||
26. :doc:`Funnel Transformer <model_doc/funnel>` (from CMU/Google Brain) released with the paper `Funnel-Transformer:
|
||||
Filtering out Sequential Redundancy for Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__ by
|
||||
Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
26. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
|
||||
27. :doc:`GPT <model_doc/gpt>` (from OpenAI) released with the paper `Improving Language Understanding by Generative
|
||||
Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans
|
||||
and Ilya Sutskever.
|
||||
27. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
|
||||
28. :doc:`GPT-2 <model_doc/gpt2>` (from OpenAI) released with the paper `Language Models are Unsupervised Multitask
|
||||
Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford*, Jeffrey Wu*, Rewon Child, David
|
||||
Luan, Dario Amodei** and Ilya Sutskever**.
|
||||
28. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
|
||||
29. :doc:`GPT Neo <model_doc/gpt_neo>` (from EleutherAI) released in the repository `EleutherAI/gpt-neo
|
||||
<https://github.com/EleutherAI/gpt-neo>`__ by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
29. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
|
||||
30. :doc:`I-BERT <model_doc/ibert>` (from Berkeley) released with the paper `I-BERT: Integer-only BERT Quantization
|
||||
<https://arxiv.org/abs/2101.01321>`__ by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
|
||||
30. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
|
||||
31. :doc:`LayoutLM <model_doc/layoutlm>` (from Microsoft Research Asia) released with the paper `LayoutLM: Pre-training
|
||||
of Text and Layout for Document Image Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li,
|
||||
Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou.
|
||||
31. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
|
||||
32. :doc:`LED <model_doc/led>` (from AllenAI) released with the paper `Longformer: The Long-Document Transformer
|
||||
<https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
32. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
|
||||
33. :doc:`Longformer <model_doc/longformer>` (from AllenAI) released with the paper `Longformer: The Long-Document
|
||||
Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
33. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
|
||||
34. :doc:`LUKE <model_doc/luke>` (from Studio Ousia) released with the paper `LUKE: Deep Contextualized Entity
|
||||
Representations with Entity-aware Self-attention <https://arxiv.org/abs/2010.01057>`__ by Ikuya Yamada, Akari Asai,
|
||||
Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
|
||||
34. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
|
||||
35. :doc:`LXMERT <model_doc/lxmert>` (from UNC Chapel Hill) released with the paper `LXMERT: Learning Cross-Modality
|
||||
Encoder Representations from Transformers for Open-Domain Question Answering <https://arxiv.org/abs/1908.07490>`__
|
||||
by Hao Tan and Mohit Bansal.
|
||||
35. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
|
||||
36. :doc:`M2M100 <model_doc/m2m_100>` (from Facebook) released with the paper `Beyond English-Centric Multilingual
|
||||
Machine Translation <https://arxiv.org/abs/2010.11125>`__ by by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
|
||||
Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman
|
||||
Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||
36. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
|
||||
37. :doc:`MarianMT <model_doc/marian>` Machine translation models trained using `OPUS <http://opus.nlpl.eu/>`__ data by
|
||||
Jörg Tiedemann. The `Marian Framework <https://marian-nmt.github.io/>`__ is being developed by the Microsoft
|
||||
Translator Team.
|
||||
37. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
|
||||
38. :doc:`MBart <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Denoising Pre-training for
|
||||
Neural Machine Translation <https://arxiv.org/abs/2001.08210>`__ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
|
||||
Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
38. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
|
||||
39. :doc:`MBart-50 <model_doc/mbart>` (from Facebook) released with the paper `Multilingual Translation with Extensible
|
||||
Multilingual Pretraining and Finetuning <https://arxiv.org/abs/2008.00401>`__ by Yuqing Tang, Chau Tran, Xian Li,
|
||||
Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan.
|
||||
39. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
|
||||
40. :doc:`Megatron-BERT <model_doc/megatron_bert>` (from NVIDIA) released with the paper `Megatron-LM: Training
|
||||
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
|
||||
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||
40. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
|
||||
41. :doc:`Megatron-GPT2 <model_doc/megatron_gpt2>` (from NVIDIA) released with the paper `Megatron-LM: Training
|
||||
Multi-Billion Parameter Language Models Using Model Parallelism <https://arxiv.org/abs/1909.08053>`__ by Mohammad
|
||||
Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.
|
||||
41. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
|
||||
42. :doc:`MPNet <model_doc/mpnet>` (from Microsoft Research) released with the paper `MPNet: Masked and Permuted
|
||||
Pre-training for Language Understanding <https://arxiv.org/abs/2004.09297>`__ by Kaitao Song, Xu Tan, Tao Qin,
|
||||
Jianfeng Lu, Tie-Yan Liu.
|
||||
42. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
|
||||
43. :doc:`MT5 <model_doc/mt5>` (from Google AI) released with the paper `mT5: A massively multilingual pre-trained
|
||||
text-to-text transformer <https://arxiv.org/abs/2010.11934>`__ by Linting Xue, Noah Constant, Adam Roberts, Mihir
|
||||
Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
43. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
|
||||
44. :doc:`Pegasus <model_doc/pegasus>` (from Google) released with the paper `PEGASUS: Pre-training with Extracted
|
||||
Gap-sentences for Abstractive Summarization <https://arxiv.org/abs/1912.08777>`__> by Jingqing Zhang, Yao Zhao,
|
||||
Mohammad Saleh and Peter J. Liu.
|
||||
44. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
|
||||
45. :doc:`ProphetNet <model_doc/prophetnet>` (from Microsoft Research) released with the paper `ProphetNet: Predicting
|
||||
Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan, Weizhen Qi,
|
||||
Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
45. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
|
||||
46. :doc:`Reformer <model_doc/reformer>` (from Google Research) released with the paper `Reformer: The Efficient
|
||||
Transformer <https://arxiv.org/abs/2001.04451>`__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya.
|
||||
46. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
|
||||
47. :doc:`RoBERTa <model_doc/roberta>` (from Facebook), released together with the paper a `Robustly Optimized BERT
|
||||
Pretraining Approach <https://arxiv.org/abs/1907.11692>`__ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
|
||||
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||
47. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
|
||||
48. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
|
||||
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
|
||||
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
48. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
||||
49. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
||||
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
|
||||
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||
49. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
|
||||
50. :doc:`SqueezeBert <model_doc/squeezebert>` released with the paper `SqueezeBERT: What can computer vision teach NLP
|
||||
about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola, Albert E. Shaw, Ravi
|
||||
Krishna, and Kurt W. Keutzer.
|
||||
50. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||
51. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
||||
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||
51. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||
52. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
||||
Francesco Piccinno and Julian Martin Eisenschlos.
|
||||
52. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||
53. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
||||
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||
53. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
||||
54. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
||||
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
|
||||
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
|
||||
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
54. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||
55. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
|
||||
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
55. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
56. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
||||
Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||
56. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
57. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
||||
57. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
58. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
||||
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
58. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
59. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
||||
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
||||
Zettlemoyer and Veselin Stoyanov.
|
||||
59. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
60. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
||||
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
60. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
61. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
|
||||
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
||||
|
@ -318,6 +321,8 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| DPR | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| DeBERTa | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
|
@ -502,6 +507,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
model_doc/deberta
|
||||
model_doc/deberta_v2
|
||||
model_doc/deit
|
||||
model_doc/detr
|
||||
model_doc/dialogpt
|
||||
model_doc/distilbert
|
||||
model_doc/dpr
|
||||
|
|
|
@ -0,0 +1,202 @@
|
|||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
DETR
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The DETR model was proposed in `End-to-End Object Detection with Transformers <https://arxiv.org/abs/2005.12872>`__ by
|
||||
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko. DETR
|
||||
consists of a convolutional backbone followed by an encoder-decoder Transformer which can be trained end-to-end for
|
||||
object detection. It greatly simplifies a lot of the complexity of models like Faster-R-CNN and Mask-R-CNN, which use
|
||||
things like region proposals, non-maximum suppression procedure and anchor generation. Moreover, DETR can also be
|
||||
naturally extended to perform panoptic segmentation, by simply adding a mask head on top of the decoder outputs.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the
|
||||
detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression
|
||||
procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the
|
||||
new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via
|
||||
bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries,
|
||||
DETR reasons about the relations of the objects and the global image context to directly output the final set of
|
||||
predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many
|
||||
other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and
|
||||
highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily
|
||||
generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive
|
||||
baselines.*
|
||||
|
||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
|
||||
<https://github.com/facebookresearch/detr>`__.
|
||||
|
||||
Here's a TLDR explaining how :class:`~transformers.DetrForObjectDetection` works:
|
||||
|
||||
First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use
|
||||
ResNet-50/ResNet-101). Let's assume we also add a batch dimension. This means that the input to the backbone is a
|
||||
tensor of shape :obj:`(batch_size, 3, height, width)`, assuming the image has 3 color channels (RGB). The CNN backbone
|
||||
outputs a new lower-resolution feature map, typically of shape :obj:`(batch_size, 2048, height/32, width/32)`. This is
|
||||
then projected to match the hidden dimension of the Transformer of DETR, which is :obj:`256` by default, using a
|
||||
:obj:`nn.Conv2D` layer. So now, we have a tensor of shape :obj:`(batch_size, 256, height/32, width/32).` Next, the
|
||||
feature map is flattened and transposed to obtain a tensor of shape :obj:`(batch_size, seq_len, d_model)` =
|
||||
:obj:`(batch_size, width/32*height/32, 256)`. So a difference with NLP models is that the sequence length is actually
|
||||
longer than usual, but with a smaller :obj:`d_model` (which in NLP is typically 768 or higher).
|
||||
|
||||
Next, this is sent through the encoder, outputting :obj:`encoder_hidden_states` of the same shape (you can consider
|
||||
these as image features). Next, so-called **object queries** are sent through the decoder. This is a tensor of shape
|
||||
:obj:`(batch_size, num_queries, d_model)`, with :obj:`num_queries` typically set to 100 and initialized with zeros.
|
||||
These input embeddings are learnt positional encodings that the authors refer to as object queries, and similarly to
|
||||
the encoder, they are added to the input of each attention layer. Each object query will look for a particular object
|
||||
in the image. The decoder updates these embeddings through multiple self-attention and encoder-decoder attention layers
|
||||
to output :obj:`decoder_hidden_states` of the same shape: :obj:`(batch_size, num_queries, d_model)`. Next, two heads
|
||||
are added on top for object detection: a linear layer for classifying each object query into one of the objects or "no
|
||||
object", and a MLP to predict bounding boxes for each query.
|
||||
|
||||
The model is trained using a **bipartite matching loss**: so what we actually do is compare the predicted classes +
|
||||
bounding boxes of each of the N = 100 object queries to the ground truth annotations, padded up to the same length N
|
||||
(so if an image only contains 4 objects, 96 annotations will just have a "no object" as class and "no bounding box" as
|
||||
bounding box). The `Hungarian matching algorithm <https://en.wikipedia.org/wiki/Hungarian_algorithm>`__ is used to find
|
||||
an optimal one-to-one mapping of each of the N queries to each of the N annotations. Next, standard cross-entropy (for
|
||||
the classes) and a linear combination of the L1 and `generalized IoU loss <https://giou.stanford.edu/>`__ (for the
|
||||
bounding boxes) are used to optimize the parameters of the model.
|
||||
|
||||
DETR can be naturally extended to perform panoptic segmentation (which unifies semantic segmentation and instance
|
||||
segmentation). :class:`~transformers.DetrForSegmentation` adds a segmentation mask head on top of
|
||||
:class:`~transformers.DetrForObjectDetection`. The mask head can be trained either jointly, or in a two steps process,
|
||||
where one first trains a :class:`~transformers.DetrForObjectDetection` model to detect bounding boxes around both
|
||||
"things" (instances) and "stuff" (background things like trees, roads, sky), then freeze all the weights and train only
|
||||
the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is
|
||||
required for the training to be possible, since the Hungarian matching is computed using distances between boxes.
|
||||
|
||||
Tips:
|
||||
|
||||
- DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum
|
||||
number of objects that can be detected in a single image, and is set to 100 by default (see parameter
|
||||
:obj:`num_queries` of :class:`~transformers.DetrConfig`). Note that it's good to have some slack (in COCO, the
|
||||
authors used 100, while the maximum number of objects in a COCO image is ~70).
|
||||
- The decoder of DETR updates the query embeddings in parallel. This is different from language models like GPT-2,
|
||||
which use autoregressive decoding instead of parallel. Hence, no causal attention mask is used.
|
||||
- DETR adds position embeddings to the hidden states at each self-attention and cross-attention layer before projecting
|
||||
to queries and keys. For the position embeddings of the image, one can choose between fixed sinusoidal or learned
|
||||
absolute position embeddings. By default, the parameter :obj:`position_embedding_type` of
|
||||
:class:`~transformers.DetrConfig` is set to :obj:`"sine"`.
|
||||
- During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help
|
||||
the model output the correct number of objects of each class. If you set the parameter :obj:`auxiliary_loss` of
|
||||
:class:`~transformers.DetrConfig` to :obj:`True`, then prediction feedforward neural networks and Hungarian losses
|
||||
are added after each decoder layer (with the FFNs sharing parameters).
|
||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||
`num_boxes` variable in the `DetrLoss` class of `modeling_detr.py`. When training on multiple nodes, this should be
|
||||
set to the average number of target boxes across all nodes, as can be seen in the original implementation `here
|
||||
<https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232>`__.
|
||||
- :class:`~transformers.DetrForObjectDetection` and :class:`~transformers.DetrForSegmentation` can be initialized with
|
||||
any convolutional backbone available in the `timm library <https://github.com/rwightman/pytorch-image-models>`__.
|
||||
Initializing with a MobileNet backbone for example can be done by setting the :obj:`backbone` attribute of
|
||||
:class:`~transformers.DetrConfig` to :obj:`"tf_mobilenetv3_small_075"`, and then initializing the model with that
|
||||
config.
|
||||
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
|
||||
at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
|
||||
least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
|
||||
:class:`~transformers.DetrFeatureExtractor` to prepare images (and optional annotations in COCO format) for the
|
||||
model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
|
||||
largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
|
||||
Alternatively, one can also define a custom :obj:`collate_fn` in order to batch images together, using
|
||||
:meth:`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`.
|
||||
- The size of the images will determine the amount of memory being used, and will thus determine the :obj:`batch_size`.
|
||||
It is advised to use a batch size of 2 per GPU. See `this Github thread
|
||||
<https://github.com/facebookresearch/detr/issues/150>`__ for more info.
|
||||
|
||||
As a summary, consider the following table:
|
||||
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Task** | **Object detection** | **Instance segmentation** | **Panoptic segmentation** |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Description** | Predicting bounding boxes and class labels around | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as |
|
||||
| | objects in an image | | "stuff" (i.e. background things like trees and roads) in an image |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Model** | :class:`~transformers.DetrForObjectDetection` | :class:`~transformers.DetrForSegmentation` | :class:`~transformers.DetrForSegmentation` |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Example dataset** | COCO detection | COCO detection, | COCO panoptic |
|
||||
| | | COCO panoptic | |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Format of annotations to provide to** | {‘image_id’: int, | {‘image_id’: int, | {‘file_name: str, |
|
||||
| :class:`~transformers.DetrFeatureExtractor` | ‘annotations’: List[Dict]}, each Dict being a COCO | ‘annotations’: [List[Dict]] } (in case of COCO detection) | ‘image_id: int, |
|
||||
| | object annotation (containing keys "image_id", | | ‘segments_info’: List[Dict] } |
|
||||
| | | or | |
|
||||
| | | | and masks_path (path to directory containing PNG files of the masks) |
|
||||
| | | {‘file_name’: str, | |
|
||||
| | | ‘image_id’: int, | |
|
||||
| | | ‘segments_info’: List[Dict]} (in case of COCO panoptic) | |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **Postprocessing** (i.e. converting the | :meth:`~transformers.DetrFeatureExtractor.post_process` | :meth:`~transformers.DetrFeatureExtractor.post_process_segmentation` | :meth:`~transformers.DetrFeatureExtractor.post_process_segmentation`, |
|
||||
| output of the model to COCO API) | | | :meth:`~transformers.DetrFeatureExtractor.post_process_panoptic` |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
| **evaluators** | :obj:`CocoEvaluator` with iou_types = “bbox” | :obj:`CocoEvaluator` with iou_types = “bbox”, “segm” | :obj:`CocoEvaluator` with iou_tupes = “bbox, “segm” |
|
||||
| | | | |
|
||||
| | | | :obj:`PanopticEvaluator` |
|
||||
+---------------------------------------------+---------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------------+
|
||||
|
||||
In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
|
||||
:class:`~transformers.DetrFeatureExtractor` to create :obj:`pixel_values`, :obj:`pixel_mask` and optional
|
||||
:obj:`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
|
||||
outputs of the model using one of the postprocessing methods of :class:`~transformers.DetrFeatureExtractor`. These can
|
||||
be be provided to either :obj:`CocoEvaluator` or :obj:`PanopticEvaluator`, which allow you to calculate metrics like
|
||||
mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the `original repository
|
||||
<https://github.com/facebookresearch/detr>`__. See the example notebooks for more info regarding evaluation.
|
||||
|
||||
|
||||
DETR specific outputs
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.models.detr.modeling_detr.DetrModelOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.models.detr.modeling_detr.DetrObjectDetectionOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.models.detr.modeling_detr.DetrSegmentationOutput
|
||||
:members:
|
||||
|
||||
|
||||
DetrConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.DetrConfig
|
||||
:members:
|
||||
|
||||
|
||||
DetrFeatureExtractor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.DetrFeatureExtractor
|
||||
:members: __call__, pad_and_create_pixel_mask, post_process, post_process_segmentation, post_process_panoptic
|
||||
|
||||
|
||||
DetrModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.DetrModel
|
||||
:members: forward
|
||||
|
||||
|
||||
DetrForObjectDetection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.DetrForObjectDetection
|
||||
:members: forward
|
||||
|
||||
|
||||
DetrForSegmentation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.DetrForSegmentation
|
||||
:members: forward
|
3
setup.py
3
setup.py
|
@ -142,6 +142,7 @@ _deps = [
|
|||
"tensorflow-cpu>=2.3",
|
||||
"tensorflow>=2.3",
|
||||
"timeout-decorator",
|
||||
"timm",
|
||||
"tokenizers>=0.10.1,<0.11",
|
||||
"torch>=1.0",
|
||||
"torchaudio",
|
||||
|
@ -249,6 +250,7 @@ extras["integrations"] = extras["optuna"] + extras["ray"]
|
|||
extras["serving"] = deps_list("pydantic", "uvicorn", "fastapi", "starlette")
|
||||
extras["speech"] = deps_list("soundfile", "torchaudio")
|
||||
extras["vision"] = deps_list("Pillow")
|
||||
extras["timm"] = deps_list("timm")
|
||||
|
||||
extras["sentencepiece"] = deps_list("sentencepiece", "protobuf")
|
||||
extras["testing"] = (
|
||||
|
@ -270,6 +272,7 @@ extras["all"] = (
|
|||
+ extras["speech"]
|
||||
+ extras["vision"]
|
||||
+ extras["integrations"]
|
||||
+ extras["timm"]
|
||||
)
|
||||
|
||||
extras["docs_specific"] = deps_list(
|
||||
|
|
|
@ -47,6 +47,7 @@ from .file_utils import (
|
|||
is_sentencepiece_available,
|
||||
is_speech_available,
|
||||
is_tf_available,
|
||||
is_timm_available,
|
||||
is_tokenizers_available,
|
||||
is_torch_available,
|
||||
is_vision_available,
|
||||
|
@ -101,10 +102,12 @@ _import_structure = {
|
|||
"is_flax_available",
|
||||
"is_psutil_available",
|
||||
"is_py3nvml_available",
|
||||
"is_scipy_available",
|
||||
"is_sentencepiece_available",
|
||||
"is_sklearn_available",
|
||||
"is_speech_available",
|
||||
"is_tf_available",
|
||||
"is_timm_available",
|
||||
"is_tokenizers_available",
|
||||
"is_torch_available",
|
||||
"is_torch_tpu_available",
|
||||
|
@ -180,6 +183,7 @@ _import_structure = {
|
|||
"models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
|
||||
"models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"],
|
||||
"models.deit": ["DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DeiTConfig"],
|
||||
"models.detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig"],
|
||||
"models.distilbert": ["DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DistilBertConfig", "DistilBertTokenizer"],
|
||||
"models.dpr": [
|
||||
"DPR_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
|
@ -405,6 +409,7 @@ if is_vision_available():
|
|||
_import_structure["models.clip"].append("CLIPFeatureExtractor")
|
||||
_import_structure["models.clip"].append("CLIPProcessor")
|
||||
_import_structure["models.deit"].append("DeiTFeatureExtractor")
|
||||
_import_structure["models.detr"].append("DetrFeatureExtractor")
|
||||
_import_structure["models.vit"].append("ViTFeatureExtractor")
|
||||
else:
|
||||
from .utils import dummy_vision_objects
|
||||
|
@ -413,6 +418,23 @@ else:
|
|||
name for name in dir(dummy_vision_objects) if not name.startswith("_")
|
||||
]
|
||||
|
||||
# Timm-backed objects
|
||||
if is_timm_available() and is_vision_available():
|
||||
_import_structure["models.detr"].extend(
|
||||
[
|
||||
"DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"DetrForObjectDetection",
|
||||
"DetrForSegmentation",
|
||||
"DetrModel",
|
||||
]
|
||||
)
|
||||
else:
|
||||
from .utils import dummy_timm_objects
|
||||
|
||||
_import_structure["utils.dummy_timm_objects"] = [
|
||||
name for name in dir(dummy_timm_objects) if not name.startswith("_")
|
||||
]
|
||||
|
||||
# PyTorch-backed objects
|
||||
if is_torch_available():
|
||||
_import_structure["benchmark.benchmark"] = ["PyTorchBenchmark"]
|
||||
|
@ -489,6 +511,7 @@ if is_torch_available():
|
|||
"MODEL_FOR_MASKED_LM_MAPPING",
|
||||
"MODEL_FOR_MULTIPLE_CHOICE_MAPPING",
|
||||
"MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING",
|
||||
"MODEL_FOR_OBJECT_DETECTION_MAPPING",
|
||||
"MODEL_FOR_PRETRAINING_MAPPING",
|
||||
"MODEL_FOR_QUESTION_ANSWERING_MAPPING",
|
||||
"MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING",
|
||||
|
@ -1587,10 +1610,12 @@ if TYPE_CHECKING:
|
|||
is_flax_available,
|
||||
is_psutil_available,
|
||||
is_py3nvml_available,
|
||||
is_scipy_available,
|
||||
is_sentencepiece_available,
|
||||
is_sklearn_available,
|
||||
is_speech_available,
|
||||
is_tf_available,
|
||||
is_timm_available,
|
||||
is_tokenizers_available,
|
||||
is_torch_available,
|
||||
is_torch_tpu_available,
|
||||
|
@ -1666,6 +1691,7 @@ if TYPE_CHECKING:
|
|||
from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
|
||||
from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
|
||||
from .models.deit import DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, DeiTConfig
|
||||
from .models.detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
|
||||
from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer
|
||||
from .models.dpr import (
|
||||
DPR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
|
@ -1863,13 +1889,23 @@ if TYPE_CHECKING:
|
|||
from .image_utils import ImageFeatureExtractionMixin
|
||||
from .models.clip import CLIPFeatureExtractor, CLIPProcessor
|
||||
from .models.deit import DeiTFeatureExtractor
|
||||
from .models.detr import DetrFeatureExtractor
|
||||
from .models.vit import ViTFeatureExtractor
|
||||
else:
|
||||
from .utils.dummy_vision_objects import *
|
||||
|
||||
# Modeling
|
||||
if is_torch_available():
|
||||
if is_timm_available() and is_vision_available():
|
||||
from .models.detr import (
|
||||
DETR_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
DetrForObjectDetection,
|
||||
DetrForSegmentation,
|
||||
DetrModel,
|
||||
)
|
||||
else:
|
||||
from .utils.dummy_timm_objects import *
|
||||
|
||||
if is_torch_available():
|
||||
# Benchmarks
|
||||
from .benchmark.benchmark import PyTorchBenchmark
|
||||
from .benchmark.benchmark_args import PyTorchBenchmarkArguments
|
||||
|
@ -1939,6 +1975,7 @@ if TYPE_CHECKING:
|
|||
MODEL_FOR_MASKED_LM_MAPPING,
|
||||
MODEL_FOR_MULTIPLE_CHOICE_MAPPING,
|
||||
MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING,
|
||||
MODEL_FOR_OBJECT_DETECTION_MAPPING,
|
||||
MODEL_FOR_PRETRAINING_MAPPING,
|
||||
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
|
||||
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
|
||||
|
|
|
@ -59,6 +59,7 @@ deps = {
|
|||
"tensorflow-cpu": "tensorflow-cpu>=2.3",
|
||||
"tensorflow": "tensorflow>=2.3",
|
||||
"timeout-decorator": "timeout-decorator",
|
||||
"timm": "timm",
|
||||
"tokenizers": "tokenizers>=0.10.1,<0.11",
|
||||
"torch": "torch>=1.0",
|
||||
"torchaudio": "torchaudio",
|
||||
|
|
|
@ -174,6 +174,14 @@ except importlib_metadata.PackageNotFoundError:
|
|||
_soundfile_available = False
|
||||
|
||||
|
||||
_timm_available = importlib.util.find_spec("timm") is not None
|
||||
try:
|
||||
_timm_version = importlib_metadata.version("timm")
|
||||
logger.debug(f"Successfully imported timm version {_timm_version}")
|
||||
except importlib_metadata.PackageNotFoundError:
|
||||
_timm_available = False
|
||||
|
||||
|
||||
_torchaudio_available = importlib.util.find_spec("torchaudio") is not None
|
||||
try:
|
||||
_torchaudio_version = importlib_metadata.version("torchaudio")
|
||||
|
@ -317,12 +325,14 @@ def is_faiss_available():
|
|||
return _faiss_available
|
||||
|
||||
|
||||
def is_scipy_available():
|
||||
return importlib.util.find_spec("scipy") is not None
|
||||
|
||||
|
||||
def is_sklearn_available():
|
||||
if importlib.util.find_spec("sklearn") is None:
|
||||
return False
|
||||
if importlib.util.find_spec("scipy") is None:
|
||||
return False
|
||||
return importlib.util.find_spec("sklearn.metrics") and importlib.util.find_spec("scipy.stats")
|
||||
return is_scipy_available() and importlib.util.find_spec("sklearn.metrics")
|
||||
|
||||
|
||||
def is_sentencepiece_available():
|
||||
|
@ -411,6 +421,10 @@ def is_soundfile_availble():
|
|||
return _soundfile_available
|
||||
|
||||
|
||||
def is_timm_available():
|
||||
return _timm_available
|
||||
|
||||
|
||||
def is_torchaudio_available():
|
||||
return _torchaudio_available
|
||||
|
||||
|
@ -536,12 +550,24 @@ explained here: https://pandas.pydata.org/pandas-docs/stable/getting_started/ins
|
|||
"""
|
||||
|
||||
|
||||
# docstyle-ignore
|
||||
SCIPY_IMPORT_ERROR = """
|
||||
{0} requires the scipy library but it was not found in your environment. You can install it with pip:
|
||||
`pip install scipy`
|
||||
"""
|
||||
|
||||
|
||||
# docstyle-ignore
|
||||
SPEECH_IMPORT_ERROR = """
|
||||
{0} requires the torchaudio library but it was not found in your environment. You can install it with pip:
|
||||
`pip install torchaudio`
|
||||
"""
|
||||
|
||||
# docstyle-ignore
|
||||
TIMM_IMPORT_ERROR = """
|
||||
{0} requires the timm library but it was not found in your environment. You can install it with pip:
|
||||
`pip install timm`
|
||||
"""
|
||||
|
||||
# docstyle-ignore
|
||||
VISION_IMPORT_ERROR = """
|
||||
|
@ -562,9 +588,11 @@ BACKENDS_MAPPING = OrderedDict(
|
|||
("sklearn", (is_sklearn_available, SKLEARN_IMPORT_ERROR)),
|
||||
("speech", (is_speech_available, SPEECH_IMPORT_ERROR)),
|
||||
("tf", (is_tf_available, TENSORFLOW_IMPORT_ERROR)),
|
||||
("timm", (is_timm_available, TIMM_IMPORT_ERROR)),
|
||||
("tokenizers", (is_tokenizers_available, TOKENIZERS_IMPORT_ERROR)),
|
||||
("torch", (is_torch_available, PYTORCH_IMPORT_ERROR)),
|
||||
("vision", (is_vision_available, VISION_IMPORT_ERROR)),
|
||||
("scipy", (is_scipy_available, SCIPY_IMPORT_ERROR)),
|
||||
]
|
||||
)
|
||||
|
||||
|
|
|
@ -36,6 +36,7 @@ from . import (
|
|||
ctrl,
|
||||
deberta,
|
||||
deit,
|
||||
detr,
|
||||
dialogpt,
|
||||
distilbert,
|
||||
dpr,
|
||||
|
|
|
@ -35,6 +35,7 @@ if is_torch_available():
|
|||
"MODEL_FOR_MASKED_LM_MAPPING",
|
||||
"MODEL_FOR_MULTIPLE_CHOICE_MAPPING",
|
||||
"MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING",
|
||||
"MODEL_FOR_OBJECT_DETECTION_MAPPING",
|
||||
"MODEL_FOR_PRETRAINING_MAPPING",
|
||||
"MODEL_FOR_QUESTION_ANSWERING_MAPPING",
|
||||
"MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING",
|
||||
|
@ -119,6 +120,7 @@ if TYPE_CHECKING:
|
|||
MODEL_FOR_MASKED_LM_MAPPING,
|
||||
MODEL_FOR_MULTIPLE_CHOICE_MAPPING,
|
||||
MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING,
|
||||
MODEL_FOR_OBJECT_DETECTION_MAPPING,
|
||||
MODEL_FOR_PRETRAINING_MAPPING,
|
||||
MODEL_FOR_QUESTION_ANSWERING_MAPPING,
|
||||
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING,
|
||||
|
|
|
@ -39,6 +39,7 @@ from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLCo
|
|||
from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
|
||||
from ..deberta_v2.configuration_deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
|
||||
from ..deit.configuration_deit import DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, DeiTConfig
|
||||
from ..detr.configuration_detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
|
||||
from ..distilbert.configuration_distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig
|
||||
from ..dpr.configuration_dpr import DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig
|
||||
from ..electra.configuration_electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig
|
||||
|
@ -99,6 +100,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
|
|||
BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
DETR_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
|
@ -156,6 +158,7 @@ CONFIG_MAPPING = OrderedDict(
|
|||
("bigbird_pegasus", BigBirdPegasusConfig),
|
||||
("deit", DeiTConfig),
|
||||
("luke", LukeConfig),
|
||||
("detr", DetrConfig),
|
||||
("gpt_neo", GPTNeoConfig),
|
||||
("big_bird", BigBirdConfig),
|
||||
("speech_to_text", Speech2TextConfig),
|
||||
|
@ -219,6 +222,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||
("bigbird_pegasus", "BigBirdPegasus"),
|
||||
("deit", "DeiT"),
|
||||
("luke", "LUKE"),
|
||||
("detr", "DETR"),
|
||||
("gpt_neo", "GPT Neo"),
|
||||
("big_bird", "BigBird"),
|
||||
("speech_to_text", "Speech2Text"),
|
||||
|
|
|
@ -106,6 +106,7 @@ from ..deberta_v2.modeling_deberta_v2 import (
|
|||
DebertaV2Model,
|
||||
)
|
||||
from ..deit.modeling_deit import DeiTForImageClassification, DeiTForImageClassificationWithTeacher, DeiTModel
|
||||
from ..detr.modeling_detr import DetrForObjectDetection, DetrModel
|
||||
from ..distilbert.modeling_distilbert import (
|
||||
DistilBertForMaskedLM,
|
||||
DistilBertForMultipleChoice,
|
||||
|
@ -316,6 +317,7 @@ from .configuration_auto import (
|
|||
DebertaConfig,
|
||||
DebertaV2Config,
|
||||
DeiTConfig,
|
||||
DetrConfig,
|
||||
DistilBertConfig,
|
||||
DPRConfig,
|
||||
ElectraConfig,
|
||||
|
@ -372,6 +374,7 @@ MODEL_MAPPING = OrderedDict(
|
|||
(BigBirdPegasusConfig, BigBirdPegasusModel),
|
||||
(DeiTConfig, DeiTModel),
|
||||
(LukeConfig, LukeModel),
|
||||
(DetrConfig, DetrModel),
|
||||
(GPTNeoConfig, GPTNeoModel),
|
||||
(BigBirdConfig, BigBirdModel),
|
||||
(Speech2TextConfig, Speech2TextModel),
|
||||
|
@ -586,6 +589,13 @@ MODEL_FOR_MASKED_LM_MAPPING = OrderedDict(
|
|||
]
|
||||
)
|
||||
|
||||
MODEL_FOR_OBJECT_DETECTION_MAPPING = OrderedDict(
|
||||
[
|
||||
# Model for Object Detection mapping
|
||||
(DetrConfig, DetrForObjectDetection),
|
||||
]
|
||||
)
|
||||
|
||||
MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
|
||||
[
|
||||
# Model for Seq2Seq Causal LM mapping
|
||||
|
|
|
@ -0,0 +1,72 @@
|
|||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...file_utils import _BaseLazyModule, is_timm_available, is_vision_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig"],
|
||||
}
|
||||
|
||||
if is_vision_available():
|
||||
_import_structure["feature_extraction_detr"] = ["DetrFeatureExtractor"]
|
||||
|
||||
if is_timm_available():
|
||||
_import_structure["modeling_detr"] = [
|
||||
"DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"DetrForObjectDetection",
|
||||
"DetrForSegmentation",
|
||||
"DetrModel",
|
||||
"DetrPreTrainedModel",
|
||||
]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig
|
||||
|
||||
if is_vision_available():
|
||||
from .feature_extraction_detr import DetrFeatureExtractor
|
||||
|
||||
if is_timm_available():
|
||||
from .modeling_detr import (
|
||||
DETR_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
DetrForObjectDetection,
|
||||
DetrForSegmentation,
|
||||
DetrModel,
|
||||
DetrPreTrainedModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import importlib
|
||||
import os
|
||||
import sys
|
||||
|
||||
class _LazyModule(_BaseLazyModule):
|
||||
"""
|
||||
Module class that surfaces all objects but only performs associated imports when the objects are requested.
|
||||
"""
|
||||
|
||||
__file__ = globals()["__file__"]
|
||||
__path__ = [os.path.dirname(__file__)]
|
||||
|
||||
def _get_module(self, module_name: str):
|
||||
return importlib.import_module("." + module_name, self.__name__)
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, _import_structure)
|
|
@ -0,0 +1,205 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2021 Facebook AI Research and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" DETR model configuration """
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
DETR_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"facebook/detr-resnet-50": "https://huggingface.co/facebook/detr-resnet-50/resolve/main/config.json",
|
||||
# See all DETR models at https://huggingface.co/models?filter=detr
|
||||
}
|
||||
|
||||
|
||||
class DetrConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a :class:`~transformers.DetrModel`. It is used to
|
||||
instantiate a DETR model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the DETR `facebook/detr-resnet-50
|
||||
<https://huggingface.co/facebook/detr-resnet-50>`__ architecture.
|
||||
|
||||
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||
|
||||
|
||||
Args:
|
||||
num_queries (:obj:`int`, `optional`, defaults to 100):
|
||||
Number of object queries, i.e. detection slots. This is the maximal number of objects
|
||||
:class:`~transformers.DetrModel` can detect in a single image. For COCO, we recommend 100 queries.
|
||||
d_model (:obj:`int`, `optional`, defaults to 256):
|
||||
Dimension of the layers.
|
||||
encoder_layers (:obj:`int`, `optional`, defaults to 6):
|
||||
Number of encoder layers.
|
||||
decoder_layers (:obj:`int`, `optional`, defaults to 6):
|
||||
Number of decoder layers.
|
||||
encoder_attention_heads (:obj:`int`, `optional`, defaults to 8):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
decoder_attention_heads (:obj:`int`, `optional`, defaults to 8):
|
||||
Number of attention heads for each attention layer in the Transformer decoder.
|
||||
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
|
||||
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
|
||||
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
|
||||
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
|
||||
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"relu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
|
||||
dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The dropout ratio for activations inside the fully connected layer.
|
||||
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The dropout ratio for classifier.
|
||||
max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
init_std (:obj:`float`, `optional`, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
init_xavier_std (:obj:`float`, `optional`, defaults to 1):
|
||||
The scaling factor used for the Xavier initialization gain in the HM Attention map module.
|
||||
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The LayerDrop probability for the encoder. See the `LayerDrop paper <see
|
||||
https://arxiv.org/abs/1909.11556>`__ for more details.
|
||||
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The LayerDrop probability for the decoder. See the `LayerDrop paper <see
|
||||
https://arxiv.org/abs/1909.11556>`__ for more details.
|
||||
auxiliary_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
|
||||
position_embedding_type (:obj:`str`, `optional`, defaults to :obj:`"sine"`):
|
||||
Type of position embeddings to be used on top of the image features. One of :obj:`"sine"` or
|
||||
:obj:`"learned"`.
|
||||
backbone (:obj:`str`, `optional`, defaults to :obj:`"resnet50"`):
|
||||
Name of convolutional backbone to use. Supports any convolutional backbone from the timm package. For a
|
||||
list of all available models, see `this page
|
||||
<https://rwightman.github.io/pytorch-image-models/#load-a-pretrained-model>`__.
|
||||
dilation (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether to replace stride with dilation in the last convolutional block (DC5).
|
||||
class_cost (:obj:`float`, `optional`, defaults to 1):
|
||||
Relative weight of the classification error in the Hungarian matching cost.
|
||||
bbox_cost (:obj:`float`, `optional`, defaults to 5):
|
||||
Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
|
||||
giou_cost (:obj:`float`, `optional`, defaults to 2):
|
||||
Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
|
||||
mask_loss_coefficient (:obj:`float`, `optional`, defaults to 1):
|
||||
Relative weight of the Focal loss in the panoptic segmentation loss.
|
||||
dice_loss_coefficient (:obj:`float`, `optional`, defaults to 1):
|
||||
Relative weight of the DICE/F-1 loss in the panoptic segmentation loss.
|
||||
bbox_loss_coefficient (:obj:`float`, `optional`, defaults to 5):
|
||||
Relative weight of the L1 bounding box loss in the object detection loss.
|
||||
giou_loss_coefficient (:obj:`float`, `optional`, defaults to 2):
|
||||
Relative weight of the generalized IoU loss in the object detection loss.
|
||||
eos_coefficient (:obj:`float`, `optional`, defaults to 0.1):
|
||||
Relative classification weight of the 'no-object' class in the object detection loss.
|
||||
|
||||
Examples::
|
||||
|
||||
>>> from transformers import DetrModel, DetrConfig
|
||||
|
||||
>>> # Initializing a DETR facebook/detr-resnet-50 style configuration
|
||||
>>> configuration = DetrConfig()
|
||||
|
||||
>>> # Initializing a model from the facebook/detr-resnet-50 style configuration
|
||||
>>> model = DetrModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
"""
|
||||
model_type = "detr"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
num_queries=100,
|
||||
max_position_embeddings=1024,
|
||||
encoder_layers=6,
|
||||
encoder_ffn_dim=2048,
|
||||
encoder_attention_heads=8,
|
||||
decoder_layers=6,
|
||||
decoder_ffn_dim=2048,
|
||||
decoder_attention_heads=8,
|
||||
encoder_layerdrop=0.0,
|
||||
decoder_layerdrop=0.0,
|
||||
is_encoder_decoder=True,
|
||||
activation_function="relu",
|
||||
d_model=256,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.0,
|
||||
activation_dropout=0.0,
|
||||
init_std=0.02,
|
||||
init_xavier_std=1.0,
|
||||
classifier_dropout=0.0,
|
||||
scale_embedding=False,
|
||||
auxiliary_loss=False,
|
||||
position_embedding_type="sine",
|
||||
backbone="resnet50",
|
||||
dilation=False,
|
||||
class_cost=1,
|
||||
bbox_cost=5,
|
||||
giou_cost=2,
|
||||
mask_loss_coefficient=1,
|
||||
dice_loss_coefficient=1,
|
||||
bbox_loss_coefficient=5,
|
||||
giou_loss_coefficient=2,
|
||||
eos_coefficient=0.1,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
|
||||
|
||||
self.num_queries = num_queries
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.d_model = d_model
|
||||
self.encoder_ffn_dim = encoder_ffn_dim
|
||||
self.encoder_layers = encoder_layers
|
||||
self.encoder_attention_heads = encoder_attention_heads
|
||||
self.decoder_ffn_dim = decoder_ffn_dim
|
||||
self.decoder_layers = decoder_layers
|
||||
self.decoder_attention_heads = decoder_attention_heads
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.activation_function = activation_function
|
||||
self.init_std = init_std
|
||||
self.init_xavier_std = init_xavier_std
|
||||
self.encoder_layerdrop = encoder_layerdrop
|
||||
self.decoder_layerdrop = decoder_layerdrop
|
||||
self.classifier_dropout = classifier_dropout
|
||||
self.num_hidden_layers = encoder_layers
|
||||
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
|
||||
self.auxiliary_loss = auxiliary_loss
|
||||
self.position_embedding_type = position_embedding_type
|
||||
self.backbone = backbone
|
||||
self.dilation = dilation
|
||||
# Hungarian matcher
|
||||
self.class_cost = class_cost
|
||||
self.bbox_cost = bbox_cost
|
||||
self.giou_cost = giou_cost
|
||||
# Loss coefficients
|
||||
self.mask_loss_coefficient = mask_loss_coefficient
|
||||
self.dice_loss_coefficient = dice_loss_coefficient
|
||||
self.bbox_loss_coefficient = bbox_loss_coefficient
|
||||
self.giou_loss_coefficient = giou_loss_coefficient
|
||||
self.eos_coefficient = eos_coefficient
|
||||
|
||||
@property
|
||||
def num_attention_heads(self) -> int:
|
||||
return self.encoder_attention_heads
|
||||
|
||||
@property
|
||||
def hidden_size(self) -> int:
|
||||
return self.d_model
|
|
@ -0,0 +1,273 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2020 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert DETR checkpoints."""
|
||||
|
||||
|
||||
import argparse
|
||||
from collections import OrderedDict
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
|
||||
import requests
|
||||
from transformers import DetrConfig, DetrFeatureExtractor, DetrForObjectDetection, DetrForSegmentation
|
||||
from transformers.utils import logging
|
||||
from transformers.utils.coco_classes import id2label
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
# here we list all keys to be renamed (original name on the left, our name on the right)
|
||||
rename_keys = []
|
||||
for i in range(6):
|
||||
# encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
|
||||
rename_keys.append(
|
||||
(f"transformer.encoder.layers.{i}.self_attn.out_proj.weight", f"encoder.layers.{i}.self_attn.out_proj.weight")
|
||||
)
|
||||
rename_keys.append(
|
||||
(f"transformer.encoder.layers.{i}.self_attn.out_proj.bias", f"encoder.layers.{i}.self_attn.out_proj.bias")
|
||||
)
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.linear1.weight", f"encoder.layers.{i}.fc1.weight"))
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.linear1.bias", f"encoder.layers.{i}.fc1.bias"))
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.linear2.weight", f"encoder.layers.{i}.fc2.weight"))
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.linear2.bias", f"encoder.layers.{i}.fc2.bias"))
|
||||
rename_keys.append(
|
||||
(f"transformer.encoder.layers.{i}.norm1.weight", f"encoder.layers.{i}.self_attn_layer_norm.weight")
|
||||
)
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.norm1.bias", f"encoder.layers.{i}.self_attn_layer_norm.bias"))
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.norm2.weight", f"encoder.layers.{i}.final_layer_norm.weight"))
|
||||
rename_keys.append((f"transformer.encoder.layers.{i}.norm2.bias", f"encoder.layers.{i}.final_layer_norm.bias"))
|
||||
# decoder layers: 2 times output projection, 2 feedforward neural networks and 3 layernorms
|
||||
rename_keys.append(
|
||||
(f"transformer.decoder.layers.{i}.self_attn.out_proj.weight", f"decoder.layers.{i}.self_attn.out_proj.weight")
|
||||
)
|
||||
rename_keys.append(
|
||||
(f"transformer.decoder.layers.{i}.self_attn.out_proj.bias", f"decoder.layers.{i}.self_attn.out_proj.bias")
|
||||
)
|
||||
rename_keys.append(
|
||||
(
|
||||
f"transformer.decoder.layers.{i}.multihead_attn.out_proj.weight",
|
||||
f"decoder.layers.{i}.encoder_attn.out_proj.weight",
|
||||
)
|
||||
)
|
||||
rename_keys.append(
|
||||
(
|
||||
f"transformer.decoder.layers.{i}.multihead_attn.out_proj.bias",
|
||||
f"decoder.layers.{i}.encoder_attn.out_proj.bias",
|
||||
)
|
||||
)
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.linear1.weight", f"decoder.layers.{i}.fc1.weight"))
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.linear1.bias", f"decoder.layers.{i}.fc1.bias"))
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.linear2.weight", f"decoder.layers.{i}.fc2.weight"))
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.linear2.bias", f"decoder.layers.{i}.fc2.bias"))
|
||||
rename_keys.append(
|
||||
(f"transformer.decoder.layers.{i}.norm1.weight", f"decoder.layers.{i}.self_attn_layer_norm.weight")
|
||||
)
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.norm1.bias", f"decoder.layers.{i}.self_attn_layer_norm.bias"))
|
||||
rename_keys.append(
|
||||
(f"transformer.decoder.layers.{i}.norm2.weight", f"decoder.layers.{i}.encoder_attn_layer_norm.weight")
|
||||
)
|
||||
rename_keys.append(
|
||||
(f"transformer.decoder.layers.{i}.norm2.bias", f"decoder.layers.{i}.encoder_attn_layer_norm.bias")
|
||||
)
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.norm3.weight", f"decoder.layers.{i}.final_layer_norm.weight"))
|
||||
rename_keys.append((f"transformer.decoder.layers.{i}.norm3.bias", f"decoder.layers.{i}.final_layer_norm.bias"))
|
||||
|
||||
# convolutional projection + query embeddings + layernorm of decoder + class and bounding box heads
|
||||
rename_keys.extend(
|
||||
[
|
||||
("input_proj.weight", "input_projection.weight"),
|
||||
("input_proj.bias", "input_projection.bias"),
|
||||
("query_embed.weight", "query_position_embeddings.weight"),
|
||||
("transformer.decoder.norm.weight", "decoder.layernorm.weight"),
|
||||
("transformer.decoder.norm.bias", "decoder.layernorm.bias"),
|
||||
("class_embed.weight", "class_labels_classifier.weight"),
|
||||
("class_embed.bias", "class_labels_classifier.bias"),
|
||||
("bbox_embed.layers.0.weight", "bbox_predictor.layers.0.weight"),
|
||||
("bbox_embed.layers.0.bias", "bbox_predictor.layers.0.bias"),
|
||||
("bbox_embed.layers.1.weight", "bbox_predictor.layers.1.weight"),
|
||||
("bbox_embed.layers.1.bias", "bbox_predictor.layers.1.bias"),
|
||||
("bbox_embed.layers.2.weight", "bbox_predictor.layers.2.weight"),
|
||||
("bbox_embed.layers.2.bias", "bbox_predictor.layers.2.bias"),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def rename_key(state_dict, old, new):
|
||||
val = state_dict.pop(old)
|
||||
state_dict[new] = val
|
||||
|
||||
|
||||
def rename_backbone_keys(state_dict):
|
||||
new_state_dict = OrderedDict()
|
||||
for key, value in state_dict.items():
|
||||
if "backbone.0.body" in key:
|
||||
new_key = key.replace("backbone.0.body", "backbone.conv_encoder.model")
|
||||
new_state_dict[new_key] = value
|
||||
else:
|
||||
new_state_dict[key] = value
|
||||
|
||||
return new_state_dict
|
||||
|
||||
|
||||
def read_in_q_k_v(state_dict, is_panoptic=False):
|
||||
prefix = ""
|
||||
if is_panoptic:
|
||||
prefix = "detr."
|
||||
|
||||
# first: transformer encoder
|
||||
for i in range(6):
|
||||
# read in weights + bias of input projection layer (in PyTorch's MultiHeadAttention, this is a single matrix + bias)
|
||||
in_proj_weight = state_dict.pop(f"{prefix}transformer.encoder.layers.{i}.self_attn.in_proj_weight")
|
||||
in_proj_bias = state_dict.pop(f"{prefix}transformer.encoder.layers.{i}.self_attn.in_proj_bias")
|
||||
# next, add query, keys and values (in that order) to the state dict
|
||||
state_dict[f"encoder.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[:256, :]
|
||||
state_dict[f"encoder.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:256]
|
||||
state_dict[f"encoder.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[256:512, :]
|
||||
state_dict[f"encoder.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[256:512]
|
||||
state_dict[f"encoder.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[-256:, :]
|
||||
state_dict[f"encoder.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-256:]
|
||||
# next: transformer decoder (which is a bit more complex because it also includes cross-attention)
|
||||
for i in range(6):
|
||||
# read in weights + bias of input projection layer of self-attention
|
||||
in_proj_weight = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.self_attn.in_proj_weight")
|
||||
in_proj_bias = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.self_attn.in_proj_bias")
|
||||
# next, add query, keys and values (in that order) to the state dict
|
||||
state_dict[f"decoder.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[:256, :]
|
||||
state_dict[f"decoder.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:256]
|
||||
state_dict[f"decoder.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[256:512, :]
|
||||
state_dict[f"decoder.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[256:512]
|
||||
state_dict[f"decoder.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[-256:, :]
|
||||
state_dict[f"decoder.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-256:]
|
||||
# read in weights + bias of input projection layer of cross-attention
|
||||
in_proj_weight_cross_attn = state_dict.pop(
|
||||
f"{prefix}transformer.decoder.layers.{i}.multihead_attn.in_proj_weight"
|
||||
)
|
||||
in_proj_bias_cross_attn = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.multihead_attn.in_proj_bias")
|
||||
# next, add query, keys and values (in that order) of cross-attention to the state dict
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.q_proj.weight"] = in_proj_weight_cross_attn[:256, :]
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.q_proj.bias"] = in_proj_bias_cross_attn[:256]
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.k_proj.weight"] = in_proj_weight_cross_attn[256:512, :]
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.k_proj.bias"] = in_proj_bias_cross_attn[256:512]
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.v_proj.weight"] = in_proj_weight_cross_attn[-256:, :]
|
||||
state_dict[f"decoder.layers.{i}.encoder_attn.v_proj.bias"] = in_proj_bias_cross_attn[-256:]
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
im = Image.open(requests.get(url, stream=True).raw)
|
||||
|
||||
return im
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
|
||||
"""
|
||||
Copy/paste/tweak model's weights to our DETR structure.
|
||||
"""
|
||||
|
||||
# load default config
|
||||
config = DetrConfig()
|
||||
# set backbone and dilation attributes
|
||||
if "resnet101" in model_name:
|
||||
config.backbone = "resnet101"
|
||||
if "dc5" in model_name:
|
||||
config.dilation = True
|
||||
is_panoptic = "panoptic" in model_name
|
||||
if is_panoptic:
|
||||
config.num_labels = 250
|
||||
else:
|
||||
config.num_labels = 91
|
||||
config.id2label = id2label
|
||||
config.label2id = {v: k for k, v in id2label.items()}
|
||||
|
||||
# load feature extractor
|
||||
format = "coco_panoptic" if is_panoptic else "coco_detection"
|
||||
feature_extractor = DetrFeatureExtractor(format=format)
|
||||
|
||||
# prepare image
|
||||
img = prepare_img()
|
||||
encoding = feature_extractor(images=img, return_tensors="pt")
|
||||
pixel_values = encoding["pixel_values"]
|
||||
|
||||
logger.info(f"Converting model {model_name}...")
|
||||
|
||||
# load original model from torch hub
|
||||
detr = torch.hub.load("facebookresearch/detr", model_name, pretrained=True).eval()
|
||||
state_dict = detr.state_dict()
|
||||
# rename keys
|
||||
for src, dest in rename_keys:
|
||||
if is_panoptic:
|
||||
src = "detr." + src
|
||||
rename_key(state_dict, src, dest)
|
||||
state_dict = rename_backbone_keys(state_dict)
|
||||
# query, key and value matrices need special treatment
|
||||
read_in_q_k_v(state_dict, is_panoptic=is_panoptic)
|
||||
# important: we need to prepend a prefix to each of the base model keys as the head models use different attributes for them
|
||||
prefix = "detr.model." if is_panoptic else "model."
|
||||
for key in state_dict.copy().keys():
|
||||
if is_panoptic:
|
||||
if (
|
||||
key.startswith("detr")
|
||||
and not key.startswith("class_labels_classifier")
|
||||
and not key.startswith("bbox_predictor")
|
||||
):
|
||||
val = state_dict.pop(key)
|
||||
state_dict["detr.model" + key[4:]] = val
|
||||
elif "class_labels_classifier" in key or "bbox_predictor" in key:
|
||||
val = state_dict.pop(key)
|
||||
state_dict["detr." + key] = val
|
||||
elif key.startswith("bbox_attention") or key.startswith("mask_head"):
|
||||
continue
|
||||
else:
|
||||
val = state_dict.pop(key)
|
||||
state_dict[prefix + key] = val
|
||||
else:
|
||||
if not key.startswith("class_labels_classifier") and not key.startswith("bbox_predictor"):
|
||||
val = state_dict.pop(key)
|
||||
state_dict[prefix + key] = val
|
||||
# finally, create HuggingFace model and load state dict
|
||||
model = DetrForSegmentation(config) if is_panoptic else DetrForObjectDetection(config)
|
||||
model.load_state_dict(state_dict)
|
||||
model.eval()
|
||||
# verify our conversion
|
||||
original_outputs = detr(pixel_values)
|
||||
outputs = model(pixel_values)
|
||||
assert torch.allclose(outputs.logits, original_outputs["pred_logits"], atol=1e-4)
|
||||
assert torch.allclose(outputs.pred_boxes, original_outputs["pred_boxes"], atol=1e-4)
|
||||
if is_panoptic:
|
||||
assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4)
|
||||
|
||||
# Save model and feature extractor
|
||||
logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
|
||||
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument(
|
||||
"--model_name", default="detr_resnet50", type=str, help="Name of the DETR model you'd like to convert."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_detr_checkpoint(args.model_name, args.pytorch_dump_folder_path)
|
|
@ -0,0 +1,890 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Feature extractor class for DETR."""
|
||||
|
||||
import io
|
||||
import pathlib
|
||||
from collections import defaultdict
|
||||
from typing import Dict, List, Optional, Union
|
||||
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
|
||||
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
|
||||
from ...file_utils import TensorType, is_torch_available
|
||||
from ...image_utils import ImageFeatureExtractionMixin, is_torch_tensor
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
ImageInput = Union[Image.Image, np.ndarray, "torch.Tensor", List[Image.Image], List[np.ndarray], List["torch.Tensor"]]
|
||||
|
||||
|
||||
# 2 functions below inspired by https://github.com/facebookresearch/detr/blob/master/util/box_ops.py
|
||||
def center_to_corners_format(x):
|
||||
"""
|
||||
Converts a PyTorch tensor of bounding boxes of center format (center_x, center_y, width, height) to corners format
|
||||
(x_0, y_0, x_1, y_1).
|
||||
"""
|
||||
x_c, y_c, w, h = x.unbind(-1)
|
||||
b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]
|
||||
return torch.stack(b, dim=-1)
|
||||
|
||||
|
||||
def corners_to_center_format(x):
|
||||
"""
|
||||
Converts a NumPy array of bounding boxes of shape (number of bounding boxes, 4) of corners format (x_0, y_0, x_1,
|
||||
y_1) to center format (center_x, center_y, width, height).
|
||||
"""
|
||||
x_transposed = x.T
|
||||
x0, y0, x1, y1 = x_transposed[0], x_transposed[1], x_transposed[2], x_transposed[3]
|
||||
b = [(x0 + x1) / 2, (y0 + y1) / 2, (x1 - x0), (y1 - y0)]
|
||||
return np.stack(b, axis=-1)
|
||||
|
||||
|
||||
def masks_to_boxes(masks):
|
||||
"""
|
||||
Compute the bounding boxes around the provided panoptic segmentation masks.
|
||||
|
||||
The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.
|
||||
|
||||
Returns a [N, 4] tensor, with the boxes in corner (xyxy) format.
|
||||
"""
|
||||
if masks.size == 0:
|
||||
return np.zeros((0, 4))
|
||||
|
||||
h, w = masks.shape[-2:]
|
||||
|
||||
y = np.arange(0, h, dtype=np.float32)
|
||||
x = np.arange(0, w, dtype=np.float32)
|
||||
# see https://github.com/pytorch/pytorch/issues/50276
|
||||
y, x = np.meshgrid(y, x, indexing="ij")
|
||||
|
||||
x_mask = masks * np.expand_dims(x, axis=0)
|
||||
x_max = x_mask.reshape(x_mask.shape[0], -1).max(-1)
|
||||
x = np.ma.array(x_mask, mask=~(np.array(masks, dtype=bool)))
|
||||
x_min = x.filled(fill_value=1e8)
|
||||
x_min = x_min.reshape(x_min.shape[0], -1).min(-1)
|
||||
|
||||
y_mask = masks * np.expand_dims(y, axis=0)
|
||||
y_max = y_mask.reshape(x_mask.shape[0], -1).max(-1)
|
||||
y = np.ma.array(y_mask, mask=~(np.array(masks, dtype=bool)))
|
||||
y_min = y.filled(fill_value=1e8)
|
||||
y_min = y_min.reshape(y_min.shape[0], -1).min(-1)
|
||||
|
||||
return np.stack([x_min, y_min, x_max, y_max], 1)
|
||||
|
||||
|
||||
# 2 functions below copied from https://github.com/cocodataset/panopticapi/blob/master/panopticapi/utils.py
|
||||
# Copyright (c) 2018, Alexander Kirillov
|
||||
# All rights reserved.
|
||||
def rgb_to_id(color):
|
||||
if isinstance(color, np.ndarray) and len(color.shape) == 3:
|
||||
if color.dtype == np.uint8:
|
||||
color = color.astype(np.int32)
|
||||
return color[:, :, 0] + 256 * color[:, :, 1] + 256 * 256 * color[:, :, 2]
|
||||
return int(color[0] + 256 * color[1] + 256 * 256 * color[2])
|
||||
|
||||
|
||||
def id_to_rgb(id_map):
|
||||
if isinstance(id_map, np.ndarray):
|
||||
id_map_copy = id_map.copy()
|
||||
rgb_shape = tuple(list(id_map.shape) + [3])
|
||||
rgb_map = np.zeros(rgb_shape, dtype=np.uint8)
|
||||
for i in range(3):
|
||||
rgb_map[..., i] = id_map_copy % 256
|
||||
id_map_copy //= 256
|
||||
return rgb_map
|
||||
color = []
|
||||
for _ in range(3):
|
||||
color.append(id_map % 256)
|
||||
id_map //= 256
|
||||
return color
|
||||
|
||||
|
||||
class DetrFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
|
||||
r"""
|
||||
Constructs a DETR feature extractor.
|
||||
|
||||
This feature extractor inherits from :class:`~transformers.FeatureExtractionMixin` which contains most of the main
|
||||
methods. Users should refer to this superclass for more information regarding those methods.
|
||||
|
||||
|
||||
Args:
|
||||
format (:obj:`str`, `optional`, defaults to :obj:`"coco_detection"`):
|
||||
Data format of the annotations. One of "coco_detection" or "coco_panoptic".
|
||||
do_resize (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether to resize the input to a certain :obj:`size`.
|
||||
size (:obj:`int`, `optional`, defaults to 800):
|
||||
Resize the input to the given size. Only has an effect if :obj:`do_resize` is set to :obj:`True`. If size
|
||||
is a sequence like :obj:`(width, height)`, output size will be matched to this. If size is an int, smaller
|
||||
edge of the image will be matched to this number. i.e, if :obj:`height > width`, then image will be
|
||||
rescaled to :obj:`(size * height / width, size)`.
|
||||
max_size (:obj:`int`, `optional`, defaults to :obj:`1333`):
|
||||
The largest size an image dimension can have (otherwise it's capped). Only has an effect if
|
||||
:obj:`do_resize` is set to :obj:`True`.
|
||||
do_normalize (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to normalize the input with mean and standard deviation.
|
||||
image_mean (:obj:`int`, `optional`, defaults to :obj:`[0.485, 0.456, 0.406]s`):
|
||||
The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
|
||||
image_std (:obj:`int`, `optional`, defaults to :obj:`[0.229, 0.224, 0.225]`):
|
||||
The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the
|
||||
ImageNet std.
|
||||
"""
|
||||
|
||||
model_input_names = ["pixel_values", "pixel_mask"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
format="coco_detection",
|
||||
do_resize=True,
|
||||
size=800,
|
||||
max_size=1333,
|
||||
do_normalize=True,
|
||||
image_mean=None,
|
||||
image_std=None,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
self.format = self._is_valid_format(format)
|
||||
self.do_resize = do_resize
|
||||
self.size = size
|
||||
self.max_size = max_size
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean if image_mean is not None else [0.485, 0.456, 0.406] # ImageNet mean
|
||||
self.image_std = image_std if image_std is not None else [0.229, 0.224, 0.225] # ImageNet std
|
||||
|
||||
def _is_valid_format(self, format):
|
||||
if format not in ["coco_detection", "coco_panoptic"]:
|
||||
raise ValueError(f"Format {format} not supported")
|
||||
return format
|
||||
|
||||
def prepare(self, image, target, return_segmentation_masks=False, masks_path=None):
|
||||
if self.format == "coco_detection":
|
||||
image, target = self.prepare_coco_detection(image, target, return_segmentation_masks)
|
||||
return image, target
|
||||
elif self.format == "coco_panoptic":
|
||||
image, target = self.prepare_coco_panoptic(image, target, masks_path)
|
||||
return image, target
|
||||
else:
|
||||
raise ValueError(f"Format {self.format} not supported")
|
||||
|
||||
# inspired by https://github.com/facebookresearch/detr/blob/master/datasets/coco.py#L33
|
||||
def convert_coco_poly_to_mask(self, segmentations, height, width):
|
||||
|
||||
try:
|
||||
from pycocotools import mask as coco_mask
|
||||
except ImportError:
|
||||
raise ImportError("Pycocotools is not installed in your environment.")
|
||||
|
||||
masks = []
|
||||
for polygons in segmentations:
|
||||
rles = coco_mask.frPyObjects(polygons, height, width)
|
||||
mask = coco_mask.decode(rles)
|
||||
if len(mask.shape) < 3:
|
||||
mask = mask[..., None]
|
||||
mask = np.asarray(mask, dtype=np.uint8)
|
||||
mask = np.any(mask, axis=2)
|
||||
masks.append(mask)
|
||||
if masks:
|
||||
masks = np.stack(masks, axis=0)
|
||||
else:
|
||||
masks = np.zeros((0, height, width), dtype=np.uint8)
|
||||
|
||||
return masks
|
||||
|
||||
# inspired by https://github.com/facebookresearch/detr/blob/master/datasets/coco.py#L50
|
||||
def prepare_coco_detection(self, image, target, return_segmentation_masks=False):
|
||||
"""
|
||||
Convert the target in COCO format into the format expected by DETR.
|
||||
"""
|
||||
w, h = image.size
|
||||
|
||||
image_id = target["image_id"]
|
||||
image_id = np.asarray([image_id], dtype=np.int64)
|
||||
|
||||
# get all COCO annotations for the given image
|
||||
anno = target["annotations"]
|
||||
|
||||
anno = [obj for obj in anno if "iscrowd" not in obj or obj["iscrowd"] == 0]
|
||||
|
||||
boxes = [obj["bbox"] for obj in anno]
|
||||
# guard against no boxes via resizing
|
||||
boxes = np.asarray(boxes, dtype=np.float32).reshape(-1, 4)
|
||||
boxes[:, 2:] += boxes[:, :2]
|
||||
boxes[:, 0::2] = boxes[:, 0::2].clip(min=0, max=w)
|
||||
boxes[:, 1::2] = boxes[:, 1::2].clip(min=0, max=h)
|
||||
|
||||
classes = [obj["category_id"] for obj in anno]
|
||||
classes = np.asarray(classes, dtype=np.int64)
|
||||
|
||||
if return_segmentation_masks:
|
||||
segmentations = [obj["segmentation"] for obj in anno]
|
||||
masks = self.convert_coco_poly_to_mask(segmentations, h, w)
|
||||
|
||||
keypoints = None
|
||||
if anno and "keypoints" in anno[0]:
|
||||
keypoints = [obj["keypoints"] for obj in anno]
|
||||
keypoints = np.asarray(keypoints, dtype=np.float32)
|
||||
num_keypoints = keypoints.shape[0]
|
||||
if num_keypoints:
|
||||
keypoints = keypoints.reshape((-1, 3))
|
||||
|
||||
keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
|
||||
boxes = boxes[keep]
|
||||
classes = classes[keep]
|
||||
if return_segmentation_masks:
|
||||
masks = masks[keep]
|
||||
if keypoints is not None:
|
||||
keypoints = keypoints[keep]
|
||||
|
||||
target = {}
|
||||
target["boxes"] = boxes
|
||||
target["class_labels"] = classes
|
||||
if return_segmentation_masks:
|
||||
target["masks"] = masks
|
||||
target["image_id"] = image_id
|
||||
if keypoints is not None:
|
||||
target["keypoints"] = keypoints
|
||||
|
||||
# for conversion to coco api
|
||||
area = np.asarray([obj["area"] for obj in anno], dtype=np.float32)
|
||||
iscrowd = np.asarray([obj["iscrowd"] if "iscrowd" in obj else 0 for obj in anno], dtype=np.int64)
|
||||
target["area"] = area[keep]
|
||||
target["iscrowd"] = iscrowd[keep]
|
||||
|
||||
target["orig_size"] = np.asarray([int(h), int(w)], dtype=np.int64)
|
||||
target["size"] = np.asarray([int(h), int(w)], dtype=np.int64)
|
||||
|
||||
return image, target
|
||||
|
||||
def prepare_coco_panoptic(self, image, target, masks_path, return_masks=True):
|
||||
w, h = image.size
|
||||
ann_info = target.copy()
|
||||
ann_path = pathlib.Path(masks_path) / ann_info["file_name"]
|
||||
|
||||
if "segments_info" in ann_info:
|
||||
masks = np.asarray(Image.open(ann_path), dtype=np.uint32)
|
||||
masks = rgb_to_id(masks)
|
||||
|
||||
ids = np.array([ann["id"] for ann in ann_info["segments_info"]])
|
||||
masks = masks == ids[:, None, None]
|
||||
masks = np.asarray(masks, dtype=np.uint8)
|
||||
|
||||
labels = np.asarray([ann["category_id"] for ann in ann_info["segments_info"]], dtype=np.int64)
|
||||
|
||||
target = {}
|
||||
target["image_id"] = np.asarray(
|
||||
[ann_info["image_id"] if "image_id" in ann_info else ann_info["id"]], dtype=np.int64
|
||||
)
|
||||
if return_masks:
|
||||
target["masks"] = masks
|
||||
target["class_labels"] = labels
|
||||
|
||||
target["boxes"] = masks_to_boxes(masks)
|
||||
|
||||
target["size"] = np.asarray([int(h), int(w)], dtype=np.int64)
|
||||
target["orig_size"] = np.asarray([int(h), int(w)], dtype=np.int64)
|
||||
if "segments_info" in ann_info:
|
||||
target["iscrowd"] = np.asarray([ann["iscrowd"] for ann in ann_info["segments_info"]], dtype=np.int64)
|
||||
target["area"] = np.asarray([ann["area"] for ann in ann_info["segments_info"]], dtype=np.float32)
|
||||
|
||||
return image, target
|
||||
|
||||
def _resize(self, image, size, target=None, max_size=None):
|
||||
"""
|
||||
Resize the image to the given size. Size can be min_size (scalar) or (w, h) tuple. If size is an int, smaller
|
||||
edge of the image will be matched to this number.
|
||||
|
||||
If given, also resize the target accordingly.
|
||||
"""
|
||||
if not isinstance(image, Image.Image):
|
||||
image = self.to_pil_image(image)
|
||||
|
||||
def get_size_with_aspect_ratio(image_size, size, max_size=None):
|
||||
w, h = image_size
|
||||
if max_size is not None:
|
||||
min_original_size = float(min((w, h)))
|
||||
max_original_size = float(max((w, h)))
|
||||
if max_original_size / min_original_size * size > max_size:
|
||||
size = int(round(max_size * min_original_size / max_original_size))
|
||||
|
||||
if (w <= h and w == size) or (h <= w and h == size):
|
||||
return (h, w)
|
||||
|
||||
if w < h:
|
||||
ow = size
|
||||
oh = int(size * h / w)
|
||||
else:
|
||||
oh = size
|
||||
ow = int(size * w / h)
|
||||
|
||||
return (oh, ow)
|
||||
|
||||
def get_size(image_size, size, max_size=None):
|
||||
if isinstance(size, (list, tuple)):
|
||||
return size
|
||||
else:
|
||||
# size returned must be (w, h) since we use PIL to resize images
|
||||
# so we revert the tuple
|
||||
return get_size_with_aspect_ratio(image_size, size, max_size)[::-1]
|
||||
|
||||
size = get_size(image.size, size, max_size)
|
||||
rescaled_image = self.resize(image, size=size)
|
||||
|
||||
if target is None:
|
||||
return rescaled_image, None
|
||||
|
||||
ratios = tuple(float(s) / float(s_orig) for s, s_orig in zip(rescaled_image.size, image.size))
|
||||
ratio_width, ratio_height = ratios
|
||||
|
||||
target = target.copy()
|
||||
if "boxes" in target:
|
||||
boxes = target["boxes"]
|
||||
scaled_boxes = boxes * np.asarray([ratio_width, ratio_height, ratio_width, ratio_height], dtype=np.float32)
|
||||
target["boxes"] = scaled_boxes
|
||||
|
||||
if "area" in target:
|
||||
area = target["area"]
|
||||
scaled_area = area * (ratio_width * ratio_height)
|
||||
target["area"] = scaled_area
|
||||
|
||||
w, h = size
|
||||
target["size"] = np.asarray([h, w], dtype=np.int64)
|
||||
|
||||
if "masks" in target:
|
||||
# use PyTorch as current workaround
|
||||
# TODO replace by self.resize
|
||||
masks = torch.from_numpy(target["masks"][:, None]).float()
|
||||
interpolated_masks = F.interpolate(masks, size=(h, w), mode="nearest")[:, 0] > 0.5
|
||||
target["masks"] = interpolated_masks.numpy()
|
||||
|
||||
return rescaled_image, target
|
||||
|
||||
def _normalize(self, image, mean, std, target=None):
|
||||
"""
|
||||
Normalize the image with a certain mean and std.
|
||||
|
||||
If given, also normalize the target bounding boxes based on the size of the image.
|
||||
"""
|
||||
|
||||
image = self.normalize(image, mean=mean, std=std)
|
||||
if target is None:
|
||||
return image, None
|
||||
|
||||
target = target.copy()
|
||||
h, w = image.shape[-2:]
|
||||
|
||||
if "boxes" in target:
|
||||
boxes = target["boxes"]
|
||||
boxes = corners_to_center_format(boxes)
|
||||
boxes = boxes / np.asarray([w, h, w, h], dtype=np.float32)
|
||||
target["boxes"] = boxes
|
||||
|
||||
return image, target
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
images: ImageInput,
|
||||
annotations: Union[List[Dict], List[List[Dict]]] = None,
|
||||
return_segmentation_masks: Optional[bool] = False,
|
||||
masks_path: Optional[pathlib.Path] = None,
|
||||
pad_and_return_pixel_mask: Optional[bool] = True,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
**kwargs,
|
||||
) -> BatchFeature:
|
||||
"""
|
||||
Main method to prepare for the model one or several image(s) and optional annotations. Images are by default
|
||||
padded up to the largest image in a batch, and a pixel mask is created that indicates which pixels are
|
||||
real/which are padding.
|
||||
|
||||
.. warning::
|
||||
|
||||
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
|
||||
PIL images.
|
||||
|
||||
Args:
|
||||
images (:obj:`PIL.Image.Image`, :obj:`np.ndarray`, :obj:`torch.Tensor`, :obj:`List[PIL.Image.Image]`, :obj:`List[np.ndarray]`, :obj:`List[torch.Tensor]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
annotations (:obj:`Dict`, :obj:`List[Dict]`, `optional`):
|
||||
The corresponding annotations in COCO format.
|
||||
|
||||
In case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
|
||||
"coco_detection"`, the annotations for each image should have the following format: {'image_id': int,
|
||||
'annotations': [annotation]}, with the annotations being a list of COCO object annotations.
|
||||
|
||||
In case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
|
||||
"coco_panoptic"`, the annotations for each image should have the following format: {'image_id': int,
|
||||
'file_name': str, 'segments_info': [segment_info]} with segments_info being a list of COCO panoptic
|
||||
annotations.
|
||||
|
||||
return_segmentation_masks (:obj:`Dict`, :obj:`List[Dict]`, `optional`, defaults to :obj:`False`):
|
||||
Whether to also return instance segmentation masks in case :obj:`format = "coco_detection"`.
|
||||
|
||||
masks_path (:obj:`pathlib.Path`, `optional`):
|
||||
Path to the directory containing the PNG files that store the class-agnostic image segmentations. Only
|
||||
relevant in case :class:`~transformers.DetrFeatureExtractor` was initialized with :obj:`format =
|
||||
"coco_panoptic"`.
|
||||
|
||||
pad_and_return_pixel_mask (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether or not to pad images up to the largest image in a batch and create a pixel mask.
|
||||
|
||||
If left to the default, will return a pixel mask that is:
|
||||
|
||||
- 1 for pixels that are real (i.e. **not masked**),
|
||||
- 0 for pixels that are padding (i.e. **masked**).
|
||||
|
||||
return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
|
||||
If set, will return tensors instead of NumPy arrays. If set to :obj:`'pt'`, return PyTorch
|
||||
:obj:`torch.Tensor` objects.
|
||||
|
||||
Returns:
|
||||
:class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
|
||||
|
||||
- **pixel_values** -- Pixel values to be fed to a model.
|
||||
- **pixel_mask** -- Pixel mask to be fed to a model (when :obj:`pad_and_return_pixel_mask=True` or if
|
||||
`"pixel_mask"` is in :obj:`self.model_input_names`).
|
||||
"""
|
||||
# Input type checking for clearer error
|
||||
|
||||
valid_images = False
|
||||
valid_annotations = False
|
||||
valid_masks_path = False
|
||||
|
||||
# Check that images has a valid type
|
||||
if isinstance(images, (Image.Image, np.ndarray)) or is_torch_tensor(images):
|
||||
valid_images = True
|
||||
elif isinstance(images, (list, tuple)):
|
||||
if len(images) == 0 or isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]):
|
||||
valid_images = True
|
||||
|
||||
if not valid_images:
|
||||
raise ValueError(
|
||||
"Images must of type `PIL.Image.Image`, `np.ndarray` or `torch.Tensor` (single example),"
|
||||
"`List[PIL.Image.Image]`, `List[np.ndarray]` or `List[torch.Tensor]` (batch of examples)."
|
||||
)
|
||||
|
||||
is_batched = bool(
|
||||
isinstance(images, (list, tuple))
|
||||
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
|
||||
)
|
||||
|
||||
# Check that annotations has a valid type
|
||||
if annotations is not None:
|
||||
if not is_batched:
|
||||
if self.format == "coco_detection":
|
||||
if isinstance(annotations, dict) and "image_id" in annotations and "annotations" in annotations:
|
||||
if isinstance(annotations["annotations"], (list, tuple)):
|
||||
# an image can have no annotations
|
||||
if len(annotations["annotations"]) == 0 or isinstance(annotations["annotations"][0], dict):
|
||||
valid_annotations = True
|
||||
elif self.format == "coco_panoptic":
|
||||
if isinstance(annotations, dict) and "image_id" in annotations and "segments_info" in annotations:
|
||||
if isinstance(annotations["segments_info"], (list, tuple)):
|
||||
# an image can have no segments (?)
|
||||
if len(annotations["segments_info"]) == 0 or isinstance(
|
||||
annotations["segments_info"][0], dict
|
||||
):
|
||||
valid_annotations = True
|
||||
else:
|
||||
if isinstance(annotations, (list, tuple)):
|
||||
assert len(images) == len(annotations), "There must be as many annotations as there are images"
|
||||
if isinstance(annotations[0], Dict):
|
||||
if self.format == "coco_detection":
|
||||
if isinstance(annotations[0]["annotations"], (list, tuple)):
|
||||
valid_annotations = True
|
||||
elif self.format == "coco_panoptic":
|
||||
if isinstance(annotations[0]["segments_info"], (list, tuple)):
|
||||
valid_annotations = True
|
||||
|
||||
if not valid_annotations:
|
||||
raise ValueError(
|
||||
"""
|
||||
Annotations must of type `Dict` (single image) or `List[Dict]` (batch of images). In case of object
|
||||
detection, each dictionary should contain the keys 'image_id' and 'annotations', with the latter
|
||||
being a list of annotations in COCO format. In case of panoptic segmentation, each dictionary
|
||||
should contain the keys 'file_name', 'image_id' and 'segments_info', with the latter being a list
|
||||
of annotations in COCO format.
|
||||
"""
|
||||
)
|
||||
|
||||
# Check that masks_path has a valid type
|
||||
if masks_path is not None:
|
||||
if self.format == "coco_panoptic":
|
||||
if isinstance(masks_path, pathlib.Path):
|
||||
valid_masks_path = True
|
||||
if not valid_masks_path:
|
||||
raise ValueError(
|
||||
"The path to the directory containing the mask PNG files should be provided as a `pathlib.Path` object."
|
||||
)
|
||||
|
||||
if not is_batched:
|
||||
images = [images]
|
||||
if annotations is not None:
|
||||
annotations = [annotations]
|
||||
|
||||
# prepare (COCO annotations as a list of Dict -> DETR target as a single Dict per image)
|
||||
if annotations is not None:
|
||||
for idx, (image, target) in enumerate(zip(images, annotations)):
|
||||
if not isinstance(image, Image.Image):
|
||||
image = self.to_pil_image(image)
|
||||
image, target = self.prepare(image, target, return_segmentation_masks, masks_path)
|
||||
images[idx] = image
|
||||
annotations[idx] = target
|
||||
|
||||
# transformations (resizing + normalization)
|
||||
if self.do_resize and self.size is not None:
|
||||
if annotations is not None:
|
||||
for idx, (image, target) in enumerate(zip(images, annotations)):
|
||||
image, target = self._resize(image=image, target=target, size=self.size, max_size=self.max_size)
|
||||
images[idx] = image
|
||||
annotations[idx] = target
|
||||
else:
|
||||
for idx, image in enumerate(images):
|
||||
images[idx] = self._resize(image=image, target=None, size=self.size, max_size=self.max_size)[0]
|
||||
|
||||
if self.do_normalize:
|
||||
if annotations is not None:
|
||||
for idx, (image, target) in enumerate(zip(images, annotations)):
|
||||
image, target = self._normalize(
|
||||
image=image, mean=self.image_mean, std=self.image_std, target=target
|
||||
)
|
||||
images[idx] = image
|
||||
annotations[idx] = target
|
||||
else:
|
||||
images = [
|
||||
self._normalize(image=image, mean=self.image_mean, std=self.image_std)[0] for image in images
|
||||
]
|
||||
|
||||
if pad_and_return_pixel_mask:
|
||||
# pad images up to largest image in batch and create pixel_mask
|
||||
max_size = self._max_by_axis([list(image.shape) for image in images])
|
||||
c, h, w = max_size
|
||||
padded_images = []
|
||||
pixel_mask = []
|
||||
for image in images:
|
||||
# create padded image
|
||||
padded_image = np.zeros((c, h, w), dtype=np.float32)
|
||||
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
|
||||
padded_images.append(padded_image)
|
||||
# create pixel mask
|
||||
mask = np.zeros((h, w), dtype=np.int64)
|
||||
mask[: image.shape[1], : image.shape[2]] = True
|
||||
pixel_mask.append(mask)
|
||||
images = padded_images
|
||||
|
||||
# return as BatchFeature
|
||||
data = {}
|
||||
data["pixel_values"] = images
|
||||
if pad_and_return_pixel_mask:
|
||||
data["pixel_mask"] = pixel_mask
|
||||
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||
|
||||
if annotations is not None:
|
||||
# Convert to TensorType
|
||||
tensor_type = return_tensors
|
||||
if not isinstance(tensor_type, TensorType):
|
||||
tensor_type = TensorType(tensor_type)
|
||||
|
||||
if not tensor_type == TensorType.PYTORCH:
|
||||
raise ValueError("Only PyTorch is supported for the moment.")
|
||||
else:
|
||||
if not is_torch_available():
|
||||
raise ImportError("Unable to convert output to PyTorch tensors format, PyTorch is not installed.")
|
||||
|
||||
encoded_inputs["target"] = [
|
||||
{k: torch.from_numpy(v) for k, v in target.items()} for target in annotations
|
||||
]
|
||||
|
||||
return encoded_inputs
|
||||
|
||||
def _max_by_axis(self, the_list):
|
||||
# type: (List[List[int]]) -> List[int]
|
||||
maxes = the_list[0]
|
||||
for sublist in the_list[1:]:
|
||||
for index, item in enumerate(sublist):
|
||||
maxes[index] = max(maxes[index], item)
|
||||
return maxes
|
||||
|
||||
def pad_and_create_pixel_mask(
|
||||
self, pixel_values_list: List["torch.Tensor"], return_tensors: Optional[Union[str, TensorType]] = None
|
||||
):
|
||||
"""
|
||||
Pad images up to the largest image in a batch and create a corresponding :obj:`pixel_mask`.
|
||||
|
||||
Args:
|
||||
pixel_values_list (:obj:`List[torch.Tensor]`):
|
||||
List of images (pixel values) to be padded. Each image should be a tensor of shape (C, H, W).
|
||||
return_tensors (:obj:`str` or :class:`~transformers.file_utils.TensorType`, `optional`):
|
||||
If set, will return tensors instead of NumPy arrays. If set to :obj:`'pt'`, return PyTorch
|
||||
:obj:`torch.Tensor` objects.
|
||||
|
||||
Returns:
|
||||
:class:`~transformers.BatchFeature`: A :class:`~transformers.BatchFeature` with the following fields:
|
||||
|
||||
- **pixel_values** -- Pixel values to be fed to a model.
|
||||
- **pixel_mask** -- Pixel mask to be fed to a model (when :obj:`pad_and_return_pixel_mask=True` or if
|
||||
`"pixel_mask"` is in :obj:`self.model_input_names`).
|
||||
|
||||
"""
|
||||
|
||||
max_size = self._max_by_axis([list(image.shape) for image in pixel_values_list])
|
||||
c, h, w = max_size
|
||||
padded_images = []
|
||||
pixel_mask = []
|
||||
for image in pixel_values_list:
|
||||
# create padded image
|
||||
padded_image = np.zeros((c, h, w), dtype=np.float32)
|
||||
padded_image[: image.shape[0], : image.shape[1], : image.shape[2]] = np.copy(image)
|
||||
padded_images.append(padded_image)
|
||||
# create pixel mask
|
||||
mask = np.zeros((h, w), dtype=np.int64)
|
||||
mask[: image.shape[1], : image.shape[2]] = True
|
||||
pixel_mask.append(mask)
|
||||
|
||||
# return as BatchFeature
|
||||
data = {"pixel_values": padded_images, "pixel_mask": pixel_mask}
|
||||
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||
|
||||
return encoded_inputs
|
||||
|
||||
# POSTPROCESSING METHODS
|
||||
# inspired by https://github.com/facebookresearch/detr/blob/master/models/detr.py#L258
|
||||
def post_process(self, outputs, target_sizes):
|
||||
"""
|
||||
Converts the output of :class:`~transformers.DetrForObjectDetection` into the format expected by the COCO api.
|
||||
Only supports PyTorch.
|
||||
|
||||
Args:
|
||||
outputs (:class:`~transformers.DetrObjectDetectionOutput`):
|
||||
Raw outputs of the model.
|
||||
target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`, `optional`):
|
||||
Tensor containing the size (h, w) of each image of the batch. For evaluation, this must be the original
|
||||
image size (before any data augmentation). For visualization, this should be the image size after data
|
||||
augment, but before padding.
|
||||
|
||||
Returns:
|
||||
:obj:`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an
|
||||
image in the batch as predicted by the model.
|
||||
"""
|
||||
out_logits, out_bbox = outputs.logits, outputs.pred_boxes
|
||||
|
||||
assert len(out_logits) == len(
|
||||
target_sizes
|
||||
), "Make sure that you pass in as many target sizes as the batch dimension of the logits"
|
||||
assert (
|
||||
target_sizes.shape[1] == 2
|
||||
), "Each element of target_sizes must contain the size (h, w) of each image of the batch"
|
||||
|
||||
prob = F.softmax(out_logits, -1)
|
||||
scores, labels = prob[..., :-1].max(-1)
|
||||
|
||||
# convert to [x0, y0, x1, y1] format
|
||||
boxes = center_to_corners_format(out_bbox)
|
||||
# and from relative [0, 1] to absolute [0, height] coordinates
|
||||
img_h, img_w = target_sizes.unbind(1)
|
||||
scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
|
||||
boxes = boxes * scale_fct[:, None, :]
|
||||
|
||||
results = [{"scores": s, "labels": l, "boxes": b} for s, l, b in zip(scores, labels, boxes)]
|
||||
|
||||
return results
|
||||
|
||||
# inspired by https://github.com/facebookresearch/detr/blob/master/models/segmentation.py#L218
|
||||
def post_process_segmentation(self, results, outputs, orig_target_sizes, max_target_sizes, threshold=0.5):
|
||||
"""
|
||||
Converts the output of :class:`~transformers.DetrForSegmentation` into actual instance segmentation
|
||||
predictions. Only supports PyTorch.
|
||||
|
||||
Args:
|
||||
results (:obj:`List[Dict]`):
|
||||
Results list obtained by :meth:`~transformers.DetrFeatureExtractor.post_process`, to which "masks"
|
||||
results will be added.
|
||||
outputs (:class:`~transformers.DetrSegmentationOutput`):
|
||||
Raw outputs of the model.
|
||||
orig_target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`):
|
||||
Tensor containing the size (h, w) of each image of the batch. For evaluation, this must be the original
|
||||
image size (before any data augmentation).
|
||||
max_target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`):
|
||||
Tensor containing the maximum size (h, w) of each image of the batch. For evaluation, this must be the
|
||||
original image size (before any data augmentation).
|
||||
threshold (:obj:`float`, `optional`, defaults to 0.5):
|
||||
Threshold to use when turning the predicted masks into binary values.
|
||||
|
||||
Returns:
|
||||
:obj:`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, boxes and masks
|
||||
for an image in the batch as predicted by the model.
|
||||
"""
|
||||
|
||||
assert len(orig_target_sizes) == len(
|
||||
max_target_sizes
|
||||
), "Make sure to pass in as many orig_target_sizes as max_target_sizes"
|
||||
max_h, max_w = max_target_sizes.max(0)[0].tolist()
|
||||
outputs_masks = outputs.pred_masks.squeeze(2)
|
||||
outputs_masks = F.interpolate(outputs_masks, size=(max_h, max_w), mode="bilinear", align_corners=False)
|
||||
outputs_masks = (outputs_masks.sigmoid() > threshold).cpu()
|
||||
|
||||
for i, (cur_mask, t, tt) in enumerate(zip(outputs_masks, max_target_sizes, orig_target_sizes)):
|
||||
img_h, img_w = t[0], t[1]
|
||||
results[i]["masks"] = cur_mask[:, :img_h, :img_w].unsqueeze(1)
|
||||
results[i]["masks"] = F.interpolate(
|
||||
results[i]["masks"].float(), size=tuple(tt.tolist()), mode="nearest"
|
||||
).byte()
|
||||
|
||||
return results
|
||||
|
||||
# inspired by https://github.com/facebookresearch/detr/blob/master/models/segmentation.py#L241
|
||||
def post_process_panoptic(self, outputs, processed_sizes, target_sizes=None, is_thing_map=None, threshold=0.85):
|
||||
"""
|
||||
Converts the output of :class:`~transformers.DetrForSegmentation` into actual panoptic predictions. Only
|
||||
supports PyTorch.
|
||||
|
||||
Parameters:
|
||||
outputs (:class:`~transformers.DetrSegmentationOutput`):
|
||||
Raw outputs of the model.
|
||||
processed_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)` or :obj:`List[Tuple]` of length :obj:`batch_size`):
|
||||
Torch Tensor (or list) containing the size (h, w) of each image of the batch, i.e. the size after data
|
||||
augmentation but before batching.
|
||||
target_sizes (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)` or :obj:`List[Tuple]` of length :obj:`batch_size`, `optional`):
|
||||
Torch Tensor (or list) corresponding to the requested final size (h, w) of each prediction. If left to
|
||||
None, it will default to the :obj:`processed_sizes`.
|
||||
is_thing_map (:obj:`torch.Tensor` of shape :obj:`(batch_size, 2)`, `optional`):
|
||||
Dictionary mapping class indices to either True or False, depending on whether or not they are a thing.
|
||||
If not set, defaults to the :obj:`is_thing_map` of COCO panoptic.
|
||||
threshold (:obj:`float`, `optional`, defaults to 0.85):
|
||||
Threshold to use to filter out queries.
|
||||
|
||||
Returns:
|
||||
:obj:`List[Dict]`: A list of dictionaries, each dictionary containing a PNG string and segments_info values
|
||||
for an image in the batch as predicted by the model.
|
||||
"""
|
||||
if target_sizes is None:
|
||||
target_sizes = processed_sizes
|
||||
assert len(processed_sizes) == len(
|
||||
target_sizes
|
||||
), "Make sure to pass in as many processed_sizes as target_sizes"
|
||||
|
||||
if is_thing_map is None:
|
||||
# default to is_thing_map of COCO panoptic
|
||||
is_thing_map = {i: i <= 90 for i in range(201)}
|
||||
|
||||
out_logits, raw_masks, raw_boxes = outputs.logits, outputs.pred_masks, outputs.pred_boxes
|
||||
assert (
|
||||
len(out_logits) == len(raw_masks) == len(target_sizes)
|
||||
), "Make sure that you pass in as many target sizes as the batch dimension of the logits and masks"
|
||||
preds = []
|
||||
|
||||
def to_tuple(tup):
|
||||
if isinstance(tup, tuple):
|
||||
return tup
|
||||
return tuple(tup.cpu().tolist())
|
||||
|
||||
for cur_logits, cur_masks, cur_boxes, size, target_size in zip(
|
||||
out_logits, raw_masks, raw_boxes, processed_sizes, target_sizes
|
||||
):
|
||||
# we filter empty queries and detection below threshold
|
||||
scores, labels = cur_logits.softmax(-1).max(-1)
|
||||
keep = labels.ne(outputs.logits.shape[-1] - 1) & (scores > threshold)
|
||||
cur_scores, cur_classes = cur_logits.softmax(-1).max(-1)
|
||||
cur_scores = cur_scores[keep]
|
||||
cur_classes = cur_classes[keep]
|
||||
cur_masks = cur_masks[keep]
|
||||
cur_masks = F.interpolate(cur_masks[:, None], to_tuple(size), mode="bilinear").squeeze(1)
|
||||
cur_boxes = center_to_corners_format(cur_boxes[keep])
|
||||
|
||||
h, w = cur_masks.shape[-2:]
|
||||
assert len(cur_boxes) == len(cur_classes), "Not as many boxes as there are classes"
|
||||
|
||||
# It may be that we have several predicted masks for the same stuff class.
|
||||
# In the following, we track the list of masks ids for each stuff class (they are merged later on)
|
||||
cur_masks = cur_masks.flatten(1)
|
||||
stuff_equiv_classes = defaultdict(lambda: [])
|
||||
for k, label in enumerate(cur_classes):
|
||||
if not is_thing_map[label.item()]:
|
||||
stuff_equiv_classes[label.item()].append(k)
|
||||
|
||||
def get_ids_area(masks, scores, dedup=False):
|
||||
# This helper function creates the final panoptic segmentation image
|
||||
# It also returns the area of the masks that appears on the image
|
||||
|
||||
m_id = masks.transpose(0, 1).softmax(-1)
|
||||
|
||||
if m_id.shape[-1] == 0:
|
||||
# We didn't detect any mask :(
|
||||
m_id = torch.zeros((h, w), dtype=torch.long, device=m_id.device)
|
||||
else:
|
||||
m_id = m_id.argmax(-1).view(h, w)
|
||||
|
||||
if dedup:
|
||||
# Merge the masks corresponding to the same stuff class
|
||||
for equiv in stuff_equiv_classes.values():
|
||||
if len(equiv) > 1:
|
||||
for eq_id in equiv:
|
||||
m_id.masked_fill_(m_id.eq(eq_id), equiv[0])
|
||||
|
||||
final_h, final_w = to_tuple(target_size)
|
||||
|
||||
seg_img = Image.fromarray(id_to_rgb(m_id.view(h, w).cpu().numpy()))
|
||||
seg_img = seg_img.resize(size=(final_w, final_h), resample=Image.NEAREST)
|
||||
|
||||
np_seg_img = torch.ByteTensor(torch.ByteStorage.from_buffer(seg_img.tobytes()))
|
||||
np_seg_img = np_seg_img.view(final_h, final_w, 3)
|
||||
np_seg_img = np_seg_img.numpy()
|
||||
|
||||
m_id = torch.from_numpy(rgb_to_id(np_seg_img))
|
||||
|
||||
area = []
|
||||
for i in range(len(scores)):
|
||||
area.append(m_id.eq(i).sum().item())
|
||||
return area, seg_img
|
||||
|
||||
area, seg_img = get_ids_area(cur_masks, cur_scores, dedup=True)
|
||||
if cur_classes.numel() > 0:
|
||||
# We know filter empty masks as long as we find some
|
||||
while True:
|
||||
filtered_small = torch.as_tensor(
|
||||
[area[i] <= 4 for i, c in enumerate(cur_classes)], dtype=torch.bool, device=keep.device
|
||||
)
|
||||
if filtered_small.any().item():
|
||||
cur_scores = cur_scores[~filtered_small]
|
||||
cur_classes = cur_classes[~filtered_small]
|
||||
cur_masks = cur_masks[~filtered_small]
|
||||
area, seg_img = get_ids_area(cur_masks, cur_scores)
|
||||
else:
|
||||
break
|
||||
|
||||
else:
|
||||
cur_classes = torch.ones(1, dtype=torch.long, device=cur_classes.device)
|
||||
|
||||
segments_info = []
|
||||
for i, a in enumerate(area):
|
||||
cat = cur_classes[i].item()
|
||||
segments_info.append({"id": i, "isthing": is_thing_map[cat], "category_id": cat, "area": a})
|
||||
del cur_classes
|
||||
|
||||
with io.BytesIO() as out:
|
||||
seg_img.save(out, format="PNG")
|
||||
predictions = {"png_string": out.getvalue(), "segments_info": segments_info}
|
||||
preds.append(predictions)
|
||||
return preds
|
File diff suppressed because it is too large
Load Diff
|
@ -39,6 +39,7 @@ from .file_utils import (
|
|||
is_sentencepiece_available,
|
||||
is_soundfile_availble,
|
||||
is_tf_available,
|
||||
is_timm_available,
|
||||
is_tokenizers_available,
|
||||
is_torch_available,
|
||||
is_torch_tpu_available,
|
||||
|
@ -229,6 +230,19 @@ def require_onnx(test_case):
|
|||
return test_case
|
||||
|
||||
|
||||
def require_timm(test_case):
|
||||
"""
|
||||
Decorator marking a test that requires Timm.
|
||||
|
||||
These tests are skipped when Timm isn't installed.
|
||||
|
||||
"""
|
||||
if not is_timm_available():
|
||||
return unittest.skip("test requires Timm")(test_case)
|
||||
else:
|
||||
return test_case
|
||||
|
||||
|
||||
def require_torch(test_case):
|
||||
"""
|
||||
Decorator marking a test that requires PyTorch.
|
||||
|
|
|
@ -0,0 +1,94 @@
|
|||
# COCO object detection id's to class names
|
||||
id2label = {
|
||||
0: "N/A",
|
||||
1: "person",
|
||||
2: "bicycle",
|
||||
3: "car",
|
||||
4: "motorcycle",
|
||||
5: "airplane",
|
||||
6: "bus",
|
||||
7: "train",
|
||||
8: "truck",
|
||||
9: "boat",
|
||||
10: "traffic light",
|
||||
11: "fire hydrant",
|
||||
12: "N/A",
|
||||
13: "stop sign",
|
||||
14: "parking meter",
|
||||
15: "bench",
|
||||
16: "bird",
|
||||
17: "cat",
|
||||
18: "dog",
|
||||
19: "horse",
|
||||
20: "sheep",
|
||||
21: "cow",
|
||||
22: "elephant",
|
||||
23: "bear",
|
||||
24: "zebra",
|
||||
25: "giraffe",
|
||||
26: "N/A",
|
||||
27: "backpack",
|
||||
28: "umbrella",
|
||||
29: "N/A",
|
||||
30: "N/A",
|
||||
31: "handbag",
|
||||
32: "tie",
|
||||
33: "suitcase",
|
||||
34: "frisbee",
|
||||
35: "skis",
|
||||
36: "snowboard",
|
||||
37: "sports ball",
|
||||
38: "kite",
|
||||
39: "baseball bat",
|
||||
40: "baseball glove",
|
||||
41: "skateboard",
|
||||
42: "surfboard",
|
||||
43: "tennis racket",
|
||||
44: "bottle",
|
||||
45: "N/A",
|
||||
46: "wine glass",
|
||||
47: "cup",
|
||||
48: "fork",
|
||||
49: "knife",
|
||||
50: "spoon",
|
||||
51: "bowl",
|
||||
52: "banana",
|
||||
53: "apple",
|
||||
54: "sandwich",
|
||||
55: "orange",
|
||||
56: "broccoli",
|
||||
57: "carrot",
|
||||
58: "hot dog",
|
||||
59: "pizza",
|
||||
60: "donut",
|
||||
61: "cake",
|
||||
62: "chair",
|
||||
63: "couch",
|
||||
64: "potted plant",
|
||||
65: "bed",
|
||||
66: "N/A",
|
||||
67: "dining table",
|
||||
68: "N/A",
|
||||
69: "N/A",
|
||||
70: "toilet",
|
||||
71: "N/A",
|
||||
72: "tv",
|
||||
73: "laptop",
|
||||
74: "mouse",
|
||||
75: "remote",
|
||||
76: "keyboard",
|
||||
77: "cell phone",
|
||||
78: "microwave",
|
||||
79: "oven",
|
||||
80: "toaster",
|
||||
81: "sink",
|
||||
82: "refrigerator",
|
||||
83: "N/A",
|
||||
84: "book",
|
||||
85: "clock",
|
||||
86: "vase",
|
||||
87: "scissors",
|
||||
88: "teddy bear",
|
||||
89: "hair drier",
|
||||
90: "toothbrush",
|
||||
}
|
|
@ -334,6 +334,9 @@ MODEL_FOR_MULTIPLE_CHOICE_MAPPING = None
|
|||
MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING = None
|
||||
|
||||
|
||||
MODEL_FOR_OBJECT_DETECTION_MAPPING = None
|
||||
|
||||
|
||||
MODEL_FOR_PRETRAINING_MAPPING = None
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,24 @@
|
|||
# This file is autogenerated by the command `make fix-copies`, do not edit.
|
||||
from ..file_utils import requires_backends
|
||||
|
||||
|
||||
DETR_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class DetrForObjectDetection:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm", "vision"])
|
||||
|
||||
|
||||
class DetrForSegmentation:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm", "vision"])
|
||||
|
||||
|
||||
class DetrModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm", "vision"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm", "vision"])
|
|
@ -0,0 +1,24 @@
|
|||
# This file is autogenerated by the command `make fix-copies`, do not edit.
|
||||
from ..file_utils import requires_backends
|
||||
|
||||
|
||||
DETR_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class DetrForObjectDetection:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm"])
|
||||
|
||||
|
||||
class DetrForSegmentation:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm"])
|
||||
|
||||
|
||||
class DetrModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(self, *args, **kwargs):
|
||||
requires_backends(self, ["timm"])
|
|
@ -22,6 +22,11 @@ class DeiTFeatureExtractor:
|
|||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class DetrFeatureExtractor:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class ViTFeatureExtractor:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["vision"])
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 86 KiB |
|
@ -1,4 +1,3 @@
|
|||
*.*
|
||||
cache*
|
||||
temp*
|
||||
!*.txt
|
||||
|
|
Before Width: | Height: | Size: 678 KiB After Width: | Height: | Size: 678 KiB |
|
@ -0,0 +1 @@
|
|||
[{"segmentation": [[333.96, 175.14, 338.26, 134.33, 342.55, 95.67, 348.99, 79.57, 368.32, 80.64, 371.54, 91.38, 364.03, 106.41, 356.51, 145.07, 351.14, 166.55, 350.07, 184.8, 345.77, 185.88, 332.89, 178.36, 332.89, 172.99]], "area": 2120.991099999999, "iscrowd": 0, "image_id": 39769, "bbox": [332.89, 79.57, 38.65, 106.31], "category_id": 75, "id": 1108446}, {"segmentation": [[44.03, 86.01, 112.75, 74.2, 173.96, 77.42, 175.03, 89.23, 170.74, 98.9, 147.11, 102.12, 54.77, 119.3, 53.69, 119.3, 44.03, 113.93, 41.88, 94.6, 41.88, 94.6]], "area": 4052.607, "iscrowd": 0, "image_id": 39769, "bbox": [41.88, 74.2, 133.15, 45.1], "category_id": 75, "id": 1110067}, {"segmentation": [[1.08, 473.53, 633.17, 473.53, 557.66, 376.45, 535.01, 366.74, 489.71, 305.26, 470.29, 318.2, 456.27, 351.64, 413.12, 363.51, 376.45, 358.11, 348.4, 350.56, 363.51, 331.15, 357.03, 288.0, 353.8, 257.8, 344.09, 190.92, 333.3, 177.98, 345.17, 79.82, 284.76, 130.52, 265.35, 151.01, 308.49, 189.84, 317.12, 215.73, 293.39, 243.78, 269.66, 212.49, 235.15, 199.55, 214.65, 193.08, 187.69, 217.89, 159.64, 278.29, 135.91, 313.89, 169.35, 292.31, 203.87, 281.53, 220.04, 292.31, 220.04, 307.42, 175.82, 345.17, 155.33, 360.27, 105.71, 363.51, 85.21, 374.29, 74.43, 366.74, 70.11, 465.98, 42.07, 471.37, 33.44, 457.35, 34.52, 414.2, 29.12, 368.9, 9.71, 291.24, 46.38, 209.26, 99.24, 128.36, 131.6, 107.87, 50.7, 117.57, 40.99, 103.55, 40.99, 85.21, 60.4, 77.66, 141.3, 70.11, 173.66, 72.27, 174.74, 92.76, 204.94, 72.27, 225.44, 62.56, 262.11, 56.09, 292.31, 53.93, 282.61, 81.98, 298.79, 96.0, 310.65, 102.47, 348.4, 74.43, 373.21, 81.98, 430.38, 35.6, 484.31, 23.73, 540.4, 46.38, 593.26, 66.88, 638.56, 80.9, 632.09, 145.62, 581.39, 118.65, 543.64, 130.52, 533.93, 167.19, 512.36, 197.39, 498.34, 218.97, 529.62, 253.48, 549.03, 273.98, 584.63, 276.13, 587.87, 293.39, 566.29, 305.26, 531.78, 298.79, 549.03, 319.28, 576.0, 358.11, 560.9, 376.45, 639.64, 471.37, 639.64, 2.16, 1.08, 0.0]], "area": 176277.55269999994, "iscrowd": 0, "image_id": 39769, "bbox": [1.08, 0.0, 638.56, 473.53], "category_id": 63, "id": 1605237}, {"segmentation": [[1.07, 1.18, 640.0, 3.33, 638.93, 472.59, 4.3, 479.03]], "area": 301552.6694999999, "iscrowd": 0, "image_id": 39769, "bbox": [1.07, 1.18, 638.93, 477.85], "category_id": 65, "id": 1612051}, {"segmentation": [[138.75, 319.38, 148.75, 294.38, 165.0, 246.87, 197.5, 205.63, 247.5, 203.13, 268.75, 216.88, 280.0, 239.38, 293.75, 244.38, 303.75, 241.88, 307.5, 228.13, 318.75, 220.63, 315.0, 200.63, 291.25, 171.88, 265.0, 156.88, 258.75, 148.13, 262.5, 135.63, 282.5, 123.13, 292.5, 115.63, 311.25, 108.13, 313.75, 106.88, 296.25, 93.13, 282.5, 84.38, 292.5, 64.38, 288.75, 60.63, 266.25, 54.38, 232.5, 63.12, 206.25, 70.63, 170.0, 100.63, 136.25, 114.38, 101.25, 138.13, 56.25, 194.38, 27.5, 259.38, 17.5, 299.38, 32.5, 378.13, 31.25, 448.13, 41.25, 469.38, 66.25, 466.88, 70.0, 419.38, 71.25, 391.88, 77.5, 365.63, 113.75, 364.38, 145.0, 360.63, 168.75, 349.38, 191.25, 330.63, 212.5, 319.38, 223.75, 305.63, 206.25, 286.88, 172.5, 288.13]], "area": 53301.618749999994, "iscrowd": 0, "image_id": 39769, "bbox": [17.5, 54.38, 301.25, 415.0], "category_id": 17, "id": 2190839}, {"segmentation": [[543.75, 136.88, 570.0, 114.38, 591.25, 123.13, 616.25, 140.63, 640.0, 143.13, 636.25, 124.37, 605.0, 103.13, 640.0, 103.13, 633.75, 86.88, 587.5, 73.13, 548.75, 49.38, 505.0, 35.63, 462.5, 25.63, 405.0, 48.13, 362.5, 111.88, 347.5, 179.38, 355.0, 220.63, 356.25, 230.63, 365.0, 264.38, 358.75, 266.88, 358.75, 270.63, 356.25, 291.88, 356.25, 325.63, 355.0, 338.13, 350.0, 348.13, 365.0, 354.38, 396.25, 351.88, 423.75, 355.63, 446.25, 350.63, 460.0, 345.63, 462.5, 321.88, 468.75, 306.88, 481.25, 299.38, 516.25, 341.88, 536.25, 368.13, 570.0, 369.38, 578.75, 359.38, 555.0, 330.63, 532.5, 298.13, 563.75, 299.38, 582.5, 298.13, 586.25, 286.88, 578.75, 278.13, 548.75, 269.38, 525.0, 256.88, 505.0, 206.88, 536.25, 161.88, 540.0, 149.38]], "area": 59700.95625, "iscrowd": 0, "image_id": 39769, "bbox": [347.5, 25.63, 292.5, 343.75], "category_id": 17, "id": 2190842}]
|
Binary file not shown.
After Width: | Height: | Size: 8.1 KiB |
|
@ -0,0 +1 @@
|
|||
[{"id": 8222595, "category_id": 17, "iscrowd": 0, "bbox": [18, 54, 301, 415], "area": 53306}, {"id": 8225432, "category_id": 17, "iscrowd": 0, "bbox": [349, 26, 291, 343], "area": 59627}, {"id": 8798150, "category_id": 63, "iscrowd": 0, "bbox": [1, 0, 639, 474], "area": 174579}, {"id": 14466198, "category_id": 75, "iscrowd": 0, "bbox": [42, 74, 133, 45], "area": 4068}, {"id": 12821912, "category_id": 75, "iscrowd": 0, "bbox": [333, 80, 38, 106], "area": 2118}, {"id": 10898909, "category_id": 93, "iscrowd": 0, "bbox": [0, 0, 640, 480], "area": 2750}]
|
|
@ -18,6 +18,57 @@ import json
|
|||
import os
|
||||
import tempfile
|
||||
|
||||
from transformers.file_utils import is_torch_available, is_vision_available
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
def prepare_image_inputs(feature_extract_tester, equal_resolution=False, numpify=False, torchify=False):
|
||||
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||
or a list of PyTorch tensors if one specifies torchify=True.
|
||||
"""
|
||||
|
||||
assert not (numpify and torchify), "You cannot specify both numpy and PyTorch tensors at the same time"
|
||||
|
||||
if equal_resolution:
|
||||
image_inputs = []
|
||||
for i in range(feature_extract_tester.batch_size):
|
||||
image_inputs.append(
|
||||
np.random.randint(
|
||||
255,
|
||||
size=(
|
||||
feature_extract_tester.num_channels,
|
||||
feature_extract_tester.max_resolution,
|
||||
feature_extract_tester.max_resolution,
|
||||
),
|
||||
dtype=np.uint8,
|
||||
)
|
||||
)
|
||||
else:
|
||||
image_inputs = []
|
||||
for i in range(feature_extract_tester.batch_size):
|
||||
width, height = np.random.choice(
|
||||
np.arange(feature_extract_tester.min_resolution, feature_extract_tester.max_resolution), 2
|
||||
)
|
||||
image_inputs.append(
|
||||
np.random.randint(255, size=(feature_extract_tester.num_channels, width, height), dtype=np.uint8)
|
||||
)
|
||||
|
||||
if not numpify and not torchify:
|
||||
# PIL expects the channel dimension as last dimension
|
||||
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||
|
||||
if torchify:
|
||||
image_inputs = [torch.from_numpy(x) for x in image_inputs]
|
||||
|
||||
return image_inputs
|
||||
|
||||
|
||||
class FeatureExtractionSavingTestMixin:
|
||||
def test_feat_extract_to_json_string(self):
|
||||
|
|
|
@ -21,7 +21,7 @@ import numpy as np
|
|||
from transformers.file_utils import is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_vision
|
||||
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
|
@ -75,36 +75,6 @@ class DeiTFeatureExtractionTester(unittest.TestCase):
|
|||
"image_std": self.image_std,
|
||||
}
|
||||
|
||||
def prepare_inputs(self, equal_resolution=False, numpify=False, torchify=False):
|
||||
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||
or a list of PyTorch tensors if one specifies torchify=True.
|
||||
"""
|
||||
|
||||
assert not (numpify and torchify), "You cannot specify both numpy and PyTorch tensors at the same time"
|
||||
|
||||
if equal_resolution:
|
||||
image_inputs = []
|
||||
for i in range(self.batch_size):
|
||||
image_inputs.append(
|
||||
np.random.randint(
|
||||
255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8
|
||||
)
|
||||
)
|
||||
else:
|
||||
image_inputs = []
|
||||
for i in range(self.batch_size):
|
||||
width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2)
|
||||
image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8))
|
||||
|
||||
if not numpify and not torchify:
|
||||
# PIL expects the channel dimension as last dimension
|
||||
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||
|
||||
if torchify:
|
||||
image_inputs = [torch.from_numpy(x) for x in image_inputs]
|
||||
|
||||
return image_inputs
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
|
@ -136,7 +106,7 @@ class DeiTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestC
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PIL images
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, Image.Image)
|
||||
|
||||
|
@ -168,7 +138,7 @@ class DeiTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestC
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random numpy tensors
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False, numpify=True)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, numpify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, np.ndarray)
|
||||
|
||||
|
@ -200,7 +170,7 @@ class DeiTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestC
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False, torchify=True)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
|
|
|
@ -0,0 +1,339 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2021 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import json
|
||||
import pathlib
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers.file_utils import is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_vision, slow
|
||||
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import DetrFeatureExtractor
|
||||
|
||||
|
||||
class DetrFeatureExtractionTester(unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
num_channels=3,
|
||||
min_resolution=30,
|
||||
max_resolution=400,
|
||||
do_resize=True,
|
||||
size=18,
|
||||
max_size=1333, # by setting max_size > max_resolution we're effectively not testing this :p
|
||||
do_normalize=True,
|
||||
image_mean=[0.5, 0.5, 0.5],
|
||||
image_std=[0.5, 0.5, 0.5],
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.num_channels = num_channels
|
||||
self.min_resolution = min_resolution
|
||||
self.max_resolution = max_resolution
|
||||
self.do_resize = do_resize
|
||||
self.size = size
|
||||
self.max_size = max_size
|
||||
self.do_normalize = do_normalize
|
||||
self.image_mean = image_mean
|
||||
self.image_std = image_std
|
||||
|
||||
def prepare_feat_extract_dict(self):
|
||||
return {
|
||||
"do_resize": self.do_resize,
|
||||
"size": self.size,
|
||||
"max_size": self.max_size,
|
||||
"do_normalize": self.do_normalize,
|
||||
"image_mean": self.image_mean,
|
||||
"image_std": self.image_std,
|
||||
}
|
||||
|
||||
def get_expected_values(self, image_inputs, batched=False):
|
||||
"""
|
||||
This function computes the expected height and width when providing images to DetrFeatureExtractor,
|
||||
assuming do_resize is set to True with a scalar size.
|
||||
"""
|
||||
if not batched:
|
||||
image = image_inputs[0]
|
||||
if isinstance(image, Image.Image):
|
||||
w, h = image.size
|
||||
else:
|
||||
h, w = image.shape[1], image.shape[2]
|
||||
if w < h:
|
||||
expected_height = int(self.size * h / w)
|
||||
expected_width = self.size
|
||||
elif w > h:
|
||||
expected_height = self.size
|
||||
expected_width = int(self.size * w / h)
|
||||
else:
|
||||
expected_height = self.size
|
||||
expected_width = self.size
|
||||
|
||||
else:
|
||||
expected_values = []
|
||||
for image in image_inputs:
|
||||
expected_height, expected_width = self.get_expected_values([image])
|
||||
expected_values.append((expected_height, expected_width))
|
||||
expected_height = max(expected_values, key=lambda item: item[0])[0]
|
||||
expected_width = max(expected_values, key=lambda item: item[1])[1]
|
||||
|
||||
return expected_height, expected_width
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
class DetrFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCase):
|
||||
|
||||
feature_extraction_class = DetrFeatureExtractor if is_vision_available() else None
|
||||
|
||||
def setUp(self):
|
||||
self.feature_extract_tester = DetrFeatureExtractionTester(self)
|
||||
|
||||
@property
|
||||
def feat_extract_dict(self):
|
||||
return self.feature_extract_tester.prepare_feat_extract_dict()
|
||||
|
||||
def test_feat_extract_properties(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
self.assertTrue(hasattr(feature_extractor, "image_mean"))
|
||||
self.assertTrue(hasattr(feature_extractor, "image_std"))
|
||||
self.assertTrue(hasattr(feature_extractor, "do_normalize"))
|
||||
self.assertTrue(hasattr(feature_extractor, "do_resize"))
|
||||
self.assertTrue(hasattr(feature_extractor, "size"))
|
||||
self.assertTrue(hasattr(feature_extractor, "max_size"))
|
||||
|
||||
def test_batch_feature(self):
|
||||
pass
|
||||
|
||||
def test_call_pil(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PIL images
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, Image.Image)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_call_numpy(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random numpy tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, numpify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, np.ndarray)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_call_pytorch(self):
|
||||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
# Test not batched input
|
||||
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs)
|
||||
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||
)
|
||||
|
||||
# Test batched
|
||||
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||
|
||||
expected_height, expected_width = self.feature_extract_tester.get_expected_values(image_inputs, batched=True)
|
||||
|
||||
self.assertEqual(
|
||||
encoded_images.shape,
|
||||
(
|
||||
self.feature_extract_tester.batch_size,
|
||||
self.feature_extract_tester.num_channels,
|
||||
expected_height,
|
||||
expected_width,
|
||||
),
|
||||
)
|
||||
|
||||
def test_equivalence_pad_and_create_pixel_mask(self):
|
||||
# Initialize feature_extractors
|
||||
feature_extractor_1 = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
feature_extractor_2 = self.feature_extraction_class(do_resize=False, do_normalize=False)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
# Test whether the method "pad_and_return_pixel_mask" and calling the feature extractor return the same tensors
|
||||
encoded_images_with_method = feature_extractor_1.pad_and_create_pixel_mask(image_inputs, return_tensors="pt")
|
||||
encoded_images = feature_extractor_2(image_inputs, return_tensors="pt")
|
||||
|
||||
assert torch.allclose(encoded_images_with_method["pixel_values"], encoded_images["pixel_values"], atol=1e-4)
|
||||
assert torch.allclose(encoded_images_with_method["pixel_mask"], encoded_images["pixel_mask"], atol=1e-4)
|
||||
|
||||
@slow
|
||||
def test_call_pytorch_with_coco_detection_annotations(self):
|
||||
# prepare image and target
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
with open("./tests/fixtures/tests_samples/COCO/coco_annotations.txt", "r") as f:
|
||||
target = json.loads(f.read())
|
||||
|
||||
target = {"image_id": 39769, "annotations": target}
|
||||
|
||||
# encode them
|
||||
# TODO replace by facebook/detr-resnet-50
|
||||
feature_extractor = DetrFeatureExtractor.from_pretrained("nielsr/detr-resnet-50")
|
||||
encoding = feature_extractor(images=image, annotations=target, return_tensors="pt")
|
||||
|
||||
# verify pixel values
|
||||
expected_shape = torch.Size([1, 3, 800, 1066])
|
||||
self.assertEqual(encoding["pixel_values"].shape, expected_shape)
|
||||
|
||||
expected_slice = torch.tensor([0.2796, 0.3138, 0.3481])
|
||||
assert torch.allclose(encoding["pixel_values"][0, 0, 0, :3], expected_slice, atol=1e-4)
|
||||
|
||||
# verify area
|
||||
expected_area = torch.tensor([5887.9600, 11250.2061, 489353.8438, 837122.7500, 147967.5156, 165732.3438])
|
||||
assert torch.allclose(encoding["target"][0]["area"], expected_area)
|
||||
# verify boxes
|
||||
expected_boxes_shape = torch.Size([6, 4])
|
||||
self.assertEqual(encoding["target"][0]["boxes"].shape, expected_boxes_shape)
|
||||
expected_boxes_slice = torch.tensor([0.5503, 0.2765, 0.0604, 0.2215])
|
||||
assert torch.allclose(encoding["target"][0]["boxes"][0], expected_boxes_slice, atol=1e-3)
|
||||
# verify image_id
|
||||
expected_image_id = torch.tensor([39769])
|
||||
assert torch.allclose(encoding["target"][0]["image_id"], expected_image_id)
|
||||
# verify is_crowd
|
||||
expected_is_crowd = torch.tensor([0, 0, 0, 0, 0, 0])
|
||||
assert torch.allclose(encoding["target"][0]["iscrowd"], expected_is_crowd)
|
||||
# verify class_labels
|
||||
expected_class_labels = torch.tensor([75, 75, 63, 65, 17, 17])
|
||||
assert torch.allclose(encoding["target"][0]["class_labels"], expected_class_labels)
|
||||
# verify orig_size
|
||||
expected_orig_size = torch.tensor([480, 640])
|
||||
assert torch.allclose(encoding["target"][0]["orig_size"], expected_orig_size)
|
||||
# verify size
|
||||
expected_size = torch.tensor([800, 1066])
|
||||
assert torch.allclose(encoding["target"][0]["size"], expected_size)
|
||||
|
||||
@slow
|
||||
def test_call_pytorch_with_coco_panoptic_annotations(self):
|
||||
# prepare image, target and masks_path
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
with open("./tests/fixtures/tests_samples/COCO/coco_panoptic_annotations.txt", "r") as f:
|
||||
target = json.loads(f.read())
|
||||
|
||||
target = {"file_name": "000000039769.png", "image_id": 39769, "segments_info": target}
|
||||
|
||||
masks_path = pathlib.Path("./tests/fixtures/tests_samples/COCO/coco_panoptic")
|
||||
|
||||
# encode them
|
||||
# TODO replace by .from_pretrained facebook/detr-resnet-50-panoptic
|
||||
feature_extractor = DetrFeatureExtractor(format="coco_panoptic")
|
||||
encoding = feature_extractor(images=image, annotations=target, masks_path=masks_path, return_tensors="pt")
|
||||
|
||||
# verify pixel values
|
||||
expected_shape = torch.Size([1, 3, 800, 1066])
|
||||
self.assertEqual(encoding["pixel_values"].shape, expected_shape)
|
||||
|
||||
expected_slice = torch.tensor([0.2796, 0.3138, 0.3481])
|
||||
assert torch.allclose(encoding["pixel_values"][0, 0, 0, :3], expected_slice, atol=1e-4)
|
||||
|
||||
# verify area
|
||||
expected_area = torch.tensor([147979.6875, 165527.0469, 484638.5938, 11292.9375, 5879.6562, 7634.1147])
|
||||
assert torch.allclose(encoding["target"][0]["area"], expected_area)
|
||||
# verify boxes
|
||||
expected_boxes_shape = torch.Size([6, 4])
|
||||
self.assertEqual(encoding["target"][0]["boxes"].shape, expected_boxes_shape)
|
||||
expected_boxes_slice = torch.tensor([0.2625, 0.5437, 0.4688, 0.8625])
|
||||
assert torch.allclose(encoding["target"][0]["boxes"][0], expected_boxes_slice, atol=1e-3)
|
||||
# verify image_id
|
||||
expected_image_id = torch.tensor([39769])
|
||||
assert torch.allclose(encoding["target"][0]["image_id"], expected_image_id)
|
||||
# verify is_crowd
|
||||
expected_is_crowd = torch.tensor([0, 0, 0, 0, 0, 0])
|
||||
assert torch.allclose(encoding["target"][0]["iscrowd"], expected_is_crowd)
|
||||
# verify class_labels
|
||||
expected_class_labels = torch.tensor([17, 17, 63, 75, 75, 93])
|
||||
assert torch.allclose(encoding["target"][0]["class_labels"], expected_class_labels)
|
||||
# verify masks
|
||||
expected_masks_sum = 822338
|
||||
self.assertEqual(encoding["target"][0]["masks"].sum().item(), expected_masks_sum)
|
||||
# verify orig_size
|
||||
expected_orig_size = torch.tensor([480, 640])
|
||||
assert torch.allclose(encoding["target"][0]["orig_size"], expected_orig_size)
|
||||
# verify size
|
||||
expected_size = torch.tensor([800, 1066])
|
||||
assert torch.allclose(encoding["target"][0]["size"], expected_size)
|
|
@ -21,7 +21,7 @@ import numpy as np
|
|||
from transformers.file_utils import is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_vision
|
||||
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin
|
||||
from .test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
|
@ -69,36 +69,6 @@ class ViTFeatureExtractionTester(unittest.TestCase):
|
|||
"size": self.size,
|
||||
}
|
||||
|
||||
def prepare_inputs(self, equal_resolution=False, numpify=False, torchify=False):
|
||||
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||
or a list of PyTorch tensors if one specifies torchify=True.
|
||||
"""
|
||||
|
||||
assert not (numpify and torchify), "You cannot specify both numpy and PyTorch tensors at the same time"
|
||||
|
||||
if equal_resolution:
|
||||
image_inputs = []
|
||||
for i in range(self.batch_size):
|
||||
image_inputs.append(
|
||||
np.random.randint(
|
||||
255, size=(self.num_channels, self.max_resolution, self.max_resolution), dtype=np.uint8
|
||||
)
|
||||
)
|
||||
else:
|
||||
image_inputs = []
|
||||
for i in range(self.batch_size):
|
||||
width, height = np.random.choice(np.arange(self.min_resolution, self.max_resolution), 2)
|
||||
image_inputs.append(np.random.randint(255, size=(self.num_channels, width, height), dtype=np.uint8))
|
||||
|
||||
if not numpify and not torchify:
|
||||
# PIL expects the channel dimension as last dimension
|
||||
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||
|
||||
if torchify:
|
||||
image_inputs = [torch.from_numpy(x) for x in image_inputs]
|
||||
|
||||
return image_inputs
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
|
@ -128,7 +98,7 @@ class ViTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCa
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PIL images
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, Image.Image)
|
||||
|
||||
|
@ -160,7 +130,7 @@ class ViTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCa
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random numpy tensors
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False, numpify=True)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, numpify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, np.ndarray)
|
||||
|
||||
|
@ -192,7 +162,7 @@ class ViTFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCa
|
|||
# Initialize feature_extractor
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
# create random PyTorch tensors
|
||||
image_inputs = self.feature_extract_tester.prepare_inputs(equal_resolution=False, torchify=True)
|
||||
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||
for image in image_inputs:
|
||||
self.assertIsInstance(image, torch.Tensor)
|
||||
|
||||
|
|
|
@ -21,7 +21,7 @@ import random
|
|||
import tempfile
|
||||
import unittest
|
||||
import warnings
|
||||
from typing import List, Tuple
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
from huggingface_hub import HfApi
|
||||
from requests.exceptions import HTTPError
|
||||
|
@ -982,7 +982,6 @@ class ModelTesterMixin:
|
|||
|
||||
outputs = model(**inputs)
|
||||
|
||||
print(outputs)
|
||||
output = outputs[0]
|
||||
|
||||
if config.is_encoder_decoder:
|
||||
|
@ -1236,6 +1235,11 @@ class ModelTesterMixin:
|
|||
if isinstance(tuple_object, (List, Tuple)):
|
||||
for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object):
|
||||
recursive_check(tuple_iterable_value, dict_iterable_value)
|
||||
elif isinstance(tuple_object, Dict):
|
||||
for tuple_iterable_value, dict_iterable_value in zip(
|
||||
tuple_object.values(), dict_object.values()
|
||||
):
|
||||
recursive_check(tuple_iterable_value, dict_iterable_value)
|
||||
elif tuple_object is None:
|
||||
return
|
||||
else:
|
||||
|
|
|
@ -360,7 +360,7 @@ class DeiTModelTest(ModelTesterMixin, unittest.TestCase):
|
|||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/cats.png")
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,527 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch DETR model. """
|
||||
|
||||
|
||||
import inspect
|
||||
import math
|
||||
import unittest
|
||||
|
||||
from transformers import is_timm_available, is_vision_available
|
||||
from transformers.file_utils import cached_property
|
||||
from transformers.testing_utils import require_timm, require_vision, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_generation_utils import GenerationTesterMixin
|
||||
from .test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
|
||||
|
||||
|
||||
if is_timm_available():
|
||||
import torch
|
||||
|
||||
from transformers import DetrConfig, DetrForObjectDetection, DetrForSegmentation, DetrModel
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import DetrFeatureExtractor
|
||||
|
||||
|
||||
@require_timm
|
||||
class DetrModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=8,
|
||||
is_training=True,
|
||||
use_labels=True,
|
||||
hidden_size=256,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=8,
|
||||
intermediate_size=4,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
num_queries=12,
|
||||
num_channels=3,
|
||||
min_size=200,
|
||||
max_size=200,
|
||||
n_targets=8,
|
||||
num_labels=91,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.is_training = is_training
|
||||
self.use_labels = use_labels
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.num_queries = num_queries
|
||||
self.num_channels = num_channels
|
||||
self.min_size = min_size
|
||||
self.max_size = max_size
|
||||
self.n_targets = n_targets
|
||||
self.num_labels = num_labels
|
||||
|
||||
# we also set the expected seq length for both encoder and decoder
|
||||
self.encoder_seq_length = math.ceil(self.min_size / 32) * math.ceil(self.max_size / 32)
|
||||
self.decoder_seq_length = self.num_queries
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.min_size, self.max_size])
|
||||
|
||||
pixel_mask = torch.ones([self.batch_size, self.min_size, self.max_size], device=torch_device)
|
||||
|
||||
labels = None
|
||||
if self.use_labels:
|
||||
# labels is a list of Dict (each Dict being the labels for a given example in the batch)
|
||||
labels = []
|
||||
for i in range(self.batch_size):
|
||||
target = {}
|
||||
target["class_labels"] = torch.randint(
|
||||
high=self.num_labels, size=(self.n_targets,), device=torch_device
|
||||
)
|
||||
target["boxes"] = torch.rand(self.n_targets, 4, device=torch_device)
|
||||
target["masks"] = torch.rand(self.n_targets, self.min_size, self.max_size, device=torch_device)
|
||||
labels.append(target)
|
||||
|
||||
config = DetrConfig(
|
||||
d_model=self.hidden_size,
|
||||
encoder_layers=self.num_hidden_layers,
|
||||
decoder_layers=self.num_hidden_layers,
|
||||
encoder_attention_heads=self.num_attention_heads,
|
||||
decoder_attention_heads=self.num_attention_heads,
|
||||
encoder_ffn_dim=self.intermediate_size,
|
||||
decoder_ffn_dim=self.intermediate_size,
|
||||
dropout=self.hidden_dropout_prob,
|
||||
attention_dropout=self.attention_probs_dropout_prob,
|
||||
num_queries=self.num_queries,
|
||||
num_labels=self.num_labels,
|
||||
)
|
||||
return config, pixel_values, pixel_mask, labels
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, pixel_values, pixel_mask, labels = self.prepare_config_and_inputs()
|
||||
inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
def create_and_check_detr_model(self, config, pixel_values, pixel_mask, labels):
|
||||
model = DetrModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||
result = model(pixel_values)
|
||||
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape, (self.batch_size, self.decoder_seq_length, self.hidden_size)
|
||||
)
|
||||
|
||||
def create_and_check_detr_object_detection_head_model(self, config, pixel_values, pixel_mask, labels):
|
||||
model = DetrForObjectDetection(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||
result = model(pixel_values)
|
||||
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1))
|
||||
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
|
||||
|
||||
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, labels=labels)
|
||||
|
||||
self.parent.assertEqual(result.loss.shape, ())
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1))
|
||||
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
|
||||
|
||||
|
||||
@require_timm
|
||||
class DetrModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (
|
||||
(
|
||||
DetrModel,
|
||||
DetrForObjectDetection,
|
||||
DetrForSegmentation,
|
||||
)
|
||||
if is_timm_available()
|
||||
else ()
|
||||
)
|
||||
is_encoder_decoder = True
|
||||
test_torchscript = False
|
||||
test_pruning = False
|
||||
test_head_masking = False
|
||||
test_missing_keys = False
|
||||
|
||||
# special case for head models
|
||||
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
|
||||
|
||||
if return_labels:
|
||||
if model_class.__name__ in ["DetrForObjectDetection", "DetrForSegmentation"]:
|
||||
labels = []
|
||||
for i in range(self.model_tester.batch_size):
|
||||
target = {}
|
||||
target["class_labels"] = torch.ones(
|
||||
size=(self.model_tester.n_targets,), device=torch_device, dtype=torch.long
|
||||
)
|
||||
target["boxes"] = torch.ones(
|
||||
self.model_tester.n_targets, 4, device=torch_device, dtype=torch.float
|
||||
)
|
||||
target["masks"] = torch.ones(
|
||||
self.model_tester.n_targets,
|
||||
self.model_tester.min_size,
|
||||
self.model_tester.max_size,
|
||||
device=torch_device,
|
||||
dtype=torch.float,
|
||||
)
|
||||
labels.append(target)
|
||||
inputs_dict["labels"] = labels
|
||||
|
||||
return inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = DetrModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=DetrConfig, has_text_modality=False)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_detr_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_detr_model(*config_and_inputs)
|
||||
|
||||
def test_detr_object_detection_head_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_detr_object_detection_head_model(*config_and_inputs)
|
||||
|
||||
@unittest.skip(reason="DETR does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="DETR does not have a get_input_embeddings method")
|
||||
def test_model_common_attributes(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="DETR is not a generative model")
|
||||
def test_generate_without_input_ids(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="DETR does not use token embeddings")
|
||||
def test_resize_tokens_embeddings(self):
|
||||
pass
|
||||
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.return_dict = True
|
||||
|
||||
decoder_seq_length = self.model_tester.decoder_seq_length
|
||||
encoder_seq_length = self.model_tester.encoder_seq_length
|
||||
decoder_key_length = self.model_tester.decoder_seq_length
|
||||
encoder_key_length = self.model_tester.encoder_seq_length
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = False
|
||||
config.return_dict = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||
|
||||
# check that output_attentions also work using config
|
||||
del inputs_dict["output_attentions"]
|
||||
config.output_attentions = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||
|
||||
self.assertListEqual(
|
||||
list(attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
)
|
||||
out_len = len(outputs)
|
||||
|
||||
if self.is_encoder_decoder:
|
||||
correct_outlen = 5
|
||||
|
||||
# loss is at first position
|
||||
if "labels" in inputs_dict:
|
||||
correct_outlen += 1 # loss is added to beginning
|
||||
# Object Detection model returns pred_logits and pred_boxes
|
||||
if model_class.__name__ == "DetrForObjectDetection":
|
||||
correct_outlen += 2
|
||||
# Panoptic Segmentation model returns pred_logits, pred_boxes, pred_masks
|
||||
if model_class.__name__ == "DetrForSegmentation":
|
||||
correct_outlen += 3
|
||||
if "past_key_values" in outputs:
|
||||
correct_outlen += 1 # past_key_values have been returned
|
||||
|
||||
self.assertEqual(out_len, correct_outlen)
|
||||
|
||||
# decoder attentions
|
||||
decoder_attentions = outputs.decoder_attentions
|
||||
self.assertIsInstance(decoder_attentions, (list, tuple))
|
||||
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(decoder_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
|
||||
)
|
||||
|
||||
# cross attentions
|
||||
cross_attentions = outputs.cross_attentions
|
||||
self.assertIsInstance(cross_attentions, (list, tuple))
|
||||
self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(cross_attentions[0].shape[-3:]),
|
||||
[
|
||||
self.model_tester.num_attention_heads,
|
||||
decoder_seq_length,
|
||||
encoder_key_length,
|
||||
],
|
||||
)
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
if hasattr(self.model_tester, "num_hidden_states_types"):
|
||||
added_hidden_states = self.model_tester.num_hidden_states_types
|
||||
elif self.is_encoder_decoder:
|
||||
added_hidden_states = 2
|
||||
else:
|
||||
added_hidden_states = 1
|
||||
self.assertEqual(out_len + added_hidden_states, len(outputs))
|
||||
|
||||
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||
|
||||
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
|
||||
self.assertListEqual(
|
||||
list(self_attentions[0].shape[-3:]),
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
)
|
||||
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
# removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.output_hidden_states = True
|
||||
config.output_attentions = True
|
||||
|
||||
# no need to test all models as different heads yield the same functionality
|
||||
model_class = self.all_model_classes[0]
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
|
||||
inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||
|
||||
outputs = model(**inputs)
|
||||
|
||||
output = outputs[0]
|
||||
|
||||
encoder_hidden_states = outputs.encoder_hidden_states[0]
|
||||
encoder_attentions = outputs.encoder_attentions[0]
|
||||
encoder_hidden_states.retain_grad()
|
||||
encoder_attentions.retain_grad()
|
||||
|
||||
decoder_attentions = outputs.decoder_attentions[0]
|
||||
decoder_attentions.retain_grad()
|
||||
|
||||
cross_attentions = outputs.cross_attentions[0]
|
||||
cross_attentions.retain_grad()
|
||||
|
||||
output.flatten()[0].backward(retain_graph=True)
|
||||
|
||||
self.assertIsNotNone(encoder_hidden_states.grad)
|
||||
self.assertIsNotNone(encoder_attentions.grad)
|
||||
self.assertIsNotNone(decoder_attentions.grad)
|
||||
self.assertIsNotNone(cross_attentions.grad)
|
||||
|
||||
def test_forward_signature(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
signature = inspect.signature(model.forward)
|
||||
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||
arg_names = [*signature.parameters.keys()]
|
||||
|
||||
if model.config.is_encoder_decoder:
|
||||
expected_arg_names = ["pixel_values", "pixel_mask"]
|
||||
expected_arg_names.extend(
|
||||
["head_mask", "decoder_head_mask", "encoder_outputs"]
|
||||
if "head_mask" and "decoder_head_mask" in arg_names
|
||||
else []
|
||||
)
|
||||
self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
|
||||
else:
|
||||
expected_arg_names = ["pixel_values", "pixel_mask"]
|
||||
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||
|
||||
def test_different_timm_backbone(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
# let's pick a random timm backbone
|
||||
config.backbone = "tf_mobilenetv3_small_075"
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
if model_class.__name__ == "DetrForObjectDetection":
|
||||
expected_shape = (
|
||||
self.model_tester.batch_size,
|
||||
self.model_tester.num_queries,
|
||||
self.model_tester.num_labels + 1,
|
||||
)
|
||||
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||
|
||||
self.assertTrue(outputs)
|
||||
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
configs_no_init = _config_zero_init(config)
|
||||
configs_no_init.init_xavier_std = 1e9
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
if param.requires_grad:
|
||||
if "bbox_attention" in name and "bias" not in name:
|
||||
self.assertLess(
|
||||
100000,
|
||||
abs(param.data.max().item()),
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
else:
|
||||
self.assertIn(
|
||||
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
|
||||
TOLERANCE = 1e-4
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
@require_timm
|
||||
@require_vision
|
||||
@slow
|
||||
class DetrModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def default_feature_extractor(self):
|
||||
return DetrFeatureExtractor.from_pretrained("facebook/detr-resnet-50") if is_vision_available() else None
|
||||
|
||||
def test_inference_no_head(self):
|
||||
model = DetrModel.from_pretrained("facebook/detr-resnet-50").to(torch_device)
|
||||
|
||||
feature_extractor = self.default_feature_extractor
|
||||
image = prepare_img()
|
||||
encoding = feature_extractor(images=image, return_tensors="pt").to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**encoding)
|
||||
|
||||
expected_shape = torch.Size((1, 100, 256))
|
||||
assert outputs.last_hidden_state.shape == expected_shape
|
||||
expected_slice = torch.tensor(
|
||||
[[0.0616, -0.5146, -0.4032], [-0.7629, -0.4934, -1.7153], [-0.4768, -0.6403, -0.7826]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.last_hidden_state[0, :3, :3], expected_slice, atol=1e-4))
|
||||
|
||||
def test_inference_object_detection_head(self):
|
||||
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").to(torch_device)
|
||||
|
||||
feature_extractor = self.default_feature_extractor
|
||||
image = prepare_img()
|
||||
encoding = feature_extractor(images=image, return_tensors="pt").to(torch_device)
|
||||
pixel_values = encoding["pixel_values"].to(torch_device)
|
||||
pixel_mask = encoding["pixel_mask"].to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(pixel_values, pixel_mask)
|
||||
|
||||
expected_shape_logits = torch.Size((1, model.config.num_queries, model.config.num_labels + 1))
|
||||
self.assertEqual(outputs.logits.shape, expected_shape_logits)
|
||||
expected_slice_logits = torch.tensor(
|
||||
[[-19.1194, -0.0893, -11.0154], [-17.3640, -1.8035, -14.0219], [-20.0461, -0.5837, -11.1060]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_slice_logits, atol=1e-4))
|
||||
|
||||
expected_shape_boxes = torch.Size((1, model.config.num_queries, 4))
|
||||
self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
|
||||
expected_slice_boxes = torch.tensor(
|
||||
[[0.4433, 0.5302, 0.8853], [0.5494, 0.2517, 0.0529], [0.4998, 0.5360, 0.9956]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_slice_boxes, atol=1e-4))
|
||||
|
||||
def test_inference_panoptic_segmentation_head(self):
|
||||
model = DetrForSegmentation.from_pretrained("facebook/detr-resnet-50-panoptic").to(torch_device)
|
||||
|
||||
feature_extractor = self.default_feature_extractor
|
||||
image = prepare_img()
|
||||
encoding = feature_extractor(images=image, return_tensors="pt").to(torch_device)
|
||||
pixel_values = encoding["pixel_values"].to(torch_device)
|
||||
pixel_mask = encoding["pixel_mask"].to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(pixel_values, pixel_mask)
|
||||
|
||||
expected_shape_logits = torch.Size((1, model.config.num_queries, model.config.num_labels + 1))
|
||||
self.assertEqual(outputs.logits.shape, expected_shape_logits)
|
||||
expected_slice_logits = torch.tensor(
|
||||
[[-18.1565, -1.7568, -13.5029], [-16.8888, -1.4138, -14.1028], [-17.5709, -2.5080, -11.8654]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_slice_logits, atol=1e-4))
|
||||
|
||||
expected_shape_boxes = torch.Size((1, model.config.num_queries, 4))
|
||||
self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
|
||||
expected_slice_boxes = torch.tensor(
|
||||
[[0.5344, 0.1789, 0.9285], [0.4420, 0.0572, 0.0875], [0.6630, 0.6887, 0.1017]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_slice_boxes, atol=1e-4))
|
||||
|
||||
expected_shape_masks = torch.Size((1, model.config.num_queries, 200, 267))
|
||||
self.assertEqual(outputs.pred_masks.shape, expected_shape_masks)
|
||||
expected_slice_masks = torch.tensor(
|
||||
[[-7.7558, -10.8788, -11.9797], [-11.8881, -16.4329, -17.7451], [-14.7316, -19.7383, -20.3004]]
|
||||
).to(torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.pred_masks[0, 0, :3, :3], expected_slice_masks, atol=1e-4))
|
|
@ -322,7 +322,7 @@ class ViTModelTest(ModelTesterMixin, unittest.TestCase):
|
|||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/cats.png")
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
return image
|
||||
|
||||
|
||||
|
|
|
@ -47,11 +47,26 @@ class ImageClassificationPipelineTests(unittest.TestCase):
|
|||
"http://images.cocodataset.org/val2017/000000039769.jpg",
|
||||
]
|
||||
},
|
||||
{"images": "tests/fixtures/coco.jpg"},
|
||||
{"images": ["tests/fixtures/coco.jpg", "tests/fixtures/coco.jpg"]},
|
||||
{"images": Image.open("tests/fixtures/coco.jpg")},
|
||||
{"images": [Image.open("tests/fixtures/coco.jpg"), Image.open("tests/fixtures/coco.jpg")]},
|
||||
{"images": [Image.open("tests/fixtures/coco.jpg"), "tests/fixtures/coco.jpg"]},
|
||||
{"images": "./tests/fixtures/tests_samples/COCO/000000039769.png"},
|
||||
{
|
||||
"images": [
|
||||
"./tests/fixtures/tests_samples/COCO/000000039769.png",
|
||||
"./tests/fixtures/tests_samples/COCO/000000039769.png",
|
||||
]
|
||||
},
|
||||
{"images": Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")},
|
||||
{
|
||||
"images": [
|
||||
Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png"),
|
||||
Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png"),
|
||||
]
|
||||
},
|
||||
{
|
||||
"images": [
|
||||
Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png"),
|
||||
"./tests/fixtures/tests_samples/COCO/000000039769.png",
|
||||
]
|
||||
},
|
||||
]
|
||||
|
||||
def test_small_model_from_factory(self):
|
||||
|
|
|
@ -38,6 +38,9 @@ IGNORE_NON_TESTED = [
|
|||
"BigBirdPegasusEncoder", # Building part of bigger (tested) model.
|
||||
"BigBirdPegasusDecoder", # Building part of bigger (tested) model.
|
||||
"BigBirdPegasusDecoderWrapper", # Building part of bigger (tested) model.
|
||||
"DetrEncoder", # Building part of bigger (tested) model.
|
||||
"DetrDecoder", # Building part of bigger (tested) model.
|
||||
"DetrDecoderWrapper", # Building part of bigger (tested) model.
|
||||
"M2M100Encoder", # Building part of bigger (tested) model.
|
||||
"M2M100Decoder", # Building part of bigger (tested) model.
|
||||
"Speech2TextEncoder", # Building part of bigger (tested) model.
|
||||
|
@ -95,6 +98,7 @@ IGNORE_NON_AUTO_CONFIGURED = [
|
|||
"CLIPVisionModel",
|
||||
"FlaxCLIPTextModel",
|
||||
"FlaxCLIPVisionModel",
|
||||
"DetrForSegmentation",
|
||||
"DPRReader",
|
||||
"DPRSpanPredictor",
|
||||
"FlaubertForQuestionAnswering",
|
||||
|
|
Loading…
Reference in New Issue