Add fuyu model (#26911)
* initial commit * add processor, add fuyu naming * add draft processor * fix processor * remove dropout to fix loading of weights * add image processing fixes from Pedro * fix * fix processor * add basic processing fuyu test * add documentation and TODO * address comments, add tests, add doc * replace assert with torch asserts * add Mixins and fix tests * clean imports * add model tester, clean imports * fix embedding test * add updated tests from pre-release model * Processor: return input_ids used for inference * separate processing and model tests * relax test tolerance for embeddings * add test for logit comparison * make sure fuyu image processor is imported in the init * fix formattingh * more formatting issues * and more * fixups * remove some stuff * nits * update init * remove the fuyu file * Update integration test with release model * Update conversion script. The projection is not used, as confirmed by the authors. * improve geenration * Remove duplicate function * Trickle down patches to model call * processing fuyu updates * remove things * fix prepare_inputs_for_generation to fix generate() * remove model_input * update * add generation tests * nits * draft leverage automodel and autoconfig * nits * fix dtype patch * address comments, update READMEs and doc, include tests * add working processing test, remove refs to subsequences * add tests, remove Sequence classification * processing * update * update the conversion script * more processing cleanup * safe import * take out ModelTesterMixin for early release * more cl;eanup * more cleanup * more cleanup * and more * register a buffer * nits * add postprocessing of generate output * nits * updates * add one working test * fix test * make fixup works * fixup * Arthur's updates * nits * update * update * fix processor * update tests * passe more fixups * fix * nits * don't import torch * skip fuyu config for now * fixup done * fixup * update * oups * nits * Use input embeddings * no buffer * update * styling processing fuyu * fix test * update licence * protect torch import * fixup and update not doctested * kwargs should be passed * udpates * update the impofixuprts in the test * protect import * protecting imports * protect imports in type checking * add testing decorators * protect top level import structure * fix typo * fix check init * move requires_backend to functions * Imports * Protect types --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Lysandre <lysandre@huggingface.co>
This commit is contained in:
parent
5a73316bed
commit
caa0ff0bf1
|
@ -363,6 +363,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
|
|
|
@ -338,6 +338,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
|
|
|
@ -310,6 +310,7 @@ conda install -c huggingface transformers
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (गूगल रिसर्च से) साथ वाला पेपर [FNet: मिक्सिंग टोकन विद फूरियर ट्रांसफॉर्म्स](https://arxiv.org /abs/2105.03824) जेम्स ली-थॉर्प, जोशुआ आइंस्ली, इल्या एकस्टीन, सैंटियागो ओंटानन द्वारा।
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (Microsoft Research से) Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. द्वाराअनुसंधान पत्र [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) के साथ जारी किया गया
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [फ़नल-ट्रांसफॉर्मर: कुशल भाषा प्रसंस्करण के लिए अनुक्रमिक अतिरेक को छानना](https://arxiv.org/abs/2006.03236) जिहांग दाई, गुओकुन लाई, यिमिंग यांग, क्वोक वी. ले द्वारा रिहाई।
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (ADEPT से) रोहन बाविशी, एरिच एलसेन, कर्टिस हॉथोर्न, मैक्सवेल नी, ऑगस्टस ओडेना, अरुशी सोमानी, सागनाक तासिरलार [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST से) साथ वाला पेपर [वर्टिकल कटडेप्थ के साथ मोनोकुलर डेप्थ एस्टीमेशन के लिए ग्लोबल-लोकल पाथ नेटवर्क्स](https:/ /arxiv.org/abs/2201.07436) डोयोन किम, वूंगह्युन गा, प्युंगवान आह, डोंगग्यू जू, सेहवान चुन, जुनमो किम द्वारा।
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI से) साथ में दिया गया पेपर [जेनरेटिव प्री-ट्रेनिंग द्वारा भाषा की समझ में सुधार](https://blog .openai.com/language-unsupervised/) एलेक रैडफोर्ड, कार्तिक नरसिम्हन, टिम सालिमन्स और इल्या सुत्स्केवर द्वारा।
|
||||
|
|
|
@ -372,6 +372,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (Google Research から) James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon から公開された研究論文: [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824)
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (Microsoft Research から) Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. から公開された研究論文 [Focal Modulation Networks](https://arxiv.org/abs/2203.11926)
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (CMU/Google Brain から) Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le から公開された研究論文: [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236)
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (ADEPT から) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. から公開された研究論文 [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (Microsoft Research から) Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. から公開された研究論文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100)
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST から) Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim から公開された研究論文: [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436)
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI から) Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever から公開された研究論文: [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/)
|
||||
|
|
|
@ -287,6 +287,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. 논문과 함께 공개 [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
|
|
|
@ -361,6 +361,7 @@ conda install -c huggingface transformers
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
|
|
|
@ -311,6 +311,7 @@ conda install -c huggingface transformers
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (来自 Microsoft Research) 伴随论文 [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) 由 Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao 发布。
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (来自 ADEPT) 伴随论文 [blog post](https://www.adept.ai/blog/fuyu-8b 由 Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar 发布。)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (来自 Microsoft Research) 伴随论文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) 由 Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang 发布。
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (来自 OpenAI) 伴随论文 [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) 由 Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever 发布。
|
||||
|
|
|
@ -323,6 +323,7 @@ conda install -c huggingface transformers
|
|||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
|
|
|
@ -342,6 +342,8 @@
|
|||
title: FSMT
|
||||
- local: model_doc/funnel
|
||||
title: Funnel Transformer
|
||||
- local: model_doc/fuyu
|
||||
title: Fuyu
|
||||
- local: model_doc/openai-gpt
|
||||
title: GPT
|
||||
- local: model_doc/gpt_neo
|
||||
|
|
|
@ -138,6 +138,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
| [FNet](model_doc/fnet) | ✅ | ❌ | ❌ |
|
||||
| [FocalNet](model_doc/focalnet) | ✅ | ❌ | ❌ |
|
||||
| [Funnel Transformer](model_doc/funnel) | ✅ | ✅ | ❌ |
|
||||
| [Fuyu](model_doc/fuyu) | ✅ | ❌ | ❌ |
|
||||
| [GIT](model_doc/git) | ✅ | ❌ | ❌ |
|
||||
| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ |
|
||||
| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ |
|
||||
|
|
|
@ -0,0 +1,115 @@
|
|||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Fuyu
|
||||
|
||||
## Overview
|
||||
|
||||
The Fuyu model was created by [ADEPT](https://www.adept.ai/blog/fuyu-8b), and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
|
||||
|
||||
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
|
||||
|
||||
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under Apache, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
The `Fuyu` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `torch_dtype = 'float16'` which will be
|
||||
used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`.
|
||||
|
||||
The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `torch_dtype` they want, and if they don't it will be `torch.float32`.
|
||||
|
||||
Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
Tips:
|
||||
|
||||
- To convert the model, you need to clone the original repository using `git clone https://github.com/persimmon-ai-labs/adept-inference`, then get the checkpoints:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/persimmon-ai-labs/adept-inference
|
||||
wget path/to/fuyu-8b-model-weights.tar
|
||||
tar -xvf fuyu-8b-model-weights.tar
|
||||
python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \
|
||||
--pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt
|
||||
--ada_lib_path /path/to/adept-inference
|
||||
```
|
||||
|
||||
For the chat model:
|
||||
```bash
|
||||
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar
|
||||
tar -xvf 8b_base_model_release.tar
|
||||
```
|
||||
Then, model can be loaded via:
|
||||
|
||||
```py
|
||||
from transformers import FuyuConfig, FuyuForCausalLM
|
||||
model_config = FuyuConfig()
|
||||
model = FuyuForCausalLM(model_config).from_pretrained('/output/path')
|
||||
```
|
||||
|
||||
Inputs need to be passed through a specific Processor to have the correct formats.
|
||||
A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:
|
||||
|
||||
```py
|
||||
from PIL import Image
|
||||
from transformers import AutoTokenizer
|
||||
from transformers.models.fuyu.processing_fuyu import FuyuProcessor
|
||||
from transformers.models.fuyu.image_processing_fuyu import FuyuImageProcessor
|
||||
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b')
|
||||
image_processor = FuyuImageProcessor()
|
||||
|
||||
|
||||
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)
|
||||
text_prompt = "Generate a coco-style caption.\\n"
|
||||
|
||||
bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
|
||||
bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))
|
||||
inputs_to_model = processor(text=text_prompt, images=image_pil)
|
||||
|
||||
|
||||
```
|
||||
|
||||
This model was contributed by [Molbap](https://huggingface.co/Molbap).
|
||||
The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference).
|
||||
|
||||
- Fuyu uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer.
|
||||
The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece.
|
||||
|
||||
- The authors suggest to use the following prompt for image captioning: `f"Generate a coco-style caption.\\n"`
|
||||
|
||||
|
||||
## FuyuConfig
|
||||
|
||||
[[autodoc]] FuyuConfig
|
||||
|
||||
## FuyuForCausalLM
|
||||
|
||||
[[autodoc]] FuyuForCausalLM
|
||||
- forward
|
||||
|
||||
## FuyuImageProcessor
|
||||
|
||||
[[autodoc]] FuyuImageProcessor
|
||||
- __call__
|
||||
|
||||
## FuyuProcessor
|
||||
|
||||
[[autodoc]] FuyuProcessor
|
||||
- __call__
|
|
@ -37,7 +37,7 @@ You can finetune other architectures for causal language modeling following the
|
|||
Choose one of the following architectures:
|
||||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
|
||||
[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -343,6 +343,7 @@ _import_structure = {
|
|||
"models.focalnet": ["FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FocalNetConfig"],
|
||||
"models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"],
|
||||
"models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"],
|
||||
"models.fuyu": ["FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP", "FuyuConfig", "FuyuProcessor"],
|
||||
"models.git": ["GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GitConfig", "GitProcessor", "GitVisionConfig"],
|
||||
"models.glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"],
|
||||
"models.gpt2": ["GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2Config", "GPT2Tokenizer"],
|
||||
|
@ -972,6 +973,7 @@ else:
|
|||
_import_structure["models.efficientformer"].append("EfficientFormerImageProcessor")
|
||||
_import_structure["models.efficientnet"].append("EfficientNetImageProcessor")
|
||||
_import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaImageProcessor", "FlavaProcessor"])
|
||||
_import_structure["models.fuyu"].append("FuyuImageProcessor")
|
||||
_import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
|
||||
_import_structure["models.idefics"].extend(["IdeficsImageProcessor"])
|
||||
_import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"])
|
||||
|
@ -1864,6 +1866,7 @@ else:
|
|||
"load_tf_weights_in_funnel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.fuyu"].extend(["FuyuForCausalLM", "FuyuPreTrainedModel"])
|
||||
_import_structure["models.git"].extend(
|
||||
[
|
||||
"GIT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -4489,6 +4492,7 @@ if TYPE_CHECKING:
|
|||
from .models.focalnet import FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FocalNetConfig
|
||||
from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer
|
||||
from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer
|
||||
from .models.fuyu import FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP, FuyuConfig, FuyuProcessor
|
||||
from .models.git import GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GitConfig, GitProcessor, GitVisionConfig
|
||||
from .models.glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig
|
||||
from .models.gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config, GPT2Tokenizer
|
||||
|
@ -5053,6 +5057,7 @@ if TYPE_CHECKING:
|
|||
from .models.efficientformer import EfficientFormerImageProcessor
|
||||
from .models.efficientnet import EfficientNetImageProcessor
|
||||
from .models.flava import FlavaFeatureExtractor, FlavaImageProcessor, FlavaProcessor
|
||||
from .models.fuyu import FuyuImageProcessor
|
||||
from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor
|
||||
from .models.idefics import IdeficsImageProcessor
|
||||
from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor
|
||||
|
@ -5807,6 +5812,10 @@ if TYPE_CHECKING:
|
|||
FunnelPreTrainedModel,
|
||||
load_tf_weights_in_funnel,
|
||||
)
|
||||
from .models.fuyu import (
|
||||
FuyuForCausalLM,
|
||||
FuyuPreTrainedModel,
|
||||
)
|
||||
from .models.git import (
|
||||
GIT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
GitForCausalLM,
|
||||
|
|
|
@ -88,6 +88,7 @@ from . import (
|
|||
focalnet,
|
||||
fsmt,
|
||||
funnel,
|
||||
fuyu,
|
||||
git,
|
||||
glpn,
|
||||
gpt2,
|
||||
|
|
|
@ -97,6 +97,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||
("focalnet", "FocalNetConfig"),
|
||||
("fsmt", "FSMTConfig"),
|
||||
("funnel", "FunnelConfig"),
|
||||
("fuyu", "FuyuConfig"),
|
||||
("git", "GitConfig"),
|
||||
("glpn", "GLPNConfig"),
|
||||
("gpt-sw3", "GPT2Config"),
|
||||
|
@ -310,6 +311,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||
("focalnet", "FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("fsmt", "FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("funnel", "FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("fuyu", "FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("git", "GIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("glpn", "GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("gpt2", "GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
|
@ -521,6 +523,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||
("focalnet", "FocalNet"),
|
||||
("fsmt", "FairSeq Machine-Translation"),
|
||||
("funnel", "Funnel Transformer"),
|
||||
("fuyu", "Fuyu"),
|
||||
("git", "GIT"),
|
||||
("glpn", "GLPN"),
|
||||
("gpt-sw3", "GPT-Sw3"),
|
||||
|
|
|
@ -64,6 +64,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||
("efficientnet", "EfficientNetImageProcessor"),
|
||||
("flava", "FlavaImageProcessor"),
|
||||
("focalnet", "BitImageProcessor"),
|
||||
("fuyu", "FuyuImageProcessor"),
|
||||
("git", "CLIPImageProcessor"),
|
||||
("glpn", "GLPNImageProcessor"),
|
||||
("groupvit", "CLIPImageProcessor"),
|
||||
|
|
|
@ -400,6 +400,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||
("electra", "ElectraForCausalLM"),
|
||||
("ernie", "ErnieForCausalLM"),
|
||||
("falcon", "FalconForCausalLM"),
|
||||
("fuyu", "FuyuForCausalLM"),
|
||||
("git", "GitForCausalLM"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
|
|
|
@ -54,6 +54,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||
("clip", "CLIPProcessor"),
|
||||
("clipseg", "CLIPSegProcessor"),
|
||||
("flava", "FlavaProcessor"),
|
||||
("fuyu", "FuyuProcessor"),
|
||||
("git", "GitProcessor"),
|
||||
("groupvit", "CLIPProcessor"),
|
||||
("hubert", "Wav2Vec2Processor"),
|
||||
|
|
|
@ -0,0 +1,73 @@
|
|||
# Copyright 2023 AdeptAI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_fuyu": ["FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP", "FuyuConfig"],
|
||||
}
|
||||
|
||||
|
||||
try:
|
||||
if not is_vision_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["image_processing_fuyu"] = ["FuyuImageProcessor"]
|
||||
_import_structure["processing_fuyu"] = ["FuyuProcessor"]
|
||||
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_fuyu"] = [
|
||||
"FuyuForCausalLM",
|
||||
"FuyuPreTrainedModel",
|
||||
]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_fuyu import FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP, FuyuConfig
|
||||
|
||||
try:
|
||||
if not is_vision_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .image_processing_fuyu import FuyuImageProcessor
|
||||
from .processing_fuyu import FuyuProcessor
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_fuyu import (
|
||||
FuyuForCausalLM,
|
||||
FuyuPreTrainedModel,
|
||||
)
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
|
@ -0,0 +1,211 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 Adept AI and the HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Fuyu model configuration"""
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
from ..auto import CONFIG_MAPPING
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"adept/fuyu-8b-base": "https://huggingface.co/adept/fuyu-8b-base/resolve/main/config.json",
|
||||
}
|
||||
|
||||
|
||||
class FuyuConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`FuyuForCausalLM`]. It is used to instantiate an
|
||||
Fuyu model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the
|
||||
[adept/fuyu-8b-base](https://huggingface.co/adept/fuyu-8b-base).
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 262144):
|
||||
Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by the
|
||||
`inputs_ids` passed when calling [`FuyuForCausalLM`]
|
||||
hidden_size (`int`, *optional*, defaults to 4096):
|
||||
Dimension of the hidden representations.
|
||||
intermediate_size (`int`, *optional*, defaults to 16384):
|
||||
Dimension of the MLP representations.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 36):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 64):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"relu2"`):
|
||||
The non-linear activation function (function or string) in the decoder.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 16384):
|
||||
The maximum sequence length that this model might ever be used with.
|
||||
image_size (`int`, *optional*, defaults to 300):
|
||||
The input image size.
|
||||
patch_size (`int`, *optional*, defaults to 30):
|
||||
The input vision transformer encoding patch size.
|
||||
num_channels (`int`, *optional*, defaults to 3):
|
||||
The input image number of channels.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||
The epsilon used by the rms normalization layers.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||
relevant if `config.is_decoder=True`. Whether to tie weight embeddings
|
||||
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||
Whether to tie input and output embeddings.
|
||||
rope_theta (`float`, *optional*, defaults to 25000.0):
|
||||
The base period of the RoPE embeddings.
|
||||
rope_scaling (`Dict`, *optional*):
|
||||
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
|
||||
strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
|
||||
is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
|
||||
`max_position_embeddings` to the expected new maximum. See the following thread for more information on how
|
||||
these scaling strategies behave:
|
||||
https://www.reddit.com/r/LocalFuyu/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
|
||||
experimental feature, subject to breaking API changes in future versions.
|
||||
qk_layernorm (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to normalize the Queries and Keys after projecting the hidden states
|
||||
hidden_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio after applying the MLP to the hidden states.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio after computing the attention scores.
|
||||
partial_rotary_factor (`float`, *optional*, defaults to 0.5):
|
||||
Percentage of the query and keys which will have rotary embedding.
|
||||
|
||||
pad_token_id (`int`, *optional*):
|
||||
The id of the *padding* token.
|
||||
bos_token_id (`int`, *optional*, defaults to 1):
|
||||
The id of the *beginning-of-sequence* token.
|
||||
eos_token_id (`Union[int, List[int]]`, *optional*, defaults to 2):
|
||||
The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
|
||||
text_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize the `language``[`Aut`].
|
||||
|
||||
```python
|
||||
>>> from transformers import FuyuConfig
|
||||
|
||||
>>> # Initializing a Fuyu fuyu-7b style configuration
|
||||
>>> configuration = FuyuConfig()
|
||||
```"""
|
||||
model_type = "fuyu"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=262144,
|
||||
hidden_size=4096,
|
||||
intermediate_size=16384,
|
||||
num_hidden_layers=36,
|
||||
num_attention_heads=64,
|
||||
hidden_act="relu2",
|
||||
max_position_embeddings=16384,
|
||||
image_size=300,
|
||||
patch_size=30,
|
||||
num_channels=3,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-5,
|
||||
use_cache=True,
|
||||
tie_word_embeddings=False,
|
||||
rope_theta=25000.0,
|
||||
rope_scaling=None,
|
||||
qk_layernorm=True,
|
||||
hidden_dropout=0.0,
|
||||
attention_dropout=0.0,
|
||||
partial_rotary_factor=0.5,
|
||||
pad_token_id=None,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
text_config=None,
|
||||
**kwargs,
|
||||
):
|
||||
if text_config is None:
|
||||
text_config = {
|
||||
"vocab_size": vocab_size,
|
||||
"max_position_embeddings": max_position_embeddings,
|
||||
"hidden_size": hidden_size,
|
||||
"intermediate_size": intermediate_size,
|
||||
"num_hidden_layers": num_hidden_layers,
|
||||
"num_attention_heads": num_attention_heads,
|
||||
"hidden_act": hidden_act,
|
||||
"initializer_range": initializer_range,
|
||||
"layer_norm_eps": layer_norm_eps,
|
||||
"use_cache": use_cache,
|
||||
"rope_theta": rope_theta,
|
||||
"rope_scaling": rope_scaling,
|
||||
"qk_layernorm": qk_layernorm,
|
||||
"hidden_dropout": hidden_dropout,
|
||||
"attention_dropout": attention_dropout,
|
||||
"partial_rotary_factor": partial_rotary_factor,
|
||||
"pad_token_id": pad_token_id,
|
||||
"bos_token_id": bos_token_id,
|
||||
"eos_token_id": eos_token_id,
|
||||
"tie_word_embeddings": tie_word_embeddings,
|
||||
}
|
||||
logger.info("text_config is None. initializing the text model with default values.")
|
||||
text_model_type = text_config["model_type"] if "model_type" in text_config else "persimmon"
|
||||
self.text_config = CONFIG_MAPPING[text_model_type](**text_config)
|
||||
|
||||
self.vocab_size = vocab_size
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.use_cache = use_cache
|
||||
self.rope_theta = rope_theta
|
||||
self.rope_scaling = rope_scaling
|
||||
self.qk_layernorm = qk_layernorm
|
||||
self.hidden_dropout = hidden_dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.partial_rotary_factor = partial_rotary_factor
|
||||
self._rope_scaling_validation()
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
bos_token_id=bos_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
tie_word_embeddings=tie_word_embeddings,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def _rope_scaling_validation(self):
|
||||
"""
|
||||
Validate the `rope_scaling` configuration.
|
||||
"""
|
||||
if self.rope_scaling is None:
|
||||
return
|
||||
|
||||
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
|
||||
raise ValueError(
|
||||
"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
|
||||
f"got {self.rope_scaling}"
|
||||
)
|
||||
rope_scaling_type = self.rope_scaling.get("type", None)
|
||||
rope_scaling_factor = self.rope_scaling.get("factor", None)
|
||||
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
|
||||
raise ValueError(
|
||||
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
|
||||
)
|
||||
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
|
||||
raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
|
|
@ -0,0 +1,134 @@
|
|||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import warnings
|
||||
|
||||
import flatdict
|
||||
import torch
|
||||
|
||||
from transformers import FuyuConfig, FuyuForCausalLM, LlamaTokenizer
|
||||
|
||||
|
||||
try:
|
||||
from transformers import LlamaTokenizerFast
|
||||
|
||||
tokenizer_class = LlamaTokenizerFast
|
||||
except ImportError as e:
|
||||
warnings.warn(e)
|
||||
warnings.warn(
|
||||
"The converted tokenizer will be the `slow` tokenizer. To use the fast, update your `tokenizers` library and re-run the tokenizer conversion"
|
||||
)
|
||||
tokenizer_class = LlamaTokenizer
|
||||
|
||||
"""
|
||||
Sample usage: # TODO fix clone links from persimmon to fuyu
|
||||
```
|
||||
git clone https://github.com/adept-ai-labs/adept-inference
|
||||
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_base_model_release.tar
|
||||
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar
|
||||
python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path
|
||||
```
|
||||
|
||||
Thereafter, models can be loaded via:
|
||||
|
||||
```py
|
||||
from transformers import FuyuForCausalLM, FuyuTokenizer
|
||||
|
||||
model = FuyuForCausalLM.from_pretrained("/output/path")
|
||||
tokenizer = FuyuTokenizer.from_pretrained("/output/path")
|
||||
```
|
||||
|
||||
Important note: you need to be able to host the whole model in RAM to execute this script (even if the biggest versions
|
||||
come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).
|
||||
"""
|
||||
|
||||
|
||||
KEYS_TO_MODIFY_MAPPING = {
|
||||
"self_attention": "self_attn",
|
||||
"language_model.encoder": "language_model.model",
|
||||
"word_embeddings_for_head": "language_model.lm_head",
|
||||
"language_model.embedding.word_embeddings": "language_model.model.embed_tokens",
|
||||
"vit_encoder.linear_encoder": "vision_embed_tokens",
|
||||
}
|
||||
|
||||
KEYS_TO_REMOVE = {
|
||||
"rotary_emb.inv_freq",
|
||||
"image_patch_projection",
|
||||
"image_patch_projection.weight",
|
||||
"image_patch_projection.bias",
|
||||
}
|
||||
|
||||
|
||||
def rename_state_dict(state_dict):
|
||||
model_state_dict = {}
|
||||
for key, value in state_dict.items():
|
||||
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
|
||||
if key_to_modify in key:
|
||||
key = key.replace(key_to_modify, new_key)
|
||||
# if KEYS_TO_REMOVE in key:
|
||||
if key in KEYS_TO_REMOVE:
|
||||
continue
|
||||
model_state_dict[key] = value
|
||||
return model_state_dict
|
||||
|
||||
|
||||
def convert_fuyu_checkpoint(pytorch_dump_folder_path, ada_lib_path, pt_model_path, safe_serialization=False):
|
||||
sys.path.insert(0, ada_lib_path)
|
||||
model_state_dict_base = torch.load(pt_model_path, map_location="cpu")
|
||||
state_dict = flatdict.FlatDict(model_state_dict_base["model"], ".")
|
||||
state_dict = rename_state_dict(state_dict)
|
||||
|
||||
transformers_config = FuyuConfig()
|
||||
model = FuyuForCausalLM(transformers_config).to(torch.bfloat16)
|
||||
model.load_state_dict(state_dict)
|
||||
model.save_pretrained(pytorch_dump_folder_path, safe_serialization=safe_serialization)
|
||||
transformers_config.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
"--input_dir",
|
||||
help="Location of Fuyu weights, which contains tokenizer.model and model folders",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pt_model_path",
|
||||
help="Location of Fuyu `model_optim_rng.pt`",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output_dir",
|
||||
help="Location to write HF model and tokenizer",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ada_lib_path",
|
||||
help="Location of original source code from adept to deserialize .pt checkpoint",
|
||||
)
|
||||
parser.add_argument("--safe_serialization", type=bool, help="Whether or not to save using `safetensors`.")
|
||||
args = parser.parse_args()
|
||||
spm_path = os.path.join(args.input_dir, "adept_vocab.model")
|
||||
|
||||
convert_fuyu_checkpoint(
|
||||
pytorch_dump_folder_path=args.output_dir,
|
||||
pt_model_path=args.pt_model_path,
|
||||
safe_serialization=args.safe_serialization,
|
||||
ada_lib_path=args.ada_lib_path,
|
||||
)
|
||||
tokenizer = tokenizer_class(spm_path, bos_token="|ENDOFTEXT|", eos_token="|ENDOFTEXT|")
|
||||
tokenizer.save_pretrained(args.output_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -0,0 +1,254 @@
|
|||
import math
|
||||
from typing import List, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...image_processing_utils import BaseImageProcessor
|
||||
from ...image_transforms import (
|
||||
normalize,
|
||||
pad,
|
||||
resize,
|
||||
)
|
||||
from ...image_utils import to_numpy_array
|
||||
from ...utils import is_torch_available, is_vision_available, logging, requires_backends
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
import PIL
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class FuyuImageProcessor(BaseImageProcessor):
|
||||
"""
|
||||
This class should handle the image processing part before the main FuyuForCausalLM. In particular, it should
|
||||
handle:
|
||||
|
||||
- Processing Images:
|
||||
Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patch
|
||||
dimensions. The image output is always img_h ........................................... 1080 img_w
|
||||
........................................... 1920 Then, it patches up these images using the patchify_image
|
||||
function.
|
||||
|
||||
- Creating Image Input IDs:
|
||||
For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. For
|
||||
variable-sized images, each line of patches is terminated with a newline ID.
|
||||
|
||||
- Image Patch Indices:
|
||||
For each image patch, the code maintains an index where these patches should be inserted in a token stream.
|
||||
|
||||
"""
|
||||
|
||||
model_input_names = [
|
||||
"images",
|
||||
"image_input_ids",
|
||||
"image_patches",
|
||||
"image_patch_indices_per_batch",
|
||||
"image_patch_indices_per_subsequence",
|
||||
]
|
||||
|
||||
def __init__(
|
||||
self, target_height=1080, target_width=1920, padding_value=1.0, padding_mode: str = "constant", **kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
self.target_width = target_width
|
||||
self.target_height = target_height
|
||||
self.padding_value = padding_value
|
||||
self.padding_mode = padding_mode
|
||||
|
||||
def get_num_patches(self, img_h: int, img_w: int, patch_dim_h: int, patch_dim_w: int) -> int:
|
||||
"""Calculate number of patches required to encode an image."""
|
||||
if img_h % patch_dim_h != 0:
|
||||
raise ValueError(f"{img_h=} must be divisible by {patch_dim_h=}")
|
||||
if img_w % patch_dim_w != 0:
|
||||
raise ValueError(f"{img_w=} must be divisible by {patch_dim_w=}")
|
||||
|
||||
num_patches_per_dim_h = img_h // patch_dim_h
|
||||
num_patches_per_dim_w = img_w // patch_dim_w
|
||||
num_patches = num_patches_per_dim_h * num_patches_per_dim_w
|
||||
|
||||
return num_patches
|
||||
|
||||
def patchify_image(self, image: "torch.Tensor", patch_dim_h: int, patch_dim_w: int) -> "torch.Tensor":
|
||||
"""
|
||||
Convert an image into a tensor of patches.
|
||||
|
||||
Args:
|
||||
image: Image to convert. Shape: [batch, channels, height, width]
|
||||
patch_dim_h: Height of each patch.
|
||||
patch_dim_w: Width of each patch.
|
||||
"""
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
# TODO refer to https://github.com/ArthurZucker/transformers/blob/0f0a3fe5ca5697ee58faeb5b53f049af720b5e98/src/transformers/models/vit_mae/modeling_vit_mae.py#L871
|
||||
# torch implementation is faster but does not handle non-squares
|
||||
|
||||
batch_size, channels, height, width = image.shape
|
||||
unfolded_along_height = image.unfold(2, patch_dim_h, patch_dim_h)
|
||||
patches = unfolded_along_height.unfold(3, patch_dim_w, patch_dim_w)
|
||||
|
||||
patches_reshaped = patches.contiguous().view(batch_size, channels, -1, patch_dim_h, patch_dim_w)
|
||||
|
||||
patches_final = patches_reshaped.permute(0, 2, 3, 4, 1).reshape(
|
||||
batch_size, -1, channels * patch_dim_h * patch_dim_w
|
||||
)
|
||||
|
||||
return patches_final
|
||||
|
||||
def process_images_for_model_input(
|
||||
self,
|
||||
image_input: "torch.Tensor",
|
||||
image_present: "torch.Tensor",
|
||||
image_unpadded_h: "torch.Tensor",
|
||||
image_unpadded_w: "torch.Tensor",
|
||||
image_patch_dim_h: int,
|
||||
image_patch_dim_w: int,
|
||||
image_placeholder_id: int,
|
||||
image_newline_id: int,
|
||||
variable_sized: bool,
|
||||
) -> dict:
|
||||
"""Process images for model input. In particular, variable-sized images are handled here.
|
||||
|
||||
Args:
|
||||
image_input: [batch_size, 1, c, h, w] tensor of images padded to model input size.
|
||||
image_present: [batch_size, 1] tensor of 1s and 0s indicating whether an image is present.
|
||||
image_unpadded_h: [batch_size, 1] tensor of unpadded image heights.
|
||||
image_unpadded_w: [batch_size, 1] tensor of unpadded image widths.
|
||||
image_patch_dim_h: The height of the image patches.
|
||||
image_patch_dim_w: The width of the image patches.
|
||||
image_placeholder_id: The id of the image placeholder token.
|
||||
image_newline_id: The id of the image newline token.
|
||||
variable_sized: Whether to process images as variable-sized.
|
||||
"""
|
||||
requires_backends(self, ["torch"])
|
||||
# Only images that are present.
|
||||
images: List[List[torch.Tensor]] = []
|
||||
image_patches: List[List[torch.Tensor]] = []
|
||||
# Image input ids for every subsequence, including ones with no image present.
|
||||
image_input_ids: List[List[torch.Tensor]] = []
|
||||
for bi in range(image_input.shape[0]):
|
||||
images.append([])
|
||||
image_input_ids.append([])
|
||||
image_patches.append([])
|
||||
for si in range(image_input.shape[1]):
|
||||
if image_present[bi, si]:
|
||||
image = image_input[bi, si]
|
||||
if variable_sized:
|
||||
# The min() is required here due to floating point issues:
|
||||
# math.ceil(torch.tensor(300).cuda() / 30) == 11
|
||||
new_h = min(
|
||||
image.shape[1], math.ceil(image_unpadded_h[bi, si] / image_patch_dim_h) * image_patch_dim_h
|
||||
)
|
||||
new_w = min(
|
||||
image.shape[2], math.ceil(image_unpadded_w[bi, si] / image_patch_dim_w) * image_patch_dim_w
|
||||
)
|
||||
image = image[:, :new_h, :new_w]
|
||||
images[bi].append(image)
|
||||
num_patches = self.get_num_patches(
|
||||
img_h=image.shape[1],
|
||||
img_w=image.shape[2],
|
||||
patch_dim_h=image_patch_dim_h,
|
||||
patch_dim_w=image_patch_dim_w,
|
||||
)
|
||||
ids = torch.full([num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device)
|
||||
patches = self.patchify_image(
|
||||
image=image.unsqueeze(0), patch_dim_h=image_patch_dim_h, patch_dim_w=image_patch_dim_w
|
||||
).squeeze(0)
|
||||
if variable_sized:
|
||||
# Now terminate each line with |NEWLINE|.
|
||||
ids = ids.reshape(-1, new_w // image_patch_dim_w)
|
||||
ids = torch.cat(
|
||||
[
|
||||
ids,
|
||||
torch.full(
|
||||
[ids.shape[0], 1], image_newline_id, dtype=torch.int32, device=image_input.device
|
||||
),
|
||||
],
|
||||
dim=1,
|
||||
)
|
||||
ids = ids.reshape(-1)
|
||||
image_input_ids[bi].append(ids)
|
||||
image_patches[bi].append(patches)
|
||||
else:
|
||||
image_input_ids[bi].append(torch.tensor([], dtype=torch.int32, device=image_input.device))
|
||||
|
||||
# Create image_patch_input_indices, where non-negative values correspond to image patches to be inserted in
|
||||
# the stream.
|
||||
image_patch_indices_per_batch: List[List[torch.Tensor]] = []
|
||||
image_patch_indices_per_subsequence: List[List[torch.Tensor]] = []
|
||||
for bi in range(len(image_input_ids)):
|
||||
image_patch_indices_per_batch.append([])
|
||||
image_patch_indices_per_subsequence.append([])
|
||||
index_offset = 0
|
||||
for si in range(len(image_input_ids[bi])):
|
||||
# Indices of image patches.
|
||||
num_patches = torch.count_nonzero(image_input_ids[bi][si] == image_placeholder_id)
|
||||
indices = torch.arange(
|
||||
num_patches,
|
||||
dtype=image_input_ids[bi][si].dtype,
|
||||
device=image_input_ids[bi][si].device,
|
||||
)
|
||||
|
||||
# Place those indices in the image input ids token stream, with -1 representing non-index tokens.
|
||||
indices_in_stream_per_batch = torch.full_like(image_input_ids[bi][si], -1)
|
||||
indices_in_stream_per_subsequence = torch.full_like(image_input_ids[bi][si], -1)
|
||||
indices_in_stream_per_batch[
|
||||
torch.nonzero(image_input_ids[bi][si] == image_placeholder_id, as_tuple=True)[0]
|
||||
] = (indices + index_offset)
|
||||
indices_in_stream_per_subsequence[
|
||||
torch.nonzero(image_input_ids[bi][si] == image_placeholder_id, as_tuple=True)[0]
|
||||
] = indices
|
||||
|
||||
image_patch_indices_per_batch[bi].append(indices_in_stream_per_batch)
|
||||
image_patch_indices_per_subsequence[bi].append(indices_in_stream_per_subsequence)
|
||||
index_offset += num_patches
|
||||
|
||||
return {
|
||||
"images": images,
|
||||
"image_input_ids": image_input_ids,
|
||||
"image_patches": image_patches,
|
||||
"image_patch_indices_per_batch": image_patch_indices_per_batch,
|
||||
"image_patch_indices_per_subsequence": image_patch_indices_per_subsequence,
|
||||
}
|
||||
|
||||
def _scale_to_target_aspect_ratio(self, image: np.ndarray) -> np.ndarray:
|
||||
image_height, image_width, _ = image.shape
|
||||
if image_width <= self.target_width and image_height <= self.target_height:
|
||||
return image
|
||||
|
||||
height_scale_factor = self.target_height / image_height
|
||||
width_scale_factor = self.target_width / image_width
|
||||
optimal_scale_factor = min(height_scale_factor, width_scale_factor)
|
||||
|
||||
new_height = int(image_height * optimal_scale_factor)
|
||||
new_width = int(image_width * optimal_scale_factor)
|
||||
|
||||
scaled_image = resize(image=image, size=(new_width, new_height))
|
||||
return np.array(scaled_image)
|
||||
|
||||
def _pad_to_target_size(self, image: np.ndarray) -> np.ndarray:
|
||||
image_height, image_width, _ = image.shape
|
||||
|
||||
padding_top = 0
|
||||
padding_left = 0
|
||||
padding_bottom = self.target_height - image_height
|
||||
padding_right = self.target_width - image_width
|
||||
|
||||
padded_image = pad(
|
||||
image,
|
||||
((padding_top, padding_bottom), (padding_left, padding_right)),
|
||||
mode=self.padding_mode,
|
||||
constant_values=self.padding_value,
|
||||
)
|
||||
return padded_image
|
||||
|
||||
def apply_transformation(self, image: Union[np.ndarray, PIL.Image.Image]) -> np.ndarray:
|
||||
if isinstance(image, PIL.Image.Image):
|
||||
image = to_numpy_array(image)
|
||||
scaled_image = self._scale_to_target_aspect_ratio(image)
|
||||
padded_image = self._pad_to_target_size(scaled_image)
|
||||
normalized_padded_image = normalize(padded_image, 0.5, 0.5)
|
||||
return normalized_padded_image
|
|
@ -0,0 +1,323 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" PyTorch Fuyu model."""
|
||||
from typing import List, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.utils.checkpoint
|
||||
from torch import nn
|
||||
|
||||
from ...modeling_outputs import BaseModelOutputWithPast
|
||||
from ...modeling_utils import PreTrainedModel
|
||||
from ...models.auto.modeling_auto import AutoModelForCausalLM
|
||||
from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging
|
||||
from .configuration_fuyu import FuyuConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
_CONFIG_FOR_DOC = "FuyuConfig"
|
||||
|
||||
|
||||
FUYU_START_DOCSTRING = r"""
|
||||
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
|
||||
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
|
||||
etc.)
|
||||
|
||||
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
|
||||
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
|
||||
and behavior.
|
||||
|
||||
Parameters:
|
||||
config ([`FuyuConfig`]):
|
||||
Model configuration class with all the parameters of the model. Initializing with a config file does not
|
||||
load the weights associated with the model, only the configuration. Check out the
|
||||
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
|
||||
"""
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The bare Fuyu Model outputting raw hidden-states without any specific head on top.",
|
||||
FUYU_START_DOCSTRING,
|
||||
)
|
||||
class FuyuPreTrainedModel(PreTrainedModel):
|
||||
config_class = FuyuConfig
|
||||
base_model_prefix = "fuyu"
|
||||
supports_gradient_checkpointing = True
|
||||
_no_split_modules = []
|
||||
_skip_keys_device_placement = "past_key_values"
|
||||
|
||||
def _init_weights(self, module):
|
||||
std = self.config.initializer_range
|
||||
if isinstance(module, nn.Linear):
|
||||
module.weight.data.normal_(mean=0.0, std=std)
|
||||
if module.bias is not None:
|
||||
module.bias.data.zero_()
|
||||
elif isinstance(module, nn.Embedding):
|
||||
module.weight.data.normal_(mean=0.0, std=std)
|
||||
if module.padding_idx is not None:
|
||||
module.weight.data[module.padding_idx].zero_()
|
||||
|
||||
def _set_gradient_checkpointing(self, module, value=False):
|
||||
if isinstance(module, FuyuForCausalLM):
|
||||
module.gradient_checkpointing = value
|
||||
|
||||
|
||||
FUYU_INPUTS_DOCSTRING = r"""
|
||||
Args:
|
||||
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
|
||||
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
|
||||
it.
|
||||
|
||||
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
|
||||
[`PreTrainedTokenizer.__call__`] for details.
|
||||
|
||||
[What are input IDs?](../glossary#input-ids)
|
||||
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
|
||||
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 for tokens that are **not masked**,
|
||||
- 0 for tokens that are **masked**.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
|
||||
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
|
||||
[`PreTrainedTokenizer.__call__`] for details.
|
||||
|
||||
If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
|
||||
`past_key_values`).
|
||||
|
||||
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
|
||||
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
|
||||
information on the default strategy.
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
|
||||
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
|
||||
config.n_positions - 1]`.
|
||||
|
||||
[What are position IDs?](../glossary#position-ids)
|
||||
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
|
||||
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
|
||||
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
|
||||
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
|
||||
|
||||
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
|
||||
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
|
||||
|
||||
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
|
||||
don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
|
||||
`decoder_input_ids` of shape `(batch_size, sequence_length)`.
|
||||
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
|
||||
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
|
||||
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
|
||||
model's internal embedding lookup matrix.
|
||||
use_cache (`bool`, *optional*):
|
||||
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
|
||||
`past_key_values`).
|
||||
output_attentions (`bool`, *optional*):
|
||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||
tensors for more detail.
|
||||
output_hidden_states (`bool`, *optional*):
|
||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||
more detail.
|
||||
return_dict (`bool`, *optional*):
|
||||
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
||||
"""
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The bare Fuyu Model outputting raw hidden-states without any specific head on top.",
|
||||
FUYU_START_DOCSTRING,
|
||||
)
|
||||
class FuyuForCausalLM(FuyuPreTrainedModel):
|
||||
"""
|
||||
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`FuyuDecoderLayer`]
|
||||
|
||||
Args:
|
||||
config: FuyuConfig
|
||||
"""
|
||||
|
||||
def __init__(self, config: FuyuConfig):
|
||||
super().__init__(config)
|
||||
self.padding_idx = config.pad_token_id
|
||||
self.vocab_size = config.vocab_size
|
||||
self.language_model = AutoModelForCausalLM.from_config(config.text_config)
|
||||
|
||||
self.vision_embed_tokens = nn.Linear(
|
||||
config.patch_size * config.patch_size * config.num_channels, config.hidden_size
|
||||
)
|
||||
|
||||
self.gradient_checkpointing = False
|
||||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
def get_input_embeddings(self):
|
||||
return self.language_model.get_input_embeddings()
|
||||
|
||||
def set_input_embeddings(self, value):
|
||||
self.language_model.set_input_embeddings(value)
|
||||
|
||||
def gather_continuous_embeddings(
|
||||
self,
|
||||
word_embeddings: torch.Tensor,
|
||||
continuous_embeddings: List[torch.Tensor],
|
||||
image_patch_input_indices: torch.Tensor,
|
||||
) -> torch.Tensor:
|
||||
"""This function places the continuous_embeddings into the word_embeddings at the locations
|
||||
indicated by image_patch_input_indices. Different batch elements can have different numbers of continuous
|
||||
embeddings.
|
||||
|
||||
Args:
|
||||
word_embeddings: Tensor of word embeddings. Shape: [b, s, h]
|
||||
continuous_embeddings:
|
||||
Tensor of continuous embeddings. The length of the list is the batch size. Each entry is
|
||||
shape [num_image_embeddings, hidden], and num_image_embeddings needs to match the number of non-negative
|
||||
indices in image_patch_input_indices for that batch element.
|
||||
image_patch_input_indices: Tensor of indices of the image patches in the input_ids tensor. Shape: [b, s]
|
||||
"""
|
||||
if not (word_embeddings.shape[0] == len(continuous_embeddings)):
|
||||
raise ValueError(
|
||||
f"Batch sizes must match! Got {len(continuous_embeddings)=} and {word_embeddings.shape[0]=}"
|
||||
)
|
||||
|
||||
output_embeddings = word_embeddings.clone()
|
||||
for batch_idx in range(word_embeddings.shape[0]):
|
||||
# First, find the positions of all the non-negative values in image_patch_input_indices, those are the
|
||||
# positions in word_embeddings that we want to replace with content from continuous_embeddings.
|
||||
dst_indices = torch.nonzero(image_patch_input_indices[batch_idx] >= 0, as_tuple=True)[0]
|
||||
# Next look up those indices in image_patch_input_indices to find the indices in continuous_embeddings that we
|
||||
# want to use to replace the values in word_embeddings.
|
||||
src_indices = image_patch_input_indices[batch_idx][dst_indices]
|
||||
# Check if we have more indices than embeddings. Note that we could have fewer indices if images got truncated.
|
||||
if src_indices.shape[0] > continuous_embeddings[batch_idx].shape[0]:
|
||||
raise ValueError(
|
||||
f"Number of continuous embeddings {continuous_embeddings[batch_idx].shape=} does not match "
|
||||
f"number of continuous token ids {src_indices.shape=} in batch element {batch_idx}."
|
||||
)
|
||||
output_embeddings[batch_idx, dst_indices] = continuous_embeddings[batch_idx][src_indices]
|
||||
return output_embeddings
|
||||
|
||||
@add_start_docstrings_to_model_forward(FUYU_INPUTS_DOCSTRING)
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.LongTensor = None,
|
||||
image_patches: torch.Tensor = None, # [batch_size, num_total_patches, patch_size_ x patch_size x num_channels ]
|
||||
image_patches_indices: torch.Tensor = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
position_ids: Optional[torch.LongTensor] = None,
|
||||
past_key_values: Optional[List[torch.FloatTensor]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
use_cache: Optional[bool] = None,
|
||||
output_attentions: Optional[bool] = None,
|
||||
output_hidden_states: Optional[bool] = None,
|
||||
return_dict: Optional[bool] = None,
|
||||
) -> Union[Tuple, BaseModelOutputWithPast]:
|
||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||
output_hidden_states = (
|
||||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||
)
|
||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
||||
|
||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||
|
||||
# retrieve input_ids and inputs_embeds
|
||||
if input_ids is not None and inputs_embeds is not None:
|
||||
raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
|
||||
elif input_ids is not None:
|
||||
batch_size, seq_length = input_ids.shape
|
||||
elif inputs_embeds is not None:
|
||||
batch_size, seq_length, _ = inputs_embeds.shape
|
||||
else:
|
||||
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
|
||||
|
||||
seq_length_with_past = seq_length
|
||||
past_key_values_length = 0
|
||||
|
||||
if past_key_values is not None:
|
||||
past_key_values_length = past_key_values[0][0].shape[2]
|
||||
seq_length_with_past = seq_length_with_past + past_key_values_length
|
||||
|
||||
if position_ids is None:
|
||||
device = input_ids.device if input_ids is not None else inputs_embeds.device
|
||||
position_ids = torch.arange(
|
||||
past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
|
||||
)
|
||||
position_ids = position_ids.unsqueeze(0)
|
||||
|
||||
if inputs_embeds is None:
|
||||
inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
|
||||
if image_patches is not None and past_key_values is None:
|
||||
patch_embeddings = self.vision_embed_tokens(image_patches.to(self.vision_embed_tokens.weight.dtype))
|
||||
inputs_embeds = self.gather_continuous_embeddings(
|
||||
word_embeddings=inputs_embeds,
|
||||
continuous_embeddings=patch_embeddings,
|
||||
image_patch_input_indices=image_patches_indices,
|
||||
)
|
||||
|
||||
outputs = self.language_model(
|
||||
inputs_embeds=inputs_embeds,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
past_key_values=past_key_values,
|
||||
output_attentions=output_attentions,
|
||||
use_cache=use_cache,
|
||||
)
|
||||
if not return_dict:
|
||||
return tuple(v for v in outputs if v is not None)
|
||||
return outputs
|
||||
|
||||
def prepare_inputs_for_generation(
|
||||
self,
|
||||
input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
inputs_embeds=None,
|
||||
image_patches=None,
|
||||
image_patches_indices=None,
|
||||
**kwargs,
|
||||
):
|
||||
if past_key_values:
|
||||
input_ids = input_ids[:, -1:]
|
||||
|
||||
position_ids = kwargs.get("position_ids", None)
|
||||
if attention_mask is not None and position_ids is None:
|
||||
# create position_ids on the fly for batch generation
|
||||
position_ids = attention_mask.long().cumsum(-1) - 1
|
||||
position_ids.masked_fill_(attention_mask == 0, 1)
|
||||
if past_key_values:
|
||||
position_ids = position_ids[:, -1].unsqueeze(-1)
|
||||
|
||||
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
|
||||
if inputs_embeds is not None and past_key_values is None:
|
||||
model_inputs = {"inputs_embeds": inputs_embeds}
|
||||
else:
|
||||
model_inputs = {"input_ids": input_ids}
|
||||
|
||||
if image_patches_indices is not None:
|
||||
model_inputs["image_patches_indices"] = image_patches_indices
|
||||
|
||||
model_inputs.update(
|
||||
{
|
||||
"position_ids": position_ids,
|
||||
"past_key_values": past_key_values,
|
||||
"use_cache": kwargs.get("use_cache"),
|
||||
"attention_mask": attention_mask,
|
||||
"image_patches_indices": image_patches_indices if past_key_values is None else None,
|
||||
"image_patches": image_patches if past_key_values is None else None,
|
||||
}
|
||||
)
|
||||
return model_inputs
|
|
@ -0,0 +1,562 @@
|
|||
import re
|
||||
from typing import Any, Iterable, List, Optional, Tuple, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...image_utils import (
|
||||
ChannelDimension,
|
||||
get_image_size,
|
||||
infer_channel_dimension_format,
|
||||
is_scaled_image,
|
||||
to_numpy_array,
|
||||
)
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...utils import is_torch_available, is_vision_available, logging
|
||||
|
||||
|
||||
if is_torch_available() and is_vision_available():
|
||||
from .image_processing_fuyu import FuyuImageProcessor
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
if is_vision_available():
|
||||
import PIL
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
BBOX_OPEN_STRING = "<0x00>" # <bbox>
|
||||
BBOX_CLOSE_STRING = "<0x01>" # </bbox>
|
||||
POINT_OPEN_STRING = "<0x02>" # <point>
|
||||
POINT_CLOSE_STRING = "<0x03>" # </point>
|
||||
|
||||
TEXT_REPR_BBOX_OPEN = "<box>"
|
||||
TEXT_REPR_BBOX_CLOSE = "</box>"
|
||||
TEXT_REPR_POINT_OPEN = "<point>"
|
||||
TEXT_REPR_POINT_CLOSE = "</point>"
|
||||
|
||||
TOKEN_BBOX_OPEN_STRING = BBOX_OPEN_STRING = "<0x00>" # <bbox>
|
||||
BBOX_CLOSE_STRING = "<0x01>" # </bbox>
|
||||
TOKEN_BBOX_CLOSE_STRING = TOKEN_POINT_OPEN_STRING = POINT_OPEN_STRING = "<0x02>" # <point>
|
||||
TOKEN_POINT_CLOSE_STRING = POINT_CLOSE_STRING = "<0x03>" # </point>
|
||||
BEGINNING_OF_ANSWER_STRING = "<0x04>" # <boa>
|
||||
|
||||
|
||||
def full_unpacked_stream_to_tensor(
|
||||
all_bi_tokens_to_place: List[int],
|
||||
full_unpacked_stream: List["torch.Tensor"],
|
||||
fill_value: int,
|
||||
batch_size: int,
|
||||
new_seq_len: int,
|
||||
offset: int,
|
||||
) -> "torch.Tensor":
|
||||
"""Takes an unpacked stream of tokens (i.e. a list of tensors, one for each item in the batch) and does
|
||||
the required padding to create a single tensor for the batch of shape batch_size x new_seq_len.
|
||||
"""
|
||||
|
||||
assert len(all_bi_tokens_to_place) == batch_size
|
||||
assert len(full_unpacked_stream) == batch_size
|
||||
|
||||
# Create padded tensors for the full batch.
|
||||
new_padded_tensor = torch.full(
|
||||
[batch_size, new_seq_len],
|
||||
fill_value=fill_value,
|
||||
dtype=full_unpacked_stream[0].dtype,
|
||||
device=full_unpacked_stream[0].device,
|
||||
)
|
||||
|
||||
# Place each batch entry into the batch tensor.
|
||||
for bi in range(batch_size):
|
||||
tokens_to_place = all_bi_tokens_to_place[bi]
|
||||
new_padded_tensor[bi, :tokens_to_place] = full_unpacked_stream[bi][offset : tokens_to_place + offset]
|
||||
|
||||
return new_padded_tensor
|
||||
|
||||
|
||||
def construct_full_unpacked_stream(
|
||||
num_real_text_tokens: Union[List[List[int]], "torch.Tensor"],
|
||||
input_stream: "torch.Tensor",
|
||||
image_tokens: List[List["torch.Tensor"]],
|
||||
batch_size: int,
|
||||
num_sub_sequences: int,
|
||||
) -> List["torch.Tensor"]:
|
||||
"""Takes an input_stream tensor of shape B x S x ?. For each subsequence, adds any required
|
||||
padding to account for images and then unpacks the subsequences to create a single sequence per item in the batch.
|
||||
Returns a list of tensors, one for each item in the batch."""
|
||||
|
||||
all_bi_stream = []
|
||||
|
||||
for bi in range(batch_size):
|
||||
all_si_stream = []
|
||||
|
||||
# First, construct full token stream (including image placeholder tokens) and loss mask for each subsequence
|
||||
# and append to lists. We use lists rather than tensors because each subsequence is variable-sized.
|
||||
for si in range(num_sub_sequences):
|
||||
image_adjustment = image_tokens[bi][si]
|
||||
si_stream = torch.cat([image_adjustment, input_stream[bi, si]], dim=0)
|
||||
num_real_tokens = image_adjustment.shape[0] + num_real_text_tokens[bi][si]
|
||||
|
||||
all_si_stream.append(si_stream[:num_real_tokens])
|
||||
# Combine all subsequences for this batch entry. Still using a list because each batch entry is variable-sized.
|
||||
all_bi_stream.append(torch.cat(all_si_stream, dim=0))
|
||||
|
||||
return all_bi_stream
|
||||
|
||||
|
||||
def _replace_string_repr_with_token_tags(prompt: str) -> str:
|
||||
prompt = prompt.replace(TEXT_REPR_POINT_OPEN, TOKEN_POINT_OPEN_STRING)
|
||||
prompt = prompt.replace(TEXT_REPR_POINT_CLOSE, TOKEN_POINT_CLOSE_STRING)
|
||||
prompt = prompt.replace(TEXT_REPR_BBOX_OPEN, TOKEN_BBOX_OPEN_STRING)
|
||||
prompt = prompt.replace(TEXT_REPR_BBOX_CLOSE, TOKEN_BBOX_CLOSE_STRING)
|
||||
return prompt
|
||||
|
||||
|
||||
def _segment_prompt_into_text_token_conversions(prompt: str) -> List:
|
||||
"""
|
||||
Given a string prompt, converts the prompt into a list of TextTokenConversions.
|
||||
"""
|
||||
# Wherever, we notice the [TOKEN_OPEN_STRING, TOKEN_CLOSE_STRING], we split the prompt
|
||||
prompt_text_list: List = []
|
||||
regex_pattern = re.compile(
|
||||
f"({TOKEN_BBOX_OPEN_STRING}|{TOKEN_BBOX_CLOSE_STRING}|{TOKEN_POINT_OPEN_STRING}|{TOKEN_POINT_CLOSE_STRING})"
|
||||
)
|
||||
# Split by the regex pattern
|
||||
prompt_split = regex_pattern.split(prompt)
|
||||
for i, elem in enumerate(prompt_split):
|
||||
if len(elem) == 0 or elem in [
|
||||
TOKEN_BBOX_OPEN_STRING,
|
||||
TOKEN_BBOX_CLOSE_STRING,
|
||||
TOKEN_POINT_OPEN_STRING,
|
||||
TOKEN_POINT_CLOSE_STRING,
|
||||
]:
|
||||
continue
|
||||
prompt_text_list.append(
|
||||
(elem, i > 1 and prompt_split[i - 1] in [TOKEN_BBOX_OPEN_STRING, TOKEN_POINT_OPEN_STRING])
|
||||
)
|
||||
return prompt_text_list
|
||||
|
||||
|
||||
def _transform_coordinates_and_tokenize(prompt: str, transformed_image, tokenizer) -> List[int]:
|
||||
"""
|
||||
This function transforms the prompt in the following fashion:
|
||||
- <box> <point> and </box> </point> to their respective token mappings
|
||||
- extract the coordinates from the tag
|
||||
- transform the coordinates into the transformed image space
|
||||
- return the prompt tokens with the transformed coordinates and new tags
|
||||
|
||||
Bounding boxes and points MUST be in the following format: <box>y1, x1, y2, x2</box> <point>x, y</point> The spaces
|
||||
and punctuation added above are NOT optional.
|
||||
"""
|
||||
# Make a namedtuple that stores "text" and "is_bbox"
|
||||
|
||||
# We want to do the following: Tokenize the code normally -> when we see a point or box, tokenize using the tokenize_within_tag function
|
||||
# When point or box close tag, continue tokenizing normally
|
||||
# First, we replace the point and box tags with their respective tokens
|
||||
prompt = _replace_string_repr_with_token_tags(prompt)
|
||||
# Tokenize the prompt
|
||||
# Convert prompt into a list split
|
||||
prompt_text_list = _segment_prompt_into_text_token_conversions(prompt)
|
||||
transformed_prompt_tokens: List[int] = []
|
||||
for elem in prompt_text_list:
|
||||
if elem[1]:
|
||||
# This is a location, we need to tokenize it
|
||||
within_tag_tokenized = _transform_within_tags(elem[0], transformed_image, tokenizer)
|
||||
# Surround the text with the open and close tags
|
||||
transformed_prompt_tokens.extend(within_tag_tokenized)
|
||||
else:
|
||||
transformed_prompt_tokens.extend(tokenizer(elem[0], add_special_tokens=False).input_ids)
|
||||
return transformed_prompt_tokens
|
||||
|
||||
|
||||
def _transform_within_tags(text: str, transformed_image, tokenizer) -> List[int]:
|
||||
"""
|
||||
Given a bounding box of the fashion <box>1, 2, 3, 4</box> | <point>1, 2</point> This function is responsible for
|
||||
converting 1, 2, 3, 4 into tokens of 1 2 3 4 without any commas.
|
||||
"""
|
||||
# Convert the text into a list of strings.
|
||||
num_int_strs = text.split(",")
|
||||
if len(num_int_strs) == 2:
|
||||
# If there are any open or close tags, remove them.
|
||||
token_space_open_string = tokenizer.vocab[TOKEN_POINT_OPEN_STRING]
|
||||
token_space_close_string = tokenizer.vocab[TOKEN_POINT_CLOSE_STRING]
|
||||
else:
|
||||
token_space_open_string = tokenizer.vocab[TOKEN_BBOX_OPEN_STRING]
|
||||
token_space_close_string = tokenizer.vocab[TOKEN_BBOX_CLOSE_STRING]
|
||||
|
||||
# Remove all spaces from num_ints
|
||||
num_ints = [float(num.strip()) for num in num_int_strs]
|
||||
# scale to transformed image siz
|
||||
if len(num_ints) == 2:
|
||||
num_ints_translated = scale_point_to_transformed_image(
|
||||
x=num_ints[0], y=num_ints[1], transformed_image=transformed_image
|
||||
)
|
||||
elif len(num_ints) == 4:
|
||||
num_ints_translated = scale_bbox_to_transformed_image(
|
||||
top=num_ints[0],
|
||||
left=num_ints[1],
|
||||
bottom=num_ints[2],
|
||||
right=num_ints[3],
|
||||
transformed_image=transformed_image,
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Invalid number of ints: {len(num_ints)}")
|
||||
# Tokenize the text, skipping the
|
||||
tokens = [tokenizer.vocab[str(num)] for num in num_ints_translated]
|
||||
return [token_space_open_string] + tokens + [token_space_close_string]
|
||||
|
||||
|
||||
def _tokenize_prompts_with_image_and_batch(
|
||||
tokenizer,
|
||||
prompts: List[List[str]],
|
||||
transformed_images: Optional[List[List["torch.Tensor"]]],
|
||||
max_tokens_to_generate: int,
|
||||
max_position_embeddings: int,
|
||||
add_BOS: bool, # Same issue with types as above
|
||||
add_beginning_of_answer_token: bool,
|
||||
) -> Tuple["torch.Tensor", "torch.Tensor"]:
|
||||
"""
|
||||
Given a set of prompts and number of tokens to generate:
|
||||
- tokenize prompts
|
||||
- set the sequence length to be the max of length of prompts plus the number of tokens we would like to generate
|
||||
- pad all the sequences to this length so we can convert them into a 3D tensor.
|
||||
"""
|
||||
|
||||
# If not tool use, tranform the coordinates while tokenizing
|
||||
if transformed_images is not None:
|
||||
transformed_prompt_tokens = []
|
||||
for prompt_seq, transformed_image_seq in zip(prompts, transformed_images):
|
||||
transformed_prompt_tokens.append(
|
||||
[
|
||||
_transform_coordinates_and_tokenize(prompt, transformed_image, tokenizer)
|
||||
for prompt, transformed_image in zip(prompt_seq, transformed_image_seq)
|
||||
]
|
||||
)
|
||||
else:
|
||||
transformed_prompt_tokens = [[tokenizer.tokenize(prompt) for prompt in prompt_seq] for prompt_seq in prompts]
|
||||
|
||||
prompts_tokens = transformed_prompt_tokens
|
||||
|
||||
if add_BOS:
|
||||
bos_token = tokenizer.vocab["<s>"]
|
||||
else:
|
||||
bos_token = tokenizer.vocab["|ENDOFTEXT|"]
|
||||
prompts_tokens = [[[bos_token] + x for x in prompt_seq] for prompt_seq in prompts_tokens]
|
||||
if add_beginning_of_answer_token:
|
||||
boa = tokenizer.vocab[BEGINNING_OF_ANSWER_STRING]
|
||||
# Only add bbox open token to the last subsequence since that is what will be completed
|
||||
for token_seq in prompts_tokens:
|
||||
token_seq[-1].append(boa)
|
||||
|
||||
# Now we have a list of list of tokens which each list has a different
|
||||
# size. We want to extend this list to:
|
||||
# - incorporate the tokens that need to be generated
|
||||
# - make all the sequences equal length.
|
||||
# Get the prompts length.
|
||||
|
||||
prompts_length = [[len(x) for x in prompts_tokens_seq] for prompts_tokens_seq in prompts_tokens]
|
||||
# Get the max prompts length.
|
||||
max_prompt_len: int = np.max(prompts_length)
|
||||
# Number of tokens in the each sample of the batch.
|
||||
samples_length = min(max_prompt_len + max_tokens_to_generate, max_position_embeddings)
|
||||
if max_prompt_len + max_tokens_to_generate > max_position_embeddings:
|
||||
print(
|
||||
f"Max subsequence prompt length of {max_prompt_len} + max tokens to generate {max_tokens_to_generate}",
|
||||
f"exceeds context length of {max_position_embeddings}. Will generate as many tokens as possible.",
|
||||
)
|
||||
# Now update the list of list to be of the same size: samples_length.
|
||||
for prompt_tokens_seq, prompts_length_seq in zip(prompts_tokens, prompts_length):
|
||||
for prompt_tokens, prompt_length in zip(prompt_tokens_seq, prompts_length_seq):
|
||||
if len(prompt_tokens) > samples_length:
|
||||
raise ValueError("Length of subsequence prompt exceeds sequence length.")
|
||||
padding_size = samples_length - prompt_length
|
||||
prompt_tokens.extend([tokenizer.vocab["|ENDOFTEXT|"]] * padding_size)
|
||||
|
||||
# Now we are in a structured format, we can convert to tensors.
|
||||
prompts_tokens_tensor = torch.tensor(prompts_tokens, dtype=torch.int64)
|
||||
prompts_length_tensor = torch.tensor(prompts_length, dtype=torch.int64)
|
||||
|
||||
return prompts_tokens_tensor, prompts_length_tensor
|
||||
|
||||
|
||||
def original_to_transformed_h_coords(self, original_coords):
|
||||
# apply crop
|
||||
cropped_coords = (
|
||||
self._clamp_coords(original_coords, min_value=self.crop_top, max_value=self.crop_bottom) - self.crop_top
|
||||
)
|
||||
# apply scale
|
||||
scaled_coords = self._scale_coords(cropped_coords, scale=self.scaled_h / self.original_h)
|
||||
# apply pad
|
||||
return scaled_coords + self.padding_top
|
||||
|
||||
|
||||
def original_to_transformed_w_coords(self, original_coords):
|
||||
# apply crop
|
||||
cropped_coords = (
|
||||
self._clamp_coords(original_coords, min_value=self.crop_left, max_value=self.crop_right) - self.crop_left
|
||||
)
|
||||
# apply scale
|
||||
scaled_coords = self._scale_coords(cropped_coords, scale=self.scaled_w / self.original_w)
|
||||
# apply pad
|
||||
return scaled_coords + self.padding_left
|
||||
|
||||
|
||||
def scale_point_to_transformed_image(x: float, y: float) -> List[int]:
|
||||
x_scaled = original_to_transformed_w_coords(np.array([x / 2]))[0]
|
||||
y_scaled = original_to_transformed_h_coords(np.array([y / 2]))[0]
|
||||
return [x_scaled, y_scaled]
|
||||
|
||||
|
||||
def scale_bbox_to_transformed_image(top: float, left: float, bottom: float, right: float) -> List[int]:
|
||||
top_scaled = original_to_transformed_w_coords(np.array([top / 2]))[0]
|
||||
left_scaled = original_to_transformed_h_coords(np.array([left / 2]))[0]
|
||||
bottom_scaled = original_to_transformed_w_coords(np.array([bottom / 2]))[0]
|
||||
right_scaled = original_to_transformed_h_coords(np.array([right / 2]))[0]
|
||||
return [top_scaled, left_scaled, bottom_scaled, right_scaled]
|
||||
|
||||
|
||||
# Copied from transformers.models.detr.image_processing_detr.max_across_indices
|
||||
def max_across_indices(values: Iterable[Any]) -> List[Any]:
|
||||
"""
|
||||
Return the maximum value across all indices of an iterable of values.
|
||||
"""
|
||||
return [max(values_i) for values_i in zip(*values)]
|
||||
|
||||
|
||||
# Copied from transformers.models.detr.image_processing_detr.get_max_height_width
|
||||
def get_max_height_width(
|
||||
images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None
|
||||
) -> List[int]:
|
||||
"""
|
||||
Get the maximum height and width across all images in a batch.
|
||||
"""
|
||||
if input_data_format is None:
|
||||
input_data_format = infer_channel_dimension_format(images[0])
|
||||
|
||||
if input_data_format == ChannelDimension.FIRST:
|
||||
_, max_height, max_width = max_across_indices([img.shape for img in images])
|
||||
elif input_data_format == ChannelDimension.LAST:
|
||||
max_height, max_width, _ = max_across_indices([img.shape for img in images])
|
||||
else:
|
||||
raise ValueError(f"Invalid channel dimension format: {input_data_format}")
|
||||
return (max_height, max_width)
|
||||
|
||||
|
||||
# Copied from transformers.models.detr.image_processing_detr.make_pixel_mask
|
||||
def make_pixel_mask(
|
||||
image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
|
||||
|
||||
Args:
|
||||
image (`np.ndarray`):
|
||||
Image to make the pixel mask for.
|
||||
output_size (`Tuple[int, int]`):
|
||||
Output size of the mask.
|
||||
"""
|
||||
input_height, input_width = get_image_size(image, channel_dim=input_data_format)
|
||||
mask = np.zeros(output_size, dtype=np.int64)
|
||||
mask[:input_height, :input_width] = 1
|
||||
return mask
|
||||
|
||||
|
||||
class FuyuProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.
|
||||
|
||||
[`FuyuProcessor`] offers all the functionalities of [`FuyuImageProcessor`] and [`LlamaTokenizerFast`]. See the
|
||||
[`~FuyuProcessor.__call__`] and [`~FuyuProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
image_processor ([`FuyuImageProcessor`]):
|
||||
The image processor is a required input.
|
||||
tokenizer ([`LlamaTokenizerFast`]):
|
||||
The tokenizer is a required input.
|
||||
"""
|
||||
attributes = ["image_processor", "tokenizer"]
|
||||
image_processor_class = "FuyuImageProcessor"
|
||||
tokenizer_class = "AutoTokenizer"
|
||||
|
||||
def __init__(self, image_processor, tokenizer):
|
||||
super().__init__(image_processor=image_processor, tokenizer=tokenizer)
|
||||
self.image_processor = image_processor
|
||||
self.tokenizer = tokenizer
|
||||
self.max_tokens_to_generate = 10
|
||||
self.max_position_embeddings = 16384 # TODO Can't derive this from model files: where to set it?
|
||||
self.image_processor = FuyuImageProcessor()
|
||||
|
||||
def _process_images(self, images):
|
||||
"""Utility function to preprocess the images and extract necessary information about original formats."""
|
||||
batch_images = []
|
||||
image_unpadded_heights = []
|
||||
image_unpadded_widths = []
|
||||
|
||||
for image in images:
|
||||
image = to_numpy_array(image)
|
||||
if not is_scaled_image(image):
|
||||
image = image / 255.0
|
||||
channel_dimension = infer_channel_dimension_format(image, 3)
|
||||
if channel_dimension == ChannelDimension.FIRST:
|
||||
width_index = 2
|
||||
height_index = 1
|
||||
elif channel_dimension == ChannelDimension.LAST:
|
||||
width_index = 1
|
||||
height_index = 0
|
||||
|
||||
image_unpadded_widths.append([image.shape[width_index]])
|
||||
image_unpadded_heights.append([image.shape[height_index]])
|
||||
|
||||
# Reproduct adept padding sampler
|
||||
padded_image = self.image_processor.apply_transformation(image)
|
||||
|
||||
tensor_img = torch.Tensor(padded_image).permute(2, 0, 1)
|
||||
batch_images.append([tensor_img])
|
||||
|
||||
return batch_images, torch.Tensor(image_unpadded_heights), torch.Tensor(image_unpadded_widths)
|
||||
|
||||
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
|
||||
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to
|
||||
encode the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
|
||||
FuyuImageProcessor's [`~FuyuImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
|
||||
of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
images (`PIL.Image.Image`, `List[PIL.Image.Image]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
||||
"""
|
||||
if text is None and images is None:
|
||||
raise ValueError("You have to specify either text or images. Both cannot be none.")
|
||||
if text is not None and images is not None:
|
||||
if isinstance(text, str):
|
||||
prompts = [[text]]
|
||||
elif isinstance(text, list):
|
||||
prompts = [[text_seq] for text_seq in text]
|
||||
batch_images = []
|
||||
if isinstance(images, PIL.Image.Image):
|
||||
images = [images]
|
||||
if isinstance(images, list):
|
||||
batch_images, image_unpadded_heights, image_unpadded_widths = self._process_images(images)
|
||||
# image_unpadded_heights = image_unpadded_heights.unsqueeze(0)
|
||||
# image_unpadded_widths = image_unpadded_widths.unsqueeze(0)
|
||||
else:
|
||||
raise ValueError("images must be a list of ndarrays or PIL Images to be processed.")
|
||||
|
||||
# Note: the original adept code has a handling of image_unpadded_h and w, but it doesn't seem to hold
|
||||
# when there are several different size subsequences per batch. The current implementation reflects
|
||||
# that limitation and should be documented.
|
||||
#
|
||||
self.subsequence_length = 1 # Each batch contains only one sequence.
|
||||
self.batch_size = len(batch_images)
|
||||
# FIXME max_tokens_to_generate is embedded into this processor's call.
|
||||
prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
|
||||
tokenizer=self.tokenizer,
|
||||
prompts=prompts,
|
||||
transformed_images=batch_images,
|
||||
max_tokens_to_generate=self.max_tokens_to_generate,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
add_BOS=True,
|
||||
add_beginning_of_answer_token=True,
|
||||
)
|
||||
# same so far
|
||||
|
||||
# This is 1 if there is an image per subsequence, else 0. [batch, 1, presence]
|
||||
# the remainder of current image processing logic assumes subsequence_size = 1.
|
||||
# Here it is OK as the model cannot handle > 1 subsequences
|
||||
# the image could be absent however and image presence should be inferred from user batch input
|
||||
# hence this code assumes the images are present. Use an assert?
|
||||
|
||||
image_present = torch.ones(self.batch_size, 1, 1)
|
||||
|
||||
image_placeholder_id = self.tokenizer("|SPEAKER|", add_special_tokens=False)["input_ids"][1]
|
||||
image_newline_id = self.tokenizer("|NEWLINE|", add_special_tokens=False)["input_ids"][1]
|
||||
tensor_batch_images = torch.stack([img[0] for img in batch_images]).unsqueeze(1)
|
||||
model_image_input = self.image_processor.process_images_for_model_input(
|
||||
image_input=tensor_batch_images,
|
||||
image_present=image_present,
|
||||
image_unpadded_h=image_unpadded_heights,
|
||||
image_unpadded_w=image_unpadded_widths,
|
||||
image_patch_dim_h=30,
|
||||
image_patch_dim_w=30,
|
||||
image_placeholder_id=image_placeholder_id,
|
||||
image_newline_id=image_newline_id,
|
||||
variable_sized=True,
|
||||
)
|
||||
|
||||
image_padded_unpacked_tokens = construct_full_unpacked_stream(
|
||||
num_real_text_tokens=prompts_length,
|
||||
input_stream=prompt_tokens,
|
||||
image_tokens=model_image_input["image_input_ids"],
|
||||
batch_size=self.batch_size,
|
||||
num_sub_sequences=self.subsequence_length,
|
||||
)
|
||||
# Construct inputs for image patch indices.
|
||||
unpacked_image_patch_indices_per_batch = construct_full_unpacked_stream(
|
||||
num_real_text_tokens=prompts_length,
|
||||
input_stream=torch.full_like(prompt_tokens, -1),
|
||||
image_tokens=model_image_input["image_patch_indices_per_batch"],
|
||||
batch_size=self.batch_size,
|
||||
num_sub_sequences=self.subsequence_length,
|
||||
)
|
||||
max_prompt_length = max(x.shape[-1] for x in image_padded_unpacked_tokens)
|
||||
max_seq_len_batch = min(max_prompt_length + self.max_tokens_to_generate, self.max_position_embeddings)
|
||||
all_bi_tokens_to_place = []
|
||||
for bi in range(self.batch_size):
|
||||
tokens_to_place = min(max_seq_len_batch, max(0, image_padded_unpacked_tokens[bi].shape[0]))
|
||||
all_bi_tokens_to_place.append(tokens_to_place)
|
||||
|
||||
# Use same packing logic for the image patch indices.
|
||||
image_patch_input_indices = full_unpacked_stream_to_tensor(
|
||||
all_bi_tokens_to_place=all_bi_tokens_to_place,
|
||||
full_unpacked_stream=unpacked_image_patch_indices_per_batch,
|
||||
fill_value=-1,
|
||||
batch_size=self.batch_size,
|
||||
new_seq_len=max_seq_len_batch,
|
||||
offset=0,
|
||||
)
|
||||
|
||||
image_patches_tensor = torch.stack([img[0] for img in model_image_input["image_patches"]]).unsqueeze(1)
|
||||
return {
|
||||
"input_ids": image_padded_unpacked_tokens[0].unsqueeze(0),
|
||||
"image_patches": image_patches_tensor[0][0].unsqueeze(0),
|
||||
"image_patches_indices": image_patch_input_indices,
|
||||
}
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
||||
the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
|
@ -3614,6 +3614,20 @@ def load_tf_weights_in_funnel(*args, **kwargs):
|
|||
requires_backends(load_tf_weights_in_funnel, ["torch"])
|
||||
|
||||
|
||||
class FuyuForCausalLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class FuyuPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
GIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
|
|
@ -219,6 +219,13 @@ class FlavaProcessor(metaclass=DummyObject):
|
|||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class FuyuImageProcessor(metaclass=DummyObject):
|
||||
_backends = ["vision"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["vision"])
|
||||
|
||||
|
||||
class GLPNFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["vision"]
|
||||
|
||||
|
|
|
@ -0,0 +1,65 @@
|
|||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import (
|
||||
require_torch,
|
||||
require_torchvision,
|
||||
require_vision,
|
||||
)
|
||||
|
||||
|
||||
if is_torch_available() and is_vision_available():
|
||||
import torch
|
||||
|
||||
from transformers import FuyuImageProcessor
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
@require_torchvision
|
||||
class TestFuyuImageProcessor(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.processor = FuyuImageProcessor(target_height=160, target_width=320, padding_value=1.0)
|
||||
self.batch_size = 3
|
||||
self.channels = 3
|
||||
self.height = 300
|
||||
self.width = 300
|
||||
|
||||
self.image_input = torch.rand(self.batch_size, self.channels, self.height, self.width)
|
||||
|
||||
self.image_patch_dim_h = 30
|
||||
self.image_patch_dim_w = 30
|
||||
self.sample_image = np.zeros((450, 210, 3), dtype=np.uint8)
|
||||
self.sample_image_pil = Image.fromarray(self.sample_image)
|
||||
|
||||
def test_patches(self):
|
||||
expected_num_patches = self.processor.get_num_patches(
|
||||
img_h=self.height, img_w=self.width, patch_dim_h=self.image_patch_dim_h, patch_dim_w=self.image_patch_dim_w
|
||||
)
|
||||
|
||||
patches_final = self.processor.patchify_image(
|
||||
image=self.image_input, patch_dim_h=self.image_patch_dim_h, patch_dim_w=self.image_patch_dim_w
|
||||
)
|
||||
assert (
|
||||
patches_final.shape[1] == expected_num_patches
|
||||
), f"Expected {expected_num_patches} patches, got {patches_final.shape[1]}."
|
||||
|
||||
def test_scale_to_target_aspect_ratio(self):
|
||||
scaled_image = self.processor._scale_to_target_aspect_ratio(self.sample_image)
|
||||
self.assertEqual(scaled_image.shape[0], 74)
|
||||
self.assertEqual(scaled_image.shape[1], 160)
|
||||
|
||||
def test_apply_transformation_numpy(self):
|
||||
transformed_image = self.processor.apply_transformation(self.sample_image)
|
||||
self.assertEqual(transformed_image.shape[0], 160)
|
||||
self.assertEqual(transformed_image.shape[1], 320)
|
||||
|
||||
def test_apply_transformation_pil(self):
|
||||
transformed_image = self.processor.apply_transformation(self.sample_image_pil)
|
||||
self.assertEqual(transformed_image.shape[0], 160)
|
||||
self.assertEqual(transformed_image.shape[1], 320)
|
|
@ -0,0 +1,362 @@
|
|||
import io
|
||||
import unittest
|
||||
|
||||
import requests
|
||||
|
||||
from transformers import AutoTokenizer, FuyuConfig, is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_torch_gpu, slow, torch_device
|
||||
|
||||
from ...test_modeling_common import ids_tensor, random_attention_mask
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
if is_torch_available() and is_vision_available():
|
||||
from transformers import FuyuImageProcessor, FuyuProcessor
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import FuyuForCausalLM
|
||||
|
||||
|
||||
# Copied from transformers.tests.llama.test_modelling_llama.LlamaModelTest with Llama->Fuyu
|
||||
class FuyuModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=7,
|
||||
image_size=300,
|
||||
patch_size=30,
|
||||
num_channels=3,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_token_type_ids=False,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
type_sequence_label_size=2,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
num_choices=4,
|
||||
pad_token_id=0,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.type_sequence_label_size = type_sequence_label_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.num_choices = num_choices
|
||||
self.pad_token_id = pad_token_id
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
|
||||
sequence_labels = None
|
||||
token_labels = None
|
||||
choice_labels = None
|
||||
if self.use_labels:
|
||||
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
|
||||
def get_config(self):
|
||||
return FuyuConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
is_decoder=False,
|
||||
initializer_range=self.initializer_range,
|
||||
pad_token_id=self.pad_token_id,
|
||||
)
|
||||
|
||||
def create_and_check_model(
|
||||
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||
):
|
||||
model = FuyuForCausalLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
result = model(input_ids)
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
|
||||
def create_and_check_model_as_decoder(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
config.add_cross_attention = True
|
||||
model = FuyuForCausalLM(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
)
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
|
||||
def create_and_check_for_causal_lm(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
model = FuyuForCausalLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_ids, attention_mask=input_mask, labels=token_labels)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
):
|
||||
config.is_decoder = True
|
||||
config.add_cross_attention = True
|
||||
model = FuyuForCausalLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
# first forward pass
|
||||
outputs = model(
|
||||
input_ids,
|
||||
attention_mask=input_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
use_cache=True,
|
||||
)
|
||||
past_key_values = outputs.past_key_values
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(
|
||||
next_input_ids,
|
||||
attention_mask=next_attention_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
output_hidden_states=True,
|
||||
)["hidden_states"][0]
|
||||
output_from_past = model(
|
||||
next_tokens,
|
||||
attention_mask=next_attention_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
past_key_values=past_key_values,
|
||||
output_hidden_states=True,
|
||||
)["hidden_states"][0]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
sequence_labels,
|
||||
token_labels,
|
||||
choice_labels,
|
||||
) = config_and_inputs
|
||||
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@slow
|
||||
class FuyuIntegrationTest(unittest.TestCase): # , ModelTesterMixin)
|
||||
"""
|
||||
Currently, all these tests depend on a value of max_tokens_to_generate of 10.
|
||||
"""
|
||||
|
||||
all_model_classes = ("FuyuForCausalLM") if is_torch_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.pretrained_model_name = "huggingface/new_model_release_weights"
|
||||
tokenizer = AutoTokenizer.from_pretrained(self.pretrained_model_name)
|
||||
image_processor = FuyuImageProcessor()
|
||||
|
||||
self.processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)
|
||||
self.model = FuyuForCausalLM.from_pretrained(self.pretrained_model_name)
|
||||
self.bus_image_url = (
|
||||
"https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
|
||||
)
|
||||
self.bus_image_pil = Image.open(io.BytesIO(requests.get(self.bus_image_url).content))
|
||||
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_model_8b_chat_greedy_generation_bus_captioning(self):
|
||||
EXPECTED_TEXT_COMPLETION = """A bus parked on the side of a road.|ENDOFTEXT|"""
|
||||
text_prompt_coco_captioning = "Generate a coco-style caption.\n"
|
||||
|
||||
model_inputs_bus_captioning = self.processor(text=text_prompt_coco_captioning, images=self.bus_image_pil)
|
||||
generated_tokens = self.model.generate(**model_inputs_bus_captioning, max_new_tokens=10)
|
||||
text = self.processor.tokenizer.batch_decode(generated_tokens)
|
||||
end_sequence = text[0].split("\x04")[1]
|
||||
clean_sequence = (
|
||||
end_sequence[: end_sequence.find("|ENDOFTEXT|") + len("|ENDOFTEXT|")]
|
||||
if "|ENDOFTEXT|" in end_sequence
|
||||
else end_sequence
|
||||
)
|
||||
self.assertEqual(EXPECTED_TEXT_COMPLETION, clean_sequence[1:])
|
||||
|
||||
|
||||
"""
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_model_8b_chat_greedy_generation_bus_color(self):
|
||||
EXPECTED_TEXT_COMPLETION = "The bus is blue.\n|ENDOFTEXT|"
|
||||
text_prompt_bus_color = "What color is the bus?\n"
|
||||
model_inputs_bus_color = self.processor(text=text_prompt_bus_color, images=self.bus_image_pil)
|
||||
|
||||
generated_tokens = self.model.generate(**model_inputs_bus_color, max_new_tokens=10)
|
||||
text = self.processor.tokenizer.batch_decode(generated_tokens)
|
||||
end_sequence = text[0].split("\x04")[1]
|
||||
clean_sequence = (
|
||||
end_sequence[: end_sequence.find("|ENDOFTEXT|") + len("|ENDOFTEXT|")]
|
||||
if "|ENDOFTEXT|" in end_sequence
|
||||
else end_sequence
|
||||
)
|
||||
self.assertEqual(EXPECTED_TEXT_COMPLETION, clean_sequence)
|
||||
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_model_8b_chat_greedy_generation_chart_vqa(self):
|
||||
# fmt: off
|
||||
EXPECTED_TEXT_TOKENS = ["The","life expectancy","at","birth","of male","s in","","20","18","is","","80",".","7",".","\n","|ENDOFTEXT|",]
|
||||
# fmt: on
|
||||
expected_text_completion = " ".join(EXPECTED_TEXT_TOKENS) # TODO make sure the end string matches
|
||||
|
||||
text_prompt_chart_vqa = "What is the highest life expectancy at birth of male?\n"
|
||||
|
||||
chart_image_url = (
|
||||
"https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/chart.png"
|
||||
)
|
||||
chart_image_pil = Image.open(io.BytesIO(requests.get(chart_image_url).content))
|
||||
|
||||
model_inputs_chart_vqa = self.processor(text=text_prompt_chart_vqa, images=chart_image_pil)
|
||||
generated_tokens = self.model.generate(**model_inputs_chart_vqa, max_new_tokens=10)
|
||||
text = self.processor.tokenizer.batch_decode(generated_tokens)
|
||||
end_sequence = text[0].split("\x04")[1]
|
||||
clean_sequence = (
|
||||
end_sequence[: end_sequence.find("|ENDOFTEXT|") + len("|ENDOFTEXT|")]
|
||||
if "|ENDOFTEXT|" in end_sequence
|
||||
else end_sequence
|
||||
)
|
||||
self.assertEqual(expected_text_completion, clean_sequence)
|
||||
|
||||
@slow
|
||||
@require_torch_gpu
|
||||
def test_model_8b_chat_greedy_generation_bounding_box(self):
|
||||
EXPECTED_TEXT_COMPLETION = "\x00194213202244\x01|ENDOFTEXT|"
|
||||
text_prompt_bbox = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\nWilliams" # noqa: E231
|
||||
|
||||
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.png"
|
||||
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
|
||||
|
||||
model_inputs_bbox = self.processor(text=text_prompt_bbox, images=bbox_image_pil)
|
||||
generated_tokens = self.model.generate(**model_inputs_bbox, max_new_tokens=10)
|
||||
text = self.processor.tokenizer.batch_decode(generated_tokens)
|
||||
end_sequence = text[0].split("\x04")[1]
|
||||
clean_sequence = (
|
||||
end_sequence[: end_sequence.find("|ENDOFTEXT|") + len("|ENDOFTEXT|")]
|
||||
if "|ENDOFTEXT|" in end_sequence
|
||||
else end_sequence
|
||||
)
|
||||
self.assertEqual(EXPECTED_TEXT_COMPLETION, clean_sequence)
|
||||
"""
|
|
@ -0,0 +1,126 @@
|
|||
import io
|
||||
import unittest
|
||||
|
||||
import requests
|
||||
|
||||
from transformers import AutoTokenizer, is_torch_available, is_vision_available
|
||||
from transformers.testing_utils import require_torch, require_torch_gpu, slow
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
if is_vision_available() and is_torch_available():
|
||||
from transformers import FuyuImageProcessor, FuyuProcessor
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers.models.fuyu.processing_fuyu import construct_full_unpacked_stream, full_unpacked_stream_to_tensor
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_torch_gpu
|
||||
@slow
|
||||
class FuyuProcessingTest(unittest.TestCase): # TODO Which mixins do we add here?
|
||||
""" """
|
||||
|
||||
def setUp(self):
|
||||
pretrained_model_name = "huggingface/pre_release_model"
|
||||
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
|
||||
image_processor = FuyuImageProcessor()
|
||||
|
||||
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)
|
||||
text_prompt = "Generate a coco-style caption.\\n"
|
||||
bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
|
||||
bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))
|
||||
|
||||
self.one_image_bus_model_inputs = processor(text=text_prompt, images=bus_image_pil)
|
||||
|
||||
def test_fuyu_processing(self):
|
||||
"""
|
||||
Test to ensure that the standard processing on a gold example matches adept's code.
|
||||
"""
|
||||
# fmt: off
|
||||
EXPECTED_IMAGE_PATCH_INPUTS = torch.Tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, -1, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, -1, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, -1, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, -1, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, -1, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, -1, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, -1, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, -1, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, -1, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, -1, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, -1, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, -1, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, -1, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,]]).to(torch.int64)
|
||||
EXPECTED_PADDED_UNPACKED_TOKEN_INPUTS = torch.Tensor([[71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71011, 71019, 1, 128340, 71374, 71389, 120412, 71377, 71835, 71374, 73615, 71375, 71399, 71435, 71122,]]).to(torch.int64)
|
||||
# fmt: on
|
||||
torch.testing.assert_close(
|
||||
self.one_image_bus_model_inputs["image_patches_indices"], EXPECTED_IMAGE_PATCH_INPUTS
|
||||
)
|
||||
torch.testing.assert_close(self.one_image_bus_model_inputs["input_ids"], EXPECTED_PADDED_UNPACKED_TOKEN_INPUTS)
|
||||
|
||||
|
||||
@require_torch
|
||||
class TestImageTextProcessingUtils(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.batch_size = 2
|
||||
self.new_seq_len = 8
|
||||
self.num_sub_sequences = 1
|
||||
|
||||
self.all_bi_tokens_to_place = [4, 6]
|
||||
self.full_unpacked_stream = [torch.tensor([1, 2, 3, 4]), torch.tensor([5, 6, 7, 8, 9, 10])]
|
||||
self.fill_value = 0
|
||||
|
||||
self.num_real_text_tokens = [[3, 2], [2, 4]]
|
||||
# Here the input stream is padded to avoid inconsistencies (current model release matches)
|
||||
self.input_stream = torch.tensor([[[1, 2, 3], [4, 5, 0]], [[6, 7, 0], [8, 9, 10]]])
|
||||
self.image_tokens = [
|
||||
[torch.tensor([1, 2]), torch.tensor([3])],
|
||||
[torch.tensor([4, 5, 6]), torch.tensor([7, 8])],
|
||||
]
|
||||
|
||||
def test_full_unpacked_stream_to_tensor(self):
|
||||
result = full_unpacked_stream_to_tensor(
|
||||
self.all_bi_tokens_to_place,
|
||||
self.full_unpacked_stream,
|
||||
self.fill_value,
|
||||
self.batch_size,
|
||||
self.new_seq_len,
|
||||
offset=0,
|
||||
)
|
||||
EXPECTED_TENSOR = torch.tensor([[1, 2, 3, 4, 0, 0, 0, 0], [5, 6, 7, 8, 9, 10, 0, 0]])
|
||||
self.assertTrue(torch.equal(result, EXPECTED_TENSOR))
|
||||
|
||||
def test_construct_full_unpacked_stream(self):
|
||||
result = construct_full_unpacked_stream(
|
||||
self.num_real_text_tokens, self.input_stream, self.image_tokens, self.batch_size, self.num_sub_sequences
|
||||
)
|
||||
EXPECTED_UNPACKED_STREAM = [torch.tensor([1, 2, 1, 2, 3]), torch.tensor([4, 5, 6, 6, 7])]
|
||||
for i in range(len(result)):
|
||||
self.assertTrue(torch.equal(result[i], EXPECTED_UNPACKED_STREAM[i]))
|
||||
|
||||
|
||||
@require_torch
|
||||
class TestProcessImagesForModelInput(unittest.TestCase):
|
||||
def setUp(self):
|
||||
"""
|
||||
Adding a mix of present and absent images.
|
||||
"""
|
||||
self.image_processor = FuyuImageProcessor()
|
||||
|
||||
self.image_input = torch.randn([1, 1, 3, 64, 64])
|
||||
self.image_present = torch.tensor([[1]])
|
||||
self.image_unpadded_h = torch.tensor([[45]]) # Adjusted for subsequence of 1
|
||||
self.image_unpadded_w = torch.tensor([[50]]) # Adjusted for subsequence of 1
|
||||
self.image_patch_dim_h = 16
|
||||
self.image_patch_dim_w = 16
|
||||
self.image_placeholder_id = 999
|
||||
self.image_newline_id = 888
|
||||
self.variable_sized = True
|
||||
|
||||
def test_process_images_for_model_input_fixed_sized(self):
|
||||
self.variable_sized = False
|
||||
result = self.image_processor.process_images_for_model_input(
|
||||
image_input=self.image_input,
|
||||
image_present=self.image_present,
|
||||
image_unpadded_h=self.image_unpadded_h,
|
||||
image_unpadded_w=self.image_unpadded_w,
|
||||
image_patch_dim_h=self.image_patch_dim_h,
|
||||
image_patch_dim_w=self.image_patch_dim_w,
|
||||
image_placeholder_id=self.image_placeholder_id,
|
||||
image_newline_id=self.image_newline_id,
|
||||
variable_sized=self.variable_sized,
|
||||
)
|
||||
print(result["images"][0][0])
|
||||
self.assertEqual(result["images"][0][0].shape, torch.Size([3, 64, 64]))
|
|
@ -36,6 +36,7 @@ SPECIAL_CASES_TO_ALLOW = {
|
|||
"EncodecConfig": ["overlap"],
|
||||
# used as `self.bert_model = BertModel(config, ...)`
|
||||
"DPRConfig": True,
|
||||
"FuyuConfig": True,
|
||||
# not used in modeling files, but it's an important information
|
||||
"FSMTConfig": ["langs"],
|
||||
# used internally in the configuration class file
|
||||
|
|
|
@ -79,6 +79,7 @@ PRIVATE_MODELS = [
|
|||
# Being in this list is an exception and should **not** be the rule.
|
||||
IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
||||
# models to ignore for not tested
|
||||
"FuyuForCausalLM", # Not tested fort now
|
||||
"InstructBlipQFormerModel", # Building part of bigger (tested) model.
|
||||
"UMT5EncoderModel", # Building part of bigger (tested) model.
|
||||
"Blip2QFormerModel", # Building part of bigger (tested) model.
|
||||
|
|
|
@ -566,6 +566,7 @@ src/transformers/models/funnel/configuration_funnel.py
|
|||
src/transformers/models/funnel/convert_funnel_original_tf_checkpoint_to_pytorch.py
|
||||
src/transformers/models/funnel/modeling_funnel.py
|
||||
src/transformers/models/funnel/modeling_tf_funnel.py
|
||||
src/transformers/models/fuyu/convert_fuyu_model_weights_to_hf.py
|
||||
src/transformers/models/git/configuration_git.py
|
||||
src/transformers/models/git/convert_git_to_pytorch.py
|
||||
src/transformers/models/glpn/configuration_glpn.py
|
||||
|
|
Loading…
Reference in New Issue