[CLAP] Add CLAP to the library (#21370)

* add model like clip

* update

* text model ok

* clap text works

* some refactor

- `CLAPVision` to `CLAPAudio`
- refactor kwargs of audio modules

* more refactor

* more refactor

* more refactor

* correct fusion

* more refactor

* new modules

* add basic processor

* fixup

* remove whisper copioed from

* audio logits match

* add doc

* correct filters mel and add maxlength

* style

* few fixes

* forward passes

* fixup

* fixup

* some clean up

* remove mels form the dictionnary

* pad after the repeat

* update padding when dsmaller

* fix padding

* style

* use swin patch merging

* use copied from swin

* processor with any tokenizer

* more copied from

* some clean up

* more refactor

* fix mel when rand_trunc

* style

* remove unused imports

* update processing

* remove image processing tests

* add testing fiel

* fixmodeling issues

* replace with `is_longer`

* clap in serialization

* more refactor

* `make fixup`

* make fixup

* fix feature extractor

* update test feature extractor

* `make fixup`

* clean up config

* more clean up

* more cleanup

* update tests

* refactor tests and inits

* removeCLAP vision config

* remove CLAP from image procssing auto and dummy vision objects

* update inits

* style

* re order classes in modeling clap

* Use roberta tokenizer as the other weights are not open sourced

* small cleaup

* remove tokenization CLAP

* processor tokenizr is roberta

* update feature extraction doc

* remove vclap from model zero shot

* update f_min and f_max to frequency_xx

* some changes

- fix modeling keys
- add `is_longer` in the forward pass
- make fixup

* make fixup

* consistent behavior ebtween rand_crop and fusion

* add numpy resize and bilinear and documentation

* move resizing to image utils

* clean feature extraction

* import resize from correct file

* resize in image transforms

* update

* style

* style

* nit

* remove unused arguments form the feature extractor

* style

* few fixes + make fixup

* oops

* fix more tests

* add zero shot audio classification pipeline

* update zeroshot classification pipeline

* fixup

* fix copies

* all CI tests pass

* make fixup + fix docs

* fix docs

* fix docs

* update tests pip;eline

* update zero shot pipeline

* update feature extraction clap

* update tokenization auto

* use nested simplify

* update pipeline tests

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* split in two lines

* fixes

* refactor

* clean up

* add integration tests

* update config docstring

* style

* update processor

* fix processor test

* fix feat extractor tests

* update docs

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix readmes

* fix tips

* Update src/transformers/models/auto/configuration_auto.py

* update doc and remove todo -> properly explained

* fix idx and typo

* typoe

* cleanup config

* cleanup tests, styles and doc

* ignore docstyle on image transform

* add conversion script

* remove the `clap` indx in favor of `CLAP`

* update __init

* nits

* Update src/transformers/pipelines/__init__.py

* fix bug

* clarifiy config

* fix copy

* fix init

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix model output

* fix comment

* make fixup

* make fixup

* rename to `Clap`

* replace to `Clap`

* replace to `Clap`

* repo consistency

* again repo-consistency

* make fixup

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* add config

* changes

* update conversion

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* remove unused function

* update based on code reviews

* style

* more comments

* cleanup

* clean up

* style

* apply suggestions

* Empty commit

* pipeline will be added in a different PR

* update calls to audio utils functions

* update pipeline init

* style

* style

* styling again

* use pad

* fix repo-consistency

* update utils and add doc for audio utils

* clean up resize by using torch. update inits accordingly

* style

* CLap's  tokenizer is RobertA

* add audio utils to internal toctreee

* update totctree

* style

* update documentation and normalize naming accross audio utils and feature extraction clap

* style

* clean up

* update doc and typos

* fix doctest

* update modelin code, got rid of a lot of reshaping

* style on added doc audio utils

* update modeling clap

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* docstringvariables with CLAP

* rename key

* update modeling CLAP

* update audio utils docstring

* update processing clap

* fix readmes

* fix toctree

* udpate configuration clap

* fix init

* make fixup

* fix

* fix

* update naming

* update

* update checkpoint path

* Apply suggestions from code review

* Major refactoring

* Update src/transformers/models/clap/configuration_clap.py

* merge

---------

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
This commit is contained in:
Arthur 2023-02-16 20:59:27 +01:00 committed by GitHub
parent 6b0257de42
commit c236a62172
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
33 changed files with 5104 additions and 15 deletions

View File

@ -295,6 +295,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

View File

@ -288,6 +288,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

View File

@ -260,6 +260,7 @@ conda install -c huggingface transformers
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा। 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा।
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा। 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा।
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) के साथ जारी किया गया
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा। 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज। 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।

View File

@ -322,6 +322,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)

View File

@ -237,6 +237,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다.
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다.
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.

View File

@ -261,6 +261,7 @@ conda install -c huggingface transformers
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。

View File

@ -273,6 +273,7 @@ conda install -c huggingface transformers
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.

View File

@ -495,6 +495,8 @@
sections: sections:
- local: model_doc/audio-spectrogram-transformer - local: model_doc/audio-spectrogram-transformer
title: Audio Spectrogram Transformer title: Audio Spectrogram Transformer
- local: model_doc/clap
title: CLAP
- local: model_doc/hubert - local: model_doc/hubert
title: Hubert title: Hubert
- local: model_doc/mctct - local: model_doc/mctct
@ -622,6 +624,8 @@
title: Utilities for Generation title: Utilities for Generation
- local: internal/image_processing_utils - local: internal/image_processing_utils
title: Utilities for Image Processors title: Utilities for Image Processors
- local: internal/audio_utils
title: Utilities for Audio processing
- local: internal/file_utils - local: internal/file_utils
title: General Utilities title: General Utilities
title: Internal Helpers title: Internal Helpers

View File

@ -74,6 +74,7 @@ The documentation is organized into five sections:
1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. 1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
1. **[CLAP](model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
@ -263,6 +264,7 @@ Flax), PyTorch, and/or TensorFlow.
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ | | CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
| Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ | | Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| CLAP | ❌ | ❌ | ✅ | ❌ | ❌ |
| CLIP | ✅ | ✅ | ✅ | ✅ | ✅ | | CLIP | ✅ | ✅ | ✅ | ✅ | ✅ |
| CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ | | CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ |
| CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ | | CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ |

View File

@ -0,0 +1,34 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Utilities for `FeatureExtractors`
This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *Mel log spectrogram*.
Most of those are only useful if you are studying the code of the image processors in the library.
## Audio Transformations
[[autodoc]] audio_utils.hertz_to_mel
[[autodoc]] audio_utils.mel_to_hertz
[[autodoc]] audio_utils.get_mel_filter_banks
[[autodoc]] audio_utils.stft
[[autodoc]] audio_utils.power_to_db
[[autodoc]] audio_utils.fram_wave

View File

@ -0,0 +1,77 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# CLAP
## Overview
The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with
feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
The abstract from the paper is the following:
*Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6*
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
The original code can be found [here](https://github.com/LAION-AI/Clap).
## ClapConfig
[[autodoc]] ClapConfig
- from_text_audio_configs
## ClapTextConfig
[[autodoc]] ClapTextConfig
## ClapAudioConfig
[[autodoc]] ClapAudioConfig
## ClapFeatureExtractor
[[autodoc]] ClapFeatureExtractor
## ClapProcessor
[[autodoc]] ClapProcessor
## ClapModel
[[autodoc]] ClapModel
- forward
- get_text_features
- get_audio_features
## ClapTextModel
[[autodoc]] ClapTextModel
- forward
## ClapTextModelWithProjection
[[autodoc]] ClapTextModelWithProjection
- forward
## ClapAudioModel
[[autodoc]] ClapAudioModel
- forward
## ClapAudioModelWithProjection
[[autodoc]] ClapAudioModelWithProjection
- forward

View File

@ -47,6 +47,7 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
# Base objects, independent of any specific backend # Base objects, independent of any specific backend
_import_structure = { _import_structure = {
"audio_utils": [],
"benchmark": [], "benchmark": [],
"commands": [], "commands": [],
"configuration_utils": ["PretrainedConfig"], "configuration_utils": ["PretrainedConfig"],
@ -206,6 +207,13 @@ _import_structure = {
"ChineseCLIPTextConfig", "ChineseCLIPTextConfig",
"ChineseCLIPVisionConfig", "ChineseCLIPVisionConfig",
], ],
"models.clap": [
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapAudioConfig",
"ClapConfig",
"ClapProcessor",
"ClapTextConfig",
],
"models.clip": [ "models.clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"CLIPConfig", "CLIPConfig",
@ -1231,6 +1239,18 @@ else:
"ChineseCLIPVisionModel", "ChineseCLIPVisionModel",
] ]
) )
_import_structure["models.clap"].extend(
[
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapAudioModel",
"ClapAudioModelWithProjection",
"ClapFeatureExtractor",
"ClapModel",
"ClapPreTrainedModel",
"ClapTextModel",
"ClapTextModelWithProjection",
]
)
_import_structure["models.clip"].extend( _import_structure["models.clip"].extend(
[ [
"CLIP_PRETRAINED_MODEL_ARCHIVE_LIST", "CLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
@ -3730,6 +3750,13 @@ if TYPE_CHECKING:
ChineseCLIPTextConfig, ChineseCLIPTextConfig,
ChineseCLIPVisionConfig, ChineseCLIPVisionConfig,
) )
from .models.clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioConfig,
ClapConfig,
ClapProcessor,
ClapTextConfig,
)
from .models.clip import ( from .models.clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIPConfig, CLIPConfig,
@ -4623,6 +4650,16 @@ if TYPE_CHECKING:
ChineseCLIPTextModel, ChineseCLIPTextModel,
ChineseCLIPVisionModel, ChineseCLIPVisionModel,
) )
from .models.clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioModel,
ClapAudioModelWithProjection,
ClapFeatureExtractor,
ClapModel,
ClapPreTrainedModel,
ClapTextModel,
ClapTextModelWithProjection,
)
from .models.clip import ( from .models.clip import (
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST, CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
CLIPModel, CLIPModel,

View File

@ -0,0 +1,359 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Audio processing functions to extract feature from a raw audio. Should all be in numpy to support all frameworks, and
remmove unecessary dependencies.
"""
import math
import warnings
from typing import Optional
import numpy as np
from numpy.fft import fft
def hertz_to_mel(freq: float, mel_scale: str = "htk") -> float:
"""Convert Hertz to Mels.
Args:
freqs (`float`):
Frequencies in Hertz
mel_scale (`str`, *optional*, defaults to `"htk"`):
Scale to use, `htk` or `slaney`.
Returns:
mels (`float`):
Frequency in Mels
"""
if mel_scale not in ["slaney", "htk"]:
raise ValueError('mel_scale should be one of "htk" or "slaney".')
if mel_scale == "htk":
return 2595.0 * math.log10(1.0 + (freq / 700.0))
# Fill in the linear part
frequency_min = 0.0
f_sp = 200.0 / 3
mels = (freq - frequency_min) / f_sp
# Fill in the log-scale part
min_log_hertz = 1000.0
min_log_mel = (min_log_hertz - frequency_min) / f_sp
logstep = math.log(6.4) / 27.0
if freq >= min_log_hertz:
mels = min_log_mel + math.log(freq / min_log_hertz) / logstep
return mels
def mel_to_hertz(mels: np.array, mel_scale: str = "htk") -> np.array:
"""Convert mel bin numbers to frequencies.
Args:
mels (`np.array`):
Mel frequencies
mel_scale (`str`, *optional*, `"htk"`):
Scale to use: `htk` or `slaney`.
Returns:
freqs (`np.array`):
Mels converted to Hertz
"""
if mel_scale not in ["slaney", "htk"]:
raise ValueError('mel_scale should be one of "htk" or "slaney".')
if mel_scale == "htk":
return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
# Fill in the linear scale
frequency_min = 0.0
f_sp = 200.0 / 3
freqs = frequency_min + f_sp * mels
# And now the nonlinear scale
min_log_hertz = 1000.0
min_log_mel = (min_log_hertz - frequency_min) / f_sp
logstep = math.log(6.4) / 27.0
log_t = mels >= min_log_mel
freqs[log_t] = min_log_hertz * np.exp(logstep * (mels[log_t] - min_log_mel))
return freqs
def _create_triangular_filterbank(
all_freqs: np.array,
f_pts: np.array,
) -> np.array:
"""Create a triangular filter bank.
Args:
all_freqs (`np.array` of shape (`nb_frequency_bins`, )):
Discrete frequencies used when the STFT was computed.
f_pts (`np.array`, of shape (`nb_mel_filters`, )):
Coordinates of the middle points of the triangular filters to create.
Returns:
fb (np.array):
The filter bank of size (`nb_frequency_bins`, `nb_mel_filters`).
"""
# Adapted from Librosa
# calculate the difference between each filter mid point and each stft freq point in hertz
f_diff = f_pts[1:] - f_pts[:-1] # (n_filter + 1)
slopes = np.expand_dims(f_pts, 0) - np.expand_dims(all_freqs, 1) # (nb_frequency_bins, n_filter + 2)
# create overlapping triangles
zero = np.zeros(1)
down_slopes = (-1.0 * slopes[:, :-2]) / f_diff[:-1] # (nb_frequency_bins, n_filter)
up_slopes = slopes[:, 2:] / f_diff[1:] # (nb_frequency_bins, n_filter)
fb = np.maximum(zero, np.minimum(down_slopes, up_slopes))
return fb
def get_mel_filter_banks(
nb_frequency_bins: int,
nb_mel_filters: int,
frequency_min: float,
frequency_max: float,
sample_rate: int,
norm: Optional[str] = None,
mel_scale: str = "htk",
) -> np.array:
"""
Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a *mel filter bank*,
and various implementation exist, which differ in the number of filters, the shape of the filters, the way the
filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these
features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency.
This code is heavily inspired from the *torchaudio* implementation, see
[here](https://pytorch.org/audio/stable/transforms.html) for more details.
Tips:
- Different banks of Mel filters were introduced in the litterature. The following variation are supported:
- MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz
and a speech bandwidth of `[0, 4600]` Hertz
- MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a
speech bandwidth `[0, 8000]` Hertz (sampling rate 16 kHertz).
- MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate
of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization.
- HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling
rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz
- The default parameters of `torchaudio`'s mel filterbanks implement the `"htk"` filers while `torchlibrosa`
uses the `"slaney"` implementation.
Args:
nb_frequency_bins (`int`):
Number of frequencies used to compute the spectrogram (should be the same as in `stft`).
nb_mel_filters (`int`):
Number of Mel filers to generate.
frequency_min (`float`):
Minimum frequency of interest(Hertz).
frequency_max (`float`):
Maximum frequency of interest(Hertz).
sample_rate (`int`):
Sample rate of the audio waveform.
norm (`str`, *optional*):
If "slaney", divide the triangular Mel weights by the width of the mel band (area normalization).
mel_scale (`str`, *optional*, defaults to `"htk"`):
Scale to use: `"htk"` or `"slaney"`.
Returns:
`np.ndarray`: Triangular filter banks (fb matrix) of shape (`nb_frequency_bins`, `nb_mel_filters`). This matrix
is a projection matrix to go from a spectrogram to a Mel Spectrogram.
"""
if norm is not None and norm != "slaney":
raise ValueError('norm must be one of None or "slaney"')
# freqency bins
all_freqs = np.linspace(0, sample_rate // 2, nb_frequency_bins)
# Compute mim and max frequencies in mel scale
m_min = hertz_to_mel(frequency_min, mel_scale=mel_scale)
m_max = hertz_to_mel(frequency_max, mel_scale=mel_scale)
# create the centers of the triangular mel filters.
m_pts = np.linspace(m_min, m_max, nb_mel_filters + 2)
f_pts = mel_to_hertz(m_pts, mel_scale=mel_scale)
# create the filterbank
filterbank = _create_triangular_filterbank(all_freqs, f_pts)
if norm is not None and norm == "slaney":
# Slaney-style mel is scaled to be approx constant energy per channel
enorm = 2.0 / (f_pts[2 : nb_mel_filters + 2] - f_pts[:nb_mel_filters])
filterbank *= np.expand_dims(enorm, 0)
if (filterbank.max(axis=0) == 0.0).any():
warnings.warn(
"At least one mel filterbank has all zero values. "
f"The value for `nb_mel_filters` ({nb_mel_filters}) may be set too high. "
f"Or, the value for `nb_frequency_bins` ({nb_frequency_bins}) may be set too low."
)
return filterbank
def power_to_db(mel_spectrogram, top_db=None, a_min=1e-10, ref=1.0):
"""
Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb.
It computes `10 * log10(mel_spectrogram / ref)`, using basic log properties for stability.
Tips:
- The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on
a
linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into
it.
- This means that large variations in energy may not sound all that different if the sound is loud to begin
with. This compression operation makes the mel features match more closely what humans actually hear.
Args:
mel_spectrogram (`np.array`):
Input mel spectrogram.
top_db (`int`, *optional*):
The maximum decibel value.
a_min (`int`, *optional*, default to 1e-10):
Minimum value to use when cliping the mel spectrogram.
ref (`float`, *optional*, default to 1.0):
Maximum reference value used to scale the mel_spectrogram.
"""
log_spec = 10 * np.log10(np.clip(mel_spectrogram, a_min=a_min, a_max=None))
log_spec -= 10.0 * np.log10(np.maximum(a_min, ref))
if top_db is not None:
if top_db < 0:
raise ValueError("top_db must be non-negative")
log_spec = np.clip(log_spec, min=np.maximum(log_spec) - top_db, max=np.inf)
return log_spec
# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = 400, center: bool = True):
"""
In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed
segments called `frames`.
The window length (window_length) defines how much of the signal is contained in each frame, while the hop length
defines the step between the beginning of each new frame.
Args:
waveform (`np.array` of shape `(sample_length,)`):
The raw waveform which will be split into smaller chunks.
hop_length (`int`, *optional*, defaults to 160):
Step between each window of the waveform.
fft_window_size (`int`, *optional*, defaults to 400):
Defines the size of the window.
center (`bool`, defaults to `True`):
Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the
waveform on the left and on the right.
Return:
framed_waveform (`np.array` of shape `(waveform.shape // hop_length , fft_window_size)`):
The framed waveforms that can be fed to `np.fft`.
"""
frames = []
for i in range(0, waveform.shape[0] + 1, hop_length):
if center:
half_window = (fft_window_size - 1) // 2 + 1
start = i - half_window if i > half_window else 0
end = i + half_window if i < waveform.shape[0] - half_window else waveform.shape[0]
frame = waveform[start:end]
if start == 0:
padd_width = (-i + half_window, 0)
frame = np.pad(frame, pad_width=padd_width, mode="reflect")
elif end == waveform.shape[0]:
padd_width = (0, (i - waveform.shape[0] + half_window))
frame = np.pad(frame, pad_width=padd_width, mode="reflect")
else:
frame = waveform[i : i + fft_window_size]
frame_width = frame.shape[0]
if frame_width < waveform.shape[0]:
frame = np.lib.pad(
frame, pad_width=(0, fft_window_size - frame_width), mode="constant", constant_values=0
)
frames.append(frame)
frames = np.stack(frames, 0)
return frames
# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = None):
"""
Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results
as `torch.stft`.
Args:
frames (`np.array` of dimension `(num_frames, fft_window_size)`):
A framed audio signal obtained using `audio_utils.fram_wav`.
windowing_function (`np.array` of dimension `(nb_frequency_bins, nb_mel_filters)`:
A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the
boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function.
For more information on the discontinuities, called *Spectral leakage*, refer to [this
tutorial]https://download.ni.com/evaluation/pxi/Understanding%20FFTs%20and%20Windowing.pdf
fft_window_size (`int`, *optional*):
Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the
spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of
frequency bins (`nb_frequency_bins`) used to divide the window into equal strips is equal to
`(1+fft_window_size)//2`. An increase of the fft_window_size slows the calculus time proportionnally.
Example:
```python
>>> from transformers.audio_utils import stft, fram_wave
>>> import numpy as np
>>> audio = np.random.rand(50)
>>> fft_window_size = 10
>>> hop_length = 2
>>> framed_audio = fram_wave(audio, hop_length, fft_window_size)
>>> spectrogram = stft(framed_audio, np.hanning(fft_window_size + 1))
```
Returns:
spectrogram (`np.ndarray`):
A spectrogram of shape `(num_frames, nb_frequency_bins)` obtained using the STFT algorithm
"""
frame_size = frames.shape[1]
if fft_window_size is None:
fft_window_size = frame_size
if fft_window_size < frame_size:
raise ValueError("FFT size must greater or equal the frame size")
# number of FFT bins to store
nb_frequency_bins = (fft_window_size >> 1) + 1
spectrogram = np.empty((len(frames), nb_frequency_bins), dtype=np.complex64)
fft_signal = np.zeros(fft_window_size)
for f, frame in enumerate(frames):
if windowing_function is not None:
np.multiply(frame, windowing_function, out=fft_signal[:frame_size])
else:
fft_signal[:frame_size] = frame
spectrogram[f] = fft(fft_signal, axis=0)[:nb_frequency_bins]
return spectrogram.T

View File

@ -235,11 +235,13 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
Pad inputs (on left/right and up to predefined length or max length in the batch) Pad inputs (on left/right and up to predefined length or max length in the batch)
Args: Args:
processed_features: processed_features (`Union[Dict[str, np.ndarray], BatchFeature]`):
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`) of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
max_length: maximum length of the returned list and optionally padding length (see below) max_length (`int`, *optional*):
padding_strategy: PaddingStrategy to use for padding. Maximum length of the returned list and optionally padding length (see below)
padding_strategy (`PaddingStrategy`, *optional*, default to `PaddingStrategy.DO_NOT_PAD`):
PaddingStrategy to use for padding.
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default) - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
@ -248,11 +250,12 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
- 'left': pads on the left of the sequences - 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences - 'right': pads on the right of the sequences
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. pad_to_multiple_of (`int`, *optional*):
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
return_attention_mask: which benefit from having sequence lengths be a multiple of 128.
(optional) Set to False to avoid returning attention mask (default: set to model specifics) return_attention_mask (`bool`, *optional*):
Set to False to avoid returning attention mask (default: set to model specifics)
""" """
required_input = processed_features[self.model_input_names[0]] required_input = processed_features[self.model_input_names[0]]
@ -303,15 +306,17 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
Truncate inputs to predefined length or max length in the batch Truncate inputs to predefined length or max length in the batch
Args: Args:
processed_features: processed_features(`Union[Dict[str, np.ndarray], BatchFeature]`):
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`) of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
max_length: maximum length of the returned list and optionally padding length (see below) max_length (`int`, *optional*):
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. maximum length of the returned list and optionally padding length (see below)
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability pad_to_multiple_of (`int`, *optional*) :
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
truncation: enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
(optional) Activates truncation to cut input sequences longer than `max_length` to `max_length`. which benefit from having sequence lengths be a multiple of 128.
truncation (`bool`, *optional*):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
""" """
if not truncation: if not truncation:
return processed_features return processed_features

View File

@ -40,6 +40,7 @@ from . import (
camembert, camembert,
canine, canine,
chinese_clip, chinese_clip,
clap,
clip, clip,
clipseg, clipseg,
codegen, codegen,

View File

@ -49,6 +49,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("camembert", "CamembertConfig"), ("camembert", "CamembertConfig"),
("canine", "CanineConfig"), ("canine", "CanineConfig"),
("chinese_clip", "ChineseCLIPConfig"), ("chinese_clip", "ChineseCLIPConfig"),
("clap", "ClapConfig"),
("clip", "CLIPConfig"), ("clip", "CLIPConfig"),
("clipseg", "CLIPSegConfig"), ("clipseg", "CLIPSegConfig"),
("codegen", "CodeGenConfig"), ("codegen", "CodeGenConfig"),
@ -222,6 +223,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"),
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@ -385,6 +387,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("camembert", "CamemBERT"), ("camembert", "CamemBERT"),
("canine", "CANINE"), ("canine", "CANINE"),
("chinese_clip", "Chinese-CLIP"), ("chinese_clip", "Chinese-CLIP"),
("clap", "CLAP"),
("clip", "CLIP"), ("clip", "CLIP"),
("clipseg", "CLIPSeg"), ("clipseg", "CLIPSeg"),
("codegen", "CodeGen"), ("codegen", "CodeGen"),

View File

@ -40,6 +40,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
("audio-spectrogram-transformer", "ASTFeatureExtractor"), ("audio-spectrogram-transformer", "ASTFeatureExtractor"),
("beit", "BeitFeatureExtractor"), ("beit", "BeitFeatureExtractor"),
("chinese_clip", "ChineseCLIPFeatureExtractor"), ("chinese_clip", "ChineseCLIPFeatureExtractor"),
("clap", "ClapFeatureExtractor"),
("clip", "CLIPFeatureExtractor"), ("clip", "CLIPFeatureExtractor"),
("clipseg", "ViTFeatureExtractor"), ("clipseg", "ViTFeatureExtractor"),
("conditional_detr", "ConditionalDetrFeatureExtractor"), ("conditional_detr", "ConditionalDetrFeatureExtractor"),

View File

@ -47,6 +47,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("camembert", "CamembertModel"), ("camembert", "CamembertModel"),
("canine", "CanineModel"), ("canine", "CanineModel"),
("chinese_clip", "ChineseCLIPModel"), ("chinese_clip", "ChineseCLIPModel"),
("clap", "ClapModel"),
("clip", "CLIPModel"), ("clip", "CLIPModel"),
("clipseg", "CLIPSegModel"), ("clipseg", "CLIPSegModel"),
("codegen", "CodeGenModel"), ("codegen", "CodeGenModel"),

View File

@ -46,6 +46,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("blip-2", "Blip2Processor"), ("blip-2", "Blip2Processor"),
("bridgetower", "BridgeTowerProcessor"), ("bridgetower", "BridgeTowerProcessor"),
("chinese_clip", "ChineseCLIPProcessor"), ("chinese_clip", "ChineseCLIPProcessor"),
("clap", "ClapProcessor"),
("clip", "CLIPProcessor"), ("clip", "CLIPProcessor"),
("clipseg", "CLIPSegProcessor"), ("clipseg", "CLIPSegProcessor"),
("flava", "FlavaProcessor"), ("flava", "FlavaProcessor"),

View File

@ -91,6 +91,13 @@ else:
), ),
("canine", ("CanineTokenizer", None)), ("canine", ("CanineTokenizer", None)),
("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
(
"clap",
(
"RobertaTokenizer",
"RobertaTokenizerFast" if is_tokenizers_available() else None,
),
),
( (
"clip", "clip",
( (

View File

@ -0,0 +1,76 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
_import_structure = {
"configuration_clap": [
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapAudioConfig",
"ClapConfig",
"ClapTextConfig",
],
"processing_clap": ["ClapProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_clap"] = [
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
"ClapModel",
"ClapPreTrainedModel",
"ClapTextModel",
"ClapTextModelWithProjection",
"ClapAudioModel",
"ClapAudioModelWithProjection",
]
_import_structure["feature_extraction_clap"] = ["ClapFeatureExtractor"]
if TYPE_CHECKING:
from .configuration_clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioConfig,
ClapConfig,
ClapTextConfig,
)
from .processing_clap import ClapProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .feature_extraction_clap import ClapFeatureExtractor
from .modeling_clap import (
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
ClapAudioModel,
ClapAudioModelWithProjection,
ClapModel,
ClapPreTrainedModel,
ClapTextModel,
ClapTextModelWithProjection,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@ -0,0 +1,450 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" CLAP model configuration"""
import copy
import os
from typing import Union
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = {
"laion/clap-htsat-fused": "https://huggingface.co/laion/clap-htsat-fused/resolve/main/config.json",
"laion/clap-htsat-unfused": "https://huggingface.co/laion/clap-htsat-unfused/resolve/main/config.json",
}
class ClapTextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`ClapTextModel`]. It is used to instantiate a CLAP
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the CLAP
[calp-hsat-fused](https://huggingface.co/laion/clap-hsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 30522):
Vocabulary size of the CLAP model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`ClapTextModel`].
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (`int`, *optional*, defaults to 3072):
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"relu"`,
`"relu"`, `"silu"` and `"relu_new"` are supported.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (`int`, *optional*, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (`int`, *optional*, defaults to 2):
The vocabulary size of the `token_type_ids` passed when calling [`ClapTextModel`].
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the layer normalization layers.
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
is_decoder (`bool`, *optional*, defaults to `False`):
Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
classifier_dropout (`float`, *optional*):
The dropout ratio for the classification head.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
projection_dim (`int`, *optional*, defaults to 512)
Dimension of the projection head of the `ClapTextModelWithProjection`.
Examples:
```python
>>> from transformers import ClapTextConfig, ClapTextModel
>>> # Initializing a CLAP text configuration
>>> configuration = ClapTextConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = ClapTextModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "clap_text_model"
def __init__(
self,
vocab_size=50265,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=514,
type_vocab_size=1,
initializer_range=0.02,
initializer_factor=1.0,
layer_norm_eps=1e-12,
projection_dim=512,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
position_embedding_type="absolute",
use_cache=True,
classifier_dropout=None,
projection_hidden_act="relu",
**kwargs,
):
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.hidden_act = hidden_act
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.layer_norm_eps = layer_norm_eps
self.position_embedding_type = position_embedding_type
self.use_cache = use_cache
self.classifier_dropout = classifier_dropout
self.projection_hidden_act = projection_hidden_act
self.projection_dim = projection_dim
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the text config dict if we are loading from ClapConfig
if config_dict.get("model_type") == "clap":
config_dict = config_dict["text_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class ClapAudioConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`ClapAudioModel`]. It is used to instantiate a
CLAP audio encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the audio encoder of the CLAP
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
window_size (`int`, *optional*, defaults to 8):
Image size of the spectrogram
num_mel_bins (`int`, *optional*, defaults to 64):
Number of mel features used per frames. Should correspond to the value used in the `ClapProcessor` class.
spec_size (`int`, *optional*, defaults to 256):
Desired input size of the spectrogram that the model supports. It can be different from the output of the
`ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size`
of the audio models.
hidden_act (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
patch_size (`int`, *optional*, defaults to 4):
Patch size for the audio spectrogram
patch_stride (`list`, *optional*, defaults to `[4, 4]`):
Patch stride for the audio spectrogram
num_classes (`int`, *optional*, defaults to 527):
Number of classes used for the head training
hidden_size (`int`, *optional*, defaults to 768):
Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's
output,which is sent to the projection MLP layer.
projection_dim (`int`, *optional*, defaults to 512):
Hidden size of the projection layer.
depths (`list`, *optional*, defaults to `[2, 2, 6, 2]`):
Depths used for the Swin Layers of the audio model
num_attention_heads (`list`, *optional*, defaults to `[4, 8, 16, 32]`):
Number of attention heads used for the Swin Layers of the audio model
enable_fusion (`bool`, *optional*, defaults to `False`):
Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the
best results.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the encoder.
fusion_type (`[type]`, *optional*):
Fusion type used for the patch fusion.
patch_embed_input_channels (`int`, *optional*, defaults to 1):
Number of channels used for the input spectrogram
flatten_patch_embeds (`bool`, *optional*, defaults to `True`):
Whether or not to flatten the patch embeddings
patch_embeds_hidden_size (`int`, *optional*, defaults to 96):
Hidden size of the patch embeddings. It is used as the number of output channels.
enable_patch_layer_norm (`bool`, *optional*, defaults to `True`):
Whether or not to enable layer normalization for the patch embeddings
drop_path_rate (`float`, *optional*, defaults to 0.0):
Drop path rate for the patch fusion
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
qkv_bias (`bool`, *optional*, defaults to `True`):
Whether or not to add a bias to the query, key, value projections.
mlp_ratio (`float`, *optional*, defaults to 4.0):
Ratio of the mlp hidden dim to embedding dim.
aff_block_r (`int`, *optional*, defaults to 4):
downsize_ratio used in the AudioFF block
num_hidden_layers (`int`, *optional*, defaults to 4):
Number of hidden layers in the Transformer encoder.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
`"relu"`, `"silu"` and `"gelu_new"` are supported.
layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`):
The epsilon used by the layer normalization layers.
initializer_factor (`float`, *optional*, defaults to 1.0):
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
testing).
Example:
```python
>>> from transformers import ClapAudioConfig, ClapAudioModel
>>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration
>>> configuration = ClapAudioConfig()
>>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration
>>> model = ClapAudioModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "clap_audio_model"
def __init__(
self,
window_size=8,
num_mel_bins=64,
spec_size=256,
hidden_act="gelu",
patch_size=4,
patch_stride=[4, 4],
num_classes=527,
hidden_size=768,
projection_dim=512,
depths=[2, 2, 6, 2],
num_attention_heads=[4, 8, 16, 32],
enable_fusion=False,
hidden_dropout_prob=0.1,
fusion_type=None,
patch_embed_input_channels=1,
flatten_patch_embeds=True,
patch_embeds_hidden_size=96,
enable_patch_layer_norm=True,
drop_path_rate=0.0,
attention_probs_dropout_prob=0.0,
qkv_bias=True,
mlp_ratio=4.0,
aff_block_r=4,
num_hidden_layers=4,
projection_hidden_act="relu",
layer_norm_eps=1e-5,
initializer_factor=1.0,
**kwargs,
):
super().__init__(**kwargs)
self.window_size = window_size
self.num_mel_bins = num_mel_bins
self.spec_size = spec_size
self.patch_size = patch_size
self.patch_stride = patch_stride
self.num_classes = num_classes
self.hidden_size = hidden_size
self.depths = depths
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.window_size = window_size
self.enable_fusion = enable_fusion
self.fusion_type = fusion_type
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.projection_dim = projection_dim
self.flatten_patch_embeds = flatten_patch_embeds
self.patch_embeds_hidden_size = patch_embeds_hidden_size
self.enable_patch_layer_norm = enable_patch_layer_norm
self.drop_path_rate = drop_path_rate
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.qkv_bias = qkv_bias
self.mlp_ratio = mlp_ratio
self.patch_embed_input_channels = patch_embed_input_channels
self.aff_block_r = aff_block_r
self.layer_norm_eps = layer_norm_eps
self.initializer_factor = initializer_factor
self.projection_hidden_act = projection_hidden_act
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the audio config dict if we are loading from ClapConfig
if config_dict.get("model_type") == "clap":
config_dict = config_dict["audio_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
class ClapConfig(PretrainedConfig):
r"""
[`ClapConfig`] is the configuration class to store the configuration of a [`ClapModel`]. It is used to instantiate
a CLAP model according to the specified arguments, defining the text model and audio model configs. Instantiating a
configuration with the defaults will yield a similar configuration to that of the CLAP
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
text_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`ClapTextConfig`].
audio_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`ClapAudioConfig`].
projection_dim (`int`, *optional*, defaults to 512):
Dimentionality of text and audio projection layers.
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation.
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
Activation function for the projection layers.
initializer_factor (`float`, *optional*, defaults to 1.0):
Factor to scale the initialization of the model weights.
kwargs (*optional*):
Dictionary of keyword arguments.
Example:
```python
>>> from transformers import ClapConfig, ClapModel
>>> # Initializing a ClapConfig with laion-ai/base style configuration
>>> configuration = ClapConfig()
>>> # Initializing a ClapModel (with random weights) from the laion-ai/base style configuration
>>> model = ClapModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
>>> # We can also initialize a ClapConfig from a ClapTextConfig and a ClapAudioConfig
>>> from transformers import ClapTextConfig, ClapAudioConfig
>>> # Initializing a ClapText and ClapAudioConfig configuration
>>> config_text = ClapTextConfig()
>>> config_audio = ClapAudioConfig()
>>> config = ClapConfig.from_text_audio_configs(config_text, config_audio)
```"""
model_type = "clap"
is_composition = True
def __init__(
self,
text_config=None,
audio_config=None,
logit_scale_init_value=(1 / 0.07),
projection_dim=512,
projection_hidden_act="relu",
initializer_factor=1.0,
**kwargs,
):
super().__init__(**kwargs)
if text_config is None:
text_config = {}
logger.info("text_config is None. Initializing the ClapTextConfig with default values.")
if audio_config is None:
audio_config = {}
logger.info("audio_config is None. initializing the ClapAudioConfig with default values.")
self.text_config = ClapTextConfig(**text_config)
self.audio_config = ClapAudioConfig(**audio_config)
self.text_config.projection_dim = projection_dim
self.audio_config.projection_dim = projection_dim
self.text_config.projection_hidden_act = projection_hidden_act
self.audio_config.projection_hidden_act = projection_hidden_act
self.projection_dim = projection_dim
self.projection_hidden_act = projection_hidden_act
self.hidden_size = self.text_config.hidden_size
self.logit_scale_init_value = logit_scale_init_value
self.initializer_factor = initializer_factor
self.num_hidden_layers = self.text_config.num_hidden_layers + len(self.audio_config.depths)
@classmethod
def from_text_audio_configs(cls, text_config: ClapTextConfig, audio_config: ClapAudioConfig, **kwargs):
r"""
Instantiate a [`ClapConfig`] (or a derived class) from clap text model configuration and clap audio model
configuration.
Returns:
[`ClapConfig`]: An instance of a configuration object
"""
return cls(text_config=text_config.to_dict(), audio_config=audio_config.to_dict(), **kwargs)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["text_config"] = self.text_config.to_dict()
output["audio_config"] = self.audio_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@ -0,0 +1,123 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import re
import torch
from CLAP import create_model
from transformers import AutoFeatureExtractor, ClapConfig, ClapModel
KEYS_TO_MODIFY_MAPPING = {
"text_branch": "text_model",
"audio_branch": "audio_model.audio_encoder",
"attn": "attention.self",
"self.proj": "output.dense",
"attention.self_mask": "attn_mask",
"mlp.fc1": "intermediate.dense",
"mlp.fc2": "output.dense",
"norm1": "layernorm_before",
"norm2": "layernorm_after",
"bn0": "batch_norm",
}
processor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused", truncation="rand_trunc")
def init_clap(checkpoint_path, enable_fusion=False):
model, model_cfg = create_model(
"HTSAT-tiny",
"roberta",
checkpoint_path,
precision="fp32",
device="cuda:0" if torch.cuda.is_available() else "cpu",
enable_fusion=enable_fusion,
fusion_type="aff_2d" if enable_fusion else None,
)
return model, model_cfg
def rename_state_dict(state_dict):
model_state_dict = {}
sequential_layers_pattern = r".*sequential.(\d+).*"
text_projection_pattern = r".*_projection.(\d+).*"
for key, value in state_dict.items():
# check if any key needs to be modified
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
if re.match(sequential_layers_pattern, key):
# replace sequential layers with list
sequential_layer = re.match(sequential_layers_pattern, key).group(1)
key = key.replace(f"sequential.{sequential_layer}.", f"layers.{int(sequential_layer)//3}.linear.")
elif re.match(text_projection_pattern, key):
projecton_layer = int(re.match(text_projection_pattern, key).group(1))
# Because in CLAP they use `nn.Sequential`...
transformers_projection_layer = 1 if projecton_layer == 0 else 2
key = key.replace(f"_projection.{projecton_layer}.", f"_projection.linear{transformers_projection_layer}.")
if "audio" and "qkv" in key:
# split qkv into query key and value
mixed_qkv = value
qkv_dim = mixed_qkv.size(0) // 3
query_layer = mixed_qkv[:qkv_dim]
key_layer = mixed_qkv[qkv_dim : qkv_dim * 2]
value_layer = mixed_qkv[qkv_dim * 2 :]
model_state_dict[key.replace("qkv", "query")] = query_layer
model_state_dict[key.replace("qkv", "key")] = key_layer
model_state_dict[key.replace("qkv", "value")] = value_layer
else:
model_state_dict[key] = value
return model_state_dict
def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, enable_fusion=False):
clap_model, clap_model_cfg = init_clap(checkpoint_path, enable_fusion=enable_fusion)
clap_model.eval()
state_dict = clap_model.state_dict()
state_dict = rename_state_dict(state_dict)
transformers_config = ClapConfig()
transformers_config.audio_config.enable_fusion = enable_fusion
model = ClapModel(transformers_config)
# ignore the spectrogram embedding layer
model.load_state_dict(state_dict, strict=False)
model.save_pretrained(pytorch_dump_folder_path)
transformers_config.save_pretrained(pytorch_dump_folder_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
parser.add_argument("--enable_fusion", action="store_true", help="Whether to enable fusion or not")
args = parser.parse_args()
convert_clap_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.enable_fusion)

View File

@ -0,0 +1,356 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Feature extractor class for CLAP."""
import copy
from typing import Any, Dict, List, Optional, Union
import numpy as np
import torch
from ...audio_utils import fram_wave, get_mel_filter_banks, power_to_db, stft
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
from ...feature_extraction_utils import BatchFeature
from ...utils import TensorType, logging
logger = logging.get_logger(__name__)
class ClapFeatureExtractor(SequenceFeatureExtractor):
r"""
Constructs a CLAP feature extractor.
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the *Short Time
Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent.
Args:
feature_size (`int`, defaults to 64):
The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters
(`n_mels`).
sampling_rate (`int`, defaults to 48_000):
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves
to warn users if the audio fed to the feature extractor does not have the same sampling rate.
hop_length (`int`, defaults to 480):
Length of the overlaping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split
in smaller `frames` with a step of `hop_length` between each frame.
max_length_s (`int`, defaults to 10):
The maximum input lenght of the model in seconds. This is used to pad the audio.
fft_window_size (`int`, defaults to 1024):
Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency
resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples.
padding_value (`float`, *optional*, defaults to 0.0):
Padding value used to pad the audio. Should correspond to silences.
return_attention_mask (`bool`, *optional*, defaults to `False`):
Whether or not the model should return the attention masks coresponding to the input.
frequency_min (`float`, *optional*, default to 0):
The lowest frequency of interest. The STFT will not be computed for values below this.
frequency_max (`float`, *optional*, default to 14_000):
The highest frequency of interest. The STFT will not be computed for values above this.
top_db (`float`, *optional*):
The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the
`audio_utils.power_to_db` function
truncation (`str`, *optional*, default to `"fusions"`):
Truncation pattern for long audio inputs. Two patterns are available:
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a
downsampled version of the entire mel spectrogram.
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a copy
of the original mel obtained from the padded audio.
- `rand_trunc` will select a random crop of the mel spectrogram.
padding (`str`, *optional*, defaults to `"repeatpad"`):
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
- `repeat`: the audio is repeated and then cut to fit the `max_length`
- `pad`: the audio is padded.
"""
model_input_names = ["input_features", "is_longer"]
def __init__(
self,
feature_size=64,
sampling_rate=48_000,
hop_length=480,
max_length_s=10,
fft_window_size=1024,
padding_value=0.0,
return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask
frequency_min: float = 0,
frequency_max: float = 14_000,
top_db: int = None,
truncation: str = "fusion",
padding: str = "repeatpad",
**kwargs,
):
super().__init__(
feature_size=feature_size,
sampling_rate=sampling_rate,
padding_value=padding_value,
return_attention_mask=return_attention_mask,
**kwargs,
)
self.top_db = top_db
self.truncation = truncation
self.padding = padding
self.fft_window_size = fft_window_size
self.nb_frequency_bins = (fft_window_size >> 1) + 1
self.hop_length = hop_length
self.max_length_s = max_length_s
self.nb_max_samples = max_length_s * sampling_rate
self.sampling_rate = sampling_rate
self.frequency_min = frequency_min
self.frequency_max = frequency_max
self.mel_filters = get_mel_filter_banks(
nb_frequency_bins=self.nb_frequency_bins,
nb_mel_filters=feature_size,
frequency_min=frequency_min,
frequency_max=frequency_max,
sample_rate=sampling_rate,
norm=None,
mel_scale="htk",
)
self.mel_filters_slaney = get_mel_filter_banks(
nb_frequency_bins=self.nb_frequency_bins,
nb_mel_filters=feature_size,
frequency_min=frequency_min,
frequency_max=frequency_max,
sample_rate=sampling_rate,
norm="slaney",
mel_scale="slaney",
)
def to_dict(self) -> Dict[str, Any]:
"""
Serializes this instance to a Python dictionary.
Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance, excpet for the
mel filter banks, which do not need to be saved or printed as they are too long.
"""
output = copy.deepcopy(self.__dict__)
output["feature_extractor_type"] = self.__class__.__name__
if "mel_filters" in output:
del output["mel_filters"]
if "mel_filters_slaney" in output:
del output["mel_filters_slaney"]
return output
def _np_extract_fbank_features(self, waveform: np.array, mel_filters: Optional[np.array] = None) -> np.ndarray:
"""
Compute the log-Mel spectrogram of the provided `waveform` using the `hanning` window. In CLAP, two different
filter banks are used depending on the truncation pattern:
- `self.mel_filters`: they correspond to the defaults parameters of `torchaduio` which can be obtained from
calling `torchaudio.transforms.MelSpectrogram().mel_scale.fb`. These filters are used when `truncation`
is set to `"fusion"`.
- `self.mel_filteres_slaney` : they correspond to the defaults parameters of `torchlibrosa` which used
`librosa.filters.mel` when computing the mel spectrogram. These filters were only used in the original
implementation when the truncation mode is not `"fusion"`.
"""
window = np.hanning(self.fft_window_size + 1)[:-1]
frames = fram_wave(waveform, self.hop_length, self.fft_window_size)
spectrogram = stft(frames, window, fft_window_size=self.fft_window_size)
magnitudes = np.abs(spectrogram) ** 2
mel_spectrogram = np.matmul(mel_filters.T, magnitudes)
log_mel_spectrogram = power_to_db(mel_spectrogram).T
log_mel_spectrogram = np.asarray(log_mel_spectrogram, np.float32)
return log_mel_spectrogram
def _random_mel_fusion(self, mel, total_frames, chunk_frames):
ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3)
if len(ranges[1]) == 0:
# if the audio is too short, we just use the first chunk
ranges[1] = [0]
if len(ranges[2]) == 0:
# if the audio is too short, we just use the first chunk
ranges[2] = [0]
# randomly choose index for each part
idx_front = np.random.choice(ranges[0])
idx_middle = np.random.choice(ranges[1])
idx_back = np.random.choice(ranges[2])
mel_chunk_front = mel[idx_front : idx_front + chunk_frames, :]
mel_chunk_middle = mel[idx_middle : idx_middle + chunk_frames, :]
mel_chunk_back = mel[idx_back : idx_back + chunk_frames, :]
mel = torch.tensor(mel[None, None, :])
mel_shrink = torch.nn.functional.interpolate(
mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False, antialias=False
)
mel_shrink = mel_shrink[0][0].numpy()
mel_fusion = np.stack([mel_chunk_front, mel_chunk_middle, mel_chunk_back, mel_shrink], axis=0)
return mel_fusion
def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> np.array:
"""
Extracts the mel spectrogram and prepares it for the mode based on the `truncation` and `padding` arguments.
Four different path are possible:
- `truncation="fusion"` and the length of the waveform is greater than the max length: the mel spectrogram
will be computed on the entire audio. 3 random crops and a dowsampled version of the full mel spectrogram
are then stacked together. They will later be used for `feature_fusion`.
- `truncation="rand_trunc"` and the length of the waveform is smaller than the max length: the audio is
padded based on `padding`.
- `truncation="fusion"` and the length of the waveform is smaller than the max length: the audio is padded
based on `padding`, and is repeated `4` times.
- `truncation="rand_trunc"` and the length of the waveform is greater than the max length: the mel
spectrogram will be computed on a random crop of the waveform.
"""
if waveform.shape[0] > max_length:
if truncation == "rand_trunc":
longer = True
# random crop to max_length (for compatibility) -> this should be handled by self.pad
overflow = len(waveform) - max_length
idx = np.random.randint(0, overflow + 1)
waveform = waveform[idx : idx + max_length]
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
elif truncation == "fusion":
mel = self._np_extract_fbank_features(waveform, self.mel_filters)
chunk_frames = max_length // self.hop_length + 1 # the +1 related to how the spectrogram is computed
total_frames = mel.shape[0]
if chunk_frames == total_frames:
# there is a corner case where the audio length is larger than max_length but smaller than max_length+hop_length.
# In this case, we just use the whole audio.
input_mel = np.stack([mel, mel, mel, mel], axis=0)
longer = False
else:
input_mel = self._random_mel_fusion(mel, total_frames, chunk_frames)
longer = True
else:
raise NotImplementedError(f"data_truncating {truncation} not implemented")
else:
longer = False
# only use repeat as a new possible value for padding. you repeat the audio before applying the usual max_length padding
if waveform.shape[0] < max_length:
if padding == "repeat":
n_repeat = int(max_length / len(waveform))
waveform = np.stack(np.tile(waveform, n_repeat + 1))[:max_length]
if padding == "repeatpad":
n_repeat = int(max_length / len(waveform))
waveform = np.stack(np.tile(waveform, n_repeat))
waveform = np.pad(waveform, (0, max_length - waveform.shape[0]), mode="constant", constant_values=0)
if truncation == "fusion":
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters)
input_mel = np.stack([input_mel, input_mel, input_mel, input_mel], axis=0)
else:
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
return input_mel, longer
def __call__(
self,
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
truncation: str = None,
padding: Optional[str] = None,
max_length: Optional[int] = None,
sampling_rate: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
"""
Main method to featurize and prepare for the model one or several sequence(s).
Args:
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
values, a list of numpy arrays or a list of list of float values.
truncation (`str`, *optional*):
Truncation pattern for long audio inputs. Two patterns are available:
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and
a downsampled version of the entire mel spectrogram.
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a
copy of the original mel obtained from the padded audio.
- `rand_trunc` will select a random crop of the mel spectrogram.
padding (`str`, *optional*):
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
- `repeat`: the audio is repeated and then cut to fit the `max_length`
- `pad`: the audio is padded.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors instead of list of python integers. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.np.array` objects.
- `'np'`: Return Numpy `np.ndarray` objects.
sampling_rate (`int`, *optional*):
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
pipeline.
"""
truncation = truncation if truncation is not None else self.truncation
padding = padding if padding else self.padding
if sampling_rate is not None:
if sampling_rate != self.sampling_rate:
raise ValueError(
f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
f" was sampled with {self.sampling_rate} and not {sampling_rate}."
)
else:
logger.warning(
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
"Failing to do so can result in silent errors that might be hard to debug."
)
is_batched = bool(
isinstance(raw_speech, (list, tuple))
and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list)))
)
if is_batched:
raw_speech = [np.asarray(speech, dtype=np.float64) for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float64)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float64)
# always return batch
if not is_batched:
raw_speech = [np.asarray(raw_speech)]
# convert to mel spectrogram, truncate and pad if needed.
padded_inputs = [
self._get_input_mel(waveform, max_length if max_length else self.nb_max_samples, truncation, padding)
for waveform in raw_speech
]
input_mel = []
is_longer = []
for mel, longer in padded_inputs:
input_mel.append(mel)
is_longer.append(longer)
if truncation == "fusion" and sum(is_longer) == 0:
# if no audio is longer than 10s, then randomly select one audio to be longer
rand_idx = np.random.randint(0, len(input_mel))
is_longer[rand_idx] = True
if isinstance(input_mel[0], List):
input_mel = [np.asarray(feature, dtype=np.float64) for feature in input_mel]
input_features = {"input_features": input_mel, "is_longer": is_longer}
input_features = BatchFeature(input_features)
if return_tensors is not None:
input_features = input_features.convert_to_tensors(return_tensors)
return input_features

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,116 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Audio/Text processor class for CLAP
"""
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
class ClapProcessor(ProcessorMixin):
r"""
Constructs a CLAP processor which wraps a CLAP feature extractor and a RoBerta tokenizer into a single processor.
[`ClapProcessor`] offers all the functionalities of [`ClapFeatureExtractor`] and [`RobertaTokenizerFast`]. See the
[`~ClapProcessor.__call__`] and [`~ClapProcessor.decode`] for more information.
Args:
feature_extractor ([`ClapFeatureExtractor`]):
The audio processor is a required input.
tokenizer ([`RobertaTokenizerFast`]):
The tokenizer is a required input.
"""
feature_extractor_class = "ClapFeatureExtractor"
tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast")
def __init__(self, feature_extractor, tokenizer):
super().__init__(feature_extractor, tokenizer)
def __call__(self, text=None, audios=None, return_tensors=None, **kwargs):
"""
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
and `kwargs` arguments to RobertaTokenizerFast's [`~RobertaTokenizerFast.__call__`] if `text` is not `None` to
encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
ClapFeatureExtractor's [`~ClapFeatureExtractor.__call__`] if `audios` is not `None`. Please refer to the
doctsring of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
audios (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
and T the sample length of the audio.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **audio_features** -- Audio features to be fed to a model. Returned when `audios` is not `None`.
"""
sampling_rate = kwargs.pop("sampling_rate", None)
if text is None and audios is None:
raise ValueError("You have to specify either text or audios. Both cannot be none.")
if text is not None:
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
if audios is not None:
audio_features = self.feature_extractor(
audios, sampling_rate=sampling_rate, return_tensors=return_tensors, **kwargs
)
if text is not None and audios is not None:
encoding["input_features"] = audio_features.input_features
return encoding
elif text is not None:
return encoding
else:
return BatchEncoding(data=dict(**audio_features), tensor_type=return_tensors)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
to the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
feature_extractor_input_names = self.feature_extractor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))

View File

@ -1464,6 +1464,58 @@ class ChineseCLIPVisionModel(metaclass=DummyObject):
requires_backends(self, ["torch"]) requires_backends(self, ["torch"])
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = None
class ClapAudioModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapAudioModelWithProjection(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapFeatureExtractor(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapTextModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class ClapTextModelWithProjection(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

View File

@ -0,0 +1,267 @@
# coding=utf-8
# Copyright 2023 HuggingFace Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import random
import unittest
import numpy as np
from transformers import ClapFeatureExtractor
from transformers.testing_utils import require_torch, require_torchaudio
from transformers.utils.import_utils import is_torch_available
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
if is_torch_available():
import torch
global_rng = random.Random()
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
def floats_list(shape, scale=1.0, rng=None, name=None):
"""Creates a random float32 tensor"""
if rng is None:
rng = global_rng
values = []
for batch_idx in range(shape[0]):
values.append([])
for _ in range(shape[1]):
values[-1].append(rng.random() * scale)
return values
@require_torch
@require_torchaudio
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap
class ClapFeatureExtractionTester(unittest.TestCase):
def __init__(
self,
parent,
batch_size=7,
min_seq_length=400,
max_seq_length=2000,
feature_size=10,
hop_length=160,
chunk_length=8,
padding_value=0.0,
sampling_rate=4_000,
return_attention_mask=False,
do_normalize=True,
):
self.parent = parent
self.batch_size = batch_size
self.min_seq_length = min_seq_length
self.max_seq_length = max_seq_length
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
self.padding_value = padding_value
self.sampling_rate = sampling_rate
self.return_attention_mask = return_attention_mask
self.do_normalize = do_normalize
self.feature_size = feature_size
self.chunk_length = chunk_length
self.hop_length = hop_length
def prepare_feat_extract_dict(self):
return {
"feature_size": self.feature_size,
"hop_length": self.hop_length,
"chunk_length": self.chunk_length,
"padding_value": self.padding_value,
"sampling_rate": self.sampling_rate,
"return_attention_mask": self.return_attention_mask,
"do_normalize": self.do_normalize,
}
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
def _flatten(list_of_lists):
return list(itertools.chain(*list_of_lists))
if equal_length:
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
else:
# make sure that inputs increase in size
speech_inputs = [
floats_list((x, self.feature_size))
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
]
if numpify:
speech_inputs = [np.asarray(x) for x in speech_inputs]
return speech_inputs
@require_torch
@require_torchaudio
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest with Whisper->Clap
class ClapFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
feature_extraction_class = ClapFeatureExtractor
def setUp(self):
self.feat_extract_tester = ClapFeatureExtractionTester(self)
def test_call(self):
# Tests that all call wrap to encode_plus and batch_encode_plus
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
# create three inputs of length 800, 1000, and 1200
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
# Test feature size
input_features = feature_extractor(np_speech_inputs, padding="max_length", return_tensors="np").input_features
self.assertTrue(input_features.ndim == 4)
# Test not batched input
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
# Test batched
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
def test_double_precision_pad(self):
import torch
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
py_speech_inputs = np_speech_inputs.tolist()
for inputs in [py_speech_inputs, np_speech_inputs]:
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
self.assertTrue(np_processed.input_features.dtype == np.float32)
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
def _load_datasamples(self, num_samples):
from datasets import load_dataset
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# automatic decoding with librispeech
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
return [x["array"] for x in speech_samples]
def integration_test_fusion(self):
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
[
-30.2194, -22.4424, -18.6442, -17.2452, -22.7392, -32.2576, -36.1404,
-35.6120, -29.6229, -29.0454, -32.2157, -36.7664, -29.4436, -26.7825,
-31.1811, -38.3918, -38.8749, -43.4485, -47.6236, -38.7528, -31.8574,
-39.0591, -41.3190, -32.3319, -31.4699, -33.4502, -36.7412, -34.5265,
-35.1091, -40.4518, -42.7346, -44.5909, -44.9747, -45.8328, -47.0772,
-46.2723, -44.3613, -48.6253, -44.9551, -43.8700, -44.6104, -48.0146,
-42.7614, -47.3587, -47.4369, -45.5018, -47.0198, -42.8759, -47.5056,
-47.1567, -49.2621, -49.5643, -48.4330, -48.8495, -47.2512, -40.8439,
-48.1234, -49.1218, -48.7222, -50.2399, -46.8487, -41.9921, -50.4015,
-50.7827
],
[
-89.0141, -89.1411, -88.8096, -88.5480, -88.3481, -88.2038,
-88.1105, -88.0647, -88.0636, -88.1051, -88.1877, -88.1110,
-87.8613, -88.6679, -88.2685, -88.9684, -88.7977, -89.6264,
-89.9299, -90.3184, -91.1446, -91.9265, -92.7267, -93.6099,
-94.6395, -95.3243, -95.5923, -95.5773, -95.0889, -94.3354,
-93.5746, -92.9287, -92.4525, -91.9798, -91.8852, -91.7500,
-91.7259, -91.7561, -91.7959, -91.7070, -91.6914, -91.5019,
-91.0640, -90.0807, -88.7102, -87.0826, -85.5956, -84.4441,
-83.8461, -83.8605, -84.6702, -86.3900, -89.3073, -93.2926,
-96.3813, -97.3529, -100.0000, -99.6942, -92.2851, -87.9588,
-85.7214, -84.6807, -84.1940, -84.2021
],
[
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
-72.9405
]
]
)
# fmt: on
MEL_BIN = [963, 963, 161]
input_speech = self._load_datasamples(1)
feaure_extractor = ClapFeatureExtractor()
for padding, EXPECTED_VALUES, idx_in_mel in zip(
["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES, MEL_BIN
):
input_features = feaure_extractor(input_speech, return_tensors="pt", padding=padding).input_features
self.assertTrue(torch.allclose(input_features[0, idx_in_mel], EXPECTED_VALUES, atol=1e-4))
def integration_test_rand_trunc(self):
# TODO in this case we should set the seed and use a longer audio to properly see the random truncation
# fmt: off
EXPECTED_INPUT_FEATURES = torch.tensor(
[
[
-42.3330, -36.2735, -35.9231, -43.5947, -48.4525, -46.5227, -42.6477,
-47.2740, -51.4336, -50.0846, -51.8711, -50.4232, -47.4736, -54.2275,
-53.3947, -55.4904, -54.8750, -54.5510, -55.4156, -57.4395, -51.7385,
-55.9118, -57.7800, -63.2064, -67.0651, -61.4379, -56.4268, -54.8667,
-52.3487, -56.4418, -57.1842, -55.1005, -55.6366, -59.4395, -56.8604,
-56.4949, -61.6573, -61.0826, -60.3250, -63.7876, -67.4882, -60.2323,
-54.6886, -50.5369, -47.7656, -45.8909, -49.1273, -57.4141, -58.3201,
-51.9862, -51.4897, -59.2561, -60.4730, -61.2203, -69.3174, -69.7464,
-65.5861, -58.9921, -59.5610, -61.0584, -58.1149, -64.4045, -66.2622,
-64.4610
],
[
-41.2298, -38.4211, -39.8834, -45.9950, -47.3839, -43.9849, -46.0371,
-52.5490, -56.6912, -51.8794, -50.1284, -49.7506, -53.9422, -63.2854,
-56.5754, -55.0469, -55.3181, -55.8115, -56.0058, -57.9215, -58.7597,
-59.1994, -59.2141, -64.4198, -73.5138, -64.4647, -59.3351, -54.5626,
-54.7508, -65.0230, -60.0270, -54.7644, -56.0108, -60.1531, -57.6879,
-56.3766, -63.3395, -65.3032, -61.5202, -63.0677, -68.4217, -60.6868,
-54.4619, -50.8533, -47.7200, -45.9197, -49.0961, -57.7621, -59.0750,
-51.9122, -51.4332, -59.4132, -60.3415, -61.6558, -70.7049, -69.7905,
-66.9104, -59.0324, -59.6138, -61.2023, -58.2169, -65.3837, -66.4425,
-64.4142
],
[
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
-72.9405
]
]
)
# fmt: on
input_speech = self._load_datasamples(1)
feaure_extractor = ClapFeatureExtractor()
for padding, EXPECTED_VALUES in zip(["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES):
input_features = feaure_extractor(
input_speech, return_tensors="pt", truncation="rand_trunc", padding=padding
).input_features
self.assertTrue(torch.allclose(input_features[0, 0, :30], EXPECTED_VALUES, atol=1e-4))

View File

@ -0,0 +1,665 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch CLAP model. """
import inspect
import os
import tempfile
import unittest
import numpy as np
from datasets import load_dataset
from transformers import ClapAudioConfig, ClapConfig, ClapProcessor, ClapTextConfig
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.utils import is_torch_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
ModelTesterMixin,
_config_zero_init,
floats_tensor,
ids_tensor,
random_attention_mask,
)
if is_torch_available():
import torch
from torch import nn
from transformers import (
ClapAudioModel,
ClapAudioModelWithProjection,
ClapModel,
ClapTextModel,
ClapTextModelWithProjection,
)
from transformers.models.clap.modeling_clap import CLAP_PRETRAINED_MODEL_ARCHIVE_LIST
class ClapAudioModelTester:
def __init__(
self,
parent,
batch_size=12,
image_size=60,
num_mel_bins=16,
window_size=4,
spec_size=64,
patch_size=2,
patch_stride=2,
seq_length=16,
freq_ratio=2,
num_channels=3,
is_training=True,
hidden_size=256,
patch_embeds_hidden_size=32,
projection_dim=32,
num_hidden_layers=4,
num_heads=[2, 2, 2, 2],
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
initializer_range=0.02,
scope=None,
):
self.parent = parent
self.batch_size = batch_size
self.image_size = image_size
self.num_mel_bins = num_mel_bins
self.window_size = window_size
self.patch_size = patch_size
self.num_channels = num_channels
self.is_training = is_training
self.hidden_size = hidden_size
self.projection_dim = projection_dim
self.num_hidden_layers = num_hidden_layers
self.num_heads = num_heads
self.num_attention_heads = num_heads[0]
self.seq_length = seq_length
self.spec_size = spec_size
self.freq_ratio = freq_ratio
self.patch_stride = patch_stride
self.patch_embeds_hidden_size = patch_embeds_hidden_size
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.scope = scope
def prepare_config_and_inputs(self):
input_features = floats_tensor([self.batch_size, 1, self.hidden_size, self.num_mel_bins])
config = self.get_config()
return config, input_features
def get_config(self):
return ClapAudioConfig(
image_size=self.image_size,
patch_size=self.patch_size,
num_mel_bins=self.num_mel_bins,
window_size=self.window_size,
num_channels=self.num_channels,
hidden_size=self.hidden_size,
patch_stride=self.patch_stride,
projection_dim=self.projection_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
initializer_range=self.initializer_range,
spec_size=self.spec_size,
freq_ratio=self.freq_ratio,
patch_embeds_hidden_size=self.patch_embeds_hidden_size,
)
def create_and_check_model(self, config, input_features):
model = ClapAudioModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_features)
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def create_and_check_model_with_projection(self, config, input_features):
model = ClapAudioModelWithProjection(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_features)
self.parent.assertEqual(result.audio_embeds.shape, (self.batch_size, self.projection_dim))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_features = config_and_inputs
inputs_dict = {"input_features": input_features}
return config, inputs_dict
@require_torch
class ClapAudioModelTest(ModelTesterMixin, unittest.TestCase):
"""
Here we also overwrite some of the tests of test_modeling_common.py, as CLAP does not use input_ids, inputs_embeds,
attention_mask and seq_length.
"""
all_model_classes = (ClapAudioModel, ClapAudioModelWithProjection) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_resize_embeddings = False
test_head_masking = False
def setUp(self):
self.model_tester = ClapAudioModelTester(self)
self.config_tester = ConfigTester(self, config_class=ClapAudioConfig, has_text_modality=False, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
@unittest.skip(reason="ClapAudioModel does not use inputs_embeds")
def test_inputs_embeds(self):
pass
def test_model_common_attributes(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
x = model.get_output_embeddings()
self.assertTrue(x is None or isinstance(x, nn.Linear))
def test_hidden_states_output(self):
def check_hidden_states_output(inputs_dict, config, model_class):
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
hidden_states = outputs.hidden_states
expected_num_layers = getattr(
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
)
self.assertEqual(len(hidden_states), expected_num_layers)
self.assertListEqual(
list(hidden_states[0].shape[-2:]),
[self.model_tester.patch_embeds_hidden_size, self.model_tester.patch_embeds_hidden_size],
)
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
inputs_dict["output_hidden_states"] = True
check_hidden_states_output(inputs_dict, config, model_class)
# check that output_hidden_states also work using config
del inputs_dict["output_hidden_states"]
config.output_hidden_states = True
check_hidden_states_output(inputs_dict, config, model_class)
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_retain_grad_hidden_states_attentions(self):
pass
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["input_features"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_with_projection(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_training(self):
pass
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapAudioModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
def test_model_with_projection_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapAudioModelWithProjection.from_pretrained(model_name)
self.assertIsNotNone(model)
self.assertTrue(hasattr(model, "visual_projection"))
class ClapTextModelTester:
def __init__(
self,
parent,
batch_size=12,
seq_length=7,
is_training=True,
use_input_mask=True,
use_labels=True,
vocab_size=99,
hidden_size=32,
projection_dim=32,
num_hidden_layers=5,
num_attention_heads=4,
intermediate_size=37,
dropout=0.1,
attention_dropout=0.1,
max_position_embeddings=512,
initializer_range=0.02,
scope=None,
projection_hidden_act="relu",
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.projection_dim = projection_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.scope = scope
self.projection_hidden_act = projection_hidden_act
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
if input_mask is not None:
batch_size, seq_length = input_mask.shape
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
for batch_idx, start_index in enumerate(rnd_start_indices):
input_mask[batch_idx, :start_index] = 1
input_mask[batch_idx, start_index:] = 0
config = self.get_config()
return config, input_ids, input_mask
def get_config(self):
return ClapTextConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
projection_dim=self.projection_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
max_position_embeddings=self.max_position_embeddings,
initializer_range=self.initializer_range,
projection_hidden_act=self.projection_hidden_act,
)
def create_and_check_model(self, config, input_ids, input_mask):
model = ClapTextModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_ids, attention_mask=input_mask)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
def create_and_check_model_with_projection(self, config, input_ids, input_mask):
model = ClapTextModelWithProjection(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(input_ids, attention_mask=input_mask)
result = model(input_ids)
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
self.parent.assertEqual(result.text_embeds.shape, (self.batch_size, self.projection_dim))
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, input_mask = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
return config, inputs_dict
@require_torch
class ClapTextModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (ClapTextModel, ClapTextModelWithProjection) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_head_masking = False
def setUp(self):
self.model_tester = ClapTextModelTester(self)
self.config_tester = ConfigTester(self, config_class=ClapTextConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_model_with_projection(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
def test_training(self):
pass
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="ClapTextModel does not use inputs_embeds")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapTextModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
def test_model_with_projection_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapTextModelWithProjection.from_pretrained(model_name)
self.assertIsNotNone(model)
self.assertTrue(hasattr(model, "text_projection"))
class ClapModelTester:
def __init__(self, parent, text_kwargs=None, audio_kwargs=None, is_training=True):
if text_kwargs is None:
text_kwargs = {}
if audio_kwargs is None:
audio_kwargs = {}
self.parent = parent
self.text_model_tester = ClapTextModelTester(parent, **text_kwargs)
self.audio_model_tester = ClapAudioModelTester(parent, **audio_kwargs)
self.is_training = is_training
def prepare_config_and_inputs(self):
_, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
_, input_features = self.audio_model_tester.prepare_config_and_inputs()
config = self.get_config()
return config, input_ids, attention_mask, input_features
def get_config(self):
return ClapConfig.from_text_audio_configs(
self.text_model_tester.get_config(), self.audio_model_tester.get_config(), projection_dim=64
)
def create_and_check_model(self, config, input_ids, attention_mask, input_features):
model = ClapModel(config).to(torch_device).eval()
with torch.no_grad():
result = model(input_ids, input_features, attention_mask)
self.parent.assertEqual(
result.logits_per_audio.shape, (self.audio_model_tester.batch_size, self.text_model_tester.batch_size)
)
self.parent.assertEqual(
result.logits_per_text.shape, (self.text_model_tester.batch_size, self.audio_model_tester.batch_size)
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, input_ids, attention_mask, input_features = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"input_features": input_features,
"return_loss": True,
}
return config, inputs_dict
@require_torch
class ClapModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (ClapModel,) if is_torch_available() else ()
fx_compatible = False
test_head_masking = False
test_pruning = False
test_resize_embeddings = False
test_attention_outputs = False
def setUp(self):
self.model_tester = ClapModelTester(self)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
@unittest.skip(reason="Hidden_states is tested in individual model tests")
def test_hidden_states_output(self):
pass
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
def test_inputs_embeds(self):
pass
@unittest.skip(reason="Retain_grad is tested in individual model tests")
def test_retain_grad_hidden_states_attentions(self):
pass
@unittest.skip(reason="ClapModel does not have input/output embeddings")
def test_model_common_attributes(self):
pass
# override as the `logit_scale` parameter initilization is different for CLAP
def test_initialization(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
# check if `logit_scale` is initilized as per the original implementation
if name == "logit_scale":
self.assertAlmostEqual(
param.data.item(),
np.log(1 / 0.07),
delta=1e-3,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
else:
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
def _create_and_check_torchscript(self, config, inputs_dict):
if not self.test_torchscript:
return
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
configs_no_init.torchscript = True
configs_no_init.return_dict = False
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
model.to(torch_device)
model.eval()
try:
input_ids = inputs_dict["input_ids"]
input_features = inputs_dict["input_features"] # CLAP needs input_features
traced_model = torch.jit.trace(model, (input_ids, input_features))
except RuntimeError:
self.fail("Couldn't trace module.")
with tempfile.TemporaryDirectory() as tmp_dir_name:
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
try:
torch.jit.save(traced_model, pt_file_name)
except Exception:
self.fail("Couldn't save module.")
try:
loaded_model = torch.jit.load(pt_file_name)
except Exception:
self.fail("Couldn't load module.")
model.to(torch_device)
model.eval()
loaded_model.to(torch_device)
loaded_model.eval()
model_state_dict = model.state_dict()
loaded_model_state_dict = loaded_model.state_dict()
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
models_equal = True
for layer_name, p1 in model_state_dict.items():
p2 = loaded_model_state_dict[layer_name]
if p1.data.ne(p2.data).sum() > 0:
models_equal = False
self.assertTrue(models_equal)
def test_load_audio_text_config(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
# Save ClapConfig and check if we can load ClapAudioConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
audio_config = ClapAudioConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.audio_config.to_dict(), audio_config.to_dict())
# Save ClapConfig and check if we can load ClapTextConfig from it
with tempfile.TemporaryDirectory() as tmp_dir_name:
config.save_pretrained(tmp_dir_name)
text_config = ClapTextConfig.from_pretrained(tmp_dir_name)
self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
@slow
def test_model_from_pretrained(self):
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = ClapModel.from_pretrained(model_name)
self.assertIsNotNone(model)
@slow
@require_torch
class ClapModelIntegrationTest(unittest.TestCase):
paddings = ["repeatpad", "repeat", "pad"]
def test_integration_unfused(self):
EXPECTED_MEANS_UNFUSED = {
"repeatpad": 0.0024,
"pad": 0.0020,
"repeat": 0.0023,
}
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[-1]
model_id = "laion/clap-htsat-unfused"
model = ClapModel.from_pretrained(model_id).to(torch_device)
processor = ClapProcessor.from_pretrained(model_id)
for padding in self.paddings:
inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding).to(
torch_device
)
audio_embed = model.get_audio_features(**inputs)
expected_mean = EXPECTED_MEANS_UNFUSED[padding]
self.assertTrue(
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
)
def test_integration_fused(self):
EXPECTED_MEANS_FUSED = {
"repeatpad": 0.00069,
"repeat": 0.00196,
"pad": -0.000379,
}
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[-1]
model_id = "laion/clap-htsat-fused"
model = ClapModel.from_pretrained(model_id).to(torch_device)
processor = ClapProcessor.from_pretrained(model_id)
for padding in self.paddings:
inputs = processor(
audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding, truncation="fusion"
).to(torch_device)
audio_embed = model.get_audio_features(**inputs)
expected_mean = EXPECTED_MEANS_FUSED[padding]
self.assertTrue(
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
)

View File

@ -0,0 +1,125 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
from transformers import ClapFeatureExtractor, ClapProcessor, RobertaTokenizer, RobertaTokenizerFast
from transformers.testing_utils import require_sentencepiece, require_torchaudio
from .test_feature_extraction_clap import floats_list
@require_torchaudio
@require_sentencepiece
class ClapProcessorTest(unittest.TestCase):
def setUp(self):
self.checkpoint = "laion/clap-htsat-unfused"
self.tmpdirname = tempfile.mkdtemp()
def get_tokenizer(self, **kwargs):
return RobertaTokenizer.from_pretrained(self.checkpoint, **kwargs)
def get_feature_extractor(self, **kwargs):
return ClapFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def test_save_load_pretrained_default(self):
tokenizer = self.get_tokenizer()
feature_extractor = self.get_feature_extractor()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
processor.save_pretrained(self.tmpdirname)
processor = ClapProcessor.from_pretrained(self.tmpdirname)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
def test_save_load_pretrained_additional_features(self):
processor = ClapProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
processor.save_pretrained(self.tmpdirname)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
processor = ClapProcessor.from_pretrained(
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
def test_feature_extractor(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
raw_speech = floats_list((3, 1000))
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
input_processor = processor(audios=raw_speech, return_tensors="np")
for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
def test_tokenizer(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
input_str = "This is a test string"
encoded_processor = processor(text=input_str)
encoded_tok = tokenizer(input_str)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key])
def test_tokenizer_decode(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
def test_model_input_names(self):
feature_extractor = self.get_feature_extractor()
tokenizer = self.get_tokenizer()
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
self.assertListEqual(
processor.model_input_names[2:],
feature_extractor.model_input_names,
msg="`processor` and `feature_extractor` model input names do not match",
)

View File

@ -172,6 +172,10 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
# should **not** be the rule. # should **not** be the rule.
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
# models to ignore for model xxx mapping # models to ignore for model xxx mapping
"ClapTextModel",
"ClapTextModelWithProjection",
"ClapAudioModel",
"ClapAudioModelWithProjection",
"Blip2ForConditionalGeneration", "Blip2ForConditionalGeneration",
"Blip2QFormerModel", "Blip2QFormerModel",
"Blip2VisionModel", "Blip2VisionModel",

View File

@ -40,6 +40,8 @@ src/transformers/models/bloom/configuration_bloom.py
src/transformers/models/camembert/configuration_camembert.py src/transformers/models/camembert/configuration_camembert.py
src/transformers/models/canine/configuration_canine.py src/transformers/models/canine/configuration_canine.py
src/transformers/models/canine/modeling_canine.py src/transformers/models/canine/modeling_canine.py
src/transformers/models/clap/configuration_clap.py
src/transformers/models/clap/modeling_clap.py
src/transformers/models/clip/configuration_clip.py src/transformers/models/clip/configuration_clip.py
src/transformers/models/clipseg/modeling_clipseg.py src/transformers/models/clipseg/modeling_clipseg.py
src/transformers/models/codegen/configuration_codegen.py src/transformers/models/codegen/configuration_codegen.py