[CLAP] Add CLAP to the library (#21370)
* add model like clip * update * text model ok * clap text works * some refactor - `CLAPVision` to `CLAPAudio` - refactor kwargs of audio modules * more refactor * more refactor * more refactor * correct fusion * more refactor * new modules * add basic processor * fixup * remove whisper copioed from * audio logits match * add doc * correct filters mel and add maxlength * style * few fixes * forward passes * fixup * fixup * some clean up * remove mels form the dictionnary * pad after the repeat * update padding when dsmaller * fix padding * style * use swin patch merging * use copied from swin * processor with any tokenizer * more copied from * some clean up * more refactor * fix mel when rand_trunc * style * remove unused imports * update processing * remove image processing tests * add testing fiel * fixmodeling issues * replace with `is_longer` * clap in serialization * more refactor * `make fixup` * make fixup * fix feature extractor * update test feature extractor * `make fixup` * clean up config * more clean up * more cleanup * update tests * refactor tests and inits * removeCLAP vision config * remove CLAP from image procssing auto and dummy vision objects * update inits * style * re order classes in modeling clap * Use roberta tokenizer as the other weights are not open sourced * small cleaup * remove tokenization CLAP * processor tokenizr is roberta * update feature extraction doc * remove vclap from model zero shot * update f_min and f_max to frequency_xx * some changes - fix modeling keys - add `is_longer` in the forward pass - make fixup * make fixup * consistent behavior ebtween rand_crop and fusion * add numpy resize and bilinear and documentation * move resizing to image utils * clean feature extraction * import resize from correct file * resize in image transforms * update * style * style * nit * remove unused arguments form the feature extractor * style * few fixes + make fixup * oops * fix more tests * add zero shot audio classification pipeline * update zeroshot classification pipeline * fixup * fix copies * all CI tests pass * make fixup + fix docs * fix docs * fix docs * update tests pip;eline * update zero shot pipeline * update feature extraction clap * update tokenization auto * use nested simplify * update pipeline tests * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * split in two lines * fixes * refactor * clean up * add integration tests * update config docstring * style * update processor * fix processor test * fix feat extractor tests * update docs * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix readmes * fix tips * Update src/transformers/models/auto/configuration_auto.py * update doc and remove todo -> properly explained * fix idx and typo * typoe * cleanup config * cleanup tests, styles and doc * ignore docstyle on image transform * add conversion script * remove the `clap` indx in favor of `CLAP` * update __init * nits * Update src/transformers/pipelines/__init__.py * fix bug * clarifiy config * fix copy * fix init * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix model output * fix comment * make fixup * make fixup * rename to `Clap` * replace to `Clap` * replace to `Clap` * repo consistency * again repo-consistency * make fixup * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * add config * changes * update conversion * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * remove unused function * update based on code reviews * style * more comments * cleanup * clean up * style * apply suggestions * Empty commit * pipeline will be added in a different PR * update calls to audio utils functions * update pipeline init * style * style * styling again * use pad * fix repo-consistency * update utils and add doc for audio utils * clean up resize by using torch. update inits accordingly * style * CLap's tokenizer is RobertA * add audio utils to internal toctreee * update totctree * style * update documentation and normalize naming accross audio utils and feature extraction clap * style * clean up * update doc and typos * fix doctest * update modelin code, got rid of a lot of reshaping * style on added doc audio utils * update modeling clap * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * docstringvariables with CLAP * rename key * update modeling CLAP * update audio utils docstring * update processing clap * fix readmes * fix toctree * udpate configuration clap * fix init * make fixup * fix * fix * update naming * update * update checkpoint path * Apply suggestions from code review * Major refactoring * Update src/transformers/models/clap/configuration_clap.py * merge --------- Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
This commit is contained in:
parent
6b0257de42
commit
c236a62172
|
@ -295,6 +295,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||||
|
|
|
@ -288,6 +288,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||||
|
|
|
@ -260,6 +260,7 @@ conda install -c huggingface transformers
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा।
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा।
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा।
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा।
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) के साथ जारी किया गया
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा।
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज।
|
||||||
|
|
|
@ -322,6 +322,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874)
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335)
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003)
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474)
|
||||||
|
|
|
@ -237,6 +237,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다.
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다.
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다.
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다.
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다.
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다.
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다.
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다.
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다.
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다.
|
||||||
|
|
|
@ -261,6 +261,7 @@ conda install -c huggingface transformers
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。
|
||||||
|
|
|
@ -273,6 +273,7 @@ conda install -c huggingface transformers
|
||||||
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||||
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||||
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
||||||
|
1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||||
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||||
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||||
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||||
|
|
|
@ -495,6 +495,8 @@
|
||||||
sections:
|
sections:
|
||||||
- local: model_doc/audio-spectrogram-transformer
|
- local: model_doc/audio-spectrogram-transformer
|
||||||
title: Audio Spectrogram Transformer
|
title: Audio Spectrogram Transformer
|
||||||
|
- local: model_doc/clap
|
||||||
|
title: CLAP
|
||||||
- local: model_doc/hubert
|
- local: model_doc/hubert
|
||||||
title: Hubert
|
title: Hubert
|
||||||
- local: model_doc/mctct
|
- local: model_doc/mctct
|
||||||
|
@ -622,6 +624,8 @@
|
||||||
title: Utilities for Generation
|
title: Utilities for Generation
|
||||||
- local: internal/image_processing_utils
|
- local: internal/image_processing_utils
|
||||||
title: Utilities for Image Processors
|
title: Utilities for Image Processors
|
||||||
|
- local: internal/audio_utils
|
||||||
|
title: Utilities for Audio processing
|
||||||
- local: internal/file_utils
|
- local: internal/file_utils
|
||||||
title: General Utilities
|
title: General Utilities
|
||||||
title: Internal Helpers
|
title: Internal Helpers
|
||||||
|
|
|
@ -74,6 +74,7 @@ The documentation is organized into five sections:
|
||||||
1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||||
1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||||
1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou.
|
||||||
|
1. **[CLAP](model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||||
1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.
|
||||||
1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker.
|
||||||
1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong.
|
||||||
|
@ -263,6 +264,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||||
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
| CLAP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| CLIP | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| CLIP | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| CLIPSeg | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| CodeGen | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
|
|
|
@ -0,0 +1,34 @@
|
||||||
|
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Utilities for `FeatureExtractors`
|
||||||
|
|
||||||
|
This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *Mel log spectrogram*.
|
||||||
|
|
||||||
|
|
||||||
|
Most of those are only useful if you are studying the code of the image processors in the library.
|
||||||
|
|
||||||
|
## Audio Transformations
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.hertz_to_mel
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.mel_to_hertz
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.get_mel_filter_banks
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.stft
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.power_to_db
|
||||||
|
|
||||||
|
[[autodoc]] audio_utils.fram_wave
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,77 @@
|
||||||
|
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# CLAP
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with
|
||||||
|
feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.
|
||||||
|
|
||||||
|
CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6*
|
||||||
|
|
||||||
|
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) .
|
||||||
|
The original code can be found [here](https://github.com/LAION-AI/Clap).
|
||||||
|
|
||||||
|
|
||||||
|
## ClapConfig
|
||||||
|
|
||||||
|
[[autodoc]] ClapConfig
|
||||||
|
- from_text_audio_configs
|
||||||
|
|
||||||
|
## ClapTextConfig
|
||||||
|
|
||||||
|
[[autodoc]] ClapTextConfig
|
||||||
|
|
||||||
|
## ClapAudioConfig
|
||||||
|
|
||||||
|
[[autodoc]] ClapAudioConfig
|
||||||
|
|
||||||
|
## ClapFeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] ClapFeatureExtractor
|
||||||
|
|
||||||
|
## ClapProcessor
|
||||||
|
|
||||||
|
[[autodoc]] ClapProcessor
|
||||||
|
|
||||||
|
## ClapModel
|
||||||
|
|
||||||
|
[[autodoc]] ClapModel
|
||||||
|
- forward
|
||||||
|
- get_text_features
|
||||||
|
- get_audio_features
|
||||||
|
|
||||||
|
## ClapTextModel
|
||||||
|
|
||||||
|
[[autodoc]] ClapTextModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ClapTextModelWithProjection
|
||||||
|
|
||||||
|
[[autodoc]] ClapTextModelWithProjection
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ClapAudioModel
|
||||||
|
|
||||||
|
[[autodoc]] ClapAudioModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ClapAudioModelWithProjection
|
||||||
|
|
||||||
|
[[autodoc]] ClapAudioModelWithProjection
|
||||||
|
- forward
|
||||||
|
|
|
@ -47,6 +47,7 @@ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
|
||||||
|
|
||||||
# Base objects, independent of any specific backend
|
# Base objects, independent of any specific backend
|
||||||
_import_structure = {
|
_import_structure = {
|
||||||
|
"audio_utils": [],
|
||||||
"benchmark": [],
|
"benchmark": [],
|
||||||
"commands": [],
|
"commands": [],
|
||||||
"configuration_utils": ["PretrainedConfig"],
|
"configuration_utils": ["PretrainedConfig"],
|
||||||
|
@ -206,6 +207,13 @@ _import_structure = {
|
||||||
"ChineseCLIPTextConfig",
|
"ChineseCLIPTextConfig",
|
||||||
"ChineseCLIPVisionConfig",
|
"ChineseCLIPVisionConfig",
|
||||||
],
|
],
|
||||||
|
"models.clap": [
|
||||||
|
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ClapAudioConfig",
|
||||||
|
"ClapConfig",
|
||||||
|
"ClapProcessor",
|
||||||
|
"ClapTextConfig",
|
||||||
|
],
|
||||||
"models.clip": [
|
"models.clip": [
|
||||||
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
"CLIPConfig",
|
"CLIPConfig",
|
||||||
|
@ -1231,6 +1239,18 @@ else:
|
||||||
"ChineseCLIPVisionModel",
|
"ChineseCLIPVisionModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.clap"].extend(
|
||||||
|
[
|
||||||
|
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ClapAudioModel",
|
||||||
|
"ClapAudioModelWithProjection",
|
||||||
|
"ClapFeatureExtractor",
|
||||||
|
"ClapModel",
|
||||||
|
"ClapPreTrainedModel",
|
||||||
|
"ClapTextModel",
|
||||||
|
"ClapTextModelWithProjection",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.clip"].extend(
|
_import_structure["models.clip"].extend(
|
||||||
[
|
[
|
||||||
"CLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"CLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
@ -3730,6 +3750,13 @@ if TYPE_CHECKING:
|
||||||
ChineseCLIPTextConfig,
|
ChineseCLIPTextConfig,
|
||||||
ChineseCLIPVisionConfig,
|
ChineseCLIPVisionConfig,
|
||||||
)
|
)
|
||||||
|
from .models.clap import (
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ClapAudioConfig,
|
||||||
|
ClapConfig,
|
||||||
|
ClapProcessor,
|
||||||
|
ClapTextConfig,
|
||||||
|
)
|
||||||
from .models.clip import (
|
from .models.clip import (
|
||||||
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
CLIPConfig,
|
CLIPConfig,
|
||||||
|
@ -4623,6 +4650,16 @@ if TYPE_CHECKING:
|
||||||
ChineseCLIPTextModel,
|
ChineseCLIPTextModel,
|
||||||
ChineseCLIPVisionModel,
|
ChineseCLIPVisionModel,
|
||||||
)
|
)
|
||||||
|
from .models.clap import (
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ClapAudioModel,
|
||||||
|
ClapAudioModelWithProjection,
|
||||||
|
ClapFeatureExtractor,
|
||||||
|
ClapModel,
|
||||||
|
ClapPreTrainedModel,
|
||||||
|
ClapTextModel,
|
||||||
|
ClapTextModelWithProjection,
|
||||||
|
)
|
||||||
from .models.clip import (
|
from .models.clip import (
|
||||||
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
CLIPModel,
|
CLIPModel,
|
||||||
|
|
|
@ -0,0 +1,359 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Audio processing functions to extract feature from a raw audio. Should all be in numpy to support all frameworks, and
|
||||||
|
remmove unecessary dependencies.
|
||||||
|
"""
|
||||||
|
import math
|
||||||
|
import warnings
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from numpy.fft import fft
|
||||||
|
|
||||||
|
|
||||||
|
def hertz_to_mel(freq: float, mel_scale: str = "htk") -> float:
|
||||||
|
"""Convert Hertz to Mels.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
freqs (`float`):
|
||||||
|
Frequencies in Hertz
|
||||||
|
mel_scale (`str`, *optional*, defaults to `"htk"`):
|
||||||
|
Scale to use, `htk` or `slaney`.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
mels (`float`):
|
||||||
|
Frequency in Mels
|
||||||
|
"""
|
||||||
|
|
||||||
|
if mel_scale not in ["slaney", "htk"]:
|
||||||
|
raise ValueError('mel_scale should be one of "htk" or "slaney".')
|
||||||
|
|
||||||
|
if mel_scale == "htk":
|
||||||
|
return 2595.0 * math.log10(1.0 + (freq / 700.0))
|
||||||
|
|
||||||
|
# Fill in the linear part
|
||||||
|
frequency_min = 0.0
|
||||||
|
f_sp = 200.0 / 3
|
||||||
|
|
||||||
|
mels = (freq - frequency_min) / f_sp
|
||||||
|
|
||||||
|
# Fill in the log-scale part
|
||||||
|
min_log_hertz = 1000.0
|
||||||
|
min_log_mel = (min_log_hertz - frequency_min) / f_sp
|
||||||
|
logstep = math.log(6.4) / 27.0
|
||||||
|
|
||||||
|
if freq >= min_log_hertz:
|
||||||
|
mels = min_log_mel + math.log(freq / min_log_hertz) / logstep
|
||||||
|
|
||||||
|
return mels
|
||||||
|
|
||||||
|
|
||||||
|
def mel_to_hertz(mels: np.array, mel_scale: str = "htk") -> np.array:
|
||||||
|
"""Convert mel bin numbers to frequencies.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
mels (`np.array`):
|
||||||
|
Mel frequencies
|
||||||
|
mel_scale (`str`, *optional*, `"htk"`):
|
||||||
|
Scale to use: `htk` or `slaney`.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
freqs (`np.array`):
|
||||||
|
Mels converted to Hertz
|
||||||
|
"""
|
||||||
|
|
||||||
|
if mel_scale not in ["slaney", "htk"]:
|
||||||
|
raise ValueError('mel_scale should be one of "htk" or "slaney".')
|
||||||
|
|
||||||
|
if mel_scale == "htk":
|
||||||
|
return 700.0 * (10.0 ** (mels / 2595.0) - 1.0)
|
||||||
|
|
||||||
|
# Fill in the linear scale
|
||||||
|
frequency_min = 0.0
|
||||||
|
f_sp = 200.0 / 3
|
||||||
|
freqs = frequency_min + f_sp * mels
|
||||||
|
|
||||||
|
# And now the nonlinear scale
|
||||||
|
min_log_hertz = 1000.0
|
||||||
|
min_log_mel = (min_log_hertz - frequency_min) / f_sp
|
||||||
|
logstep = math.log(6.4) / 27.0
|
||||||
|
|
||||||
|
log_t = mels >= min_log_mel
|
||||||
|
freqs[log_t] = min_log_hertz * np.exp(logstep * (mels[log_t] - min_log_mel))
|
||||||
|
|
||||||
|
return freqs
|
||||||
|
|
||||||
|
|
||||||
|
def _create_triangular_filterbank(
|
||||||
|
all_freqs: np.array,
|
||||||
|
f_pts: np.array,
|
||||||
|
) -> np.array:
|
||||||
|
"""Create a triangular filter bank.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
all_freqs (`np.array` of shape (`nb_frequency_bins`, )):
|
||||||
|
Discrete frequencies used when the STFT was computed.
|
||||||
|
f_pts (`np.array`, of shape (`nb_mel_filters`, )):
|
||||||
|
Coordinates of the middle points of the triangular filters to create.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
fb (np.array):
|
||||||
|
The filter bank of size (`nb_frequency_bins`, `nb_mel_filters`).
|
||||||
|
"""
|
||||||
|
# Adapted from Librosa
|
||||||
|
# calculate the difference between each filter mid point and each stft freq point in hertz
|
||||||
|
f_diff = f_pts[1:] - f_pts[:-1] # (n_filter + 1)
|
||||||
|
slopes = np.expand_dims(f_pts, 0) - np.expand_dims(all_freqs, 1) # (nb_frequency_bins, n_filter + 2)
|
||||||
|
# create overlapping triangles
|
||||||
|
zero = np.zeros(1)
|
||||||
|
down_slopes = (-1.0 * slopes[:, :-2]) / f_diff[:-1] # (nb_frequency_bins, n_filter)
|
||||||
|
up_slopes = slopes[:, 2:] / f_diff[1:] # (nb_frequency_bins, n_filter)
|
||||||
|
fb = np.maximum(zero, np.minimum(down_slopes, up_slopes))
|
||||||
|
|
||||||
|
return fb
|
||||||
|
|
||||||
|
|
||||||
|
def get_mel_filter_banks(
|
||||||
|
nb_frequency_bins: int,
|
||||||
|
nb_mel_filters: int,
|
||||||
|
frequency_min: float,
|
||||||
|
frequency_max: float,
|
||||||
|
sample_rate: int,
|
||||||
|
norm: Optional[str] = None,
|
||||||
|
mel_scale: str = "htk",
|
||||||
|
) -> np.array:
|
||||||
|
"""
|
||||||
|
Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a *mel filter bank*,
|
||||||
|
and various implementation exist, which differ in the number of filters, the shape of the filters, the way the
|
||||||
|
filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these
|
||||||
|
features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency.
|
||||||
|
This code is heavily inspired from the *torchaudio* implementation, see
|
||||||
|
[here](https://pytorch.org/audio/stable/transforms.html) for more details.
|
||||||
|
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
- Different banks of Mel filters were introduced in the litterature. The following variation are supported:
|
||||||
|
- MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz
|
||||||
|
and a speech bandwidth of `[0, 4600]` Hertz
|
||||||
|
- MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a
|
||||||
|
speech bandwidth `[0, 8000]` Hertz (sampling rate ≥ 16 kHertz).
|
||||||
|
- MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate
|
||||||
|
of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization.
|
||||||
|
- HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling
|
||||||
|
rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz
|
||||||
|
- The default parameters of `torchaudio`'s mel filterbanks implement the `"htk"` filers while `torchlibrosa`
|
||||||
|
uses the `"slaney"` implementation.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
nb_frequency_bins (`int`):
|
||||||
|
Number of frequencies used to compute the spectrogram (should be the same as in `stft`).
|
||||||
|
nb_mel_filters (`int`):
|
||||||
|
Number of Mel filers to generate.
|
||||||
|
frequency_min (`float`):
|
||||||
|
Minimum frequency of interest(Hertz).
|
||||||
|
frequency_max (`float`):
|
||||||
|
Maximum frequency of interest(Hertz).
|
||||||
|
sample_rate (`int`):
|
||||||
|
Sample rate of the audio waveform.
|
||||||
|
norm (`str`, *optional*):
|
||||||
|
If "slaney", divide the triangular Mel weights by the width of the mel band (area normalization).
|
||||||
|
mel_scale (`str`, *optional*, defaults to `"htk"`):
|
||||||
|
Scale to use: `"htk"` or `"slaney"`.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`np.ndarray`: Triangular filter banks (fb matrix) of shape (`nb_frequency_bins`, `nb_mel_filters`). This matrix
|
||||||
|
is a projection matrix to go from a spectrogram to a Mel Spectrogram.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
if norm is not None and norm != "slaney":
|
||||||
|
raise ValueError('norm must be one of None or "slaney"')
|
||||||
|
|
||||||
|
# freqency bins
|
||||||
|
all_freqs = np.linspace(0, sample_rate // 2, nb_frequency_bins)
|
||||||
|
|
||||||
|
# Compute mim and max frequencies in mel scale
|
||||||
|
m_min = hertz_to_mel(frequency_min, mel_scale=mel_scale)
|
||||||
|
m_max = hertz_to_mel(frequency_max, mel_scale=mel_scale)
|
||||||
|
|
||||||
|
# create the centers of the triangular mel filters.
|
||||||
|
m_pts = np.linspace(m_min, m_max, nb_mel_filters + 2)
|
||||||
|
f_pts = mel_to_hertz(m_pts, mel_scale=mel_scale)
|
||||||
|
|
||||||
|
# create the filterbank
|
||||||
|
filterbank = _create_triangular_filterbank(all_freqs, f_pts)
|
||||||
|
|
||||||
|
if norm is not None and norm == "slaney":
|
||||||
|
# Slaney-style mel is scaled to be approx constant energy per channel
|
||||||
|
enorm = 2.0 / (f_pts[2 : nb_mel_filters + 2] - f_pts[:nb_mel_filters])
|
||||||
|
filterbank *= np.expand_dims(enorm, 0)
|
||||||
|
|
||||||
|
if (filterbank.max(axis=0) == 0.0).any():
|
||||||
|
warnings.warn(
|
||||||
|
"At least one mel filterbank has all zero values. "
|
||||||
|
f"The value for `nb_mel_filters` ({nb_mel_filters}) may be set too high. "
|
||||||
|
f"Or, the value for `nb_frequency_bins` ({nb_frequency_bins}) may be set too low."
|
||||||
|
)
|
||||||
|
|
||||||
|
return filterbank
|
||||||
|
|
||||||
|
|
||||||
|
def power_to_db(mel_spectrogram, top_db=None, a_min=1e-10, ref=1.0):
|
||||||
|
"""
|
||||||
|
Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb.
|
||||||
|
It computes `10 * log10(mel_spectrogram / ref)`, using basic log properties for stability.
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
- The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on
|
||||||
|
a
|
||||||
|
linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into
|
||||||
|
it.
|
||||||
|
- This means that large variations in energy may not sound all that different if the sound is loud to begin
|
||||||
|
with. This compression operation makes the mel features match more closely what humans actually hear.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
mel_spectrogram (`np.array`):
|
||||||
|
Input mel spectrogram.
|
||||||
|
top_db (`int`, *optional*):
|
||||||
|
The maximum decibel value.
|
||||||
|
a_min (`int`, *optional*, default to 1e-10):
|
||||||
|
Minimum value to use when cliping the mel spectrogram.
|
||||||
|
ref (`float`, *optional*, default to 1.0):
|
||||||
|
Maximum reference value used to scale the mel_spectrogram.
|
||||||
|
|
||||||
|
"""
|
||||||
|
log_spec = 10 * np.log10(np.clip(mel_spectrogram, a_min=a_min, a_max=None))
|
||||||
|
log_spec -= 10.0 * np.log10(np.maximum(a_min, ref))
|
||||||
|
if top_db is not None:
|
||||||
|
if top_db < 0:
|
||||||
|
raise ValueError("top_db must be non-negative")
|
||||||
|
log_spec = np.clip(log_spec, min=np.maximum(log_spec) - top_db, max=np.inf)
|
||||||
|
return log_spec
|
||||||
|
|
||||||
|
|
||||||
|
# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
|
||||||
|
def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = 400, center: bool = True):
|
||||||
|
"""
|
||||||
|
In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed
|
||||||
|
segments called `frames`.
|
||||||
|
|
||||||
|
The window length (window_length) defines how much of the signal is contained in each frame, while the hop length
|
||||||
|
defines the step between the beginning of each new frame.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
waveform (`np.array` of shape `(sample_length,)`):
|
||||||
|
The raw waveform which will be split into smaller chunks.
|
||||||
|
hop_length (`int`, *optional*, defaults to 160):
|
||||||
|
Step between each window of the waveform.
|
||||||
|
fft_window_size (`int`, *optional*, defaults to 400):
|
||||||
|
Defines the size of the window.
|
||||||
|
center (`bool`, defaults to `True`):
|
||||||
|
Whether or not to center each frame around the middle of the frame. Centering is done by reflecting the
|
||||||
|
waveform on the left and on the right.
|
||||||
|
|
||||||
|
Return:
|
||||||
|
framed_waveform (`np.array` of shape `(waveform.shape // hop_length , fft_window_size)`):
|
||||||
|
The framed waveforms that can be fed to `np.fft`.
|
||||||
|
"""
|
||||||
|
frames = []
|
||||||
|
for i in range(0, waveform.shape[0] + 1, hop_length):
|
||||||
|
if center:
|
||||||
|
half_window = (fft_window_size - 1) // 2 + 1
|
||||||
|
start = i - half_window if i > half_window else 0
|
||||||
|
end = i + half_window if i < waveform.shape[0] - half_window else waveform.shape[0]
|
||||||
|
frame = waveform[start:end]
|
||||||
|
if start == 0:
|
||||||
|
padd_width = (-i + half_window, 0)
|
||||||
|
frame = np.pad(frame, pad_width=padd_width, mode="reflect")
|
||||||
|
|
||||||
|
elif end == waveform.shape[0]:
|
||||||
|
padd_width = (0, (i - waveform.shape[0] + half_window))
|
||||||
|
frame = np.pad(frame, pad_width=padd_width, mode="reflect")
|
||||||
|
|
||||||
|
else:
|
||||||
|
frame = waveform[i : i + fft_window_size]
|
||||||
|
frame_width = frame.shape[0]
|
||||||
|
if frame_width < waveform.shape[0]:
|
||||||
|
frame = np.lib.pad(
|
||||||
|
frame, pad_width=(0, fft_window_size - frame_width), mode="constant", constant_values=0
|
||||||
|
)
|
||||||
|
frames.append(frame)
|
||||||
|
|
||||||
|
frames = np.stack(frames, 0)
|
||||||
|
return frames
|
||||||
|
|
||||||
|
|
||||||
|
# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference.
|
||||||
|
|
||||||
|
|
||||||
|
def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = None):
|
||||||
|
"""
|
||||||
|
Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results
|
||||||
|
as `torch.stft`.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
frames (`np.array` of dimension `(num_frames, fft_window_size)`):
|
||||||
|
A framed audio signal obtained using `audio_utils.fram_wav`.
|
||||||
|
windowing_function (`np.array` of dimension `(nb_frequency_bins, nb_mel_filters)`:
|
||||||
|
A array reprensenting the function that will be used to reduces the amplitude of the discontinuities at the
|
||||||
|
boundaries of each frame when computing the STFT. Each frame will be multiplied by the windowing_function.
|
||||||
|
For more information on the discontinuities, called *Spectral leakage*, refer to [this
|
||||||
|
tutorial]https://download.ni.com/evaluation/pxi/Understanding%20FFTs%20and%20Windowing.pdf
|
||||||
|
fft_window_size (`int`, *optional*):
|
||||||
|
Size of the window om which the Fourier transform is applied. This controls the frequency resolution of the
|
||||||
|
spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. The number of
|
||||||
|
frequency bins (`nb_frequency_bins`) used to divide the window into equal strips is equal to
|
||||||
|
`(1+fft_window_size)//2`. An increase of the fft_window_size slows the calculus time proportionnally.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers.audio_utils import stft, fram_wave
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> audio = np.random.rand(50)
|
||||||
|
>>> fft_window_size = 10
|
||||||
|
>>> hop_length = 2
|
||||||
|
>>> framed_audio = fram_wave(audio, hop_length, fft_window_size)
|
||||||
|
>>> spectrogram = stft(framed_audio, np.hanning(fft_window_size + 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
spectrogram (`np.ndarray`):
|
||||||
|
A spectrogram of shape `(num_frames, nb_frequency_bins)` obtained using the STFT algorithm
|
||||||
|
"""
|
||||||
|
frame_size = frames.shape[1]
|
||||||
|
|
||||||
|
if fft_window_size is None:
|
||||||
|
fft_window_size = frame_size
|
||||||
|
|
||||||
|
if fft_window_size < frame_size:
|
||||||
|
raise ValueError("FFT size must greater or equal the frame size")
|
||||||
|
# number of FFT bins to store
|
||||||
|
nb_frequency_bins = (fft_window_size >> 1) + 1
|
||||||
|
|
||||||
|
spectrogram = np.empty((len(frames), nb_frequency_bins), dtype=np.complex64)
|
||||||
|
fft_signal = np.zeros(fft_window_size)
|
||||||
|
|
||||||
|
for f, frame in enumerate(frames):
|
||||||
|
if windowing_function is not None:
|
||||||
|
np.multiply(frame, windowing_function, out=fft_signal[:frame_size])
|
||||||
|
else:
|
||||||
|
fft_signal[:frame_size] = frame
|
||||||
|
spectrogram[f] = fft(fft_signal, axis=0)[:nb_frequency_bins]
|
||||||
|
return spectrogram.T
|
|
@ -235,11 +235,13 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
|
||||||
Pad inputs (on left/right and up to predefined length or max length in the batch)
|
Pad inputs (on left/right and up to predefined length or max length in the batch)
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
processed_features:
|
processed_features (`Union[Dict[str, np.ndarray], BatchFeature]`):
|
||||||
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
|
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
|
||||||
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
|
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
|
||||||
max_length: maximum length of the returned list and optionally padding length (see below)
|
max_length (`int`, *optional*):
|
||||||
padding_strategy: PaddingStrategy to use for padding.
|
Maximum length of the returned list and optionally padding length (see below)
|
||||||
|
padding_strategy (`PaddingStrategy`, *optional*, default to `PaddingStrategy.DO_NOT_PAD`):
|
||||||
|
PaddingStrategy to use for padding.
|
||||||
|
|
||||||
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
|
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
|
||||||
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
|
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
|
||||||
|
@ -248,11 +250,12 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
|
||||||
|
|
||||||
- 'left': pads on the left of the sequences
|
- 'left': pads on the left of the sequences
|
||||||
- 'right': pads on the right of the sequences
|
- 'right': pads on the right of the sequences
|
||||||
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
|
pad_to_multiple_of (`int`, *optional*):
|
||||||
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
|
Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
|
||||||
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
|
enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
|
||||||
return_attention_mask:
|
which benefit from having sequence lengths be a multiple of 128.
|
||||||
(optional) Set to False to avoid returning attention mask (default: set to model specifics)
|
return_attention_mask (`bool`, *optional*):
|
||||||
|
Set to False to avoid returning attention mask (default: set to model specifics)
|
||||||
"""
|
"""
|
||||||
required_input = processed_features[self.model_input_names[0]]
|
required_input = processed_features[self.model_input_names[0]]
|
||||||
|
|
||||||
|
@ -303,15 +306,17 @@ class SequenceFeatureExtractor(FeatureExtractionMixin):
|
||||||
Truncate inputs to predefined length or max length in the batch
|
Truncate inputs to predefined length or max length in the batch
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
processed_features:
|
processed_features(`Union[Dict[str, np.ndarray], BatchFeature]`):
|
||||||
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
|
Dictionary of input values (`np.ndarray[float]`) / input vectors (`List[np.ndarray[float]]`) or batch
|
||||||
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
|
of inputs values (`List[np.ndarray[int]]`) / input vectors (`List[np.ndarray[int]]`)
|
||||||
max_length: maximum length of the returned list and optionally padding length (see below)
|
max_length (`int`, *optional*):
|
||||||
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
|
maximum length of the returned list and optionally padding length (see below)
|
||||||
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
|
pad_to_multiple_of (`int`, *optional*) :
|
||||||
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
|
Integer if set will pad the sequence to a multiple of the provided value. This is especially useful to
|
||||||
truncation:
|
enable the use of Tensor Core on NVIDIA hardware with compute capability `>= 7.5` (Volta), or on TPUs
|
||||||
(optional) Activates truncation to cut input sequences longer than `max_length` to `max_length`.
|
which benefit from having sequence lengths be a multiple of 128.
|
||||||
|
truncation (`bool`, *optional*):
|
||||||
|
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
|
||||||
"""
|
"""
|
||||||
if not truncation:
|
if not truncation:
|
||||||
return processed_features
|
return processed_features
|
||||||
|
|
|
@ -40,6 +40,7 @@ from . import (
|
||||||
camembert,
|
camembert,
|
||||||
canine,
|
canine,
|
||||||
chinese_clip,
|
chinese_clip,
|
||||||
|
clap,
|
||||||
clip,
|
clip,
|
||||||
clipseg,
|
clipseg,
|
||||||
codegen,
|
codegen,
|
||||||
|
|
|
@ -49,6 +49,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||||
("camembert", "CamembertConfig"),
|
("camembert", "CamembertConfig"),
|
||||||
("canine", "CanineConfig"),
|
("canine", "CanineConfig"),
|
||||||
("chinese_clip", "ChineseCLIPConfig"),
|
("chinese_clip", "ChineseCLIPConfig"),
|
||||||
|
("clap", "ClapConfig"),
|
||||||
("clip", "CLIPConfig"),
|
("clip", "CLIPConfig"),
|
||||||
("clipseg", "CLIPSegConfig"),
|
("clipseg", "CLIPSegConfig"),
|
||||||
("codegen", "CodeGenConfig"),
|
("codegen", "CodeGenConfig"),
|
||||||
|
@ -222,6 +223,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||||
("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"),
|
||||||
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
@ -385,6 +387,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||||
("camembert", "CamemBERT"),
|
("camembert", "CamemBERT"),
|
||||||
("canine", "CANINE"),
|
("canine", "CANINE"),
|
||||||
("chinese_clip", "Chinese-CLIP"),
|
("chinese_clip", "Chinese-CLIP"),
|
||||||
|
("clap", "CLAP"),
|
||||||
("clip", "CLIP"),
|
("clip", "CLIP"),
|
||||||
("clipseg", "CLIPSeg"),
|
("clipseg", "CLIPSeg"),
|
||||||
("codegen", "CodeGen"),
|
("codegen", "CodeGen"),
|
||||||
|
|
|
@ -40,6 +40,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
||||||
("audio-spectrogram-transformer", "ASTFeatureExtractor"),
|
("audio-spectrogram-transformer", "ASTFeatureExtractor"),
|
||||||
("beit", "BeitFeatureExtractor"),
|
("beit", "BeitFeatureExtractor"),
|
||||||
("chinese_clip", "ChineseCLIPFeatureExtractor"),
|
("chinese_clip", "ChineseCLIPFeatureExtractor"),
|
||||||
|
("clap", "ClapFeatureExtractor"),
|
||||||
("clip", "CLIPFeatureExtractor"),
|
("clip", "CLIPFeatureExtractor"),
|
||||||
("clipseg", "ViTFeatureExtractor"),
|
("clipseg", "ViTFeatureExtractor"),
|
||||||
("conditional_detr", "ConditionalDetrFeatureExtractor"),
|
("conditional_detr", "ConditionalDetrFeatureExtractor"),
|
||||||
|
|
|
@ -47,6 +47,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||||
("camembert", "CamembertModel"),
|
("camembert", "CamembertModel"),
|
||||||
("canine", "CanineModel"),
|
("canine", "CanineModel"),
|
||||||
("chinese_clip", "ChineseCLIPModel"),
|
("chinese_clip", "ChineseCLIPModel"),
|
||||||
|
("clap", "ClapModel"),
|
||||||
("clip", "CLIPModel"),
|
("clip", "CLIPModel"),
|
||||||
("clipseg", "CLIPSegModel"),
|
("clipseg", "CLIPSegModel"),
|
||||||
("codegen", "CodeGenModel"),
|
("codegen", "CodeGenModel"),
|
||||||
|
|
|
@ -46,6 +46,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||||
("blip-2", "Blip2Processor"),
|
("blip-2", "Blip2Processor"),
|
||||||
("bridgetower", "BridgeTowerProcessor"),
|
("bridgetower", "BridgeTowerProcessor"),
|
||||||
("chinese_clip", "ChineseCLIPProcessor"),
|
("chinese_clip", "ChineseCLIPProcessor"),
|
||||||
|
("clap", "ClapProcessor"),
|
||||||
("clip", "CLIPProcessor"),
|
("clip", "CLIPProcessor"),
|
||||||
("clipseg", "CLIPSegProcessor"),
|
("clipseg", "CLIPSegProcessor"),
|
||||||
("flava", "FlavaProcessor"),
|
("flava", "FlavaProcessor"),
|
||||||
|
|
|
@ -91,6 +91,13 @@ else:
|
||||||
),
|
),
|
||||||
("canine", ("CanineTokenizer", None)),
|
("canine", ("CanineTokenizer", None)),
|
||||||
("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
("chinese_clip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
(
|
||||||
|
"clap",
|
||||||
|
(
|
||||||
|
"RobertaTokenizer",
|
||||||
|
"RobertaTokenizerFast" if is_tokenizers_available() else None,
|
||||||
|
),
|
||||||
|
),
|
||||||
(
|
(
|
||||||
"clip",
|
"clip",
|
||||||
(
|
(
|
||||||
|
|
|
@ -0,0 +1,76 @@
|
||||||
|
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_clap": [
|
||||||
|
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ClapAudioConfig",
|
||||||
|
"ClapConfig",
|
||||||
|
"ClapTextConfig",
|
||||||
|
],
|
||||||
|
"processing_clap": ["ClapProcessor"],
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["modeling_clap"] = [
|
||||||
|
"CLAP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"ClapModel",
|
||||||
|
"ClapPreTrainedModel",
|
||||||
|
"ClapTextModel",
|
||||||
|
"ClapTextModelWithProjection",
|
||||||
|
"ClapAudioModel",
|
||||||
|
"ClapAudioModelWithProjection",
|
||||||
|
]
|
||||||
|
_import_structure["feature_extraction_clap"] = ["ClapFeatureExtractor"]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_clap import (
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ClapAudioConfig,
|
||||||
|
ClapConfig,
|
||||||
|
ClapTextConfig,
|
||||||
|
)
|
||||||
|
from .processing_clap import ClapProcessor
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .feature_extraction_clap import ClapFeatureExtractor
|
||||||
|
from .modeling_clap import (
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
ClapAudioModel,
|
||||||
|
ClapAudioModelWithProjection,
|
||||||
|
ClapModel,
|
||||||
|
ClapPreTrainedModel,
|
||||||
|
ClapTextModel,
|
||||||
|
ClapTextModelWithProjection,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
|
@ -0,0 +1,450 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" CLAP model configuration"""
|
||||||
|
|
||||||
|
import copy
|
||||||
|
import os
|
||||||
|
from typing import Union
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = {
|
||||||
|
"laion/clap-htsat-fused": "https://huggingface.co/laion/clap-htsat-fused/resolve/main/config.json",
|
||||||
|
"laion/clap-htsat-unfused": "https://huggingface.co/laion/clap-htsat-unfused/resolve/main/config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class ClapTextConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`ClapTextModel`]. It is used to instantiate a CLAP
|
||||||
|
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||||
|
defaults will yield a similar configuration to that of the CLAP
|
||||||
|
[calp-hsat-fused](https://huggingface.co/laion/clap-hsat-fused) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 30522):
|
||||||
|
Vocabulary size of the CLAP model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`ClapTextModel`].
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"relu"`,
|
||||||
|
`"relu"`, `"silu"` and `"relu_new"` are supported.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 512):
|
||||||
|
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||||
|
just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
type_vocab_size (`int`, *optional*, defaults to 2):
|
||||||
|
The vocabulary size of the `token_type_ids` passed when calling [`ClapTextModel`].
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
|
||||||
|
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
|
||||||
|
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
|
||||||
|
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||||
|
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
|
||||||
|
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
|
||||||
|
is_decoder (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||||
|
relevant if `config.is_decoder=True`.
|
||||||
|
classifier_dropout (`float`, *optional*):
|
||||||
|
The dropout ratio for the classification head.
|
||||||
|
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
|
||||||
|
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||||
|
projection_dim (`int`, *optional*, defaults to 512)
|
||||||
|
Dimension of the projection head of the `ClapTextModelWithProjection`.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ClapTextConfig, ClapTextModel
|
||||||
|
|
||||||
|
>>> # Initializing a CLAP text configuration
|
||||||
|
>>> configuration = ClapTextConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model (with random weights) from the configuration
|
||||||
|
>>> model = ClapTextModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "clap_text_model"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=50265,
|
||||||
|
hidden_size=768,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_attention_heads=12,
|
||||||
|
intermediate_size=3072,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=514,
|
||||||
|
type_vocab_size=1,
|
||||||
|
initializer_range=0.02,
|
||||||
|
initializer_factor=1.0,
|
||||||
|
layer_norm_eps=1e-12,
|
||||||
|
projection_dim=512,
|
||||||
|
pad_token_id=1,
|
||||||
|
bos_token_id=0,
|
||||||
|
eos_token_id=2,
|
||||||
|
position_embedding_type="absolute",
|
||||||
|
use_cache=True,
|
||||||
|
classifier_dropout=None,
|
||||||
|
projection_hidden_act="relu",
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||||
|
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.initializer_factor = initializer_factor
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.position_embedding_type = position_embedding_type
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.classifier_dropout = classifier_dropout
|
||||||
|
self.projection_hidden_act = projection_hidden_act
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the text config dict if we are loading from ClapConfig
|
||||||
|
if config_dict.get("model_type") == "clap":
|
||||||
|
config_dict = config_dict["text_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class ClapAudioConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`ClapAudioModel`]. It is used to instantiate a
|
||||||
|
CLAP audio encoder according to the specified arguments, defining the model architecture. Instantiating a
|
||||||
|
configuration with the defaults will yield a similar configuration to that of the audio encoder of the CLAP
|
||||||
|
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
window_size (`int`, *optional*, defaults to 8):
|
||||||
|
Image size of the spectrogram
|
||||||
|
num_mel_bins (`int`, *optional*, defaults to 64):
|
||||||
|
Number of mel features used per frames. Should correspond to the value used in the `ClapProcessor` class.
|
||||||
|
spec_size (`int`, *optional*, defaults to 256):
|
||||||
|
Desired input size of the spectrogram that the model supports. It can be different from the output of the
|
||||||
|
`ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size`
|
||||||
|
of the audio models.
|
||||||
|
hidden_act (`str`, *optional*, defaults to `"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||||
|
patch_size (`int`, *optional*, defaults to 4):
|
||||||
|
Patch size for the audio spectrogram
|
||||||
|
patch_stride (`list`, *optional*, defaults to `[4, 4]`):
|
||||||
|
Patch stride for the audio spectrogram
|
||||||
|
num_classes (`int`, *optional*, defaults to 527):
|
||||||
|
Number of classes used for the head training
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's
|
||||||
|
output,which is sent to the projection MLP layer.
|
||||||
|
projection_dim (`int`, *optional*, defaults to 512):
|
||||||
|
Hidden size of the projection layer.
|
||||||
|
depths (`list`, *optional*, defaults to `[2, 2, 6, 2]`):
|
||||||
|
Depths used for the Swin Layers of the audio model
|
||||||
|
num_attention_heads (`list`, *optional*, defaults to `[4, 8, 16, 32]`):
|
||||||
|
Number of attention heads used for the Swin Layers of the audio model
|
||||||
|
enable_fusion (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the
|
||||||
|
best results.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probabilitiy for all fully connected layers in the encoder.
|
||||||
|
fusion_type (`[type]`, *optional*):
|
||||||
|
Fusion type used for the patch fusion.
|
||||||
|
patch_embed_input_channels (`int`, *optional*, defaults to 1):
|
||||||
|
Number of channels used for the input spectrogram
|
||||||
|
flatten_patch_embeds (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to flatten the patch embeddings
|
||||||
|
patch_embeds_hidden_size (`int`, *optional*, defaults to 96):
|
||||||
|
Hidden size of the patch embeddings. It is used as the number of output channels.
|
||||||
|
enable_patch_layer_norm (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to enable layer normalization for the patch embeddings
|
||||||
|
drop_path_rate (`float`, *optional*, defaults to 0.0):
|
||||||
|
Drop path rate for the patch fusion
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to add a bias to the query, key, value projections.
|
||||||
|
mlp_ratio (`float`, *optional*, defaults to 4.0):
|
||||||
|
Ratio of the mlp hidden dim to embedding dim.
|
||||||
|
aff_block_r (`int`, *optional*, defaults to 4):
|
||||||
|
downsize_ratio used in the AudioFF block
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 4):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
|
||||||
|
The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||||
|
layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
initializer_factor (`float`, *optional*, defaults to 1.0):
|
||||||
|
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||||
|
testing).
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ClapAudioConfig, ClapAudioModel
|
||||||
|
|
||||||
|
>>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration
|
||||||
|
>>> configuration = ClapAudioConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration
|
||||||
|
>>> model = ClapAudioModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "clap_audio_model"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
window_size=8,
|
||||||
|
num_mel_bins=64,
|
||||||
|
spec_size=256,
|
||||||
|
hidden_act="gelu",
|
||||||
|
patch_size=4,
|
||||||
|
patch_stride=[4, 4],
|
||||||
|
num_classes=527,
|
||||||
|
hidden_size=768,
|
||||||
|
projection_dim=512,
|
||||||
|
depths=[2, 2, 6, 2],
|
||||||
|
num_attention_heads=[4, 8, 16, 32],
|
||||||
|
enable_fusion=False,
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
fusion_type=None,
|
||||||
|
patch_embed_input_channels=1,
|
||||||
|
flatten_patch_embeds=True,
|
||||||
|
patch_embeds_hidden_size=96,
|
||||||
|
enable_patch_layer_norm=True,
|
||||||
|
drop_path_rate=0.0,
|
||||||
|
attention_probs_dropout_prob=0.0,
|
||||||
|
qkv_bias=True,
|
||||||
|
mlp_ratio=4.0,
|
||||||
|
aff_block_r=4,
|
||||||
|
num_hidden_layers=4,
|
||||||
|
projection_hidden_act="relu",
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
initializer_factor=1.0,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.window_size = window_size
|
||||||
|
self.num_mel_bins = num_mel_bins
|
||||||
|
self.spec_size = spec_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.patch_stride = patch_stride
|
||||||
|
self.num_classes = num_classes
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.depths = depths
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.window_size = window_size
|
||||||
|
self.enable_fusion = enable_fusion
|
||||||
|
self.fusion_type = fusion_type
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
self.flatten_patch_embeds = flatten_patch_embeds
|
||||||
|
self.patch_embeds_hidden_size = patch_embeds_hidden_size
|
||||||
|
self.enable_patch_layer_norm = enable_patch_layer_norm
|
||||||
|
self.drop_path_rate = drop_path_rate
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.mlp_ratio = mlp_ratio
|
||||||
|
self.patch_embed_input_channels = patch_embed_input_channels
|
||||||
|
self.aff_block_r = aff_block_r
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.initializer_factor = initializer_factor
|
||||||
|
self.projection_hidden_act = projection_hidden_act
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the audio config dict if we are loading from ClapConfig
|
||||||
|
if config_dict.get("model_type") == "clap":
|
||||||
|
config_dict = config_dict["audio_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class ClapConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
[`ClapConfig`] is the configuration class to store the configuration of a [`ClapModel`]. It is used to instantiate
|
||||||
|
a CLAP model according to the specified arguments, defining the text model and audio model configs. Instantiating a
|
||||||
|
configuration with the defaults will yield a similar configuration to that of the CLAP
|
||||||
|
[laion/clap-htsat-fused](https://huggingface.co/laion/clap-htsat-fused) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text_config (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`ClapTextConfig`].
|
||||||
|
audio_config (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`ClapAudioConfig`].
|
||||||
|
projection_dim (`int`, *optional*, defaults to 512):
|
||||||
|
Dimentionality of text and audio projection layers.
|
||||||
|
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
|
||||||
|
The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation.
|
||||||
|
projection_hidden_act (`str`, *optional*, defaults to `"relu"`):
|
||||||
|
Activation function for the projection layers.
|
||||||
|
initializer_factor (`float`, *optional*, defaults to 1.0):
|
||||||
|
Factor to scale the initialization of the model weights.
|
||||||
|
kwargs (*optional*):
|
||||||
|
Dictionary of keyword arguments.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import ClapConfig, ClapModel
|
||||||
|
|
||||||
|
>>> # Initializing a ClapConfig with laion-ai/base style configuration
|
||||||
|
>>> configuration = ClapConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a ClapModel (with random weights) from the laion-ai/base style configuration
|
||||||
|
>>> model = ClapModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
|
||||||
|
>>> # We can also initialize a ClapConfig from a ClapTextConfig and a ClapAudioConfig
|
||||||
|
>>> from transformers import ClapTextConfig, ClapAudioConfig
|
||||||
|
|
||||||
|
>>> # Initializing a ClapText and ClapAudioConfig configuration
|
||||||
|
>>> config_text = ClapTextConfig()
|
||||||
|
>>> config_audio = ClapAudioConfig()
|
||||||
|
|
||||||
|
>>> config = ClapConfig.from_text_audio_configs(config_text, config_audio)
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "clap"
|
||||||
|
is_composition = True
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
text_config=None,
|
||||||
|
audio_config=None,
|
||||||
|
logit_scale_init_value=(1 / 0.07),
|
||||||
|
projection_dim=512,
|
||||||
|
projection_hidden_act="relu",
|
||||||
|
initializer_factor=1.0,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
if text_config is None:
|
||||||
|
text_config = {}
|
||||||
|
logger.info("text_config is None. Initializing the ClapTextConfig with default values.")
|
||||||
|
|
||||||
|
if audio_config is None:
|
||||||
|
audio_config = {}
|
||||||
|
logger.info("audio_config is None. initializing the ClapAudioConfig with default values.")
|
||||||
|
|
||||||
|
self.text_config = ClapTextConfig(**text_config)
|
||||||
|
self.audio_config = ClapAudioConfig(**audio_config)
|
||||||
|
self.text_config.projection_dim = projection_dim
|
||||||
|
self.audio_config.projection_dim = projection_dim
|
||||||
|
|
||||||
|
self.text_config.projection_hidden_act = projection_hidden_act
|
||||||
|
self.audio_config.projection_hidden_act = projection_hidden_act
|
||||||
|
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
self.projection_hidden_act = projection_hidden_act
|
||||||
|
self.hidden_size = self.text_config.hidden_size
|
||||||
|
|
||||||
|
self.logit_scale_init_value = logit_scale_init_value
|
||||||
|
self.initializer_factor = initializer_factor
|
||||||
|
self.num_hidden_layers = self.text_config.num_hidden_layers + len(self.audio_config.depths)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_text_audio_configs(cls, text_config: ClapTextConfig, audio_config: ClapAudioConfig, **kwargs):
|
||||||
|
r"""
|
||||||
|
Instantiate a [`ClapConfig`] (or a derived class) from clap text model configuration and clap audio model
|
||||||
|
configuration.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`ClapConfig`]: An instance of a configuration object
|
||||||
|
"""
|
||||||
|
|
||||||
|
return cls(text_config=text_config.to_dict(), audio_config=audio_config.to_dict(), **kwargs)
|
||||||
|
|
||||||
|
def to_dict(self):
|
||||||
|
"""
|
||||||
|
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||||
|
"""
|
||||||
|
output = copy.deepcopy(self.__dict__)
|
||||||
|
output["text_config"] = self.text_config.to_dict()
|
||||||
|
output["audio_config"] = self.audio_config.to_dict()
|
||||||
|
output["model_type"] = self.__class__.model_type
|
||||||
|
return output
|
|
@ -0,0 +1,123 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from CLAP import create_model
|
||||||
|
|
||||||
|
from transformers import AutoFeatureExtractor, ClapConfig, ClapModel
|
||||||
|
|
||||||
|
|
||||||
|
KEYS_TO_MODIFY_MAPPING = {
|
||||||
|
"text_branch": "text_model",
|
||||||
|
"audio_branch": "audio_model.audio_encoder",
|
||||||
|
"attn": "attention.self",
|
||||||
|
"self.proj": "output.dense",
|
||||||
|
"attention.self_mask": "attn_mask",
|
||||||
|
"mlp.fc1": "intermediate.dense",
|
||||||
|
"mlp.fc2": "output.dense",
|
||||||
|
"norm1": "layernorm_before",
|
||||||
|
"norm2": "layernorm_after",
|
||||||
|
"bn0": "batch_norm",
|
||||||
|
}
|
||||||
|
|
||||||
|
processor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused", truncation="rand_trunc")
|
||||||
|
|
||||||
|
|
||||||
|
def init_clap(checkpoint_path, enable_fusion=False):
|
||||||
|
model, model_cfg = create_model(
|
||||||
|
"HTSAT-tiny",
|
||||||
|
"roberta",
|
||||||
|
checkpoint_path,
|
||||||
|
precision="fp32",
|
||||||
|
device="cuda:0" if torch.cuda.is_available() else "cpu",
|
||||||
|
enable_fusion=enable_fusion,
|
||||||
|
fusion_type="aff_2d" if enable_fusion else None,
|
||||||
|
)
|
||||||
|
return model, model_cfg
|
||||||
|
|
||||||
|
|
||||||
|
def rename_state_dict(state_dict):
|
||||||
|
model_state_dict = {}
|
||||||
|
|
||||||
|
sequential_layers_pattern = r".*sequential.(\d+).*"
|
||||||
|
text_projection_pattern = r".*_projection.(\d+).*"
|
||||||
|
|
||||||
|
for key, value in state_dict.items():
|
||||||
|
# check if any key needs to be modified
|
||||||
|
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
|
||||||
|
if key_to_modify in key:
|
||||||
|
key = key.replace(key_to_modify, new_key)
|
||||||
|
|
||||||
|
if re.match(sequential_layers_pattern, key):
|
||||||
|
# replace sequential layers with list
|
||||||
|
sequential_layer = re.match(sequential_layers_pattern, key).group(1)
|
||||||
|
|
||||||
|
key = key.replace(f"sequential.{sequential_layer}.", f"layers.{int(sequential_layer)//3}.linear.")
|
||||||
|
elif re.match(text_projection_pattern, key):
|
||||||
|
projecton_layer = int(re.match(text_projection_pattern, key).group(1))
|
||||||
|
|
||||||
|
# Because in CLAP they use `nn.Sequential`...
|
||||||
|
transformers_projection_layer = 1 if projecton_layer == 0 else 2
|
||||||
|
|
||||||
|
key = key.replace(f"_projection.{projecton_layer}.", f"_projection.linear{transformers_projection_layer}.")
|
||||||
|
|
||||||
|
if "audio" and "qkv" in key:
|
||||||
|
# split qkv into query key and value
|
||||||
|
mixed_qkv = value
|
||||||
|
qkv_dim = mixed_qkv.size(0) // 3
|
||||||
|
|
||||||
|
query_layer = mixed_qkv[:qkv_dim]
|
||||||
|
key_layer = mixed_qkv[qkv_dim : qkv_dim * 2]
|
||||||
|
value_layer = mixed_qkv[qkv_dim * 2 :]
|
||||||
|
|
||||||
|
model_state_dict[key.replace("qkv", "query")] = query_layer
|
||||||
|
model_state_dict[key.replace("qkv", "key")] = key_layer
|
||||||
|
model_state_dict[key.replace("qkv", "value")] = value_layer
|
||||||
|
else:
|
||||||
|
model_state_dict[key] = value
|
||||||
|
|
||||||
|
return model_state_dict
|
||||||
|
|
||||||
|
|
||||||
|
def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, enable_fusion=False):
|
||||||
|
clap_model, clap_model_cfg = init_clap(checkpoint_path, enable_fusion=enable_fusion)
|
||||||
|
|
||||||
|
clap_model.eval()
|
||||||
|
state_dict = clap_model.state_dict()
|
||||||
|
state_dict = rename_state_dict(state_dict)
|
||||||
|
|
||||||
|
transformers_config = ClapConfig()
|
||||||
|
transformers_config.audio_config.enable_fusion = enable_fusion
|
||||||
|
model = ClapModel(transformers_config)
|
||||||
|
|
||||||
|
# ignore the spectrogram embedding layer
|
||||||
|
model.load_state_dict(state_dict, strict=False)
|
||||||
|
|
||||||
|
model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
transformers_config.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
parser.add_argument("--enable_fusion", action="store_true", help="Whether to enable fusion or not")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
convert_clap_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.enable_fusion)
|
|
@ -0,0 +1,356 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Feature extractor class for CLAP."""
|
||||||
|
|
||||||
|
|
||||||
|
import copy
|
||||||
|
from typing import Any, Dict, List, Optional, Union
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from ...audio_utils import fram_wave, get_mel_filter_banks, power_to_db, stft
|
||||||
|
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
|
||||||
|
from ...feature_extraction_utils import BatchFeature
|
||||||
|
from ...utils import TensorType, logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class ClapFeatureExtractor(SequenceFeatureExtractor):
|
||||||
|
r"""
|
||||||
|
Constructs a CLAP feature extractor.
|
||||||
|
|
||||||
|
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
|
||||||
|
most of the main methods. Users should refer to this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the *Short Time
|
||||||
|
Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
feature_size (`int`, defaults to 64):
|
||||||
|
The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters
|
||||||
|
(`n_mels`).
|
||||||
|
sampling_rate (`int`, defaults to 48_000):
|
||||||
|
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves
|
||||||
|
to warn users if the audio fed to the feature extractor does not have the same sampling rate.
|
||||||
|
hop_length (`int`, defaults to 480):
|
||||||
|
Length of the overlaping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split
|
||||||
|
in smaller `frames` with a step of `hop_length` between each frame.
|
||||||
|
max_length_s (`int`, defaults to 10):
|
||||||
|
The maximum input lenght of the model in seconds. This is used to pad the audio.
|
||||||
|
fft_window_size (`int`, defaults to 1024):
|
||||||
|
Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency
|
||||||
|
resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples.
|
||||||
|
padding_value (`float`, *optional*, defaults to 0.0):
|
||||||
|
Padding value used to pad the audio. Should correspond to silences.
|
||||||
|
return_attention_mask (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the model should return the attention masks coresponding to the input.
|
||||||
|
frequency_min (`float`, *optional*, default to 0):
|
||||||
|
The lowest frequency of interest. The STFT will not be computed for values below this.
|
||||||
|
frequency_max (`float`, *optional*, default to 14_000):
|
||||||
|
The highest frequency of interest. The STFT will not be computed for values above this.
|
||||||
|
top_db (`float`, *optional*):
|
||||||
|
The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the
|
||||||
|
`audio_utils.power_to_db` function
|
||||||
|
truncation (`str`, *optional*, default to `"fusions"`):
|
||||||
|
Truncation pattern for long audio inputs. Two patterns are available:
|
||||||
|
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a
|
||||||
|
downsampled version of the entire mel spectrogram.
|
||||||
|
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a copy
|
||||||
|
of the original mel obtained from the padded audio.
|
||||||
|
- `rand_trunc` will select a random crop of the mel spectrogram.
|
||||||
|
padding (`str`, *optional*, defaults to `"repeatpad"`):
|
||||||
|
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
|
||||||
|
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
|
||||||
|
- `repeat`: the audio is repeated and then cut to fit the `max_length`
|
||||||
|
- `pad`: the audio is padded.
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_input_names = ["input_features", "is_longer"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
feature_size=64,
|
||||||
|
sampling_rate=48_000,
|
||||||
|
hop_length=480,
|
||||||
|
max_length_s=10,
|
||||||
|
fft_window_size=1024,
|
||||||
|
padding_value=0.0,
|
||||||
|
return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask
|
||||||
|
frequency_min: float = 0,
|
||||||
|
frequency_max: float = 14_000,
|
||||||
|
top_db: int = None,
|
||||||
|
truncation: str = "fusion",
|
||||||
|
padding: str = "repeatpad",
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(
|
||||||
|
feature_size=feature_size,
|
||||||
|
sampling_rate=sampling_rate,
|
||||||
|
padding_value=padding_value,
|
||||||
|
return_attention_mask=return_attention_mask,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
self.top_db = top_db
|
||||||
|
self.truncation = truncation
|
||||||
|
self.padding = padding
|
||||||
|
self.fft_window_size = fft_window_size
|
||||||
|
self.nb_frequency_bins = (fft_window_size >> 1) + 1
|
||||||
|
self.hop_length = hop_length
|
||||||
|
self.max_length_s = max_length_s
|
||||||
|
self.nb_max_samples = max_length_s * sampling_rate
|
||||||
|
self.sampling_rate = sampling_rate
|
||||||
|
self.frequency_min = frequency_min
|
||||||
|
self.frequency_max = frequency_max
|
||||||
|
self.mel_filters = get_mel_filter_banks(
|
||||||
|
nb_frequency_bins=self.nb_frequency_bins,
|
||||||
|
nb_mel_filters=feature_size,
|
||||||
|
frequency_min=frequency_min,
|
||||||
|
frequency_max=frequency_max,
|
||||||
|
sample_rate=sampling_rate,
|
||||||
|
norm=None,
|
||||||
|
mel_scale="htk",
|
||||||
|
)
|
||||||
|
self.mel_filters_slaney = get_mel_filter_banks(
|
||||||
|
nb_frequency_bins=self.nb_frequency_bins,
|
||||||
|
nb_mel_filters=feature_size,
|
||||||
|
frequency_min=frequency_min,
|
||||||
|
frequency_max=frequency_max,
|
||||||
|
sample_rate=sampling_rate,
|
||||||
|
norm="slaney",
|
||||||
|
mel_scale="slaney",
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Serializes this instance to a Python dictionary.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance, excpet for the
|
||||||
|
mel filter banks, which do not need to be saved or printed as they are too long.
|
||||||
|
"""
|
||||||
|
output = copy.deepcopy(self.__dict__)
|
||||||
|
output["feature_extractor_type"] = self.__class__.__name__
|
||||||
|
if "mel_filters" in output:
|
||||||
|
del output["mel_filters"]
|
||||||
|
if "mel_filters_slaney" in output:
|
||||||
|
del output["mel_filters_slaney"]
|
||||||
|
return output
|
||||||
|
|
||||||
|
def _np_extract_fbank_features(self, waveform: np.array, mel_filters: Optional[np.array] = None) -> np.ndarray:
|
||||||
|
"""
|
||||||
|
Compute the log-Mel spectrogram of the provided `waveform` using the `hanning` window. In CLAP, two different
|
||||||
|
filter banks are used depending on the truncation pattern:
|
||||||
|
- `self.mel_filters`: they correspond to the defaults parameters of `torchaduio` which can be obtained from
|
||||||
|
calling `torchaudio.transforms.MelSpectrogram().mel_scale.fb`. These filters are used when `truncation`
|
||||||
|
is set to `"fusion"`.
|
||||||
|
- `self.mel_filteres_slaney` : they correspond to the defaults parameters of `torchlibrosa` which used
|
||||||
|
`librosa.filters.mel` when computing the mel spectrogram. These filters were only used in the original
|
||||||
|
implementation when the truncation mode is not `"fusion"`.
|
||||||
|
"""
|
||||||
|
window = np.hanning(self.fft_window_size + 1)[:-1]
|
||||||
|
frames = fram_wave(waveform, self.hop_length, self.fft_window_size)
|
||||||
|
spectrogram = stft(frames, window, fft_window_size=self.fft_window_size)
|
||||||
|
|
||||||
|
magnitudes = np.abs(spectrogram) ** 2
|
||||||
|
mel_spectrogram = np.matmul(mel_filters.T, magnitudes)
|
||||||
|
log_mel_spectrogram = power_to_db(mel_spectrogram).T
|
||||||
|
log_mel_spectrogram = np.asarray(log_mel_spectrogram, np.float32)
|
||||||
|
return log_mel_spectrogram
|
||||||
|
|
||||||
|
def _random_mel_fusion(self, mel, total_frames, chunk_frames):
|
||||||
|
ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3)
|
||||||
|
if len(ranges[1]) == 0:
|
||||||
|
# if the audio is too short, we just use the first chunk
|
||||||
|
ranges[1] = [0]
|
||||||
|
if len(ranges[2]) == 0:
|
||||||
|
# if the audio is too short, we just use the first chunk
|
||||||
|
ranges[2] = [0]
|
||||||
|
# randomly choose index for each part
|
||||||
|
idx_front = np.random.choice(ranges[0])
|
||||||
|
idx_middle = np.random.choice(ranges[1])
|
||||||
|
idx_back = np.random.choice(ranges[2])
|
||||||
|
|
||||||
|
mel_chunk_front = mel[idx_front : idx_front + chunk_frames, :]
|
||||||
|
mel_chunk_middle = mel[idx_middle : idx_middle + chunk_frames, :]
|
||||||
|
mel_chunk_back = mel[idx_back : idx_back + chunk_frames, :]
|
||||||
|
|
||||||
|
mel = torch.tensor(mel[None, None, :])
|
||||||
|
mel_shrink = torch.nn.functional.interpolate(
|
||||||
|
mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False, antialias=False
|
||||||
|
)
|
||||||
|
mel_shrink = mel_shrink[0][0].numpy()
|
||||||
|
mel_fusion = np.stack([mel_chunk_front, mel_chunk_middle, mel_chunk_back, mel_shrink], axis=0)
|
||||||
|
return mel_fusion
|
||||||
|
|
||||||
|
def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> np.array:
|
||||||
|
"""
|
||||||
|
Extracts the mel spectrogram and prepares it for the mode based on the `truncation` and `padding` arguments.
|
||||||
|
Four different path are possible:
|
||||||
|
- `truncation="fusion"` and the length of the waveform is greater than the max length: the mel spectrogram
|
||||||
|
will be computed on the entire audio. 3 random crops and a dowsampled version of the full mel spectrogram
|
||||||
|
are then stacked together. They will later be used for `feature_fusion`.
|
||||||
|
- `truncation="rand_trunc"` and the length of the waveform is smaller than the max length: the audio is
|
||||||
|
padded based on `padding`.
|
||||||
|
- `truncation="fusion"` and the length of the waveform is smaller than the max length: the audio is padded
|
||||||
|
based on `padding`, and is repeated `4` times.
|
||||||
|
- `truncation="rand_trunc"` and the length of the waveform is greater than the max length: the mel
|
||||||
|
spectrogram will be computed on a random crop of the waveform.
|
||||||
|
|
||||||
|
"""
|
||||||
|
if waveform.shape[0] > max_length:
|
||||||
|
if truncation == "rand_trunc":
|
||||||
|
longer = True
|
||||||
|
# random crop to max_length (for compatibility) -> this should be handled by self.pad
|
||||||
|
overflow = len(waveform) - max_length
|
||||||
|
idx = np.random.randint(0, overflow + 1)
|
||||||
|
waveform = waveform[idx : idx + max_length]
|
||||||
|
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
|
||||||
|
elif truncation == "fusion":
|
||||||
|
mel = self._np_extract_fbank_features(waveform, self.mel_filters)
|
||||||
|
chunk_frames = max_length // self.hop_length + 1 # the +1 related to how the spectrogram is computed
|
||||||
|
total_frames = mel.shape[0]
|
||||||
|
if chunk_frames == total_frames:
|
||||||
|
# there is a corner case where the audio length is larger than max_length but smaller than max_length+hop_length.
|
||||||
|
# In this case, we just use the whole audio.
|
||||||
|
input_mel = np.stack([mel, mel, mel, mel], axis=0)
|
||||||
|
longer = False
|
||||||
|
else:
|
||||||
|
input_mel = self._random_mel_fusion(mel, total_frames, chunk_frames)
|
||||||
|
longer = True
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(f"data_truncating {truncation} not implemented")
|
||||||
|
|
||||||
|
else:
|
||||||
|
longer = False
|
||||||
|
# only use repeat as a new possible value for padding. you repeat the audio before applying the usual max_length padding
|
||||||
|
if waveform.shape[0] < max_length:
|
||||||
|
if padding == "repeat":
|
||||||
|
n_repeat = int(max_length / len(waveform))
|
||||||
|
waveform = np.stack(np.tile(waveform, n_repeat + 1))[:max_length]
|
||||||
|
if padding == "repeatpad":
|
||||||
|
n_repeat = int(max_length / len(waveform))
|
||||||
|
waveform = np.stack(np.tile(waveform, n_repeat))
|
||||||
|
waveform = np.pad(waveform, (0, max_length - waveform.shape[0]), mode="constant", constant_values=0)
|
||||||
|
|
||||||
|
if truncation == "fusion":
|
||||||
|
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters)
|
||||||
|
input_mel = np.stack([input_mel, input_mel, input_mel, input_mel], axis=0)
|
||||||
|
else:
|
||||||
|
input_mel = self._np_extract_fbank_features(waveform, self.mel_filters_slaney)[None, :]
|
||||||
|
|
||||||
|
return input_mel, longer
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
|
||||||
|
truncation: str = None,
|
||||||
|
padding: Optional[str] = None,
|
||||||
|
max_length: Optional[int] = None,
|
||||||
|
sampling_rate: Optional[int] = None,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
**kwargs,
|
||||||
|
) -> BatchFeature:
|
||||||
|
"""
|
||||||
|
Main method to featurize and prepare for the model one or several sequence(s).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
|
||||||
|
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
|
||||||
|
values, a list of numpy arrays or a list of list of float values.
|
||||||
|
truncation (`str`, *optional*):
|
||||||
|
Truncation pattern for long audio inputs. Two patterns are available:
|
||||||
|
- `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and
|
||||||
|
a downsampled version of the entire mel spectrogram.
|
||||||
|
If `config.fusion` is set to True, shorter audios also need to to return 4 mels, which will just be a
|
||||||
|
copy of the original mel obtained from the padded audio.
|
||||||
|
- `rand_trunc` will select a random crop of the mel spectrogram.
|
||||||
|
padding (`str`, *optional*):
|
||||||
|
Padding pattern for shorter audio inputs. Three patterns were originally implemented:
|
||||||
|
- `repeatpad`: the audio is repeated, and then padded to fit the `max_length`.
|
||||||
|
- `repeat`: the audio is repeated and then cut to fit the `max_length`
|
||||||
|
- `pad`: the audio is padded.
|
||||||
|
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||||
|
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||||
|
|
||||||
|
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||||
|
- `'pt'`: Return PyTorch `torch.np.array` objects.
|
||||||
|
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||||
|
sampling_rate (`int`, *optional*):
|
||||||
|
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
|
||||||
|
`sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition
|
||||||
|
pipeline.
|
||||||
|
"""
|
||||||
|
truncation = truncation if truncation is not None else self.truncation
|
||||||
|
padding = padding if padding else self.padding
|
||||||
|
|
||||||
|
if sampling_rate is not None:
|
||||||
|
if sampling_rate != self.sampling_rate:
|
||||||
|
raise ValueError(
|
||||||
|
f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a"
|
||||||
|
f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input"
|
||||||
|
f" was sampled with {self.sampling_rate} and not {sampling_rate}."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.warning(
|
||||||
|
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
|
||||||
|
"Failing to do so can result in silent errors that might be hard to debug."
|
||||||
|
)
|
||||||
|
|
||||||
|
is_batched = bool(
|
||||||
|
isinstance(raw_speech, (list, tuple))
|
||||||
|
and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list)))
|
||||||
|
)
|
||||||
|
|
||||||
|
if is_batched:
|
||||||
|
raw_speech = [np.asarray(speech, dtype=np.float64) for speech in raw_speech]
|
||||||
|
elif not is_batched and not isinstance(raw_speech, np.ndarray):
|
||||||
|
raw_speech = np.asarray(raw_speech, dtype=np.float64)
|
||||||
|
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
|
||||||
|
raw_speech = raw_speech.astype(np.float64)
|
||||||
|
|
||||||
|
# always return batch
|
||||||
|
if not is_batched:
|
||||||
|
raw_speech = [np.asarray(raw_speech)]
|
||||||
|
|
||||||
|
# convert to mel spectrogram, truncate and pad if needed.
|
||||||
|
padded_inputs = [
|
||||||
|
self._get_input_mel(waveform, max_length if max_length else self.nb_max_samples, truncation, padding)
|
||||||
|
for waveform in raw_speech
|
||||||
|
]
|
||||||
|
|
||||||
|
input_mel = []
|
||||||
|
is_longer = []
|
||||||
|
for mel, longer in padded_inputs:
|
||||||
|
input_mel.append(mel)
|
||||||
|
is_longer.append(longer)
|
||||||
|
|
||||||
|
if truncation == "fusion" and sum(is_longer) == 0:
|
||||||
|
# if no audio is longer than 10s, then randomly select one audio to be longer
|
||||||
|
rand_idx = np.random.randint(0, len(input_mel))
|
||||||
|
is_longer[rand_idx] = True
|
||||||
|
|
||||||
|
if isinstance(input_mel[0], List):
|
||||||
|
input_mel = [np.asarray(feature, dtype=np.float64) for feature in input_mel]
|
||||||
|
|
||||||
|
input_features = {"input_features": input_mel, "is_longer": is_longer}
|
||||||
|
input_features = BatchFeature(input_features)
|
||||||
|
|
||||||
|
if return_tensors is not None:
|
||||||
|
input_features = input_features.convert_to_tensors(return_tensors)
|
||||||
|
|
||||||
|
return input_features
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,116 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Audio/Text processor class for CLAP
|
||||||
|
"""
|
||||||
|
|
||||||
|
from ...processing_utils import ProcessorMixin
|
||||||
|
from ...tokenization_utils_base import BatchEncoding
|
||||||
|
|
||||||
|
|
||||||
|
class ClapProcessor(ProcessorMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a CLAP processor which wraps a CLAP feature extractor and a RoBerta tokenizer into a single processor.
|
||||||
|
|
||||||
|
[`ClapProcessor`] offers all the functionalities of [`ClapFeatureExtractor`] and [`RobertaTokenizerFast`]. See the
|
||||||
|
[`~ClapProcessor.__call__`] and [`~ClapProcessor.decode`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
feature_extractor ([`ClapFeatureExtractor`]):
|
||||||
|
The audio processor is a required input.
|
||||||
|
tokenizer ([`RobertaTokenizerFast`]):
|
||||||
|
The tokenizer is a required input.
|
||||||
|
"""
|
||||||
|
feature_extractor_class = "ClapFeatureExtractor"
|
||||||
|
tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast")
|
||||||
|
|
||||||
|
def __init__(self, feature_extractor, tokenizer):
|
||||||
|
super().__init__(feature_extractor, tokenizer)
|
||||||
|
|
||||||
|
def __call__(self, text=None, audios=None, return_tensors=None, **kwargs):
|
||||||
|
"""
|
||||||
|
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
|
||||||
|
and `kwargs` arguments to RobertaTokenizerFast's [`~RobertaTokenizerFast.__call__`] if `text` is not `None` to
|
||||||
|
encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
|
||||||
|
ClapFeatureExtractor's [`~ClapFeatureExtractor.__call__`] if `audios` is not `None`. Please refer to the
|
||||||
|
doctsring of the above two methods for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text (`str`, `List[str]`, `List[List[str]]`):
|
||||||
|
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||||
|
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||||
|
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||||
|
audios (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||||
|
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
|
||||||
|
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
|
||||||
|
and T the sample length of the audio.
|
||||||
|
|
||||||
|
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||||
|
If set, will return tensors of a particular framework. Acceptable values are:
|
||||||
|
|
||||||
|
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||||
|
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||||
|
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||||
|
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||||
|
|
||||||
|
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||||
|
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||||
|
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||||
|
`None`).
|
||||||
|
- **audio_features** -- Audio features to be fed to a model. Returned when `audios` is not `None`.
|
||||||
|
"""
|
||||||
|
sampling_rate = kwargs.pop("sampling_rate", None)
|
||||||
|
|
||||||
|
if text is None and audios is None:
|
||||||
|
raise ValueError("You have to specify either text or audios. Both cannot be none.")
|
||||||
|
|
||||||
|
if text is not None:
|
||||||
|
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
|
||||||
|
|
||||||
|
if audios is not None:
|
||||||
|
audio_features = self.feature_extractor(
|
||||||
|
audios, sampling_rate=sampling_rate, return_tensors=return_tensors, **kwargs
|
||||||
|
)
|
||||||
|
|
||||||
|
if text is not None and audios is not None:
|
||||||
|
encoding["input_features"] = audio_features.input_features
|
||||||
|
return encoding
|
||||||
|
elif text is not None:
|
||||||
|
return encoding
|
||||||
|
else:
|
||||||
|
return BatchEncoding(data=dict(**audio_features), tensor_type=return_tensors)
|
||||||
|
|
||||||
|
def batch_decode(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||||
|
refer to the docstring of this method for more information.
|
||||||
|
"""
|
||||||
|
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||||
|
|
||||||
|
def decode(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
|
||||||
|
to the docstring of this method for more information.
|
||||||
|
"""
|
||||||
|
return self.tokenizer.decode(*args, **kwargs)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def model_input_names(self):
|
||||||
|
tokenizer_input_names = self.tokenizer.model_input_names
|
||||||
|
feature_extractor_input_names = self.feature_extractor.model_input_names
|
||||||
|
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
|
|
@ -1464,6 +1464,58 @@ class ChineseCLIPVisionModel(metaclass=DummyObject):
|
||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
CLAP_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class ClapAudioModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapAudioModelWithProjection(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapFeatureExtractor(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapTextModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class ClapTextModelWithProjection(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,267 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 HuggingFace Inc.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import random
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers import ClapFeatureExtractor
|
||||||
|
from transformers.testing_utils import require_torch, require_torchaudio
|
||||||
|
from transformers.utils.import_utils import is_torch_available
|
||||||
|
|
||||||
|
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
global_rng = random.Random()
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
|
||||||
|
def floats_list(shape, scale=1.0, rng=None, name=None):
|
||||||
|
"""Creates a random float32 tensor"""
|
||||||
|
if rng is None:
|
||||||
|
rng = global_rng
|
||||||
|
|
||||||
|
values = []
|
||||||
|
for batch_idx in range(shape[0]):
|
||||||
|
values.append([])
|
||||||
|
for _ in range(shape[1]):
|
||||||
|
values[-1].append(rng.random() * scale)
|
||||||
|
|
||||||
|
return values
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_torchaudio
|
||||||
|
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester with Whisper->Clap
|
||||||
|
class ClapFeatureExtractionTester(unittest.TestCase):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=7,
|
||||||
|
min_seq_length=400,
|
||||||
|
max_seq_length=2000,
|
||||||
|
feature_size=10,
|
||||||
|
hop_length=160,
|
||||||
|
chunk_length=8,
|
||||||
|
padding_value=0.0,
|
||||||
|
sampling_rate=4_000,
|
||||||
|
return_attention_mask=False,
|
||||||
|
do_normalize=True,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.min_seq_length = min_seq_length
|
||||||
|
self.max_seq_length = max_seq_length
|
||||||
|
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
|
||||||
|
self.padding_value = padding_value
|
||||||
|
self.sampling_rate = sampling_rate
|
||||||
|
self.return_attention_mask = return_attention_mask
|
||||||
|
self.do_normalize = do_normalize
|
||||||
|
self.feature_size = feature_size
|
||||||
|
self.chunk_length = chunk_length
|
||||||
|
self.hop_length = hop_length
|
||||||
|
|
||||||
|
def prepare_feat_extract_dict(self):
|
||||||
|
return {
|
||||||
|
"feature_size": self.feature_size,
|
||||||
|
"hop_length": self.hop_length,
|
||||||
|
"chunk_length": self.chunk_length,
|
||||||
|
"padding_value": self.padding_value,
|
||||||
|
"sampling_rate": self.sampling_rate,
|
||||||
|
"return_attention_mask": self.return_attention_mask,
|
||||||
|
"do_normalize": self.do_normalize,
|
||||||
|
}
|
||||||
|
|
||||||
|
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
|
||||||
|
def _flatten(list_of_lists):
|
||||||
|
return list(itertools.chain(*list_of_lists))
|
||||||
|
|
||||||
|
if equal_length:
|
||||||
|
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
|
||||||
|
else:
|
||||||
|
# make sure that inputs increase in size
|
||||||
|
speech_inputs = [
|
||||||
|
floats_list((x, self.feature_size))
|
||||||
|
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
|
||||||
|
]
|
||||||
|
if numpify:
|
||||||
|
speech_inputs = [np.asarray(x) for x in speech_inputs]
|
||||||
|
return speech_inputs
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_torchaudio
|
||||||
|
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest with Whisper->Clap
|
||||||
|
class ClapFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
|
||||||
|
feature_extraction_class = ClapFeatureExtractor
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.feat_extract_tester = ClapFeatureExtractionTester(self)
|
||||||
|
|
||||||
|
def test_call(self):
|
||||||
|
# Tests that all call wrap to encode_plus and batch_encode_plus
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||||
|
# create three inputs of length 800, 1000, and 1200
|
||||||
|
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
|
||||||
|
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
|
||||||
|
|
||||||
|
# Test feature size
|
||||||
|
input_features = feature_extractor(np_speech_inputs, padding="max_length", return_tensors="np").input_features
|
||||||
|
self.assertTrue(input_features.ndim == 4)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
|
||||||
|
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
|
||||||
|
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
|
||||||
|
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
|
||||||
|
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
|
||||||
|
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
|
||||||
|
|
||||||
|
def test_double_precision_pad(self):
|
||||||
|
import torch
|
||||||
|
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||||
|
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
|
||||||
|
py_speech_inputs = np_speech_inputs.tolist()
|
||||||
|
|
||||||
|
for inputs in [py_speech_inputs, np_speech_inputs]:
|
||||||
|
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
|
||||||
|
self.assertTrue(np_processed.input_features.dtype == np.float32)
|
||||||
|
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
|
||||||
|
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
|
||||||
|
|
||||||
|
def _load_datasamples(self, num_samples):
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||||
|
# automatic decoding with librispeech
|
||||||
|
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
|
||||||
|
|
||||||
|
return [x["array"] for x in speech_samples]
|
||||||
|
|
||||||
|
def integration_test_fusion(self):
|
||||||
|
# fmt: off
|
||||||
|
EXPECTED_INPUT_FEATURES = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
-30.2194, -22.4424, -18.6442, -17.2452, -22.7392, -32.2576, -36.1404,
|
||||||
|
-35.6120, -29.6229, -29.0454, -32.2157, -36.7664, -29.4436, -26.7825,
|
||||||
|
-31.1811, -38.3918, -38.8749, -43.4485, -47.6236, -38.7528, -31.8574,
|
||||||
|
-39.0591, -41.3190, -32.3319, -31.4699, -33.4502, -36.7412, -34.5265,
|
||||||
|
-35.1091, -40.4518, -42.7346, -44.5909, -44.9747, -45.8328, -47.0772,
|
||||||
|
-46.2723, -44.3613, -48.6253, -44.9551, -43.8700, -44.6104, -48.0146,
|
||||||
|
-42.7614, -47.3587, -47.4369, -45.5018, -47.0198, -42.8759, -47.5056,
|
||||||
|
-47.1567, -49.2621, -49.5643, -48.4330, -48.8495, -47.2512, -40.8439,
|
||||||
|
-48.1234, -49.1218, -48.7222, -50.2399, -46.8487, -41.9921, -50.4015,
|
||||||
|
-50.7827
|
||||||
|
],
|
||||||
|
[
|
||||||
|
-89.0141, -89.1411, -88.8096, -88.5480, -88.3481, -88.2038,
|
||||||
|
-88.1105, -88.0647, -88.0636, -88.1051, -88.1877, -88.1110,
|
||||||
|
-87.8613, -88.6679, -88.2685, -88.9684, -88.7977, -89.6264,
|
||||||
|
-89.9299, -90.3184, -91.1446, -91.9265, -92.7267, -93.6099,
|
||||||
|
-94.6395, -95.3243, -95.5923, -95.5773, -95.0889, -94.3354,
|
||||||
|
-93.5746, -92.9287, -92.4525, -91.9798, -91.8852, -91.7500,
|
||||||
|
-91.7259, -91.7561, -91.7959, -91.7070, -91.6914, -91.5019,
|
||||||
|
-91.0640, -90.0807, -88.7102, -87.0826, -85.5956, -84.4441,
|
||||||
|
-83.8461, -83.8605, -84.6702, -86.3900, -89.3073, -93.2926,
|
||||||
|
-96.3813, -97.3529, -100.0000, -99.6942, -92.2851, -87.9588,
|
||||||
|
-85.7214, -84.6807, -84.1940, -84.2021
|
||||||
|
],
|
||||||
|
[
|
||||||
|
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
|
||||||
|
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
|
||||||
|
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
|
||||||
|
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
|
||||||
|
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
|
||||||
|
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
|
||||||
|
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
|
||||||
|
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
|
||||||
|
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
|
||||||
|
-72.9405
|
||||||
|
]
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# fmt: on
|
||||||
|
MEL_BIN = [963, 963, 161]
|
||||||
|
input_speech = self._load_datasamples(1)
|
||||||
|
feaure_extractor = ClapFeatureExtractor()
|
||||||
|
for padding, EXPECTED_VALUES, idx_in_mel in zip(
|
||||||
|
["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES, MEL_BIN
|
||||||
|
):
|
||||||
|
input_features = feaure_extractor(input_speech, return_tensors="pt", padding=padding).input_features
|
||||||
|
self.assertTrue(torch.allclose(input_features[0, idx_in_mel], EXPECTED_VALUES, atol=1e-4))
|
||||||
|
|
||||||
|
def integration_test_rand_trunc(self):
|
||||||
|
# TODO in this case we should set the seed and use a longer audio to properly see the random truncation
|
||||||
|
# fmt: off
|
||||||
|
EXPECTED_INPUT_FEATURES = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
-42.3330, -36.2735, -35.9231, -43.5947, -48.4525, -46.5227, -42.6477,
|
||||||
|
-47.2740, -51.4336, -50.0846, -51.8711, -50.4232, -47.4736, -54.2275,
|
||||||
|
-53.3947, -55.4904, -54.8750, -54.5510, -55.4156, -57.4395, -51.7385,
|
||||||
|
-55.9118, -57.7800, -63.2064, -67.0651, -61.4379, -56.4268, -54.8667,
|
||||||
|
-52.3487, -56.4418, -57.1842, -55.1005, -55.6366, -59.4395, -56.8604,
|
||||||
|
-56.4949, -61.6573, -61.0826, -60.3250, -63.7876, -67.4882, -60.2323,
|
||||||
|
-54.6886, -50.5369, -47.7656, -45.8909, -49.1273, -57.4141, -58.3201,
|
||||||
|
-51.9862, -51.4897, -59.2561, -60.4730, -61.2203, -69.3174, -69.7464,
|
||||||
|
-65.5861, -58.9921, -59.5610, -61.0584, -58.1149, -64.4045, -66.2622,
|
||||||
|
-64.4610
|
||||||
|
],
|
||||||
|
[
|
||||||
|
-41.2298, -38.4211, -39.8834, -45.9950, -47.3839, -43.9849, -46.0371,
|
||||||
|
-52.5490, -56.6912, -51.8794, -50.1284, -49.7506, -53.9422, -63.2854,
|
||||||
|
-56.5754, -55.0469, -55.3181, -55.8115, -56.0058, -57.9215, -58.7597,
|
||||||
|
-59.1994, -59.2141, -64.4198, -73.5138, -64.4647, -59.3351, -54.5626,
|
||||||
|
-54.7508, -65.0230, -60.0270, -54.7644, -56.0108, -60.1531, -57.6879,
|
||||||
|
-56.3766, -63.3395, -65.3032, -61.5202, -63.0677, -68.4217, -60.6868,
|
||||||
|
-54.4619, -50.8533, -47.7200, -45.9197, -49.0961, -57.7621, -59.0750,
|
||||||
|
-51.9122, -51.4332, -59.4132, -60.3415, -61.6558, -70.7049, -69.7905,
|
||||||
|
-66.9104, -59.0324, -59.6138, -61.2023, -58.2169, -65.3837, -66.4425,
|
||||||
|
-64.4142
|
||||||
|
],
|
||||||
|
[
|
||||||
|
-51.6882, -50.6852, -50.8198, -51.7428, -53.0325, -54.1619, -56.4903,
|
||||||
|
-59.0314, -60.7996, -60.5164, -59.9680, -60.5393, -62.5796, -65.4166,
|
||||||
|
-65.6149, -65.1409, -65.7226, -67.9057, -72.5089, -82.3530, -86.3189,
|
||||||
|
-83.4241, -79.1279, -79.3384, -82.7335, -79.8316, -80.2167, -74.3638,
|
||||||
|
-71.3930, -75.3849, -74.5381, -71.4504, -70.3791, -71.4547, -71.8820,
|
||||||
|
-67.3885, -69.5686, -71.9852, -71.0307, -73.0053, -80.8802, -72.9227,
|
||||||
|
-63.8526, -60.3260, -59.6012, -57.8316, -61.0603, -67.3403, -67.1709,
|
||||||
|
-60.4967, -60.5079, -68.3345, -67.5213, -70.6416, -79.6219, -78.2198,
|
||||||
|
-74.6851, -69.5718, -69.4968, -70.6882, -66.8175, -73.8558, -74.3855,
|
||||||
|
-72.9405
|
||||||
|
]
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
input_speech = self._load_datasamples(1)
|
||||||
|
feaure_extractor = ClapFeatureExtractor()
|
||||||
|
for padding, EXPECTED_VALUES in zip(["repeat", "repeatpad", None], EXPECTED_INPUT_FEATURES):
|
||||||
|
input_features = feaure_extractor(
|
||||||
|
input_speech, return_tensors="pt", truncation="rand_trunc", padding=padding
|
||||||
|
).input_features
|
||||||
|
self.assertTrue(torch.allclose(input_features[0, 0, :30], EXPECTED_VALUES, atol=1e-4))
|
|
@ -0,0 +1,665 @@
|
||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch CLAP model. """
|
||||||
|
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
from transformers import ClapAudioConfig, ClapConfig, ClapProcessor, ClapTextConfig
|
||||||
|
from transformers.testing_utils import require_torch, slow, torch_device
|
||||||
|
from transformers.utils import is_torch_available
|
||||||
|
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import (
|
||||||
|
ModelTesterMixin,
|
||||||
|
_config_zero_init,
|
||||||
|
floats_tensor,
|
||||||
|
ids_tensor,
|
||||||
|
random_attention_mask,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
from torch import nn
|
||||||
|
|
||||||
|
from transformers import (
|
||||||
|
ClapAudioModel,
|
||||||
|
ClapAudioModelWithProjection,
|
||||||
|
ClapModel,
|
||||||
|
ClapTextModel,
|
||||||
|
ClapTextModelWithProjection,
|
||||||
|
)
|
||||||
|
from transformers.models.clap.modeling_clap import CLAP_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||||
|
|
||||||
|
|
||||||
|
class ClapAudioModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=12,
|
||||||
|
image_size=60,
|
||||||
|
num_mel_bins=16,
|
||||||
|
window_size=4,
|
||||||
|
spec_size=64,
|
||||||
|
patch_size=2,
|
||||||
|
patch_stride=2,
|
||||||
|
seq_length=16,
|
||||||
|
freq_ratio=2,
|
||||||
|
num_channels=3,
|
||||||
|
is_training=True,
|
||||||
|
hidden_size=256,
|
||||||
|
patch_embeds_hidden_size=32,
|
||||||
|
projection_dim=32,
|
||||||
|
num_hidden_layers=4,
|
||||||
|
num_heads=[2, 2, 2, 2],
|
||||||
|
intermediate_size=37,
|
||||||
|
dropout=0.1,
|
||||||
|
attention_dropout=0.1,
|
||||||
|
initializer_range=0.02,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.image_size = image_size
|
||||||
|
self.num_mel_bins = num_mel_bins
|
||||||
|
self.window_size = window_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.is_training = is_training
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_heads = num_heads
|
||||||
|
self.num_attention_heads = num_heads[0]
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.spec_size = spec_size
|
||||||
|
self.freq_ratio = freq_ratio
|
||||||
|
self.patch_stride = patch_stride
|
||||||
|
self.patch_embeds_hidden_size = patch_embeds_hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.dropout = dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_features = floats_tensor([self.batch_size, 1, self.hidden_size, self.num_mel_bins])
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_features
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return ClapAudioConfig(
|
||||||
|
image_size=self.image_size,
|
||||||
|
patch_size=self.patch_size,
|
||||||
|
num_mel_bins=self.num_mel_bins,
|
||||||
|
window_size=self.window_size,
|
||||||
|
num_channels=self.num_channels,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
patch_stride=self.patch_stride,
|
||||||
|
projection_dim=self.projection_dim,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
dropout=self.dropout,
|
||||||
|
attention_dropout=self.attention_dropout,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
spec_size=self.spec_size,
|
||||||
|
freq_ratio=self.freq_ratio,
|
||||||
|
patch_embeds_hidden_size=self.patch_embeds_hidden_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, input_features):
|
||||||
|
model = ClapAudioModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(input_features)
|
||||||
|
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_model_with_projection(self, config, input_features):
|
||||||
|
model = ClapAudioModelWithProjection(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(input_features)
|
||||||
|
self.parent.assertEqual(result.audio_embeds.shape, (self.batch_size, self.projection_dim))
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
config, input_features = config_and_inputs
|
||||||
|
inputs_dict = {"input_features": input_features}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class ClapAudioModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
"""
|
||||||
|
Here we also overwrite some of the tests of test_modeling_common.py, as CLAP does not use input_ids, inputs_embeds,
|
||||||
|
attention_mask and seq_length.
|
||||||
|
"""
|
||||||
|
|
||||||
|
all_model_classes = (ClapAudioModel, ClapAudioModelWithProjection) if is_torch_available() else ()
|
||||||
|
fx_compatible = False
|
||||||
|
test_pruning = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = ClapAudioModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=ClapAudioConfig, has_text_modality=False, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||||
|
x = model.get_output_embeddings()
|
||||||
|
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||||
|
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
hidden_states = outputs.hidden_states
|
||||||
|
|
||||||
|
expected_num_layers = getattr(
|
||||||
|
self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
|
||||||
|
)
|
||||||
|
self.assertEqual(len(hidden_states), expected_num_layers)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(hidden_states[0].shape[-2:]),
|
||||||
|
[self.model_tester.patch_embeds_hidden_size, self.model_tester.patch_embeds_hidden_size],
|
||||||
|
)
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
# check that output_hidden_states also work using config
|
||||||
|
del inputs_dict["output_hidden_states"]
|
||||||
|
config.output_hidden_states = True
|
||||||
|
|
||||||
|
check_hidden_states_output(inputs_dict, config, model_class)
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
expected_arg_names = ["input_features"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_model_with_projection(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
|
||||||
|
def test_training(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel does not output any loss term in the forward pass")
|
||||||
|
def test_training_gradient_checkpointing(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
|
||||||
|
def test_save_load_fast_init_from_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapAudioModel has no base class and is not available in MODEL_MAPPING")
|
||||||
|
def test_save_load_fast_init_to_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ClapAudioModel.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_with_projection_from_pretrained(self):
|
||||||
|
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ClapAudioModelWithProjection.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
self.assertTrue(hasattr(model, "visual_projection"))
|
||||||
|
|
||||||
|
|
||||||
|
class ClapTextModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=12,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
hidden_size=32,
|
||||||
|
projection_dim=32,
|
||||||
|
num_hidden_layers=5,
|
||||||
|
num_attention_heads=4,
|
||||||
|
intermediate_size=37,
|
||||||
|
dropout=0.1,
|
||||||
|
attention_dropout=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
initializer_range=0.02,
|
||||||
|
scope=None,
|
||||||
|
projection_hidden_act="relu",
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.dropout = dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.scope = scope
|
||||||
|
self.projection_hidden_act = projection_hidden_act
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||||
|
|
||||||
|
if input_mask is not None:
|
||||||
|
batch_size, seq_length = input_mask.shape
|
||||||
|
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
|
||||||
|
for batch_idx, start_index in enumerate(rnd_start_indices):
|
||||||
|
input_mask[batch_idx, :start_index] = 1
|
||||||
|
input_mask[batch_idx, start_index:] = 0
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_ids, input_mask
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return ClapTextConfig(
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
projection_dim=self.projection_dim,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
dropout=self.dropout,
|
||||||
|
attention_dropout=self.attention_dropout,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
projection_hidden_act=self.projection_hidden_act,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, input_ids, input_mask):
|
||||||
|
model = ClapTextModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(input_ids, attention_mask=input_mask)
|
||||||
|
result = model(input_ids)
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||||
|
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_model_with_projection(self, config, input_ids, input_mask):
|
||||||
|
model = ClapTextModelWithProjection(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(input_ids, attention_mask=input_mask)
|
||||||
|
result = model(input_ids)
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||||
|
self.parent.assertEqual(result.text_embeds.shape, (self.batch_size, self.projection_dim))
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
config, input_ids, input_mask = config_and_inputs
|
||||||
|
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class ClapTextModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (ClapTextModel, ClapTextModelWithProjection) if is_torch_available() else ()
|
||||||
|
fx_compatible = False
|
||||||
|
test_pruning = False
|
||||||
|
test_head_masking = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = ClapTextModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=ClapTextConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_model_with_projection(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model_with_projection(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
|
||||||
|
def test_training(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapTextModel does not output any loss term in the forward pass")
|
||||||
|
def test_training_gradient_checkpointing(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapTextModel does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
|
||||||
|
def test_save_load_fast_init_from_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapTextModel has no base class and is not available in MODEL_MAPPING")
|
||||||
|
def test_save_load_fast_init_to_base(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ClapTextModel.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_with_projection_from_pretrained(self):
|
||||||
|
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ClapTextModelWithProjection.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
self.assertTrue(hasattr(model, "text_projection"))
|
||||||
|
|
||||||
|
|
||||||
|
class ClapModelTester:
|
||||||
|
def __init__(self, parent, text_kwargs=None, audio_kwargs=None, is_training=True):
|
||||||
|
if text_kwargs is None:
|
||||||
|
text_kwargs = {}
|
||||||
|
if audio_kwargs is None:
|
||||||
|
audio_kwargs = {}
|
||||||
|
|
||||||
|
self.parent = parent
|
||||||
|
self.text_model_tester = ClapTextModelTester(parent, **text_kwargs)
|
||||||
|
self.audio_model_tester = ClapAudioModelTester(parent, **audio_kwargs)
|
||||||
|
self.is_training = is_training
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
_, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
|
||||||
|
_, input_features = self.audio_model_tester.prepare_config_and_inputs()
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_ids, attention_mask, input_features
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return ClapConfig.from_text_audio_configs(
|
||||||
|
self.text_model_tester.get_config(), self.audio_model_tester.get_config(), projection_dim=64
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, input_ids, attention_mask, input_features):
|
||||||
|
model = ClapModel(config).to(torch_device).eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
result = model(input_ids, input_features, attention_mask)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.logits_per_audio.shape, (self.audio_model_tester.batch_size, self.text_model_tester.batch_size)
|
||||||
|
)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.logits_per_text.shape, (self.text_model_tester.batch_size, self.audio_model_tester.batch_size)
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
config, input_ids, attention_mask, input_features = config_and_inputs
|
||||||
|
inputs_dict = {
|
||||||
|
"input_ids": input_ids,
|
||||||
|
"attention_mask": attention_mask,
|
||||||
|
"input_features": input_features,
|
||||||
|
"return_loss": True,
|
||||||
|
}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class ClapModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (ClapModel,) if is_torch_available() else ()
|
||||||
|
fx_compatible = False
|
||||||
|
test_head_masking = False
|
||||||
|
test_pruning = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_attention_outputs = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = ClapModelTester(self)
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="Hidden_states is tested in individual model tests")
|
||||||
|
def test_hidden_states_output(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="Retain_grad is tested in individual model tests")
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="ClapModel does not have input/output embeddings")
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# override as the `logit_scale` parameter initilization is different for CLAP
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
if param.requires_grad:
|
||||||
|
# check if `logit_scale` is initilized as per the original implementation
|
||||||
|
if name == "logit_scale":
|
||||||
|
self.assertAlmostEqual(
|
||||||
|
param.data.item(),
|
||||||
|
np.log(1 / 0.07),
|
||||||
|
delta=1e-3,
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
def _create_and_check_torchscript(self, config, inputs_dict):
|
||||||
|
if not self.test_torchscript:
|
||||||
|
return
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
|
||||||
|
configs_no_init.torchscript = True
|
||||||
|
configs_no_init.return_dict = False
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
try:
|
||||||
|
input_ids = inputs_dict["input_ids"]
|
||||||
|
input_features = inputs_dict["input_features"] # CLAP needs input_features
|
||||||
|
traced_model = torch.jit.trace(model, (input_ids, input_features))
|
||||||
|
except RuntimeError:
|
||||||
|
self.fail("Couldn't trace module.")
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||||
|
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
|
||||||
|
|
||||||
|
try:
|
||||||
|
torch.jit.save(traced_model, pt_file_name)
|
||||||
|
except Exception:
|
||||||
|
self.fail("Couldn't save module.")
|
||||||
|
|
||||||
|
try:
|
||||||
|
loaded_model = torch.jit.load(pt_file_name)
|
||||||
|
except Exception:
|
||||||
|
self.fail("Couldn't load module.")
|
||||||
|
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
loaded_model.to(torch_device)
|
||||||
|
loaded_model.eval()
|
||||||
|
|
||||||
|
model_state_dict = model.state_dict()
|
||||||
|
loaded_model_state_dict = loaded_model.state_dict()
|
||||||
|
|
||||||
|
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
|
||||||
|
|
||||||
|
models_equal = True
|
||||||
|
for layer_name, p1 in model_state_dict.items():
|
||||||
|
p2 = loaded_model_state_dict[layer_name]
|
||||||
|
if p1.data.ne(p2.data).sum() > 0:
|
||||||
|
models_equal = False
|
||||||
|
|
||||||
|
self.assertTrue(models_equal)
|
||||||
|
|
||||||
|
def test_load_audio_text_config(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
# Save ClapConfig and check if we can load ClapAudioConfig from it
|
||||||
|
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||||
|
config.save_pretrained(tmp_dir_name)
|
||||||
|
audio_config = ClapAudioConfig.from_pretrained(tmp_dir_name)
|
||||||
|
self.assertDictEqual(config.audio_config.to_dict(), audio_config.to_dict())
|
||||||
|
|
||||||
|
# Save ClapConfig and check if we can load ClapTextConfig from it
|
||||||
|
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||||
|
config.save_pretrained(tmp_dir_name)
|
||||||
|
text_config = ClapTextConfig.from_pretrained(tmp_dir_name)
|
||||||
|
self.assertDictEqual(config.text_config.to_dict(), text_config.to_dict())
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
for model_name in CLAP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||||
|
model = ClapModel.from_pretrained(model_name)
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
@slow
|
||||||
|
@require_torch
|
||||||
|
class ClapModelIntegrationTest(unittest.TestCase):
|
||||||
|
paddings = ["repeatpad", "repeat", "pad"]
|
||||||
|
|
||||||
|
def test_integration_unfused(self):
|
||||||
|
EXPECTED_MEANS_UNFUSED = {
|
||||||
|
"repeatpad": 0.0024,
|
||||||
|
"pad": 0.0020,
|
||||||
|
"repeat": 0.0023,
|
||||||
|
}
|
||||||
|
|
||||||
|
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||||
|
audio_sample = librispeech_dummy[-1]
|
||||||
|
|
||||||
|
model_id = "laion/clap-htsat-unfused"
|
||||||
|
|
||||||
|
model = ClapModel.from_pretrained(model_id).to(torch_device)
|
||||||
|
processor = ClapProcessor.from_pretrained(model_id)
|
||||||
|
|
||||||
|
for padding in self.paddings:
|
||||||
|
inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding).to(
|
||||||
|
torch_device
|
||||||
|
)
|
||||||
|
|
||||||
|
audio_embed = model.get_audio_features(**inputs)
|
||||||
|
expected_mean = EXPECTED_MEANS_UNFUSED[padding]
|
||||||
|
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_integration_fused(self):
|
||||||
|
EXPECTED_MEANS_FUSED = {
|
||||||
|
"repeatpad": 0.00069,
|
||||||
|
"repeat": 0.00196,
|
||||||
|
"pad": -0.000379,
|
||||||
|
}
|
||||||
|
|
||||||
|
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||||
|
audio_sample = librispeech_dummy[-1]
|
||||||
|
|
||||||
|
model_id = "laion/clap-htsat-fused"
|
||||||
|
|
||||||
|
model = ClapModel.from_pretrained(model_id).to(torch_device)
|
||||||
|
processor = ClapProcessor.from_pretrained(model_id)
|
||||||
|
|
||||||
|
for padding in self.paddings:
|
||||||
|
inputs = processor(
|
||||||
|
audios=audio_sample["audio"]["array"], return_tensors="pt", padding=padding, truncation="fusion"
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
audio_embed = model.get_audio_features(**inputs)
|
||||||
|
expected_mean = EXPECTED_MEANS_FUSED[padding]
|
||||||
|
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(audio_embed.cpu().mean(), torch.tensor([expected_mean]), atol=1e-3, rtol=1e-3)
|
||||||
|
)
|
|
@ -0,0 +1,125 @@
|
||||||
|
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers import ClapFeatureExtractor, ClapProcessor, RobertaTokenizer, RobertaTokenizerFast
|
||||||
|
from transformers.testing_utils import require_sentencepiece, require_torchaudio
|
||||||
|
|
||||||
|
from .test_feature_extraction_clap import floats_list
|
||||||
|
|
||||||
|
|
||||||
|
@require_torchaudio
|
||||||
|
@require_sentencepiece
|
||||||
|
class ClapProcessorTest(unittest.TestCase):
|
||||||
|
def setUp(self):
|
||||||
|
self.checkpoint = "laion/clap-htsat-unfused"
|
||||||
|
self.tmpdirname = tempfile.mkdtemp()
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return RobertaTokenizer.from_pretrained(self.checkpoint, **kwargs)
|
||||||
|
|
||||||
|
def get_feature_extractor(self, **kwargs):
|
||||||
|
return ClapFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
|
||||||
|
|
||||||
|
def tearDown(self):
|
||||||
|
shutil.rmtree(self.tmpdirname)
|
||||||
|
|
||||||
|
def test_save_load_pretrained_default(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
|
||||||
|
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
processor = ClapProcessor.from_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
|
||||||
|
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
|
||||||
|
|
||||||
|
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||||
|
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
|
||||||
|
|
||||||
|
def test_save_load_pretrained_additional_features(self):
|
||||||
|
processor = ClapProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||||
|
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
|
||||||
|
|
||||||
|
processor = ClapProcessor.from_pretrained(
|
||||||
|
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||||
|
self.assertIsInstance(processor.tokenizer, RobertaTokenizerFast)
|
||||||
|
|
||||||
|
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||||
|
self.assertIsInstance(processor.feature_extractor, ClapFeatureExtractor)
|
||||||
|
|
||||||
|
def test_feature_extractor(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
raw_speech = floats_list((3, 1000))
|
||||||
|
|
||||||
|
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
|
||||||
|
input_processor = processor(audios=raw_speech, return_tensors="np")
|
||||||
|
|
||||||
|
for key in input_feat_extract.keys():
|
||||||
|
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||||
|
|
||||||
|
def test_tokenizer(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
input_str = "This is a test string"
|
||||||
|
|
||||||
|
encoded_processor = processor(text=input_str)
|
||||||
|
|
||||||
|
encoded_tok = tokenizer(input_str)
|
||||||
|
|
||||||
|
for key in encoded_tok.keys():
|
||||||
|
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||||
|
|
||||||
|
def test_tokenizer_decode(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||||
|
|
||||||
|
decoded_processor = processor.batch_decode(predicted_ids)
|
||||||
|
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||||
|
|
||||||
|
self.assertListEqual(decoded_tok, decoded_processor)
|
||||||
|
|
||||||
|
def test_model_input_names(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = ClapProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
processor.model_input_names[2:],
|
||||||
|
feature_extractor.model_input_names,
|
||||||
|
msg="`processor` and `feature_extractor` model input names do not match",
|
||||||
|
)
|
|
@ -172,6 +172,10 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
|
||||||
# should **not** be the rule.
|
# should **not** be the rule.
|
||||||
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||||
# models to ignore for model xxx mapping
|
# models to ignore for model xxx mapping
|
||||||
|
"ClapTextModel",
|
||||||
|
"ClapTextModelWithProjection",
|
||||||
|
"ClapAudioModel",
|
||||||
|
"ClapAudioModelWithProjection",
|
||||||
"Blip2ForConditionalGeneration",
|
"Blip2ForConditionalGeneration",
|
||||||
"Blip2QFormerModel",
|
"Blip2QFormerModel",
|
||||||
"Blip2VisionModel",
|
"Blip2VisionModel",
|
||||||
|
|
|
@ -40,6 +40,8 @@ src/transformers/models/bloom/configuration_bloom.py
|
||||||
src/transformers/models/camembert/configuration_camembert.py
|
src/transformers/models/camembert/configuration_camembert.py
|
||||||
src/transformers/models/canine/configuration_canine.py
|
src/transformers/models/canine/configuration_canine.py
|
||||||
src/transformers/models/canine/modeling_canine.py
|
src/transformers/models/canine/modeling_canine.py
|
||||||
|
src/transformers/models/clap/configuration_clap.py
|
||||||
|
src/transformers/models/clap/modeling_clap.py
|
||||||
src/transformers/models/clip/configuration_clip.py
|
src/transformers/models/clip/configuration_clip.py
|
||||||
src/transformers/models/clipseg/modeling_clipseg.py
|
src/transformers/models/clipseg/modeling_clipseg.py
|
||||||
src/transformers/models/codegen/configuration_codegen.py
|
src/transformers/models/codegen/configuration_codegen.py
|
||||||
|
|
Loading…
Reference in New Issue