Add MusicGen Melody (#28819)
* first modeling code * make repository * still WIP * update model * add tests * add latest change * clean docstrings and copied from * update docstrings md and readme * correct chroma function * correct copied from and remove unreleated test * add doc to toctree * correct imports * add convert script to notdoctested * Add suggestion from Sanchit Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct get_uncoditional_inputs docstrings * modify README according to SANCHIT feedback * add chroma to audio utils * clean librosa and torchaudio hard dependencies * fix FE * refactor audio decoder -> audio encoder for consistency with previous musicgen * refactor conditional -> encoder * modify sampling rate logics * modify license at the beginning * refactor all_self_attns->all_attentions * remove ignore copy from causallm generate * add copied from for from_sub_models * fix make copies * add warning if audio is truncated * add copied from where relevant * remove artefact * fix convert script * fix torchaudio and FE * modify chroma method according to feedback-> better naming * refactor input_values->input_features * refactor input_values->input_features and fix import fe * add input_features to docstrigs * correct inputs_embeds logics * remove dtype conversion * refactor _prepare_conditional_hidden_states_kwargs_for_generation ->_prepare_encoder_hidden_states_kwargs_for_generation * change warning for chroma length * Update src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * change way to save wav, using soundfile * correct docs and change to soundfile * fix import * fix init proj layers * remove line breaks from md * fix issue with docstrings * add FE suggestions * improve is in logics and remove useless imports * remove custom from_pretrained * simplify docstring code * add suggestions for modeling tests * make style * update converting script with sanity check * remove encoder attention mask from conditional generation * replace musicgen melody checkpoints with official orga * rename ylacombe->facebook in checkpoints * fix copies * remove unecessary warning * add shape in code docstrings * add files to slow doc tests * fix md bug and add md to not_tested * make fix-copies * fix hidden states test and batching --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
This commit is contained in:
parent
bf3dfd1160
commit
c43b380e70
|
@ -442,6 +442,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -438,6 +438,7 @@ Aktuelle Anzahl der Checkpoints: ![](https://img.shields.io/endpoint?url=https:/
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -415,6 +415,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -436,6 +436,7 @@ Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=h
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (de l'Université du Wisconsin - Madison) a été publié dans l'article [Analyse multi-résolution (MRA) pour une auto-attention approximative](https://arxiv.org/abs/2207.10284) par Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (de Google AI) a été publié dans l'article [mT5 : un transformateur texte-à-texte pré-entraîné massivement multilingue](https://arxiv.org/abs/2010.11934) par Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (de Meta) a été publié dans l'article [Génération de musique simple et contrôlable](https://arxiv.org/abs/2306.05284) par Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi et Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (de Meta) publié dans l'article [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) parJade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (de RUC AI Box) a été publié dans l'article [MVP : Pré-entraînement supervisé multi-tâche pour la génération de langage naturel](https://arxiv.org/abs/2206.12131) par Tianyi Tang, Junyi Li, Wayne Xin Zhao et Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (de SHI Labs) a été publié dans l'article [Transformateur d'attention de voisinage](https://arxiv.org/abs/2204.07143) par Ali Hassani, Steven Walton, Jiachen Li, Shen Li et Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (du laboratoire Noah's Ark de Huawei) a été publié dans l'article [NEZHA : Représentation contextualisée neurale pour la compréhension du langage chinois](https://arxiv.org/abs/1909.00204) par Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen et Qun Liu.
|
||||
|
|
|
@ -389,6 +389,7 @@ conda install conda-forge::transformers
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison से) Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. द्वाराअनुसंधान पत्र [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) के साथ जारी किया गया
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI से) साथ वाला पेपर [mT5: एक व्यापक बहुभाषी पूर्व-प्रशिक्षित टेक्स्ट-टू-टेक्स्ट ट्रांसफॉर्मर](https://arxiv.org/abs/2010.11934) लिंटिंग ज़ू, नोआ कॉन्सटेंट, एडम रॉबर्ट्स, मिहिर काले, रामी अल-रफू, आदित्य सिद्धांत, आदित्य बरुआ, कॉलिन रैफेल द्वारा पोस्ट किया गया।
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https://arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा।
|
||||
|
|
|
@ -449,6 +449,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison から) Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. から公開された研究論文 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284)
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI から) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel から公開された研究論文: [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934)
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (RUC AI Box から) Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen から公開された研究論文: [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131)
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (SHI Labs から) Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi から公開された研究論文: [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143)
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab から) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu から公開された研究論文: [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204)
|
||||
|
|
|
@ -364,6 +364,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison 에서 제공)은 Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.의 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) 논문과 함께 발표했습니다.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI 에서) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 의 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 논문과 함께 발표했습니다.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (RUC AI Box 에서) Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen 의 [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) 논문과 함께 발표했습니다.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (SHI Labs 에서) Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi 의 [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) 논문과 함께 발표했습니다.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab 에서) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 의 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 논문과 함께 발표했습니다.
|
||||
|
|
|
@ -447,6 +447,7 @@ Número atual de pontos de verificação: ![](https://img.shields.io/endpoint?ur
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -437,6 +437,7 @@ conda install conda-forge::transformers
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -439,6 +439,7 @@ Flax, PyTorch లేదా TensorFlow యొక్క ఇన్స్టా
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -438,6 +438,7 @@ Số lượng điểm kiểm tra hiện tại: ![](https://img.shields.io/endpoi
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (từ the University of Wisconsin - Madison) được phát hành với bài báo [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (từ Google AI) được phát hành với bài báo [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (từ Meta) được phát hành với bài báo [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (từ Meta) được phát hành với bài báo [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (từ RUC AI Box) được phát hành với bài báo [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (từ SHI Labs) được phát hành với bài báo [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (từ Huawei Noah’s Ark Lab) được phát hành với bài báo [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -388,6 +388,7 @@ conda install conda-forge::transformers
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (来自 the University of Wisconsin - Madison) 伴随论文 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) 由 Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh 发布。
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (来自 中国人民大学 AI Box) 伴随论文 [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) 由 Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen 发布。
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (来自 SHI Labs) 伴随论文 [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) 由 Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi 发布。
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (来自华为诺亚方舟实验室) 伴随论文 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 由 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 发布。
|
||||
|
|
|
@ -400,6 +400,7 @@ conda install conda-forge::transformers
|
|||
1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.
|
||||
1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MusicGen Melody](https://huggingface.co/docs/transformers/main/model_doc/musicgen_melody)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
|
||||
1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
|
|
|
@ -642,6 +642,8 @@
|
|||
title: MMS
|
||||
- local: model_doc/musicgen
|
||||
title: MusicGen
|
||||
- local: model_doc/musicgen_melody
|
||||
title: MusicGen Melody
|
||||
- local: model_doc/pop2piano
|
||||
title: Pop2Piano
|
||||
- local: model_doc/seamless_m4t
|
||||
|
|
|
@ -207,6 +207,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
| [MRA](model_doc/mra) | ✅ | ❌ | ❌ |
|
||||
| [MT5](model_doc/mt5) | ✅ | ✅ | ✅ |
|
||||
| [MusicGen](model_doc/musicgen) | ✅ | ❌ | ❌ |
|
||||
| [MusicGen Melody](model_doc/musicgen_melody) | ✅ | ❌ | ❌ |
|
||||
| [MVP](model_doc/mvp) | ✅ | ❌ | ❌ |
|
||||
| [NAT](model_doc/nat) | ✅ | ❌ | ❌ |
|
||||
| [Nezha](model_doc/nezha) | ✅ | ❌ | ❌ |
|
||||
|
|
|
@ -0,0 +1,288 @@
|
|||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
:warning: Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# MusicGen Melody
|
||||
|
||||
## Overview
|
||||
|
||||
The MusicGen Melody model was proposed in [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
|
||||
|
||||
MusicGen Melody is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or *audio codes*, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.
|
||||
|
||||
Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen.*
|
||||
|
||||
|
||||
This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/audiocraft). The pre-trained checkpoints can be found on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen).
|
||||
|
||||
|
||||
## Difference with [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen)
|
||||
|
||||
There are two key differences with MusicGen:
|
||||
1. The audio prompt is used here as a conditional signal for the generated audio sample, whereas it's used for audio continuation in [MusicGen](https://huggingface.co/docs/transformers/main/en/model_doc/musicgen).
|
||||
2. Conditional text and audio signals are concatenated to the decoder's hidden states instead of being used as a cross-attention signal, as in MusicGen.
|
||||
|
||||
## Generation
|
||||
|
||||
MusicGen Melody is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenMelodyForConditionalGeneration.generate`], or by overriding the model's generation config (see below).
|
||||
|
||||
Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen Melody. The mono channel versions generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), and each set of codebooks is decoded independently through the audio compression model. The audio streams for each channel are combined to give the final stereo output.
|
||||
|
||||
|
||||
#### Audio Conditional Generation
|
||||
|
||||
The model can generate an audio sample conditioned on a text and an audio prompt through use of the [`MusicgenMelodyProcessor`] to pre-process the inputs.
|
||||
|
||||
In the following examples, we load an audio file using the 🤗 Datasets library, which can be pip installed through the command below:
|
||||
|
||||
```
|
||||
pip install --upgrade pip
|
||||
pip install datasets[audio]
|
||||
```
|
||||
|
||||
The audio file we are about to use is loaded as follows:
|
||||
```python
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
|
||||
>>> sample = next(iter(dataset))["audio"]
|
||||
```
|
||||
|
||||
The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass. The [Demucs](https://github.com/adefossez/demucs/tree/main) model can be used to separate vocals and other signals from the drums and bass components.
|
||||
|
||||
If you wish to use Demucs, you first need to follow the installation steps [here](https://github.com/adefossez/demucs/tree/main?tab=readme-ov-file#for-musicians) before using the following snippet:
|
||||
|
||||
```python
|
||||
from demucs import pretrained
|
||||
from demucs.apply import apply_model
|
||||
from demucs.audio import convert_audio
|
||||
import torch
|
||||
|
||||
|
||||
wav = torch.tensor(sample["array"]).to(torch.float32)
|
||||
|
||||
demucs = pretrained.get_model('htdemucs')
|
||||
|
||||
wav = convert_audio(wav[None], sample["sampling_rate"], demucs.samplerate, demucs.audio_channels)
|
||||
wav = apply_model(demucs, wav[None])
|
||||
```
|
||||
|
||||
You can then use the following snippet to generate music:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> inputs = processor(
|
||||
... audio=wav,
|
||||
... sampling_rate=demucs.samplerate,
|
||||
... text=["80s blues track with groovy saxophone"],
|
||||
... padding=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
|
||||
```
|
||||
|
||||
You can also pass the audio signal directly without using Demucs, although the quality of the generation will probably be degraded:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> inputs = processor(
|
||||
... audio=sample["array"],
|
||||
... sampling_rate=sample["sampling_rate"],
|
||||
... text=["80s blues track with groovy saxophone"],
|
||||
... padding=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
|
||||
```
|
||||
|
||||
The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen to the generated audio samples, you can either play them in an ipynb notebook:
|
||||
|
||||
```python
|
||||
from IPython.display import Audio
|
||||
|
||||
sampling_rate = model.config.audio_encoder.sampling_rate
|
||||
Audio(audio_values[0].numpy(), rate=sampling_rate)
|
||||
```
|
||||
|
||||
Or save them as a `.wav` file using a third-party library, e.g. `soundfile`:
|
||||
|
||||
```python
|
||||
>>> import soundfile as sf
|
||||
|
||||
>>> sampling_rate = model.config.audio_encoder.sampling_rate
|
||||
>>> sf.write("musicgen_out.wav", audio_values[0].T.numpy(), sampling_rate)
|
||||
```
|
||||
|
||||
|
||||
### Text-only Conditional Generation
|
||||
|
||||
The same [`MusicgenMelodyProcessor`] can be used to pre-process a text-only prompt.
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> inputs = processor(
|
||||
... text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
|
||||
... padding=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
|
||||
```
|
||||
|
||||
The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits (which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or 'null' prompt). Higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results, use `guidance_scale=3` (default).
|
||||
|
||||
|
||||
You can also generate in batch:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> # take the first quarter of the audio sample
|
||||
>>> sample_1 = sample["array"][: len(sample["array"]) // 4]
|
||||
|
||||
>>> # take the first half of the audio sample
|
||||
>>> sample_2 = sample["array"][: len(sample["array"]) // 2]
|
||||
|
||||
>>> inputs = processor(
|
||||
... audio=[sample_1, sample_2],
|
||||
... sampling_rate=sample["sampling_rate"],
|
||||
... text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"],
|
||||
... padding=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
|
||||
```
|
||||
|
||||
### Unconditional Generation
|
||||
|
||||
The inputs for unconditional (or 'null') generation can be obtained through the method [`MusicgenMelodyProcessor.get_unconditional_inputs`]:
|
||||
|
||||
```python
|
||||
>>> from transformers import MusicgenMelodyForConditionalGeneration, MusicgenMelodyProcessor
|
||||
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
>>> unconditional_inputs = MusicgenMelodyProcessor.from_pretrained("facebook/musicgen-melody").get_unconditional_inputs(num_samples=1)
|
||||
|
||||
>>> audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)
|
||||
```
|
||||
|
||||
### Generation Configuration
|
||||
|
||||
The default parameters that control the generation process, such as sampling, guidance scale and number of generated tokens, can be found in the model's generation config, and updated as desired:
|
||||
|
||||
```python
|
||||
>>> from transformers import MusicgenMelodyForConditionalGeneration
|
||||
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> # inspect the default generation config
|
||||
>>> model.generation_config
|
||||
|
||||
>>> # increase the guidance scale to 4.0
|
||||
>>> model.generation_config.guidance_scale = 4.0
|
||||
|
||||
>>> # decrease the max length to 256 tokens
|
||||
>>> model.generation_config.max_length = 256
|
||||
```
|
||||
|
||||
Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting `do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the generation config.
|
||||
|
||||
## Model Structure
|
||||
|
||||
The MusicGen model can be de-composed into three distinct stages:
|
||||
1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5.
|
||||
2. MusicGen Melody decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations
|
||||
3. Audio decoder: used to recover the audio waveform from the audio tokens predicted by the decoder.
|
||||
|
||||
Thus, the MusicGen model can either be used as a standalone decoder model, corresponding to the class [`MusicgenMelodyForCausalLM`], or as a composite model that includes the text encoder and audio encoder, corresponding to the class [`MusicgenMelodyForConditionalGeneration`]. If only the decoder needs to be loaded from the pre-trained checkpoint, it can be loaded by first specifying the correct config, or be accessed through the `.decoder` attribute of the composite model:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoConfig, MusicgenMelodyForCausalLM, MusicgenMelodyForConditionalGeneration
|
||||
|
||||
>>> # Option 1: get decoder config and pass to `.from_pretrained`
|
||||
>>> decoder_config = AutoConfig.from_pretrained("facebook/musicgen-melody").decoder
|
||||
>>> decoder = MusicgenMelodyForCausalLM.from_pretrained("facebook/musicgen-melody", **decoder_config.to_dict())
|
||||
|
||||
>>> # Option 2: load the entire composite model, but only return the decoder
|
||||
>>> decoder = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody").decoder
|
||||
```
|
||||
|
||||
Since the text encoder and audio encoder models are frozen during training, the MusicGen decoder [`MusicgenMelodyForCausalLM`] can be trained standalone on a dataset of encoder hidden-states and audio codes. For inference, the trained decoder can be combined with the frozen text encoder and audio encoder to recover the composite [`MusicgenMelodyForConditionalGeneration`] model.
|
||||
|
||||
## Checkpoint Conversion
|
||||
|
||||
- After downloading the original checkpoints from [here](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md#importing--exporting-models), you can convert them using the **conversion script** available at `src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py` with the following command:
|
||||
|
||||
```bash
|
||||
python src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py \
|
||||
--checkpoint="facebook/musicgen-melody" --pytorch_dump_folder /output/path
|
||||
```
|
||||
|
||||
Tips:
|
||||
* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model.
|
||||
* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenMelodyForConditionalGeneration.generate`]
|
||||
|
||||
|
||||
## MusicgenMelodyDecoderConfig
|
||||
|
||||
[[autodoc]] MusicgenMelodyDecoderConfig
|
||||
|
||||
## MusicgenMelodyProcessor
|
||||
|
||||
[[autodoc]] MusicgenMelodyProcessor
|
||||
- get_unconditional_inputs
|
||||
|
||||
## MusicgenMelodyFeatureExtractor
|
||||
|
||||
[[autodoc]] MusicgenMelodyFeatureExtractor
|
||||
- _extract_stem_indices
|
||||
|
||||
## MusicgenMelodyConfig
|
||||
|
||||
[[autodoc]] MusicgenMelodyConfig
|
||||
|
||||
## MusicgenMelodyModel
|
||||
|
||||
[[autodoc]] MusicgenMelodyModel
|
||||
- forward
|
||||
|
||||
## MusicgenMelodyForCausalLM
|
||||
|
||||
[[autodoc]] MusicgenMelodyForCausalLM
|
||||
- forward
|
||||
|
||||
## MusicgenMelodyForConditionalGeneration
|
||||
|
||||
[[autodoc]] MusicgenMelodyForConditionalGeneration
|
||||
- forward
|
|
@ -37,7 +37,7 @@ You can finetune other architectures for causal language modeling following the
|
|||
Choose one of the following architectures:
|
||||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [Cohere](../model_doc/cohere), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Mamba](../model_doc/mamba), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
|
||||
[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [Cohere](../model_doc/cohere), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [Gemma](../model_doc/gemma), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Mamba](../model_doc/mamba), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MusicGen Melody](../model_doc/musicgen_melody), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Starcoder2](../model_doc/starcoder2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod)
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -42,6 +42,7 @@ from .utils import (
|
|||
is_timm_available,
|
||||
is_tokenizers_available,
|
||||
is_torch_available,
|
||||
is_torchaudio_available,
|
||||
is_torchvision_available,
|
||||
is_vision_available,
|
||||
logging,
|
||||
|
@ -638,6 +639,11 @@ _import_structure = {
|
|||
"MusicgenConfig",
|
||||
"MusicgenDecoderConfig",
|
||||
],
|
||||
"models.musicgen_melody": [
|
||||
"MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"MusicgenMelodyConfig",
|
||||
"MusicgenMelodyDecoderConfig",
|
||||
],
|
||||
"models.mvp": ["MvpConfig", "MvpTokenizer"],
|
||||
"models.nat": ["NAT_PRETRAINED_CONFIG_ARCHIVE_MAP", "NatConfig"],
|
||||
"models.nezha": ["NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP", "NezhaConfig"],
|
||||
|
@ -2787,6 +2793,15 @@ else:
|
|||
"MusicgenProcessor",
|
||||
]
|
||||
)
|
||||
_import_structure["models.musicgen_melody"].extend(
|
||||
[
|
||||
"MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"MusicgenMelodyForCausalLM",
|
||||
"MusicgenMelodyForConditionalGeneration",
|
||||
"MusicgenMelodyModel",
|
||||
"MusicgenMelodyPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.mvp"].extend(
|
||||
[
|
||||
"MVP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -4552,6 +4567,21 @@ else:
|
|||
_import_structure["models.pop2piano"].append("Pop2PianoTokenizer")
|
||||
_import_structure["models.pop2piano"].append("Pop2PianoProcessor")
|
||||
|
||||
try:
|
||||
if not is_torchaudio_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
from .utils import (
|
||||
dummy_torchaudio_objects,
|
||||
)
|
||||
|
||||
_import_structure["utils.dummy_torchaudio_objects"] = [
|
||||
name for name in dir(dummy_torchaudio_objects) if not name.startswith("_")
|
||||
]
|
||||
else:
|
||||
_import_structure["models.musicgen_melody"].append("MusicgenMelodyFeatureExtractor")
|
||||
_import_structure["models.musicgen_melody"].append("MusicgenMelodyProcessor")
|
||||
|
||||
|
||||
# FLAX-backed objects
|
||||
try:
|
||||
|
@ -5459,6 +5489,11 @@ if TYPE_CHECKING:
|
|||
MusicgenConfig,
|
||||
MusicgenDecoderConfig,
|
||||
)
|
||||
from .models.musicgen_melody import (
|
||||
MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
MusicgenMelodyConfig,
|
||||
MusicgenMelodyDecoderConfig,
|
||||
)
|
||||
from .models.mvp import MvpConfig, MvpTokenizer
|
||||
from .models.nat import NAT_PRETRAINED_CONFIG_ARCHIVE_MAP, NatConfig
|
||||
from .models.nezha import NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP, NezhaConfig
|
||||
|
@ -7360,6 +7395,13 @@ if TYPE_CHECKING:
|
|||
MusicgenPreTrainedModel,
|
||||
MusicgenProcessor,
|
||||
)
|
||||
from .models.musicgen_melody import (
|
||||
MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
MusicgenMelodyForCausalLM,
|
||||
MusicgenMelodyForConditionalGeneration,
|
||||
MusicgenMelodyModel,
|
||||
MusicgenMelodyPreTrainedModel,
|
||||
)
|
||||
from .models.mvp import (
|
||||
MVP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
MvpForCausalLM,
|
||||
|
@ -8811,6 +8853,13 @@ if TYPE_CHECKING:
|
|||
Pop2PianoTokenizer,
|
||||
)
|
||||
|
||||
try:
|
||||
if not is_torchaudio_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
from .utils.dummy_torchaudio_objects import *
|
||||
else:
|
||||
from .models.musicgen_melody import MusicgenMelodyFeatureExtractor, MusicgenMelodyProcessor
|
||||
try:
|
||||
if not is_flax_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
|
|
|
@ -17,7 +17,7 @@ Audio processing functions to extract features from audio waveforms. This code i
|
|||
and remove unnecessary dependencies.
|
||||
"""
|
||||
import warnings
|
||||
from typing import Optional, Union
|
||||
from typing import Optional, Tuple, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
@ -94,6 +94,29 @@ def mel_to_hertz(mels: Union[float, np.ndarray], mel_scale: str = "htk") -> Unio
|
|||
return freq
|
||||
|
||||
|
||||
def hertz_to_octave(
|
||||
freq: Union[float, np.ndarray], tuning: Optional[float] = 0.0, bins_per_octave: Optional[int] = 12
|
||||
):
|
||||
"""
|
||||
Convert frequency from hertz to fractional octave numbers.
|
||||
Adapted from *librosa*.
|
||||
|
||||
Args:
|
||||
freq (`float` or `np.ndarray`):
|
||||
The frequency, or multiple frequencies, in hertz (Hz).
|
||||
tuning (`float`, defaults to `0.`):
|
||||
Tuning deviation from the Stuttgart pitch (A440) in (fractional) bins per octave.
|
||||
bins_per_octave (`int`, defaults to `12`):
|
||||
Number of bins per octave.
|
||||
|
||||
Returns:
|
||||
`float` or `np.ndarray`: The frequencies on the octave scale.
|
||||
"""
|
||||
stuttgart_pitch = 440.0 * 2.0 ** (tuning / bins_per_octave)
|
||||
octave = np.log2(freq / (float(stuttgart_pitch) / 16))
|
||||
return octave
|
||||
|
||||
|
||||
def _create_triangular_filter_bank(fft_freqs: np.ndarray, filter_freqs: np.ndarray) -> np.ndarray:
|
||||
"""
|
||||
Creates a triangular filter bank.
|
||||
|
@ -116,6 +139,81 @@ def _create_triangular_filter_bank(fft_freqs: np.ndarray, filter_freqs: np.ndarr
|
|||
return np.maximum(np.zeros(1), np.minimum(down_slopes, up_slopes))
|
||||
|
||||
|
||||
def chroma_filter_bank(
|
||||
num_frequency_bins: int,
|
||||
num_chroma: int,
|
||||
sampling_rate: int,
|
||||
tuning: float = 0.0,
|
||||
power: Optional[float] = 2.0,
|
||||
weighting_parameters: Optional[Tuple[float]] = (5.0, 2),
|
||||
start_at_c_chroma: Optional[bool] = True,
|
||||
):
|
||||
"""
|
||||
Creates a chroma filter bank, i.e a linear transformation to project spectrogram bins onto chroma bins.
|
||||
|
||||
Adapted from *librosa*.
|
||||
|
||||
Args:
|
||||
num_frequency_bins (`int`):
|
||||
Number of frequencies used to compute the spectrogram (should be the same as in `stft`).
|
||||
num_chroma (`int`):
|
||||
Number of chroma bins (i.e pitch classes).
|
||||
sampling_rate (`float`):
|
||||
Sample rate of the audio waveform.
|
||||
tuning (`float`):
|
||||
Tuning deviation from A440 in fractions of a chroma bin.
|
||||
power (`float`, *optional*, defaults to 2.0):
|
||||
If 12.0, normalizes each column with their L2 norm. If 1.0, normalizes each column with their L1 norm.
|
||||
weighting_parameters (`Tuple[float]`, *optional*, defaults to `(5., 2.)`):
|
||||
If specified, apply a Gaussian weighting parameterized by the first element of the tuple being the center and
|
||||
the second element being the Gaussian half-width.
|
||||
start_at_c_chroma (`float`, *optional*, defaults to `True`):
|
||||
If True, the filter bank will start at the 'C' pitch class. Otherwise, it will start at 'A'.
|
||||
Returns:
|
||||
`np.ndarray` of shape `(num_frequency_bins, num_chroma)`
|
||||
"""
|
||||
# Get the FFT bins, not counting the DC component
|
||||
frequencies = np.linspace(0, sampling_rate, num_frequency_bins, endpoint=False)[1:]
|
||||
|
||||
freq_bins = num_chroma * hertz_to_octave(frequencies, tuning=tuning, bins_per_octave=num_chroma)
|
||||
|
||||
# make up a value for the 0 Hz bin = 1.5 octaves below bin 1
|
||||
# (so chroma is 50% rotated from bin 1, and bin width is broad)
|
||||
freq_bins = np.concatenate(([freq_bins[0] - 1.5 * num_chroma], freq_bins))
|
||||
|
||||
bins_width = np.concatenate((np.maximum(freq_bins[1:] - freq_bins[:-1], 1.0), [1]))
|
||||
|
||||
chroma_filters = np.subtract.outer(freq_bins, np.arange(0, num_chroma, dtype="d")).T
|
||||
|
||||
num_chroma2 = np.round(float(num_chroma) / 2)
|
||||
|
||||
# Project into range -num_chroma/2 .. num_chroma/2
|
||||
# add on fixed offset of 10*num_chroma to ensure all values passed to
|
||||
# rem are positive
|
||||
chroma_filters = np.remainder(chroma_filters + num_chroma2 + 10 * num_chroma, num_chroma) - num_chroma2
|
||||
|
||||
# Gaussian bumps - 2*D to make them narrower
|
||||
chroma_filters = np.exp(-0.5 * (2 * chroma_filters / np.tile(bins_width, (num_chroma, 1))) ** 2)
|
||||
|
||||
# normalize each column
|
||||
if power is not None:
|
||||
chroma_filters = chroma_filters / np.sum(chroma_filters**power, axis=0, keepdims=True) ** (1.0 / power)
|
||||
|
||||
# Maybe apply scaling for fft bins
|
||||
if weighting_parameters is not None:
|
||||
center, half_width = weighting_parameters
|
||||
chroma_filters *= np.tile(
|
||||
np.exp(-0.5 * (((freq_bins / num_chroma - center) / half_width) ** 2)),
|
||||
(num_chroma, 1),
|
||||
)
|
||||
|
||||
if start_at_c_chroma:
|
||||
chroma_filters = np.roll(chroma_filters, -3 * (num_chroma // 12), axis=0)
|
||||
|
||||
# remove aliasing columns, copy to ensure row-contiguity
|
||||
return np.ascontiguousarray(chroma_filters[:, : int(1 + num_frequency_bins / 2)])
|
||||
|
||||
|
||||
def mel_filter_bank(
|
||||
num_frequency_bins: int,
|
||||
num_mel_filters: int,
|
||||
|
|
|
@ -153,6 +153,7 @@ from . import (
|
|||
mra,
|
||||
mt5,
|
||||
musicgen,
|
||||
musicgen_melody,
|
||||
mvp,
|
||||
nat,
|
||||
nezha,
|
||||
|
|
|
@ -161,6 +161,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||
("mra", "MraConfig"),
|
||||
("mt5", "MT5Config"),
|
||||
("musicgen", "MusicgenConfig"),
|
||||
("musicgen_melody", "MusicgenMelodyConfig"),
|
||||
("mvp", "MvpConfig"),
|
||||
("nat", "NatConfig"),
|
||||
("nezha", "NezhaConfig"),
|
||||
|
@ -396,6 +397,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||
("mpt", "MPT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("mra", "MRA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("musicgen", "MUSICGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("musicgen_melody", "MUSICGEN_MELODY_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("mvp", "MVP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("nat", "NAT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("nezha", "NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
|
@ -649,6 +651,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||
("mra", "MRA"),
|
||||
("mt5", "MT5"),
|
||||
("musicgen", "MusicGen"),
|
||||
("musicgen_melody", "MusicGen Melody"),
|
||||
("mvp", "MVP"),
|
||||
("nat", "NAT"),
|
||||
("nezha", "Nezha"),
|
||||
|
|
|
@ -454,6 +454,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||
("mixtral", "MixtralForCausalLM"),
|
||||
("mpt", "MptForCausalLM"),
|
||||
("musicgen", "MusicgenForCausalLM"),
|
||||
("musicgen_melody", "MusicgenMelodyForCausalLM"),
|
||||
("mvp", "MvpForCausalLM"),
|
||||
("open-llama", "OpenLlamaForCausalLM"),
|
||||
("openai-gpt", "OpenAIGPTLMHeadModel"),
|
||||
|
@ -1176,6 +1177,7 @@ MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = OrderedDict(
|
|||
("bark", "BarkModel"),
|
||||
("fastspeech2_conformer", "FastSpeech2ConformerWithHifiGan"),
|
||||
("musicgen", "MusicgenForConditionalGeneration"),
|
||||
("musicgen_melody", "MusicgenMelodyForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForTextToSpeech"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2ForTextToSpeech"),
|
||||
("vits", "VitsModel"),
|
||||
|
|
|
@ -280,6 +280,7 @@ else:
|
|||
),
|
||||
),
|
||||
("musicgen", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
|
||||
("musicgen_melody", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
|
||||
("mvp", ("MvpTokenizer", "MvpTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("nezha", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||
(
|
||||
|
|
|
@ -0,0 +1,90 @@
|
|||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_torch_available,
|
||||
is_torchaudio_available,
|
||||
)
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_musicgen_melody": [
|
||||
"MUSICGEN_MELODY_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"MusicgenMelodyConfig",
|
||||
"MusicgenMelodyDecoderConfig",
|
||||
],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_musicgen_melody"] = [
|
||||
"MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"MusicgenMelodyForConditionalGeneration",
|
||||
"MusicgenMelodyForCausalLM",
|
||||
"MusicgenMelodyModel",
|
||||
"MusicgenMelodyPreTrainedModel",
|
||||
]
|
||||
|
||||
try:
|
||||
if not is_torchaudio_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["feature_extraction_musicgen_melody"] = ["MusicgenMelodyFeatureExtractor"]
|
||||
_import_structure["processing_musicgen_melody"] = ["MusicgenMelodyProcessor"]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_musicgen_melody import (
|
||||
MUSICGEN_MELODY_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
MusicgenMelodyConfig,
|
||||
MusicgenMelodyDecoderConfig,
|
||||
)
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_musicgen_melody import (
|
||||
MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
MusicgenMelodyForCausalLM,
|
||||
MusicgenMelodyForConditionalGeneration,
|
||||
MusicgenMelodyModel,
|
||||
MusicgenMelodyPreTrainedModel,
|
||||
)
|
||||
|
||||
try:
|
||||
if not is_torchaudio_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .feature_extraction_musicgen_melody import MusicgenMelodyFeatureExtractor
|
||||
from .processing_musicgen_melody import MusicgenMelodyProcessor
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
|
@ -0,0 +1,256 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2024 Meta AI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Musicgen Melody model configuration"""
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
from ..auto.configuration_auto import AutoConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
MUSICGEN_MELODY_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"facebook/musicgen-melody": "https://huggingface.co/facebook/musicgen-melody/resolve/main/config.json",
|
||||
}
|
||||
|
||||
|
||||
class MusicgenMelodyDecoderConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of an [`MusicgenMelodyDecoder`]. It is used to instantiate a
|
||||
Musicgen Melody decoder according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the Musicgen Melody
|
||||
[facebook/musicgen-melody](https://huggingface.co/facebook/musicgen-melody) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 2048):
|
||||
Vocabulary size of the MusicgenMelodyDecoder model. Defines the number of different tokens that can be
|
||||
represented by the `inputs_ids` passed when calling [`MusicgenMelodyDecoder`].
|
||||
max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||
The maximum sequence length that this model might ever be used with. Typically, set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
num_hidden_layers (`int`, *optional*, defaults to 24):
|
||||
Number of decoder layers.
|
||||
ffn_dim (`int`, *optional*, defaults to 4096):
|
||||
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer block.
|
||||
num_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer block.
|
||||
layerdrop (`float`, *optional*, defaults to 0.0):
|
||||
The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||
for more details.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether the model should return the last key/values attentions (not used by all models)
|
||||
activation_function (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the decoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||
hidden_size (`int`, *optional*, defaults to 1024):
|
||||
Dimensionality of the layers and the pooler layer.
|
||||
dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, text_encoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
activation_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for activations inside the fully connected layer.
|
||||
initializer_factor (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
scale_embedding (`bool`, *optional*, defaults to `False`):
|
||||
Scale embeddings by diving by sqrt(hidden_size).
|
||||
num_codebooks (`int`, *optional*, defaults to 4):
|
||||
The number of parallel codebooks forwarded to the model.
|
||||
audio_channels (`int`, *optional*, defaults to 1):
|
||||
Number of audio channels used by the model (either mono or stereo). Stereo models generate a separate
|
||||
audio stream for the left/right output channels. Mono models generate a single audio stream output.
|
||||
pad_token_id (`int`, *optional*, defaults to 2048): The id of the *padding* token.
|
||||
bos_token_id (`int`, *optional*, defaults to 2048): The id of the *beginning-of-sequence* token.
|
||||
eos_token_id (`int`, *optional*): The id of the *end-of-sequence* token.
|
||||
tie_word_embeddings (`bool`, *optional*, defaults to `False`): Whether to tie word embeddings with the text encoder.
|
||||
"""
|
||||
|
||||
model_type = "musicgen_melody_decoder"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=2048,
|
||||
max_position_embeddings=2048,
|
||||
num_hidden_layers=24,
|
||||
ffn_dim=4096,
|
||||
num_attention_heads=16,
|
||||
layerdrop=0.0,
|
||||
use_cache=True,
|
||||
activation_function="gelu",
|
||||
hidden_size=1024,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.0,
|
||||
activation_dropout=0.0,
|
||||
initializer_factor=0.02,
|
||||
scale_embedding=False,
|
||||
num_codebooks=4,
|
||||
audio_channels=1,
|
||||
pad_token_id=2048,
|
||||
bos_token_id=2048,
|
||||
eos_token_id=None,
|
||||
tie_word_embeddings=False,
|
||||
**kwargs,
|
||||
):
|
||||
self.vocab_size = vocab_size
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.hidden_size = hidden_size
|
||||
self.ffn_dim = ffn_dim
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.activation_function = activation_function
|
||||
self.initializer_factor = initializer_factor
|
||||
self.layerdrop = layerdrop
|
||||
self.use_cache = use_cache
|
||||
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
|
||||
self.num_codebooks = num_codebooks
|
||||
|
||||
if audio_channels not in [1, 2]:
|
||||
raise ValueError(f"Expected 1 (mono) or 2 (stereo) audio channels, got {audio_channels} channels.")
|
||||
self.audio_channels = audio_channels
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
bos_token_id=bos_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
tie_word_embeddings=tie_word_embeddings,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
|
||||
class MusicgenMelodyConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`MusicgenMelodyModel`]. It is used to instantiate a
|
||||
Musicgen Melody model according to the specified arguments, defining the text encoder, audio encoder and Musicgen Melody decoder
|
||||
configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the Musicgen Melody
|
||||
[facebook/musicgen-melody](https://huggingface.co/facebook/musicgen-melody) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
num_chroma (`int`, *optional*, defaults to 12): Number of chroma bins to use.
|
||||
chroma_length (`int`, *optional*, defaults to 235):
|
||||
Maximum chroma duration if audio is used to condition the model. Corresponds to the maximum duration used during training.
|
||||
kwargs (*optional*):
|
||||
Dictionary of keyword arguments. Notably:
|
||||
|
||||
- **text_encoder** ([`PretrainedConfig`], *optional*) -- An instance of a configuration object that
|
||||
defines the text encoder config.
|
||||
- **audio_encoder** ([`PretrainedConfig`], *optional*) -- An instance of a configuration object that
|
||||
defines the audio encoder config.
|
||||
- **decoder** ([`PretrainedConfig`], *optional*) -- An instance of a configuration object that defines
|
||||
the decoder config.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import (
|
||||
... MusicgenMelodyConfig,
|
||||
... MusicgenMelodyDecoderConfig,
|
||||
... T5Config,
|
||||
... EncodecConfig,
|
||||
... MusicgenMelodyForConditionalGeneration,
|
||||
... )
|
||||
|
||||
>>> # Initializing text encoder, audio encoder, and decoder model configurations
|
||||
>>> text_encoder_config = T5Config()
|
||||
>>> audio_encoder_config = EncodecConfig()
|
||||
>>> decoder_config = MusicgenMelodyDecoderConfig()
|
||||
|
||||
>>> configuration = MusicgenMelodyConfig.from_sub_models_config(
|
||||
... text_encoder_config, audio_encoder_config, decoder_config
|
||||
... )
|
||||
|
||||
>>> # Initializing a MusicgenMelodyForConditionalGeneration (with random weights) from the facebook/musicgen-melody style configuration
|
||||
>>> model = MusicgenMelodyForConditionalGeneration(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
>>> config_text_encoder = model.config.text_encoder
|
||||
>>> config_audio_encoder = model.config.audio_encoder
|
||||
>>> config_decoder = model.config.decoder
|
||||
|
||||
>>> # Saving the model, including its configuration
|
||||
>>> model.save_pretrained("musicgen_melody-model")
|
||||
|
||||
>>> # loading model and config from pretrained folder
|
||||
>>> musicgen_melody_config = MusicgenMelodyConfig.from_pretrained("musicgen_melody-model")
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("musicgen_melody-model", config=musicgen_melody_config)
|
||||
```"""
|
||||
|
||||
model_type = "musicgen_melody"
|
||||
is_composition = True
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
num_chroma=12,
|
||||
chroma_length=235,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
if "text_encoder" not in kwargs or "audio_encoder" not in kwargs or "decoder" not in kwargs:
|
||||
raise ValueError("Config has to be initialized with text_encoder, audio_encoder and decoder config")
|
||||
|
||||
text_encoder_config = kwargs.pop("text_encoder")
|
||||
text_encoder_model_type = text_encoder_config.pop("model_type")
|
||||
|
||||
audio_encoder_config = kwargs.pop("audio_encoder")
|
||||
audio_encoder_model_type = audio_encoder_config.pop("model_type")
|
||||
|
||||
decoder_config = kwargs.pop("decoder")
|
||||
|
||||
self.text_encoder = AutoConfig.for_model(text_encoder_model_type, **text_encoder_config)
|
||||
self.audio_encoder = AutoConfig.for_model(audio_encoder_model_type, **audio_encoder_config)
|
||||
self.decoder = MusicgenMelodyDecoderConfig(**decoder_config)
|
||||
self.is_encoder_decoder = False
|
||||
|
||||
self.num_chroma = num_chroma
|
||||
self.chroma_length = chroma_length
|
||||
|
||||
@classmethod
|
||||
def from_sub_models_config(
|
||||
cls,
|
||||
text_encoder_config: PretrainedConfig,
|
||||
audio_encoder_config: PretrainedConfig,
|
||||
decoder_config: MusicgenMelodyDecoderConfig,
|
||||
**kwargs,
|
||||
):
|
||||
r"""
|
||||
Instantiate a [`MusicgenMelodyConfig`] (or a derived class) from text encoder, audio encoder and decoder
|
||||
configurations.
|
||||
|
||||
Returns:
|
||||
[`MusicgenMelodyConfig`]: An instance of a configuration object
|
||||
"""
|
||||
|
||||
return cls(
|
||||
text_encoder=text_encoder_config.to_dict(),
|
||||
audio_encoder=audio_encoder_config.to_dict(),
|
||||
decoder=decoder_config.to_dict(),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
@property
|
||||
# This is a property because you might want to change the codec model on the fly
|
||||
def sampling_rate(self):
|
||||
return self.audio_encoder.sampling_rate
|
|
@ -0,0 +1,266 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert Musicgen Melody checkpoints from the original repository."""
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from typing import Dict, OrderedDict, Tuple
|
||||
|
||||
import torch
|
||||
from audiocraft.models import MusicGen
|
||||
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
EncodecModel,
|
||||
T5EncoderModel,
|
||||
)
|
||||
from transformers.models.musicgen_melody.configuration_musicgen_melody import MusicgenMelodyDecoderConfig
|
||||
from transformers.models.musicgen_melody.feature_extraction_musicgen_melody import MusicgenMelodyFeatureExtractor
|
||||
from transformers.models.musicgen_melody.modeling_musicgen_melody import (
|
||||
MusicgenMelodyForCausalLM,
|
||||
MusicgenMelodyForConditionalGeneration,
|
||||
)
|
||||
from transformers.models.musicgen_melody.processing_musicgen_melody import MusicgenMelodyProcessor
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
EXPECTED_MISSING_KEYS = ["model.decoder.embed_positions.weights"]
|
||||
EXPECTED_ADDITIONAL_KEYS = ["condition_provider.conditioners.self_wav.chroma.spec.window"]
|
||||
|
||||
|
||||
def rename_keys(name):
|
||||
if "emb" in name:
|
||||
name = name.replace("emb", "model.decoder.embed_tokens")
|
||||
if "transformer" in name:
|
||||
name = name.replace("transformer", "model.decoder")
|
||||
if "cross_attention" in name:
|
||||
name = name.replace("cross_attention", "encoder_attn")
|
||||
if "linear1" in name:
|
||||
name = name.replace("linear1", "fc1")
|
||||
if "linear2" in name:
|
||||
name = name.replace("linear2", "fc2")
|
||||
if "norm1" in name:
|
||||
name = name.replace("norm1", "self_attn_layer_norm")
|
||||
if "norm_cross" in name:
|
||||
name = name.replace("norm_cross", "encoder_attn_layer_norm")
|
||||
if "norm2" in name:
|
||||
name = name.replace("norm2", "final_layer_norm")
|
||||
if "out_norm" in name:
|
||||
name = name.replace("out_norm", "model.decoder.layer_norm")
|
||||
if "linears" in name:
|
||||
name = name.replace("linears", "lm_heads")
|
||||
if "condition_provider.conditioners.description.output_proj" in name:
|
||||
name = name.replace("condition_provider.conditioners.description.output_proj", "enc_to_dec_proj")
|
||||
if "condition_provider.conditioners.self_wav.output_proj" in name:
|
||||
name = name.replace("condition_provider.conditioners.self_wav.output_proj", "audio_enc_to_dec_proj")
|
||||
return name
|
||||
|
||||
|
||||
def rename_state_dict(state_dict: OrderedDict, hidden_size: int) -> Tuple[Dict, Dict]:
|
||||
"""Function that takes the fairseq MusicgenMelody state dict and renames it according to the HF
|
||||
module names. It further partitions the state dict into the decoder (LM) state dict, and that for the
|
||||
text encoder projection and for the audio encoder projection."""
|
||||
keys = list(state_dict.keys())
|
||||
enc_dec_proj_state_dict = {}
|
||||
audio_enc_to_dec_proj_state_dict = {}
|
||||
for key in keys:
|
||||
val = state_dict.pop(key)
|
||||
key = rename_keys(key)
|
||||
if "in_proj_weight" in key:
|
||||
# split fused qkv proj
|
||||
state_dict[key.replace("in_proj_weight", "q_proj.weight")] = val[:hidden_size, :]
|
||||
state_dict[key.replace("in_proj_weight", "k_proj.weight")] = val[hidden_size : 2 * hidden_size, :]
|
||||
state_dict[key.replace("in_proj_weight", "v_proj.weight")] = val[-hidden_size:, :]
|
||||
elif "audio_enc_to_dec_proj" in key:
|
||||
audio_enc_to_dec_proj_state_dict[key[len("audio_enc_to_dec_proj.") :]] = val
|
||||
elif "enc_to_dec_proj" in key:
|
||||
enc_dec_proj_state_dict[key[len("enc_to_dec_proj.") :]] = val
|
||||
else:
|
||||
state_dict[key] = val
|
||||
return state_dict, enc_dec_proj_state_dict, audio_enc_to_dec_proj_state_dict
|
||||
|
||||
|
||||
def decoder_config_from_checkpoint(checkpoint: str) -> MusicgenMelodyDecoderConfig:
|
||||
if checkpoint == "facebook/musicgen-melody" or checkpoint == "facebook/musicgen-stereo-melody":
|
||||
hidden_size = 1536
|
||||
num_hidden_layers = 48
|
||||
num_attention_heads = 24
|
||||
elif checkpoint == "facebook/musicgen-melody-large" or checkpoint == "facebook/musicgen-stereo-melody-large":
|
||||
hidden_size = 2048
|
||||
num_hidden_layers = 48
|
||||
num_attention_heads = 32
|
||||
else:
|
||||
raise ValueError(
|
||||
"Checkpoint should be one of `['facebook/musicgen-melody', 'facebook/musicgen-melody-large']` for the mono checkpoints, "
|
||||
"or `['facebook/musicgen-stereo-melody', 'facebook/musicgen-stereo-melody-large']` "
|
||||
f"for the stereo checkpoints, got {checkpoint}."
|
||||
)
|
||||
|
||||
if "stereo" in checkpoint:
|
||||
audio_channels = 2
|
||||
num_codebooks = 8
|
||||
else:
|
||||
audio_channels = 1
|
||||
num_codebooks = 4
|
||||
|
||||
config = MusicgenMelodyDecoderConfig(
|
||||
hidden_size=hidden_size,
|
||||
ffn_dim=hidden_size * 4,
|
||||
num_hidden_layers=num_hidden_layers,
|
||||
num_attention_heads=num_attention_heads,
|
||||
num_codebooks=num_codebooks,
|
||||
audio_channels=audio_channels,
|
||||
)
|
||||
return config
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_musicgen_melody_checkpoint(
|
||||
checkpoint, pytorch_dump_folder=None, repo_id=None, device="cpu", test_same_output=False
|
||||
):
|
||||
fairseq_model = MusicGen.get_pretrained(checkpoint, device=args.device)
|
||||
decoder_config = decoder_config_from_checkpoint(checkpoint)
|
||||
|
||||
decoder_state_dict = fairseq_model.lm.state_dict()
|
||||
decoder_state_dict, enc_dec_proj_state_dict, audio_enc_to_dec_proj_state_dict = rename_state_dict(
|
||||
decoder_state_dict, hidden_size=decoder_config.hidden_size
|
||||
)
|
||||
|
||||
text_encoder = T5EncoderModel.from_pretrained("t5-base")
|
||||
audio_encoder = EncodecModel.from_pretrained("facebook/encodec_32khz")
|
||||
decoder = MusicgenMelodyForCausalLM(decoder_config).eval()
|
||||
|
||||
# load all decoder weights - expect that we'll be missing embeddings and enc-dec projection
|
||||
missing_keys, unexpected_keys = decoder.load_state_dict(decoder_state_dict, strict=False)
|
||||
|
||||
for key in missing_keys.copy():
|
||||
if key.startswith(("text_encoder", "audio_encoder")) or key in EXPECTED_MISSING_KEYS:
|
||||
missing_keys.remove(key)
|
||||
|
||||
for key in unexpected_keys.copy():
|
||||
if key in EXPECTED_ADDITIONAL_KEYS:
|
||||
unexpected_keys.remove(key)
|
||||
|
||||
if len(missing_keys) > 0:
|
||||
raise ValueError(f"Missing key(s) in state_dict: {missing_keys}")
|
||||
|
||||
if len(unexpected_keys) > 0:
|
||||
raise ValueError(f"Unexpected key(s) in state_dict: {unexpected_keys}")
|
||||
|
||||
# init the composite model
|
||||
model = MusicgenMelodyForConditionalGeneration(
|
||||
text_encoder=text_encoder, audio_encoder=audio_encoder, decoder=decoder
|
||||
).to(args.device)
|
||||
|
||||
# load the pre-trained enc-dec projection (from the decoder state dict)
|
||||
model.enc_to_dec_proj.load_state_dict(enc_dec_proj_state_dict)
|
||||
|
||||
# load the pre-trained audio encoder projection (from the decoder state dict)
|
||||
model.audio_enc_to_dec_proj.load_state_dict(audio_enc_to_dec_proj_state_dict)
|
||||
|
||||
# check we can do a forward pass
|
||||
input_ids = torch.arange(0, 2 * decoder_config.num_codebooks, dtype=torch.long).reshape(2, -1).to(device)
|
||||
decoder_input_ids = input_ids.reshape(2 * decoder_config.num_codebooks, -1).to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids).logits
|
||||
|
||||
output_length = 1 + input_ids.shape[1] + model.config.chroma_length
|
||||
if logits.shape != (2 * decoder_config.num_codebooks, output_length, 2048):
|
||||
raise ValueError("Incorrect shape for logits")
|
||||
|
||||
# now construct the processor
|
||||
tokenizer = AutoTokenizer.from_pretrained("t5-base")
|
||||
feature_extractor = MusicgenMelodyFeatureExtractor()
|
||||
|
||||
processor = MusicgenMelodyProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
|
||||
# set the appropriate bos/pad token ids
|
||||
model.generation_config.decoder_start_token_id = 2048
|
||||
model.generation_config.pad_token_id = 2048
|
||||
|
||||
# set other default generation config params
|
||||
model.generation_config.max_length = int(30 * audio_encoder.config.frame_rate)
|
||||
model.generation_config.do_sample = True
|
||||
model.generation_config.guidance_scale = 3.0
|
||||
|
||||
if test_same_output:
|
||||
# check same output than original model
|
||||
decoder_input_ids = torch.ones_like(decoder_input_ids).to(device) * model.generation_config.pad_token_id
|
||||
with torch.no_grad():
|
||||
decoder_input_ids = decoder_input_ids[: decoder_config.num_codebooks]
|
||||
inputs = processor(text=["gen"], return_tensors="pt", padding=True).to(device)
|
||||
logits = model(**inputs, decoder_input_ids=decoder_input_ids).logits
|
||||
|
||||
attributes, prompt_tokens = fairseq_model._prepare_tokens_and_attributes(["gen"], None)
|
||||
original_logits = fairseq_model.lm.forward(
|
||||
decoder_input_ids.reshape(1, decoder_config.num_codebooks, -1), attributes
|
||||
)
|
||||
|
||||
torch.testing.assert_close(
|
||||
original_logits.squeeze(2).reshape(decoder_config.num_codebooks, -1),
|
||||
logits[:, -1],
|
||||
rtol=1e-5,
|
||||
atol=5e-5,
|
||||
)
|
||||
|
||||
if pytorch_dump_folder is not None:
|
||||
Path(pytorch_dump_folder).mkdir(exist_ok=True)
|
||||
logger.info(f"Saving model {checkpoint} to {pytorch_dump_folder}")
|
||||
model.save_pretrained(pytorch_dump_folder)
|
||||
processor.save_pretrained(pytorch_dump_folder)
|
||||
|
||||
if repo_id:
|
||||
logger.info(f"Pushing model {checkpoint} to {repo_id}")
|
||||
model.push_to_hub(repo_id, create_pr=True)
|
||||
processor.push_to_hub(repo_id, create_pr=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--checkpoint",
|
||||
default="facebook/musicgen-melody",
|
||||
type=str,
|
||||
help="Checkpoint size of the Musicgen Melody model you'd like to convert. Can be one of: "
|
||||
"`['facebook/musicgen-melody', 'facebook/musicgen-melody-large']` for the mono checkpoints, or "
|
||||
"`['facebook/musicgen-stereo-melody', 'facebook/musicgen-stereo-melody-large']` "
|
||||
"for the stereo checkpoints.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder",
|
||||
default=None,
|
||||
type=str,
|
||||
help="Path to the output PyTorch model directory.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push_to_hub",
|
||||
default="musicgen-melody",
|
||||
type=str,
|
||||
help="Where to upload the converted model on the 🤗 hub.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--device", default="cpu", type=str, help="Torch device to run the conversion, either cpu or cuda."
|
||||
)
|
||||
parser.add_argument("--test_same_output", default=False, type=bool, help="If `True`, test if same output logits.")
|
||||
|
||||
args = parser.parse_args()
|
||||
convert_musicgen_melody_checkpoint(
|
||||
args.checkpoint, args.pytorch_dump_folder, args.push_to_hub, args.device, args.test_same_output
|
||||
)
|
|
@ -0,0 +1,330 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2024 Meta AI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Feature extractor class for Musicgen Melody
|
||||
"""
|
||||
import copy
|
||||
from typing import Any, Dict, List, Optional, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...audio_utils import chroma_filter_bank
|
||||
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...utils import TensorType, is_torch_available, is_torchaudio_available, logging
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
if is_torchaudio_available():
|
||||
import torchaudio
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class MusicgenMelodyFeatureExtractor(SequenceFeatureExtractor):
|
||||
r"""
|
||||
Constructs a MusicgenMelody feature extractor.
|
||||
|
||||
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
|
||||
most of the main methods. Users should refer to this superclass for more information regarding those methods.
|
||||
|
||||
This class extracts chroma features from audio processed by [Demucs](https://github.com/adefossez/demucs/tree/main) or
|
||||
directly from raw audio waveform.
|
||||
|
||||
Args:
|
||||
feature_size (`int`, *optional*, defaults to 12):
|
||||
The feature dimension of the extracted features.
|
||||
sampling_rate (`int`, *optional*, defaults to 32000):
|
||||
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
|
||||
hop_length (`int`, *optional*, defaults to 4096):
|
||||
Length of the overlaping windows for the STFT used to obtain the Mel Frequency coefficients.
|
||||
chunk_length (`int`, *optional*, defaults to 30):
|
||||
The maximum number of chunks of `sampling_rate` samples used to trim and pad longer or shorter audio
|
||||
sequences.
|
||||
n_fft (`int`, *optional*, defaults to 16384):
|
||||
Size of the Fourier transform.
|
||||
num_chroma (`int`, *optional*, defaults to 12):
|
||||
Number of chroma bins to use.
|
||||
padding_value (`float`, *optional*, defaults to 0.0):
|
||||
Padding value used to pad the audio.
|
||||
return_attention_mask (`bool`, *optional*, defaults to `False`):
|
||||
Whether to return the attention mask. Can be overwritten when calling the feature extractor.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
|
||||
<Tip>
|
||||
|
||||
For Whisper models, `attention_mask` should always be passed for batched inference, to avoid subtle
|
||||
bugs.
|
||||
|
||||
</Tip>
|
||||
stem_indices (`List[int]`, *optional*, defaults to `[3, 2]`):
|
||||
Stem channels to extract if demucs outputs are passed.
|
||||
"""
|
||||
|
||||
model_input_names = ["input_features"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
feature_size=12,
|
||||
sampling_rate=32000,
|
||||
hop_length=4096,
|
||||
chunk_length=30,
|
||||
n_fft=16384,
|
||||
num_chroma=12,
|
||||
padding_value=0.0,
|
||||
return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask
|
||||
stem_indices=[3, 2],
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(
|
||||
feature_size=feature_size,
|
||||
sampling_rate=sampling_rate,
|
||||
padding_value=padding_value,
|
||||
return_attention_mask=return_attention_mask,
|
||||
**kwargs,
|
||||
)
|
||||
self.n_fft = n_fft
|
||||
self.hop_length = hop_length
|
||||
self.chunk_length = chunk_length
|
||||
self.n_samples = chunk_length * sampling_rate
|
||||
self.sampling_rate = sampling_rate
|
||||
self.chroma_filters = torch.from_numpy(
|
||||
chroma_filter_bank(sampling_rate=sampling_rate, num_frequency_bins=n_fft, tuning=0, num_chroma=num_chroma)
|
||||
).float()
|
||||
self.spectrogram = torchaudio.transforms.Spectrogram(
|
||||
n_fft=n_fft, win_length=n_fft, hop_length=hop_length, power=2, center=True, pad=0, normalized=True
|
||||
)
|
||||
self.stem_indices = stem_indices
|
||||
|
||||
def _torch_extract_fbank_features(self, waveform: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Compute the chroma spectrogram of the provided audio using the torchaudio spectrogram implementation and the librosa chroma features.
|
||||
"""
|
||||
|
||||
# if wav length is not long enough, pad it
|
||||
wav_length = waveform.shape[-1]
|
||||
if wav_length < self.n_fft:
|
||||
pad = self.n_fft - wav_length
|
||||
rest = 0 if pad % 2 == 0 else 1
|
||||
waveform = torch.nn.functional.pad(waveform, (pad // 2, pad // 2 + rest), "constant", 0)
|
||||
|
||||
# squeeze alongside channel dimension
|
||||
spec = self.spectrogram(waveform).squeeze(1)
|
||||
|
||||
# sum along the frequency dimension
|
||||
raw_chroma = torch.einsum("cf, ...ft->...ct", self.chroma_filters, spec)
|
||||
|
||||
# normalise with max value
|
||||
norm_chroma = torch.nn.functional.normalize(raw_chroma, p=float("inf"), dim=-2, eps=1e-6)
|
||||
|
||||
# transpose time and chroma dimension -> (batch, time, chroma)
|
||||
norm_chroma = norm_chroma.transpose(1, 2)
|
||||
|
||||
# replace max value alongside chroma dimension with 1 and replace the rest with 0
|
||||
idx = norm_chroma.argmax(-1, keepdim=True)
|
||||
norm_chroma[:] = 0
|
||||
norm_chroma.scatter_(dim=-1, index=idx, value=1)
|
||||
|
||||
return norm_chroma
|
||||
|
||||
def _extract_stem_indices(self, audio, sampling_rate=None):
|
||||
"""
|
||||
Extracts stems from the output of the [Demucs](https://github.com/adefossez/demucs/tree/main) audio separation model,
|
||||
then converts to mono-channel and resample to the feature extractor sampling rate.
|
||||
|
||||
Args:
|
||||
audio (`torch.Tensor` of shape `(batch_size, num_stems, channel_size, audio_length)`):
|
||||
The output of the Demucs model to be processed.
|
||||
sampling_rate (`int`, *optional*):
|
||||
Demucs sampling rate. If not specified, defaults to `44000`.
|
||||
"""
|
||||
sampling_rate = 44000 if sampling_rate is None else sampling_rate
|
||||
|
||||
# extract "vocals" and "others" sources from audio encoder (demucs) output
|
||||
# [batch_size, num_stems, channel_size, audio_length]
|
||||
wav = audio[:, torch.tensor(self.stem_indices)]
|
||||
|
||||
# merge extracted stems to single waveform
|
||||
wav = wav.sum(1)
|
||||
|
||||
# convert to mono-channel waveform
|
||||
wav = wav.mean(dim=1, keepdim=True)
|
||||
|
||||
# resample to model sampling rate
|
||||
# not equivalent to julius.resample
|
||||
if sampling_rate != self.sampling_rate:
|
||||
wav = torchaudio.functional.resample(
|
||||
wav, sampling_rate, self.sampling_rate, rolloff=0.945, lowpass_filter_width=24
|
||||
)
|
||||
|
||||
# [batch_size, 1, audio_length] -> [batch_size, audio_length]
|
||||
wav = wav.squeeze(1)
|
||||
|
||||
return wav
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
audio: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
|
||||
truncation: bool = True,
|
||||
pad_to_multiple_of: Optional[int] = None,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
return_attention_mask: Optional[bool] = None,
|
||||
padding: Optional[str] = True,
|
||||
max_length: Optional[int] = None,
|
||||
sampling_rate: Optional[int] = None,
|
||||
**kwargs,
|
||||
) -> BatchFeature:
|
||||
"""
|
||||
Main method to featurize and prepare for the model one or several sequence(s).
|
||||
|
||||
Args:
|
||||
audio (`torch.Tensor`, `np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[torch.Tensor]`, `List[List[float]]`):
|
||||
The sequence or batch of sequences to be padded. Each sequence can be a torch tensor, a numpy array, a list of float
|
||||
values, a list of numpy arrays, a list of torch tensors, or a list of list of float values.
|
||||
If `audio` is the output of Demucs, it has to be a torch tensor of shape `(batch_size, num_stems, channel_size, audio_length)`.
|
||||
Otherwise, it must be mono or stereo channel audio.
|
||||
truncation (`bool`, *optional*, default to `True`):
|
||||
Activates truncation to cut input sequences longer than *max_length* to *max_length*.
|
||||
pad_to_multiple_of (`int`, *optional*, defaults to None):
|
||||
If set will pad the sequence to a multiple of the provided value.
|
||||
|
||||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
|
||||
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||
return_attention_mask (`bool`, *optional*):
|
||||
Whether to return the attention mask. If left to the default, will return the attention mask according
|
||||
to the specific feature_extractor's default.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
|
||||
<Tip>
|
||||
For Musicgen Melody models, audio `attention_mask` is not necessary.
|
||||
</Tip>
|
||||
|
||||
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
|
||||
Select a strategy to pad the returned sequences (according to the model's padding side and padding
|
||||
index) among:
|
||||
|
||||
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
|
||||
acceptable input length for the model if that argument is not provided.
|
||||
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
|
||||
lengths).
|
||||
max_length (`int`, *optional*):
|
||||
Maximum length of the returned list and optionally padding length (see above).
|
||||
sampling_rate (`int`, *optional*):
|
||||
The sampling rate at which the `audio` input was sampled. It is strongly recommended to pass
|
||||
`sampling_rate` at the forward call to prevent silent errors.
|
||||
Note that if `audio` is the output of Demucs, `sampling_rate` must be the sampling rate at which Demucs operates.
|
||||
"""
|
||||
|
||||
if sampling_rate is None:
|
||||
logger.warning_once(
|
||||
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
|
||||
"Failing to do so can result in silent errors that might be hard to debug."
|
||||
)
|
||||
|
||||
if isinstance(audio, torch.Tensor) and len(audio.shape) == 4:
|
||||
logger.warning_once(
|
||||
"`audio` is a 4-dimensional torch tensor and has thus been recognized as the output of `Demucs`. "
|
||||
"If this is not the case, make sure to read Musicgen Melody docstrings and "
|
||||
"to correct `audio` to get the right behaviour."
|
||||
"Link to the docstrings: https://huggingface.co/docs/transformers/main/en/model_doc/musicgen_melody"
|
||||
)
|
||||
audio = self._extract_stem_indices(audio, sampling_rate=sampling_rate)
|
||||
elif sampling_rate is not None and sampling_rate != self.sampling_rate:
|
||||
audio = torchaudio.functional.resample(
|
||||
audio, sampling_rate, self.sampling_rate, rolloff=0.945, lowpass_filter_width=24
|
||||
)
|
||||
|
||||
is_batched = isinstance(audio, (np.ndarray, torch.Tensor)) and len(audio.shape) > 1
|
||||
is_batched = is_batched or (
|
||||
isinstance(audio, (list, tuple)) and (isinstance(audio[0], (torch.Tensor, np.ndarray, tuple, list)))
|
||||
)
|
||||
|
||||
if is_batched and not isinstance(audio[0], torch.Tensor):
|
||||
audio = [torch.tensor(speech, dtype=torch.float32).unsqueeze(-1) for speech in audio]
|
||||
elif is_batched:
|
||||
audio = [speech.unsqueeze(-1) for speech in audio]
|
||||
elif not is_batched and not isinstance(audio, torch.Tensor):
|
||||
audio = torch.tensor(audio, dtype=torch.float32).unsqueeze(-1)
|
||||
|
||||
if isinstance(audio[0], torch.Tensor) and audio[0].dtype is torch.float64:
|
||||
audio = [speech.to(torch.float32) for speech in audio]
|
||||
|
||||
# always return batch
|
||||
if not is_batched:
|
||||
audio = [audio]
|
||||
|
||||
if len(audio[0].shape) == 3:
|
||||
logger.warning_once(
|
||||
"`audio` has been detected as a batch of stereo signals. Will be convert to mono signals. "
|
||||
"If this is an undesired behaviour, make sure to read Musicgen Melody docstrings and "
|
||||
"to correct `audio` to get the right behaviour."
|
||||
"Link to the docstrings: https://huggingface.co/docs/transformers/main/en/model_doc/musicgen_melody"
|
||||
)
|
||||
# convert to mono-channel waveform
|
||||
audio = [stereo.mean(dim=0) for stereo in audio]
|
||||
|
||||
batched_speech = BatchFeature({"input_features": audio})
|
||||
|
||||
padded_inputs = self.pad(
|
||||
batched_speech,
|
||||
padding=padding,
|
||||
max_length=max_length if max_length else self.n_samples,
|
||||
truncation=truncation,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_tensors="pt",
|
||||
)
|
||||
|
||||
input_features = self._torch_extract_fbank_features(padded_inputs["input_features"].squeeze(-1))
|
||||
|
||||
padded_inputs["input_features"] = input_features
|
||||
|
||||
if return_attention_mask:
|
||||
# rescale from raw audio length to spectrogram length
|
||||
padded_inputs["attention_mask"] = padded_inputs["attention_mask"][:, :: self.hop_length]
|
||||
|
||||
if return_tensors is not None:
|
||||
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
|
||||
|
||||
return padded_inputs
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Returns:
|
||||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["feature_extractor_type"] = self.__class__.__name__
|
||||
if "mel_filters" in output:
|
||||
del output["mel_filters"]
|
||||
if "window" in output:
|
||||
del output["window"]
|
||||
if "chroma_filters" in output:
|
||||
del output["chroma_filters"]
|
||||
if "spectrogram" in output:
|
||||
del output["spectrogram"]
|
||||
return output
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,174 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2024 Meta AI and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Text/audio processor class for MusicGen Melody
|
||||
"""
|
||||
from typing import List, Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...utils import to_numpy
|
||||
|
||||
|
||||
class MusicgenMelodyProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a MusicGen Melody processor which wraps a Wav2Vec2 feature extractor - for raw audio waveform processing - and a T5 tokenizer into a single processor
|
||||
class.
|
||||
|
||||
[`MusicgenProcessor`] offers all the functionalities of [`MusicgenMelodyFeatureExtractor`] and [`T5Tokenizer`]. See
|
||||
[`~MusicgenProcessor.__call__`] and [`~MusicgenProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
feature_extractor (`MusicgenMelodyFeatureExtractor`):
|
||||
An instance of [`MusicgenMelodyFeatureExtractor`]. The feature extractor is a required input.
|
||||
tokenizer (`T5Tokenizer`):
|
||||
An instance of [`T5Tokenizer`]. The tokenizer is a required input.
|
||||
"""
|
||||
|
||||
feature_extractor_class = "MusicgenMelodyFeatureExtractor"
|
||||
tokenizer_class = ("T5Tokenizer", "T5TokenizerFast")
|
||||
|
||||
def __init__(self, feature_extractor, tokenizer):
|
||||
super().__init__(feature_extractor, tokenizer)
|
||||
|
||||
# Copied from transformers.models.musicgen.processing_musicgen.MusicgenProcessor.get_decoder_prompt_ids
|
||||
def get_decoder_prompt_ids(self, task=None, language=None, no_timestamps=True):
|
||||
return self.tokenizer.get_decoder_prompt_ids(task=task, language=language, no_timestamps=no_timestamps)
|
||||
|
||||
def __call__(self, audio=None, text=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `audio`
|
||||
and `kwargs` arguments to MusicgenMelodyFeatureExtractor's [`~MusicgenMelodyFeatureExtractor.__call__`] if `audio` is not
|
||||
`None` to pre-process the audio. It also forwards the `text` and `kwargs` arguments to
|
||||
PreTrainedTokenizer's [`~PreTrainedTokenizer.__call__`] if `text` is not `None`. Please refer to the doctsring of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
audio (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
|
||||
of a NumPy array/PyTorch tensor, each audio should be a mono-stereo signal of shape (T), where T is the sample length of the audio.
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
kwargs (*optional*):
|
||||
Remaining dictionary of keyword arguments that will be passed to the feature extractor and/or the
|
||||
tokenizer.
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **input_features** -- Audio input features to be fed to a model. Returned when `audio` is not `None`.
|
||||
- **attention_mask** -- List of token indices specifying which tokens should be attended to by the model when `text` is not `None`.
|
||||
When only `audio` is specified, returns the timestamps attention mask.
|
||||
"""
|
||||
|
||||
sampling_rate = kwargs.pop("sampling_rate", None)
|
||||
|
||||
if audio is None and text is None:
|
||||
raise ValueError("You need to specify either an `audio` or `text` input to process.")
|
||||
|
||||
if text is not None:
|
||||
inputs = self.tokenizer(text, **kwargs)
|
||||
if audio is not None:
|
||||
audio_inputs = self.feature_extractor(audio, sampling_rate=sampling_rate, **kwargs)
|
||||
|
||||
if text is None:
|
||||
return audio_inputs
|
||||
elif audio is None:
|
||||
return inputs
|
||||
else:
|
||||
inputs["input_features"] = audio_inputs["input_features"]
|
||||
return inputs
|
||||
|
||||
# Copied from transformers.models.musicgen.processing_musicgen.MusicgenProcessor.batch_decode with padding_mask->attention_mask
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method is used to decode either batches of audio outputs from the MusicGen model, or batches of token ids
|
||||
from the tokenizer. In the case of decoding token ids, this method forwards all its arguments to T5Tokenizer's
|
||||
[`~PreTrainedTokenizer.batch_decode`]. Please refer to the docstring of this method for more information.
|
||||
"""
|
||||
audio_values = kwargs.pop("audio", None)
|
||||
attention_mask = kwargs.pop("attention_mask", None)
|
||||
|
||||
if len(args) > 0:
|
||||
audio_values = args[0]
|
||||
args = args[1:]
|
||||
|
||||
if audio_values is not None:
|
||||
return self._decode_audio(audio_values, attention_mask=attention_mask)
|
||||
else:
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
# Copied from transformers.models.musicgen.processing_musicgen.MusicgenProcessor.decode
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to T5Tokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to the
|
||||
docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
||||
|
||||
# Copied from transformers.models.musicgen.processing_musicgen.MusicgenProcessor._decode_audio with padding_mask->attention_mask
|
||||
def _decode_audio(self, audio_values, attention_mask: Optional = None) -> List[np.ndarray]:
|
||||
"""
|
||||
This method strips any padding from the audio values to return a list of numpy audio arrays.
|
||||
"""
|
||||
audio_values = to_numpy(audio_values)
|
||||
bsz, channels, seq_len = audio_values.shape
|
||||
|
||||
if attention_mask is None:
|
||||
return list(audio_values)
|
||||
|
||||
attention_mask = to_numpy(attention_mask)
|
||||
|
||||
# match the sequence length of the padding mask to the generated audio arrays by padding with the **non-padding**
|
||||
# token (so that the generated audio values are **not** treated as padded tokens)
|
||||
difference = seq_len - attention_mask.shape[-1]
|
||||
padding_value = 1 - self.feature_extractor.padding_value
|
||||
attention_mask = np.pad(attention_mask, ((0, 0), (0, difference)), "constant", constant_values=padding_value)
|
||||
|
||||
audio_values = audio_values.tolist()
|
||||
for i in range(bsz):
|
||||
sliced_audio = np.asarray(audio_values[i])[
|
||||
attention_mask[i][None, :] != self.feature_extractor.padding_value
|
||||
]
|
||||
audio_values[i] = sliced_audio.reshape(channels, -1)
|
||||
|
||||
return audio_values
|
||||
|
||||
def get_unconditional_inputs(self, num_samples=1, return_tensors="pt"):
|
||||
"""
|
||||
Helper function to get null inputs for unconditional generation, enabling the model to be used without the
|
||||
feature extractor or tokenizer.
|
||||
|
||||
Args:
|
||||
num_samples (int, *optional*):
|
||||
Number of audio samples to unconditionally generate.
|
||||
|
||||
Example:
|
||||
```python
|
||||
>>> from transformers import MusicgenMelodyForConditionalGeneration, MusicgenMelodyProcessor
|
||||
|
||||
>>> model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody")
|
||||
|
||||
>>> # get the unconditional (or 'null') inputs for the model
|
||||
>>> processor = MusicgenMelodyProcessor.from_pretrained("facebook/musicgen-melody")
|
||||
>>> unconditional_inputs = processor.get_unconditional_inputs(num_samples=1)
|
||||
|
||||
>>> audio_samples = model.generate(**unconditional_inputs, max_new_tokens=256)
|
||||
```"""
|
||||
inputs = self.tokenizer([""] * num_samples, return_tensors=return_tensors, return_attention_mask=True)
|
||||
inputs["attention_mask"][:] = 0
|
||||
|
||||
return inputs
|
|
@ -5876,6 +5876,37 @@ class MusicgenProcessor(metaclass=DummyObject):
|
|||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
MUSICGEN_MELODY_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class MusicgenMelodyForCausalLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class MusicgenMelodyForConditionalGeneration(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class MusicgenMelodyModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class MusicgenMelodyPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
MVP_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,16 @@
|
|||
# This file is autogenerated by the command `make fix-copies`, do not edit.
|
||||
from ..utils import DummyObject, requires_backends
|
||||
|
||||
|
||||
class MusicgenMelodyFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["torchaudio"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torchaudio"])
|
||||
|
||||
|
||||
class MusicgenMelodyProcessor(metaclass=DummyObject):
|
||||
_backends = ["torchaudio"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torchaudio"])
|
|
@ -0,0 +1,231 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2024 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import itertools
|
||||
import math
|
||||
import os
|
||||
import random
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers.testing_utils import (
|
||||
check_json_file_has_correct_format,
|
||||
require_torch,
|
||||
require_torchaudio,
|
||||
)
|
||||
from transformers.utils.import_utils import is_torchaudio_available
|
||||
|
||||
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
|
||||
|
||||
|
||||
if is_torchaudio_available():
|
||||
import torch
|
||||
|
||||
from transformers import MusicgenMelodyFeatureExtractor
|
||||
|
||||
|
||||
global_rng = random.Random()
|
||||
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
|
||||
def floats_list(shape, scale=1.0, rng=None, name=None):
|
||||
"""Creates a random float32 tensor"""
|
||||
if rng is None:
|
||||
rng = global_rng
|
||||
|
||||
values = []
|
||||
for batch_idx in range(shape[0]):
|
||||
values.append([])
|
||||
for _ in range(shape[1]):
|
||||
values[-1].append(rng.random() * scale)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
# Copied from tests.models.musicgen.test_modeling_musicgen.get_bip_bip
|
||||
def get_bip_bip(bip_duration=0.125, duration=0.5, sample_rate=32000):
|
||||
"""Produces a series of 'bip bip' sounds at a given frequency."""
|
||||
timesteps = np.arange(int(duration * sample_rate)) / sample_rate
|
||||
wav = np.cos(2 * math.pi * 440 * timesteps)
|
||||
time_period = (timesteps % (2 * bip_duration)) / (2 * bip_duration)
|
||||
envelope = time_period >= 0.5
|
||||
return wav * envelope
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_torchaudio
|
||||
class MusicgenMelodyFeatureExtractionTester(unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
min_seq_length=400,
|
||||
max_seq_length=2000,
|
||||
feature_size=12,
|
||||
padding_value=0.0,
|
||||
sampling_rate=4_000,
|
||||
return_attention_mask=True,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.min_seq_length = min_seq_length
|
||||
self.max_seq_length = max_seq_length
|
||||
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
|
||||
self.padding_value = padding_value
|
||||
self.sampling_rate = sampling_rate
|
||||
self.return_attention_mask = return_attention_mask
|
||||
self.feature_size = feature_size
|
||||
self.num_chroma = feature_size
|
||||
|
||||
def prepare_feat_extract_dict(self):
|
||||
return {
|
||||
"feature_size": self.feature_size,
|
||||
"padding_value": self.padding_value,
|
||||
"sampling_rate": self.sampling_rate,
|
||||
"return_attention_mask": self.return_attention_mask,
|
||||
}
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester.prepare_inputs_for_common
|
||||
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
|
||||
def _flatten(list_of_lists):
|
||||
return list(itertools.chain(*list_of_lists))
|
||||
|
||||
if equal_length:
|
||||
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
|
||||
else:
|
||||
# make sure that inputs increase in size
|
||||
speech_inputs = [
|
||||
floats_list((x, self.feature_size))
|
||||
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
|
||||
]
|
||||
if numpify:
|
||||
speech_inputs = [np.asarray(x) for x in speech_inputs]
|
||||
return speech_inputs
|
||||
|
||||
|
||||
@require_torchaudio
|
||||
@require_torch
|
||||
class MusicgenMelodyFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
|
||||
feature_extraction_class = MusicgenMelodyFeatureExtractor if is_torchaudio_available() else None
|
||||
|
||||
def setUp(self):
|
||||
self.feat_extract_tester = MusicgenMelodyFeatureExtractionTester(self)
|
||||
|
||||
# Copied from tests.models.seamless_m4t.test_feature_extraction_seamless_m4t.SeamlessM4TFeatureExtractionTest.test_feat_extract_from_and_save_pretrained
|
||||
def test_feat_extract_from_and_save_pretrained(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
|
||||
check_json_file_has_correct_format(saved_file)
|
||||
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
self.assertDictEqual(dict_first, dict_second)
|
||||
|
||||
# Copied from tests.models.seamless_m4t.test_feature_extraction_seamless_m4t.SeamlessM4TFeatureExtractionTest.test_feat_extract_to_json_file
|
||||
def test_feat_extract_to_json_file(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
json_file_path = os.path.join(tmpdirname, "feat_extract.json")
|
||||
feat_extract_first.to_json_file(json_file_path)
|
||||
feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
self.assertEqual(dict_first, dict_second)
|
||||
|
||||
def test_call(self):
|
||||
# Tests that all call wrap to encode_plus and batch_encode_plus
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
# create three inputs of length 800, 1000, and 1200
|
||||
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
|
||||
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
|
||||
|
||||
# Test feature size
|
||||
input_features = feature_extractor(np_speech_inputs, padding=True, return_tensors="np").input_features
|
||||
self.assertTrue(input_features.ndim == 3)
|
||||
self.assertTrue(input_features.shape[0] == 3)
|
||||
# Ignore copy
|
||||
self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size)
|
||||
|
||||
# Test not batched input
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
|
||||
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
|
||||
|
||||
# Test batched
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
|
||||
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
|
||||
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
|
||||
|
||||
# Test 2-D numpy arrays are batched.
|
||||
speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
|
||||
np_speech_inputs = np.asarray(speech_inputs)
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
|
||||
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
|
||||
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
|
||||
|
||||
@require_torchaudio
|
||||
def test_call_from_demucs(self):
|
||||
# Tests that all call wrap to encode_plus and batch_encode_plus
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
|
||||
# (batch_size, num_stems, channel_size, audio_length)
|
||||
inputs = torch.rand([4, 5, 2, 44000])
|
||||
|
||||
# Test feature size
|
||||
input_features = feature_extractor(inputs, padding=True, return_tensors="np").input_features
|
||||
self.assertTrue(input_features.ndim == 3)
|
||||
self.assertTrue(input_features.shape[0] == 4)
|
||||
self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size)
|
||||
|
||||
# Test single input
|
||||
encoded_sequences_1 = feature_extractor(inputs[[0]], return_tensors="np").input_features
|
||||
self.assertTrue(np.allclose(encoded_sequences_1[0], input_features[0], atol=1e-3))
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad with input_features->input_features
|
||||
def test_double_precision_pad(self):
|
||||
import torch
|
||||
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
|
||||
py_speech_inputs = np_speech_inputs.tolist()
|
||||
|
||||
for inputs in [py_speech_inputs, np_speech_inputs]:
|
||||
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
|
||||
self.assertTrue(np_processed.input_features.dtype == np.float32)
|
||||
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
|
||||
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
|
||||
|
||||
def test_integration(self):
|
||||
EXPECTED_INPUT_FEATURES = torch.zeros([2, 8, 12])
|
||||
EXPECTED_INPUT_FEATURES[0, :6, 9] = 1
|
||||
EXPECTED_INPUT_FEATURES[0, 6:, 0] = 1
|
||||
EXPECTED_INPUT_FEATURES[1, :, 9] = 1
|
||||
|
||||
input_speech = [get_bip_bip(duration=0.5), get_bip_bip(duration=1.0)]
|
||||
feature_extractor = MusicgenMelodyFeatureExtractor()
|
||||
input_features = feature_extractor(input_speech, return_tensors="pt").input_features
|
||||
|
||||
self.assertEqual(input_features.shape, (2, 8, 12))
|
||||
self.assertTrue((input_features == EXPECTED_INPUT_FEATURES).all())
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,179 @@
|
|||
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Tests for the MusicGen processor."""
|
||||
|
||||
import random
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import T5Tokenizer, T5TokenizerFast
|
||||
from transformers.testing_utils import require_sentencepiece, require_torch, require_torchaudio
|
||||
from transformers.utils.import_utils import is_torchaudio_available
|
||||
|
||||
|
||||
if is_torchaudio_available():
|
||||
from transformers import MusicgenMelodyFeatureExtractor, MusicgenMelodyProcessor
|
||||
|
||||
|
||||
global_rng = random.Random()
|
||||
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
|
||||
def floats_list(shape, scale=1.0, rng=None, name=None):
|
||||
"""Creates a random float32 tensor"""
|
||||
if rng is None:
|
||||
rng = global_rng
|
||||
|
||||
values = []
|
||||
for batch_idx in range(shape[0]):
|
||||
values.append([])
|
||||
for _ in range(shape[1]):
|
||||
values[-1].append(rng.random() * scale)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_torchaudio
|
||||
# Copied from tests.models.musicgen.test_processing_musicgen.MusicgenProcessorTest with Musicgen->MusicgenMelody, Encodec->MusicgenMelody, padding_mask->attention_mask, input_values->input_features
|
||||
class MusicgenMelodyProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
# Ignore copy
|
||||
self.checkpoint = "facebook/musicgen-melody"
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return T5Tokenizer.from_pretrained(self.checkpoint, **kwargs)
|
||||
|
||||
def get_feature_extractor(self, **kwargs):
|
||||
return MusicgenMelodyFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def test_save_load_pretrained_default(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
processor = MusicgenMelodyProcessor.from_pretrained(self.tmpdirname)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
|
||||
self.assertIsInstance(processor.tokenizer, T5TokenizerFast)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, MusicgenMelodyFeatureExtractor)
|
||||
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = MusicgenMelodyProcessor(
|
||||
tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor()
|
||||
)
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
|
||||
|
||||
processor = MusicgenMelodyProcessor.from_pretrained(
|
||||
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||
)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
self.assertIsInstance(processor.tokenizer, T5TokenizerFast)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, MusicgenMelodyFeatureExtractor)
|
||||
|
||||
def test_feature_extractor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
raw_speech = floats_list((3, 1000))
|
||||
|
||||
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
|
||||
input_processor = processor(raw_speech, return_tensors="np")
|
||||
|
||||
for key in input_feat_extract.keys():
|
||||
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||
|
||||
def test_tokenizer(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
input_str = "This is a test string"
|
||||
|
||||
encoded_processor = processor(text=input_str)
|
||||
|
||||
encoded_tok = tokenizer(input_str)
|
||||
|
||||
for key in encoded_tok.keys():
|
||||
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||
|
||||
def test_tokenizer_decode(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||
|
||||
decoded_processor = processor.batch_decode(sequences=predicted_ids)
|
||||
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||
|
||||
self.assertListEqual(decoded_tok, decoded_processor)
|
||||
|
||||
def test_model_input_names(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
self.assertListEqual(
|
||||
processor.model_input_names,
|
||||
feature_extractor.model_input_names,
|
||||
msg="`processor` and `feature_extractor` model input names do not match",
|
||||
)
|
||||
|
||||
# Ignore copy
|
||||
def test_decode_audio(self):
|
||||
feature_extractor = self.get_feature_extractor(padding_side="left")
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = MusicgenMelodyProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
attention_mask = np.zeros((3, 20))
|
||||
attention_mask[0, -5:] = 1
|
||||
attention_mask[1, -20:] = 1
|
||||
attention_mask[2, -10:] = 1
|
||||
|
||||
generated_speech = np.asarray(floats_list((3, 20)))[:, None, :]
|
||||
decoded_audios = processor.batch_decode(generated_speech, attention_mask=attention_mask)
|
||||
|
||||
self.assertIsInstance(decoded_audios, list)
|
||||
|
||||
for audio in decoded_audios:
|
||||
self.assertIsInstance(audio, np.ndarray)
|
||||
|
||||
self.assertTrue(decoded_audios[0].shape == (1, 5))
|
||||
self.assertTrue(decoded_audios[1].shape == (1, 20))
|
||||
self.assertTrue(decoded_audios[2].shape == (1, 10))
|
|
@ -20,6 +20,7 @@ import pytest
|
|||
|
||||
from transformers.audio_utils import (
|
||||
amplitude_to_db,
|
||||
chroma_filter_bank,
|
||||
hertz_to_mel,
|
||||
mel_filter_bank,
|
||||
mel_to_hertz,
|
||||
|
@ -27,6 +28,11 @@ from transformers.audio_utils import (
|
|||
spectrogram,
|
||||
window_function,
|
||||
)
|
||||
from transformers.testing_utils import is_librosa_available, require_librosa
|
||||
|
||||
|
||||
if is_librosa_available():
|
||||
from librosa.filters import chroma
|
||||
|
||||
|
||||
class AudioUtilsFunctionTester(unittest.TestCase):
|
||||
|
@ -755,3 +761,57 @@ class AudioUtilsFunctionTester(unittest.TestCase):
|
|||
amplitude_to_db(spectrogram, min_value=0.0)
|
||||
with pytest.raises(ValueError):
|
||||
amplitude_to_db(spectrogram, db_range=-80)
|
||||
|
||||
@require_librosa
|
||||
def test_chroma_equivalence(self):
|
||||
num_frequency_bins = 25
|
||||
num_chroma = 6
|
||||
sampling_rate = 24000
|
||||
|
||||
# test default parameters
|
||||
original_chroma = chroma(sr=sampling_rate, n_chroma=num_chroma, n_fft=num_frequency_bins)
|
||||
utils_chroma = chroma_filter_bank(
|
||||
num_frequency_bins=num_frequency_bins, num_chroma=num_chroma, sampling_rate=sampling_rate
|
||||
)
|
||||
|
||||
self.assertTrue(np.allclose(original_chroma, utils_chroma))
|
||||
|
||||
# test no weighting_parameters
|
||||
original_chroma = chroma(sr=sampling_rate, n_chroma=num_chroma, n_fft=num_frequency_bins, octwidth=None)
|
||||
utils_chroma = chroma_filter_bank(
|
||||
num_frequency_bins=num_frequency_bins,
|
||||
num_chroma=num_chroma,
|
||||
sampling_rate=sampling_rate,
|
||||
weighting_parameters=None,
|
||||
)
|
||||
|
||||
self.assertTrue(np.allclose(original_chroma, utils_chroma))
|
||||
|
||||
# test with L1 norm
|
||||
original_chroma = chroma(sr=sampling_rate, n_chroma=num_chroma, n_fft=num_frequency_bins, norm=1.0)
|
||||
utils_chroma = chroma_filter_bank(
|
||||
num_frequency_bins=num_frequency_bins, num_chroma=num_chroma, sampling_rate=sampling_rate, power=1.0
|
||||
)
|
||||
|
||||
self.assertTrue(np.allclose(original_chroma, utils_chroma))
|
||||
|
||||
# test starting at 'A' chroma, power = None, tuning = 0, different weighting_parameters
|
||||
original_chroma = chroma(
|
||||
sr=sampling_rate,
|
||||
n_chroma=num_chroma,
|
||||
n_fft=num_frequency_bins,
|
||||
norm=None,
|
||||
base_c=None,
|
||||
octwidth=1.0,
|
||||
ctroct=4.0,
|
||||
)
|
||||
utils_chroma = chroma_filter_bank(
|
||||
num_frequency_bins=num_frequency_bins,
|
||||
num_chroma=num_chroma,
|
||||
sampling_rate=sampling_rate,
|
||||
power=None,
|
||||
start_at_c_chroma=False,
|
||||
weighting_parameters=(4.0, 1.0),
|
||||
)
|
||||
|
||||
self.assertTrue(np.allclose(original_chroma, utils_chroma))
|
||||
|
|
|
@ -395,6 +395,7 @@ OBJECTS_TO_IGNORE = [
|
|||
"MraConfig",
|
||||
"MusicgenDecoderConfig",
|
||||
"MusicgenForConditionalGeneration",
|
||||
"MusicgenMelodyForConditionalGeneration",
|
||||
"MvpConfig",
|
||||
"MvpTokenizerFast",
|
||||
"MT5Tokenizer",
|
||||
|
|
|
@ -294,6 +294,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||
"BarkCoarseModel",
|
||||
"BarkFineModel",
|
||||
"BarkSemanticModel",
|
||||
"MusicgenMelodyModel",
|
||||
"MusicgenModel",
|
||||
"MusicgenForConditionalGeneration",
|
||||
"SpeechT5ForSpeechToSpeech",
|
||||
|
|
|
@ -176,6 +176,7 @@ docs/source/en/model_doc/mpt.md
|
|||
docs/source/en/model_doc/mra.md
|
||||
docs/source/en/model_doc/mt5.md
|
||||
docs/source/en/model_doc/musicgen.md
|
||||
docs/source/en/model_doc/musicgen_melody.md
|
||||
docs/source/en/model_doc/mvp.md
|
||||
docs/source/en/model_doc/nat.md
|
||||
docs/source/en/model_doc/nezha.md
|
||||
|
@ -706,6 +707,7 @@ src/transformers/models/mt5/modeling_flax_mt5.py
|
|||
src/transformers/models/mt5/modeling_mt5.py
|
||||
src/transformers/models/mt5/modeling_tf_mt5.py
|
||||
src/transformers/models/musicgen/convert_musicgen_transformers.py
|
||||
src/transformers/models/musicgen_melody/convert_musicgen_melody_transformers.py
|
||||
src/transformers/models/mvp/modeling_mvp.py
|
||||
src/transformers/models/nezha/modeling_nezha.py
|
||||
src/transformers/models/nllb_moe/configuration_nllb_moe.py
|
||||
|
|
|
@ -8,4 +8,6 @@ docs/source/en/tasks/prompting.md
|
|||
src/transformers/models/blip_2/modeling_blip_2.py
|
||||
src/transformers/models/ctrl/modeling_ctrl.py
|
||||
src/transformers/models/fuyu/modeling_fuyu.py
|
||||
src/transformers/models/kosmos2/modeling_kosmos2.py
|
||||
src/transformers/models/kosmos2/modeling_kosmos2.py
|
||||
src/transformers/models/musicgen_melody/modeling_musicgen_melody.py
|
||||
src/transformers/models/musicgen_melody/processing_musicgen_melody.py
|
Loading…
Reference in New Issue