Add Seamless M4T model (#25693)
* first raw commit * still POC * tentative convert script * almost working speech encoder conversion scripts * intermediate code for encoder/decoders * add modeling code * first version of speech encoder * make style * add new adapter layer architecture * add adapter block * add first tentative config * add working speech encoder conversion * base model convert works now * make style * remove unnecessary classes * remove unecessary functions * add modeling code speech encoder * rework logics * forward pass of sub components work * add modeling codes * some config modifs and modeling code modifs * save WIP * new edits * same output speech encoder * correct attention mask * correct attention mask * fix generation * new generation logics * erase comments * make style * fix typo * add some descriptions * new state * clean imports * add tests * make style * make beam search and num_return_sequences>1 works * correct edge case issue * correct SeamlessM4TConformerSamePadLayer copied from * replace ACT2FN relu by nn.relu * remove unecessary return variable * move back a class * change name conformer_attention_mask ->conv_attention_mask * better nit code * add some Copied from statements * small nits * small nit in dict.get * rename t2u model -> conditionalgeneration * ongoing refactoring of structure * update models architecture * remove SeamlessM4TMultiModal classes * add tests * adapt tests * some non-working code for vocoder * add seamlessM4T vocoder * remove buggy line * fix some hifigan related bugs * remove hifigan specifc config * change * add WIP tokenization * add seamlessM4T working tokenzier * update tokenization * add tentative feature extractor * Update converting script * update working FE * refactor input_values -> input_features * update FE * changes in generation, tokenizer and modeling * make style and add t2u_decoder_input_ids * add intermediate outputs for ToSpeech models * add vocoder to speech models * update valueerror * update FE with languages * add vocoder convert * update config docstrings and names * update generation code and configuration * remove todos and update config.pad_token_id to generation_config.pad_token_id * move block vocoder * remove unecessary code and uniformize tospeech code * add feature extractor import * make style and fix some copies from * correct consistency + make fix-copies * add processor code * remove comments * add fast tokenizer support * correct pad_token_id in M4TModel * correct config * update tests and codes + make style * make some suggested correstion - correct comments and change naming * rename some attributes * rename some attributes * remove unecessary sequential * remove option to use dur predictor * nit * refactor hifigan * replace normalize_mean and normalize_var with do_normalize + save lang ids to generation config * add tests * change tgt_lang logic * update generation ToSpeech * add support import SeamlessM4TProcessor * fix generate * make tests * update integration tests, add option to only return text and update tokenizer fast * fix wrong function call * update import and convert script * update integration tests + update repo id * correct paths and add first test * update how new attention masks are computed * update tests * take first care of batching in vocoder code * add batching with the vocoder * add waveform lengths to model outputs * make style * add generate kwargs + forward kwargs of M4TModel * add docstrings forward methods * reformate docstrings * add docstrings t2u model * add another round of modeling docstrings + reformate speaker_id -> spkr_id * make style * fix check_repo * make style * add seamlessm4t to toctree * correct check_config_attributes * write config docstrings + some modifs * make style * add docstrings tokenizer * add docstrings to processor, fe and tokenizers * make style * write first version of model docs * fix FE + correct FE test * fix tokenizer + add correct integration tests * fix most tokenization tests * make style * correct most processor test * add generation tests and fix num_return_sequences > 1 * correct integration tests -still one left * make style * correct position embedding * change numbeams to 1 * refactor some modeling code and correct one test * make style * correct typo * refactor intermediate fnn * refactor feedforward conformer * make style * remove comments * make style * fix tokenizer tests * make style * correct processor tests * make style * correct S2TT integration * Apply suggestions from Sanchit code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct typo * replace torch.nn->nn + make style * change Output naming (waveforms -> waveform) and ordering * nit renaming and formating * remove return None when not necessary * refactor SeamlessM4TConformerFeedForward * nit typo * remove almost copied from comments * add a copied from comment and remove an unecessary dropout * remove inputs_embeds from speechencoder * remove backward compatibiliy function * reformate class docstrings for a few components * remove unecessary methods * split over 2 lines smthg hard to read * make style * replace two steps offset by one step as suggested * nice typo * move warnings * remove useless lines from processor * make generation non-standard test more robusts * remove torch.inference_mode from tests * split integration tests * enrich md * rename control_symbol_vocoder_offset->vocoder_offset * clean convert file * remove tgt_lang and src_lang from FE * change generate docstring of ToText models * update generate docstring of tospeech models * unify how to deal withtext_decoder_input_ids * add default spkr_id * unify tgt_lang for t2u_model * simplify tgt_lang verification * remove a todo * change config docstring * make style * simplify t2u_tgt_lang_id * make style * enrich/correct comments * enrich .md * correct typo in docstrings * add torchaudio dependency * update tokenizer * make style and fix copies * modify SeamlessM4TConverter with new tokenizer behaviour * make style * correct small typo docs * fix import * update docs and add requirement to tests * add convert_fairseq2_to_hf in utils/not_doctested.txt * update FE * fix imports and make style * remove torchaudio in FE test * add seamless_m4t.md to utils/not_doctested.txt * nits and change the way docstring dataset is loaded * move checkpoints from ylacombe/ to facebook/ orga * refactor warning/error to be in the 119 line width limit * round overly precised floats * add stereo audio behaviour * refactor .md and make style * enrich docs with more precised architecture description * readd undocumented models * make fix-copies * apply some suggestions * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * correct bug from previous commit * refactor a parameter allowing to clean the code + some small nits * clean tokenizer * make style and fix * make style * clean tokenizers arguments * add precisions for some tests * move docs from not_tested to slow * modify tokenizer according to last comments * add copied from statements in tests * correct convert script * correct parameter docstring style * correct tokenization * correct multi gpus * make style * clean modeling code * make style * add copied from statements * add copied statements * add support with ASR pipeline * remove file added inadvertently * fix docstrings seamlessM4TModel * add seamlessM4TConfig to OBJECTS_TO_IGNORE due of unconventional markdown * add seamlessm4t to assisted generation ignored models --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
parent
50d0cf4f6b
commit
cb45f71c4d
|
@ -459,6 +459,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -434,6 +434,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -408,6 +408,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर] (https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित।
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng से) Bo Peng. द्वाराअनुसंधान पत्र [this repo](https://github.com/BlinkDL/RWKV-LM) के साथ जारी किया गया
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) के साथ जारी किया गया
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https ://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा।
|
||||
|
|
|
@ -468,6 +468,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI から) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou から公開された研究論文: [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf)
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology から), Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu から公開された研究論文: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng から) Bo Peng. から公開された研究論文 [this repo](https://github.com/BlinkDL/RWKV-LM)
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
|
||||
|
|
|
@ -383,6 +383,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI 에서) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 의 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 논문과 함께 발표했습니다.
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology 에서) Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 의 a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 논문과 함께 발표했습니다.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng 에서 제공)은 Bo Peng.의 [this repo](https://github.com/BlinkDL/RWKV-LM)논문과 함께 발표했습니다.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)논문과 함께 발표했습니다.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
|
||||
|
|
|
@ -407,6 +407,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (来自 WeChatAI), 伴随论文 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 由 HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 发布。
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (来自 Bo Peng) 伴随论文 [this repo](https://github.com/BlinkDL/RWKV-LM) 由 Bo Peng 发布。
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||
|
|
|
@ -419,6 +419,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
|
||||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -614,6 +614,8 @@
|
|||
title: MusicGen
|
||||
- local: model_doc/pop2piano
|
||||
title: Pop2Piano
|
||||
- local: model_doc/seamless_m4t
|
||||
title: Seamless-M4T
|
||||
- local: model_doc/sew
|
||||
title: SEW
|
||||
- local: model_doc/sew-d
|
||||
|
|
|
@ -236,6 +236,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
| [RoFormer](model_doc/roformer) | ✅ | ✅ | ✅ |
|
||||
| [RWKV](model_doc/rwkv) | ✅ | ❌ | ❌ |
|
||||
| [SAM](model_doc/sam) | ✅ | ✅ | ❌ |
|
||||
| [SeamlessM4T](model_doc/seamless_m4t) | ✅ | ❌ | ❌ |
|
||||
| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ |
|
||||
| [SEW](model_doc/sew) | ✅ | ❌ | ❌ |
|
||||
| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ |
|
||||
|
|
|
@ -0,0 +1,218 @@
|
|||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# SeamlessM4T
|
||||
|
||||
## Overview
|
||||
|
||||
The SeamlessM4T model was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team from Meta AI.
|
||||
|
||||
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
|
||||
|
||||
SeamlessM4T enables multiple tasks without relying on separate models:
|
||||
|
||||
- Speech-to-speech translation (S2ST)
|
||||
- Speech-to-text translation (S2TT)
|
||||
- Text-to-speech translation (T2ST)
|
||||
- Text-to-text translation (T2TT)
|
||||
- Automatic speech recognition (ASR)
|
||||
|
||||
[`SeamlessM4TModel`] can perform all the above tasks, but each task also has its own dedicated sub-model.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication*
|
||||
|
||||
## Usage
|
||||
|
||||
First, load the processor and a checkpoint of the model:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, SeamlessM4TModel
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
|
||||
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
|
||||
```
|
||||
|
||||
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
||||
|
||||
Here is how to use the processor to process text and audio:
|
||||
|
||||
```python
|
||||
>>> # let's load an audio sample from an Arabic speech corpus
|
||||
>>> from datasets import load_dataset
|
||||
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
|
||||
>>> audio_sample = next(iter(dataset))["audio"]
|
||||
|
||||
>>> # now, process it
|
||||
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
|
||||
|
||||
>>> # now, process some English test as well
|
||||
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
||||
```
|
||||
|
||||
|
||||
### Speech
|
||||
|
||||
[`SeamlessM4TModel`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
|
||||
|
||||
```python
|
||||
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
||||
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
||||
```
|
||||
|
||||
With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
|
||||
|
||||
### Text
|
||||
|
||||
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`].
|
||||
This time, let's translate to French.
|
||||
|
||||
```python
|
||||
>>> # from audio
|
||||
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
|
||||
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
||||
|
||||
>>> # from text
|
||||
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
||||
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
|
||||
#### 1. Use dedicated models
|
||||
|
||||
[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
|
||||
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4TForSpeechToSpeech
|
||||
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
|
||||
```
|
||||
|
||||
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4TForTextToText
|
||||
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
|
||||
```
|
||||
|
||||
Feel free to try out [`SeamlessM4TForSpeechToText`] and [`SeamlessM4TForTextToSpeech`] as well.
|
||||
|
||||
#### 2. Change the speaker identity
|
||||
|
||||
You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages!
|
||||
|
||||
#### 3. Change the generation strategy
|
||||
|
||||
You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.
|
||||
|
||||
#### 4. Generate speech and text at the same time
|
||||
|
||||
Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text !
|
||||
|
||||
## Model architecture
|
||||
|
||||
|
||||
SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.
|
||||
|
||||
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.
|
||||
|
||||
Here's how the generation process works:
|
||||
|
||||
- Input text or speech is processed through its specific encoder.
|
||||
- A decoder creates text tokens in the desired language.
|
||||
- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens.
|
||||
- These unit tokens are then passed through the final vocoder to produce the actual speech.
|
||||
|
||||
|
||||
This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication).
|
||||
|
||||
## SeamlessM4TModel
|
||||
|
||||
[[autodoc]] SeamlessM4TModel
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4TForTextToSpeech
|
||||
|
||||
[[autodoc]] SeamlessM4TForTextToSpeech
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4TForSpeechToSpeech
|
||||
|
||||
[[autodoc]] SeamlessM4TForSpeechToSpeech
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4TForTextToText
|
||||
|
||||
[[autodoc]] transformers.SeamlessM4TForTextToText
|
||||
- forward
|
||||
- generate
|
||||
|
||||
## SeamlessM4TForSpeechToText
|
||||
|
||||
[[autodoc]] transformers.SeamlessM4TForSpeechToText
|
||||
- forward
|
||||
- generate
|
||||
|
||||
## SeamlessM4TConfig
|
||||
|
||||
[[autodoc]] SeamlessM4TConfig
|
||||
|
||||
|
||||
## SeamlessM4TTokenizer
|
||||
|
||||
[[autodoc]] SeamlessM4TTokenizer
|
||||
- __call__
|
||||
- build_inputs_with_special_tokens
|
||||
- get_special_tokens_mask
|
||||
- create_token_type_ids_from_sequences
|
||||
- save_vocabulary
|
||||
|
||||
|
||||
## SeamlessM4TTokenizerFast
|
||||
|
||||
[[autodoc]] SeamlessM4TTokenizerFast
|
||||
- __call__
|
||||
|
||||
## SeamlessM4TFeatureExtractor
|
||||
|
||||
[[autodoc]] SeamlessM4TFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## SeamlessM4TProcessor
|
||||
|
||||
[[autodoc]] SeamlessM4TProcessor
|
||||
- __call__
|
||||
|
||||
## SeamlessM4TCodeHifiGan
|
||||
|
||||
[[autodoc]] SeamlessM4TCodeHifiGan
|
||||
|
||||
|
||||
## SeamlessM4THifiGan
|
||||
|
||||
[[autodoc]] SeamlessM4THifiGan
|
||||
|
||||
## SeamlessM4TTextToUnitModel
|
||||
|
||||
[[autodoc]] SeamlessM4TTextToUnitModel
|
||||
|
||||
## SeamlessM4TTextToUnitForConditionalGeneration
|
||||
|
||||
[[autodoc]] SeamlessM4TTextToUnitForConditionalGeneration
|
||||
|
||||
|
|
@ -35,7 +35,7 @@ The task illustrated in this tutorial is supported by the following model archit
|
|||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
|
||||
<!--End of the generated tip-->
|
||||
|
||||
|
|
|
@ -32,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit
|
|||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
|
||||
<!--End of the generated tip-->
|
||||
|
||||
|
|
|
@ -517,6 +517,12 @@ _import_structure = {
|
|||
"SamPromptEncoderConfig",
|
||||
"SamVisionConfig",
|
||||
],
|
||||
"models.seamless_m4t": [
|
||||
"SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"SeamlessM4TConfig",
|
||||
"SeamlessM4TFeatureExtractor",
|
||||
"SeamlessM4TProcessor",
|
||||
],
|
||||
"models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"],
|
||||
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||
|
@ -805,6 +811,7 @@ else:
|
|||
_import_structure["models.plbart"].append("PLBartTokenizer")
|
||||
_import_structure["models.reformer"].append("ReformerTokenizer")
|
||||
_import_structure["models.rembert"].append("RemBertTokenizer")
|
||||
_import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizer")
|
||||
_import_structure["models.speech_to_text"].append("Speech2TextTokenizer")
|
||||
_import_structure["models.speecht5"].append("SpeechT5Tokenizer")
|
||||
_import_structure["models.t5"].append("T5Tokenizer")
|
||||
|
@ -877,6 +884,7 @@ else:
|
|||
_import_structure["models.rembert"].append("RemBertTokenizerFast")
|
||||
_import_structure["models.roberta"].append("RobertaTokenizerFast")
|
||||
_import_structure["models.roformer"].append("RoFormerTokenizerFast")
|
||||
_import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizerFast")
|
||||
_import_structure["models.splinter"].append("SplinterTokenizerFast")
|
||||
_import_structure["models.squeezebert"].append("SqueezeBertTokenizerFast")
|
||||
_import_structure["models.t5"].append("T5TokenizerFast")
|
||||
|
@ -1082,6 +1090,7 @@ else:
|
|||
_import_structure["modeling_utils"] = ["PreTrainedModel"]
|
||||
|
||||
# PyTorch models structure
|
||||
|
||||
_import_structure["models.albert"].extend(
|
||||
[
|
||||
"ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -2683,6 +2692,21 @@ else:
|
|||
"SamPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.seamless_m4t"].extend(
|
||||
[
|
||||
"SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SeamlessM4TCodeHifiGan",
|
||||
"SeamlessM4TForSpeechToSpeech",
|
||||
"SeamlessM4TForSpeechToText",
|
||||
"SeamlessM4TForTextToSpeech",
|
||||
"SeamlessM4TForTextToText",
|
||||
"SeamlessM4THifiGan",
|
||||
"SeamlessM4TModel",
|
||||
"SeamlessM4TPreTrainedModel",
|
||||
"SeamlessM4TTextToUnitForConditionalGeneration",
|
||||
"SeamlessM4TTextToUnitModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.segformer"].extend(
|
||||
[
|
||||
"SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -4658,6 +4682,12 @@ if TYPE_CHECKING:
|
|||
SamPromptEncoderConfig,
|
||||
SamVisionConfig,
|
||||
)
|
||||
from .models.seamless_m4t import (
|
||||
SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
SeamlessM4TConfig,
|
||||
SeamlessM4TFeatureExtractor,
|
||||
SeamlessM4TProcessor,
|
||||
)
|
||||
from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig
|
||||
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||
|
@ -4925,6 +4955,7 @@ if TYPE_CHECKING:
|
|||
from .models.plbart import PLBartTokenizer
|
||||
from .models.reformer import ReformerTokenizer
|
||||
from .models.rembert import RemBertTokenizer
|
||||
from .models.seamless_m4t import SeamlessM4TTokenizer
|
||||
from .models.speech_to_text import Speech2TextTokenizer
|
||||
from .models.speecht5 import SpeechT5Tokenizer
|
||||
from .models.t5 import T5Tokenizer
|
||||
|
@ -4990,6 +5021,7 @@ if TYPE_CHECKING:
|
|||
from .models.rembert import RemBertTokenizerFast
|
||||
from .models.roberta import RobertaTokenizerFast
|
||||
from .models.roformer import RoFormerTokenizerFast
|
||||
from .models.seamless_m4t import SeamlessM4TTokenizerFast
|
||||
from .models.splinter import SplinterTokenizerFast
|
||||
from .models.squeezebert import SqueezeBertTokenizerFast
|
||||
from .models.t5 import T5TokenizerFast
|
||||
|
@ -5157,8 +5189,6 @@ if TYPE_CHECKING:
|
|||
top_k_top_p_filtering,
|
||||
)
|
||||
from .modeling_utils import PreTrainedModel
|
||||
|
||||
# PyTorch model imports
|
||||
from .models.albert import (
|
||||
ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
AlbertForMaskedLM,
|
||||
|
@ -6485,6 +6515,21 @@ if TYPE_CHECKING:
|
|||
SamModel,
|
||||
SamPreTrainedModel,
|
||||
)
|
||||
|
||||
# PyTorch model imports
|
||||
from .models.seamless_m4t import (
|
||||
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SeamlessM4TCodeHifiGan,
|
||||
SeamlessM4TForSpeechToSpeech,
|
||||
SeamlessM4TForSpeechToText,
|
||||
SeamlessM4TForTextToSpeech,
|
||||
SeamlessM4TForTextToText,
|
||||
SeamlessM4THifiGan,
|
||||
SeamlessM4TModel,
|
||||
SeamlessM4TPreTrainedModel,
|
||||
SeamlessM4TTextToUnitForConditionalGeneration,
|
||||
SeamlessM4TTextToUnitModel,
|
||||
)
|
||||
from .models.segformer import (
|
||||
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SegformerDecodeHead,
|
||||
|
|
|
@ -775,6 +775,31 @@ class NllbConverter(SpmConverter):
|
|||
)
|
||||
|
||||
|
||||
class SeamlessM4TConverter(SpmConverter):
|
||||
def vocab(self, proto):
|
||||
vocab = [
|
||||
("<pad>", 0.0),
|
||||
("<unk>", 0.0),
|
||||
("<s>", 0.0),
|
||||
("</s>", 0.0),
|
||||
]
|
||||
vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
|
||||
return vocab
|
||||
|
||||
def unk_id(self, proto):
|
||||
return self.original_tokenizer.unk_token_id
|
||||
|
||||
def post_processor(self):
|
||||
return processors.TemplateProcessing(
|
||||
single="__eng__ $A </s>",
|
||||
pair="__eng__ $A $B </s>",
|
||||
special_tokens=[
|
||||
("__eng__", self.original_tokenizer.convert_tokens_to_ids("__eng__")),
|
||||
("</s>", self.original_tokenizer.convert_tokens_to_ids("</s>")),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
class XLMRobertaConverter(SpmConverter):
|
||||
def vocab(self, proto):
|
||||
vocab = [
|
||||
|
@ -1278,6 +1303,7 @@ SLOW_TO_FAST_CONVERTERS = {
|
|||
"RetriBertTokenizer": BertConverter,
|
||||
"RobertaTokenizer": RobertaConverter,
|
||||
"RoFormerTokenizer": RoFormerConverter,
|
||||
"SeamlessM4TTokenizer": SeamlessM4TConverter,
|
||||
"SqueezeBertTokenizer": BertConverter,
|
||||
"T5Tokenizer": T5Converter,
|
||||
"WhisperTokenizer": WhisperConverter,
|
||||
|
|
|
@ -180,6 +180,7 @@ from . import (
|
|||
roformer,
|
||||
rwkv,
|
||||
sam,
|
||||
seamless_m4t,
|
||||
segformer,
|
||||
sew,
|
||||
sew_d,
|
||||
|
|
|
@ -186,6 +186,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||
("roformer", "RoFormerConfig"),
|
||||
("rwkv", "RwkvConfig"),
|
||||
("sam", "SamConfig"),
|
||||
("seamless_m4t", "SeamlessM4TConfig"),
|
||||
("segformer", "SegformerConfig"),
|
||||
("sew", "SEWConfig"),
|
||||
("sew-d", "SEWDConfig"),
|
||||
|
@ -392,6 +393,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||
("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("rwkv", "RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sam", "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("seamless_m4t", "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
|
@ -622,6 +624,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||
("roformer", "RoFormer"),
|
||||
("rwkv", "RWKV"),
|
||||
("sam", "SAM"),
|
||||
("seamless_m4t", "SeamlessM4T"),
|
||||
("segformer", "SegFormer"),
|
||||
("sew", "SEW"),
|
||||
("sew-d", "SEW-D"),
|
||||
|
|
|
@ -76,6 +76,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
|||
("pop2piano", "Pop2PianoFeatureExtractor"),
|
||||
("regnet", "ConvNextFeatureExtractor"),
|
||||
("resnet", "ConvNextFeatureExtractor"),
|
||||
("seamless_m4t", "SeamlessM4TFeatureExtractor"),
|
||||
("segformer", "SegformerFeatureExtractor"),
|
||||
("sew", "Wav2Vec2FeatureExtractor"),
|
||||
("sew-d", "Wav2Vec2FeatureExtractor"),
|
||||
|
|
|
@ -175,6 +175,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||
("roformer", "RoFormerModel"),
|
||||
("rwkv", "RwkvModel"),
|
||||
("sam", "SamModel"),
|
||||
("seamless_m4t", "SeamlessM4TModel"),
|
||||
("segformer", "SegformerModel"),
|
||||
("sew", "SEWModel"),
|
||||
("sew-d", "SEWDModel"),
|
||||
|
@ -674,6 +675,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||
("pegasus_x", "PegasusXForConditionalGeneration"),
|
||||
("plbart", "PLBartForConditionalGeneration"),
|
||||
("prophetnet", "ProphetNetForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForTextToText"),
|
||||
("switch_transformers", "SwitchTransformersForConditionalGeneration"),
|
||||
("t5", "T5ForConditionalGeneration"),
|
||||
("umt5", "UMT5ForConditionalGeneration"),
|
||||
|
@ -684,6 +686,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("pop2piano", "Pop2PianoForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForSpeechToText"),
|
||||
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
|
||||
("speech_to_text", "Speech2TextForConditionalGeneration"),
|
||||
("speecht5", "SpeechT5ForSpeechToText"),
|
||||
|
@ -1047,6 +1050,7 @@ MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = OrderedDict(
|
|||
# Model for Text-To-Waveform mapping
|
||||
("bark", "BarkModel"),
|
||||
("musicgen", "MusicgenForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForTextToSpeech"),
|
||||
("vits", "VitsModel"),
|
||||
]
|
||||
)
|
||||
|
|
|
@ -71,6 +71,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||
("pix2struct", "Pix2StructProcessor"),
|
||||
("pop2piano", "Pop2PianoProcessor"),
|
||||
("sam", "SamProcessor"),
|
||||
("seamless_m4t", "SeamlessM4TProcessor"),
|
||||
("sew", "Wav2Vec2Processor"),
|
||||
("sew-d", "Wav2Vec2Processor"),
|
||||
("speech_to_text", "Speech2TextProcessor"),
|
||||
|
|
|
@ -329,6 +329,13 @@ else:
|
|||
("roc_bert", ("RoCBertTokenizer", None)),
|
||||
("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("rwkv", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
|
||||
(
|
||||
"seamless_m4t",
|
||||
(
|
||||
"SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
|
||||
"SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
|
||||
),
|
||||
),
|
||||
("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
|
||||
("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
|
||||
("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)),
|
||||
|
|
|
@ -0,0 +1,111 @@
|
|||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_sentencepiece_available,
|
||||
is_tokenizers_available,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_seamless_m4t": ["SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP", "SeamlessM4TConfig"],
|
||||
"feature_extraction_seamless_m4t": ["SeamlessM4TFeatureExtractor"],
|
||||
"processing_seamless_m4t": ["SeamlessM4TProcessor"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_sentencepiece_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["tokenization_seamless_m4t"] = ["SeamlessM4TTokenizer"]
|
||||
|
||||
try:
|
||||
if not is_tokenizers_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["tokenization_seamless_m4t_fast"] = ["SeamlessM4TTokenizerFast"]
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_seamless_m4t"] = [
|
||||
"SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SeamlessM4TForTextToSpeech",
|
||||
"SeamlessM4TForSpeechToSpeech",
|
||||
"SeamlessM4TForTextToText",
|
||||
"SeamlessM4TForSpeechToText",
|
||||
"SeamlessM4TModel",
|
||||
"SeamlessM4TPreTrainedModel",
|
||||
"SeamlessM4TCodeHifiGan",
|
||||
"SeamlessM4THifiGan",
|
||||
"SeamlessM4TTextToUnitForConditionalGeneration",
|
||||
"SeamlessM4TTextToUnitModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_seamless_m4t import SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP, SeamlessM4TConfig
|
||||
from .feature_extraction_seamless_m4t import SeamlessM4TFeatureExtractor
|
||||
from .processing_seamless_m4t import SeamlessM4TProcessor
|
||||
|
||||
try:
|
||||
if not is_sentencepiece_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .tokenization_seamless_m4t import SeamlessM4TTokenizer
|
||||
|
||||
try:
|
||||
if not is_tokenizers_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .tokenization_seamless_m4t_fast import SeamlessM4TTokenizerFast
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_seamless_m4t import (
|
||||
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SeamlessM4TCodeHifiGan,
|
||||
SeamlessM4TForSpeechToSpeech,
|
||||
SeamlessM4TForSpeechToText,
|
||||
SeamlessM4TForTextToSpeech,
|
||||
SeamlessM4TForTextToText,
|
||||
SeamlessM4THifiGan,
|
||||
SeamlessM4TModel,
|
||||
SeamlessM4TPreTrainedModel,
|
||||
SeamlessM4TTextToUnitForConditionalGeneration,
|
||||
SeamlessM4TTextToUnitModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
|
@ -0,0 +1,417 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" SeamlessM4T model configuration"""
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/config.json",
|
||||
# See all SeamlessM4T models at https://huggingface.co/models?filter=seamless_m4t
|
||||
}
|
||||
|
||||
|
||||
class SeamlessM4TConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`~SeamlessM4TModel`]. It is used to instantiate an
|
||||
SeamlessM4T model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the SeamlessM4T
|
||||
["facebook/hf-seamless-m4t-medium"](https://huggingface.co/"facebook/hf-seamless-m4t-medium") architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 256102):
|
||||
Vocabulary size of the SeamlessM4T model. Defines the number of different tokens that can be represented by
|
||||
the `inputs_ids` passed when calling [`~SeamlessM4TModel`], [`~SeamlessM4TForTextToSpeech`] or
|
||||
[`~SeamlessM4TForTextToText`].
|
||||
t2u_vocab_size (`int`, *optional*, defaults to 10082):
|
||||
Unit vocabulary size of the SeamlessM4T model. Defines the number of different unit tokens that can be
|
||||
represented by the `inputs_ids` passed when calling the Text-To-Units sub-model of [`~SeamlessM4TModel`],
|
||||
[`~SeamlessM4TForSpeechToSpeech`] or [`~SeamlessM4TForTextToSpeech`].
|
||||
|
||||
> Parameters shared across sub-models
|
||||
|
||||
hidden_size (`int`, *optional*, defaults to 1024):
|
||||
Dimensionality of the "intermediate" layers in the architecture.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||
The epsilon used by the layer normalization layers.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
max_position_embeddings (`int`, *optional*, defaults to 1024):
|
||||
The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set
|
||||
this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
|
||||
Whether the model is used as an encoder/decoder or not.
|
||||
encoder_layerdrop (`float`, *optional*, defaults to 0.05):
|
||||
The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||
for more details.
|
||||
decoder_layerdrop (`float`, *optional*, defaults to 0.05):
|
||||
The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||
for more details.
|
||||
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
|
||||
The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,
|
||||
`"gelu"`, `"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
|
||||
dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all attention layers.
|
||||
activation_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probability for all activation layers in the model.
|
||||
scale_embedding (`bool`, *optional*, defaults to `True`):
|
||||
Scale embeddings by diving by sqrt(d_model).
|
||||
|
||||
> Text encoder and text decoder specific parameters
|
||||
|
||||
encoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer text encoder.
|
||||
encoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text encoder.
|
||||
encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text encoder.
|
||||
decoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer text decoder.
|
||||
decoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text decoder.
|
||||
decoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text decoder.
|
||||
decoder_start_token_id (`int`, *optional*, defaults to 3):
|
||||
If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token. Only
|
||||
applied in the text decoder.
|
||||
max_new_tokens (`int`, *optional*, defaults to 256):
|
||||
The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
|
||||
pad_token_id (`int`, *optional*, defaults to 0):
|
||||
The id of the _padding_ text token. Only applied to the text-decoder model.
|
||||
bos_token_id (`int`, *optional*, defaults to 2):
|
||||
The id of the _beginning-of-stream_ text token. Only applied to the text-decoder model.
|
||||
eos_token_id (`int`, *optional*, defaults to 3):
|
||||
The id of the _end-of-stream_ text token. Only applied to the text-decoder model.
|
||||
|
||||
> Speech encoder specific parameters
|
||||
|
||||
speech_encoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer speech encoder.
|
||||
speech_encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer speech encoder.
|
||||
speech_encoder_intermediate_size (`int`, *optional*, defaults to 4096):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer speech encoder.
|
||||
speech_encoder_hidden_act (`str` or `function`, *optional*, defaults to `"swish"`):
|
||||
The non-linear activation function (function or string) in the speech encoder. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
|
||||
speech_encoder_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probability for all layers in the speech encoder.
|
||||
add_adapter (`bool`, *optional*, defaults to `True`):
|
||||
Add an adapter layer on top of the speech encoder.
|
||||
speech_encoder_layerdrop (`float`, *optional*, defaults to 0.1):
|
||||
The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see
|
||||
https://arxiv.org/abs/1909.11556) for more details.
|
||||
feature_projection_input_dim (`int`, *optional*, defaults to 160):
|
||||
Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing
|
||||
input audios with [`SeamlessM4TFeatureExtractor`].
|
||||
num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
|
||||
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
|
||||
embeddings layer of the speech encoder.
|
||||
num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
|
||||
Number of groups of 1D convolutional positional embeddings layer of the speech encoder.
|
||||
adaptor_kernel_size (`int`, *optional*, defaults to 8):
|
||||
Kernel size of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
|
||||
adaptor_stride (`int`, *optional*, defaults to 8):
|
||||
Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
|
||||
adaptor_dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all layers in the speech adapter.
|
||||
num_adapter_layers (`int`, *optional*, defaults to 1):
|
||||
Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
|
||||
True`.
|
||||
position_embeddings_type (`str`, *optional*, defaults to `"relative"`):
|
||||
Can be specified to `relative` or `rotary` for relative or rotary position embeddings respectively. If left
|
||||
`None` no relative position embedding is applied. Only applied to the speech encoder.
|
||||
rotary_embedding_base (`int`, *optional*, defaults to 10000):
|
||||
If `"rotary"` position embeddings are used, defines the size of the embedding base. Only applied to the
|
||||
speech encoder.
|
||||
max_source_positions (`int`, *optional*, defaults to 4096):
|
||||
if `"relative"` position embeddings are used, defines the maximum source input positions. Only applied to
|
||||
the speech encoder.
|
||||
conv_depthwise_kernel_size (`int`, *optional*, defaults to 31):
|
||||
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
|
||||
|
||||
> Text-To-Unit (t2u) model specific parameters
|
||||
|
||||
t2u_bos_token_id (`int`, *optional*, defaults to 0):
|
||||
The id of the _beginning-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_pad_token_id (`int`, *optional*, defaults to 1):
|
||||
The id of the _padding_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_eos_token_id (`int`, *optional*, defaults to 2):
|
||||
The id of the _end-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_decoder_start_token_id (`int`, *optional*, defaults to 2):
|
||||
If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token. Only
|
||||
applied to the text-to-unit seq2seq model.
|
||||
t2u_max_new_tokens (`int`, *optional*, defaults to 1024):
|
||||
The maximum numbers of unit tokens to generate, ignoring the number of tokens in the prompt. Only applied
|
||||
to the text-to-unit seq2seq model.
|
||||
t2u_encoder_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer text-to-unit encoder.
|
||||
t2u_encoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
|
||||
t2u_encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
|
||||
t2u_decoder_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer text-to-unit decoder.
|
||||
t2u_decoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
|
||||
t2u_decoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
|
||||
t2u_max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||
The maximum sequence length that this model text-to-unit component might ever be used with. Typically set
|
||||
this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||
|
||||
> Hifi-Gan Vocoder specific parameters
|
||||
|
||||
sampling_rate (`int`, *optional*, defaults to 16000):
|
||||
The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
|
||||
upsample_initial_channel (`int`, *optional*, defaults to 512):
|
||||
The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
|
||||
upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[5, 4, 4, 2, 2]`):
|
||||
A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network.
|
||||
The length of *upsample_rates* defines the number of convolutional layers and has to match the length of
|
||||
*upsample_kernel_sizes*. Applies to the vocoder only.
|
||||
upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[11, 8, 8, 4, 4]`):
|
||||
A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling
|
||||
network. The length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match
|
||||
the length of *upsample_rates*. Applies to the vocoder only.
|
||||
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
|
||||
A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive
|
||||
field fusion (MRF) module. Applies to the vocoder only.
|
||||
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
|
||||
A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in
|
||||
the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
|
||||
leaky_relu_slope (`float`, *optional*, defaults to 0.1):
|
||||
The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder
|
||||
only.
|
||||
unit_hifi_gan_vocab_size (`int`, *optional*, defaults to 10000):
|
||||
Vocabulary size of the SeamlessM4T vocoder. Defines the number of different unit tokens that can be
|
||||
represented by the `inputs_ids` passed when calling the vocoder of [`~SeamlessM4TModel`],
|
||||
[`~SeamlessM4TForSpeechToSpeech`] or [`~SeamlessM4TForTextToSpeech`].
|
||||
unit_embed_dim (`int`, *optional*, defaults to 1280):
|
||||
The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
lang_embed_dim (`int`, *optional*, defaults to 256):
|
||||
The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
spkr_embed_dim (`int`, *optional*, defaults to 256):
|
||||
The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
vocoder_num_langs (`int`, *optional*, defaults to 36):
|
||||
Number of langs supported by the vocoder. Might be different from `t2u_num_langs`.
|
||||
vocoder_num_spkrs (`int`, *optional*, defaults to 200):
|
||||
Number of speakers supported by the vocoder.
|
||||
variance_predictor_kernel_size (`int`, *optional*, defaults to 3):
|
||||
Kernel size of the duration predictor. Applies to the vocoder only.
|
||||
var_pred_dropout (`float`, *optional*, defaults to 0.5):
|
||||
The dropout probabilitiy of the duration predictor. Applies to the vocoder only.
|
||||
vocoder_offset (`int`, *optional*, defaults to 4):
|
||||
Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4TModel, SeamlessM4TConfig
|
||||
|
||||
>>> # Initializing a SeamlessM4T "facebook/hf-seamless-m4t-medium" style configuration
|
||||
>>> configuration = SeamlessM4TConfig()
|
||||
|
||||
>>> # Initializing a model from the "facebook/hf-seamless-m4t-medium" style configuration
|
||||
>>> model = SeamlessM4TModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "seamless_m4t"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=256102,
|
||||
t2u_vocab_size=10082,
|
||||
# shared config
|
||||
hidden_size=1024,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-5,
|
||||
use_cache=True,
|
||||
max_position_embeddings=1024,
|
||||
is_encoder_decoder=True,
|
||||
encoder_layerdrop=0.05,
|
||||
decoder_layerdrop=0.05,
|
||||
activation_function="relu",
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
activation_dropout=0.0,
|
||||
scale_embedding=True,
|
||||
# text encoder|decoder
|
||||
encoder_layers=24,
|
||||
encoder_ffn_dim=8192,
|
||||
encoder_attention_heads=16,
|
||||
decoder_layers=24,
|
||||
decoder_ffn_dim=8192,
|
||||
decoder_attention_heads=16,
|
||||
decoder_start_token_id=3,
|
||||
max_new_tokens=256,
|
||||
pad_token_id=0,
|
||||
bos_token_id=2,
|
||||
eos_token_id=3,
|
||||
# speech_encoder
|
||||
speech_encoder_layers=24,
|
||||
speech_encoder_attention_heads=16,
|
||||
speech_encoder_intermediate_size=4096,
|
||||
speech_encoder_hidden_act="swish",
|
||||
speech_encoder_dropout=0.0,
|
||||
add_adapter=True,
|
||||
speech_encoder_layerdrop=0.1,
|
||||
feature_projection_input_dim=160,
|
||||
num_conv_pos_embeddings=128,
|
||||
num_conv_pos_embedding_groups=16,
|
||||
adaptor_kernel_size=8,
|
||||
adaptor_stride=8,
|
||||
adaptor_dropout=0.1,
|
||||
num_adapter_layers=1,
|
||||
position_embeddings_type="relative",
|
||||
rotary_embedding_base=10000,
|
||||
max_source_positions=4096,
|
||||
conv_depthwise_kernel_size=31,
|
||||
# t2u config
|
||||
t2u_bos_token_id=0,
|
||||
t2u_pad_token_id=1,
|
||||
t2u_eos_token_id=2,
|
||||
t2u_decoder_start_token_id=2,
|
||||
t2u_max_new_tokens=1024,
|
||||
t2u_encoder_layers=6,
|
||||
t2u_encoder_ffn_dim=8192,
|
||||
t2u_encoder_attention_heads=16,
|
||||
t2u_decoder_layers=6,
|
||||
t2u_decoder_ffn_dim=8192,
|
||||
t2u_decoder_attention_heads=16,
|
||||
t2u_max_position_embeddings=2048,
|
||||
# hifi-gan vocoder config
|
||||
sampling_rate=16000,
|
||||
upsample_initial_channel=512,
|
||||
upsample_rates=[5, 4, 4, 2, 2],
|
||||
upsample_kernel_sizes=[11, 8, 8, 4, 4],
|
||||
resblock_kernel_sizes=[3, 7, 11],
|
||||
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
|
||||
leaky_relu_slope=0.1,
|
||||
# specific to Code Hifi-Gan
|
||||
unit_hifi_gan_vocab_size=10000,
|
||||
unit_embed_dim=1280,
|
||||
lang_embed_dim=256,
|
||||
spkr_embed_dim=256,
|
||||
vocoder_num_langs=36,
|
||||
vocoder_num_spkrs=200,
|
||||
variance_predictor_kernel_size=3,
|
||||
var_pred_dropout=0.5,
|
||||
vocoder_offset=4,
|
||||
**kwargs,
|
||||
):
|
||||
# overall_config
|
||||
self.vocab_size = vocab_size
|
||||
self.t2u_vocab_size = t2u_vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.use_cache = use_cache
|
||||
self.max_new_tokens = max_new_tokens
|
||||
self.encoder_layerdrop = encoder_layerdrop
|
||||
self.decoder_layerdrop = decoder_layerdrop
|
||||
self.activation_function = activation_function
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.scale_embedding = scale_embedding
|
||||
# for proper config init
|
||||
self.num_attention_heads = decoder_attention_heads
|
||||
self.num_hidden_layers = decoder_layers
|
||||
|
||||
# text|unit encoder|decoder
|
||||
self.encoder_layers = encoder_layers
|
||||
self.encoder_ffn_dim = encoder_ffn_dim
|
||||
self.encoder_attention_heads = encoder_attention_heads
|
||||
self.decoder_layers = decoder_layers
|
||||
self.decoder_ffn_dim = decoder_ffn_dim
|
||||
self.decoder_attention_heads = decoder_attention_heads
|
||||
|
||||
# speech_encoder
|
||||
self.speech_encoder_layers = speech_encoder_layers
|
||||
self.speech_encoder_hidden_act = speech_encoder_hidden_act
|
||||
self.speech_encoder_dropout = speech_encoder_dropout
|
||||
self.speech_encoder_attention_heads = speech_encoder_attention_heads
|
||||
self.speech_encoder_layerdrop = speech_encoder_layerdrop
|
||||
self.speech_encoder_intermediate_size = speech_encoder_intermediate_size
|
||||
self.feature_projection_input_dim = feature_projection_input_dim
|
||||
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||
self.adaptor_kernel_size = adaptor_kernel_size
|
||||
self.adaptor_stride = adaptor_stride
|
||||
self.adaptor_dropout = adaptor_dropout
|
||||
self.num_adapter_layers = num_adapter_layers
|
||||
self.position_embeddings_type = position_embeddings_type
|
||||
self.rotary_embedding_base = rotary_embedding_base
|
||||
self.max_source_positions = max_source_positions
|
||||
self.conv_depthwise_kernel_size = conv_depthwise_kernel_size
|
||||
self.add_adapter = add_adapter
|
||||
|
||||
# t2u config
|
||||
self.t2u_bos_token_id = t2u_bos_token_id
|
||||
self.t2u_pad_token_id = t2u_pad_token_id
|
||||
self.t2u_eos_token_id = t2u_eos_token_id
|
||||
self.t2u_decoder_start_token_id = t2u_decoder_start_token_id
|
||||
self.t2u_max_new_tokens = t2u_max_new_tokens
|
||||
self.t2u_encoder_layers = t2u_encoder_layers
|
||||
self.t2u_encoder_ffn_dim = t2u_encoder_ffn_dim
|
||||
self.t2u_encoder_attention_heads = t2u_encoder_attention_heads
|
||||
self.t2u_decoder_layers = t2u_decoder_layers
|
||||
self.t2u_decoder_ffn_dim = t2u_decoder_ffn_dim
|
||||
self.t2u_decoder_attention_heads = t2u_decoder_attention_heads
|
||||
self.t2u_max_position_embeddings = t2u_max_position_embeddings
|
||||
|
||||
# hifi-gan vocoder config
|
||||
# original parameters specific to Hifi-Gan
|
||||
self.sampling_rate = sampling_rate
|
||||
self.upsample_initial_channel = upsample_initial_channel
|
||||
self.upsample_rates = upsample_rates
|
||||
self.upsample_kernel_sizes = upsample_kernel_sizes
|
||||
self.resblock_kernel_sizes = resblock_kernel_sizes
|
||||
self.resblock_dilation_sizes = resblock_dilation_sizes
|
||||
self.leaky_relu_slope = leaky_relu_slope
|
||||
|
||||
# specific to Code Hifi-Gan
|
||||
self.unit_hifi_gan_vocab_size = unit_hifi_gan_vocab_size
|
||||
self.unit_embed_dim = unit_embed_dim
|
||||
self.lang_embed_dim = lang_embed_dim
|
||||
self.spkr_embed_dim = spkr_embed_dim
|
||||
self.vocoder_num_langs = vocoder_num_langs
|
||||
self.vocoder_num_spkrs = vocoder_num_spkrs
|
||||
self.variance_predictor_kernel_size = variance_predictor_kernel_size
|
||||
self.var_pred_dropout = var_pred_dropout
|
||||
self.vocoder_offset = vocoder_offset
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
bos_token_id=bos_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
decoder_start_token_id=decoder_start_token_id,
|
||||
is_encoder_decoder=is_encoder_decoder,
|
||||
max_position_embeddings=max_position_embeddings,
|
||||
**kwargs,
|
||||
)
|
|
@ -0,0 +1,410 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Converting Meta SeamlessM4T checkpoints from seamless_communication to HF."""
|
||||
|
||||
|
||||
import argparse
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from accelerate.utils.modeling import find_tied_parameters
|
||||
from seamless_communication.models.inference.translator import Translator
|
||||
|
||||
from transformers import (
|
||||
SeamlessM4TConfig,
|
||||
SeamlessM4TFeatureExtractor,
|
||||
SeamlessM4TModel,
|
||||
SeamlessM4TProcessor,
|
||||
SeamlessM4TTokenizer,
|
||||
)
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
# fmt: off
|
||||
UNIT_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kan__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tam__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__", ]
|
||||
# fmt: on
|
||||
|
||||
# fmt: off
|
||||
VOCODER_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__",]
|
||||
# fmt: on
|
||||
|
||||
|
||||
# fmt: off
|
||||
MEDIUM_SUPPORTED_LANGUAGES = ["ace","ace_Latn","acm","acq","aeb","afr","ajp","aka","amh","apc","arb","ars","ary","arz","asm","ast","awa","ayr","azb","azj","bak","bam","ban","bel","bem","ben","bho","bjn","bjn_Latn","bod","bos","bug","bul","cat","ceb","ces","cjk","ckb","crh","cym","dan","deu","dik","dyu","dzo","ell","eng","epo","est","eus","ewe","fao","pes","fij","fin","fon","fra","fur","fuv","gla","gle","glg","grn","guj","hat","hau","heb","hin","hne","hrv","hun","hye","ibo","ilo","ind","isl","ita","jav","jpn","kab","kac","kam","kan","kas","kas_Deva","kat","knc","knc_Latn","kaz","kbp","kea","khm","kik","kin","kir","kmb","kon","kor","kmr","lao","lvs","lij","lim","lin","lit","lmo","ltg","ltz","lua","lug","luo","lus","mag","mai","mal","mar","min","mkd","plt","mlt","mni","khk","mos","mri","zsm","mya","nld","nno","nob","npi","nso","nus","nya","oci","gaz","ory","pag","pan","pap","pol","por","prs","pbt","quy","ron","run","rus","sag","san","sat","scn","shn","sin","slk","slv","smo","sna","snd","som","sot","spa","als","srd","srp","ssw","sun","swe","swh","szl","tam","tat","tel","tgk","tgl","tha","tir","taq","taq_Tfng","tpi","tsn","tso","tuk","tum","tur","twi","tzm","uig","ukr","umb","urd","uzn","vec","vie","war","wol","xho","ydd","yor","yue","cmn","cmn_Hant","zul",]
|
||||
# fmt: on
|
||||
|
||||
|
||||
# fmt: off
|
||||
LARGE_SUPPORTED_LANGUAGES = ["afr","amh","arb","ary","arz","asm","azj","bel","ben","bos","bul","cat","ceb","ces","ckb","cmn","cmn_Hant","cym","dan","deu","ell","eng","est","eus","fin","fra","fuv","gaz","gle","glg","guj","heb","hin","hrv","hun","hye","ibo","ind","isl","ita","jav","jpn","kan","kat","kaz","khk","khm","kir","kor","lao","lit","lug","luo","lvs","mai","mal","mar","mkd","mlt","mni","mya","nld","nno","nob","npi","nya","ory","pan","pbt","pes","pol","por","ron","rus","sat","slk","slv","sna","snd","som","spa","srp","swe","swh","tam","tel","tgk","tgl","tha","tur","ukr","urd","uzn","vie","yor","yue","zlm","zul",]
|
||||
# fmt: on
|
||||
|
||||
|
||||
def assert_param_count(model_1, model_2):
|
||||
count_1 = sum(p[1].numel() for p in model_1.named_parameters() if "final_proj" not in p[0])
|
||||
count_2 = sum(p[1].numel() for p in model_2.named_parameters() if "final_proj" not in p[0])
|
||||
assert count_1 == count_2, f"{model_1.__class__}: {count_1} != {model_2.__class__}: {count_2}"
|
||||
|
||||
|
||||
def param_count(model):
|
||||
return sum(p[1].numel() for p in model.named_parameters() if "final_proj" not in p[0])
|
||||
|
||||
|
||||
def _grab_best_device(use_gpu=True):
|
||||
if torch.cuda.device_count() > 0 and use_gpu:
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
return torch.device(device)
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
vocoder_convert_list = [
|
||||
("ups", "hifi_gan.upsampler"),
|
||||
("conv_pre", "hifi_gan.conv_pre"),
|
||||
("resblocks", "hifi_gan.resblocks"),
|
||||
("conv_post", "hifi_gan.conv_post"),
|
||||
("lang", "language_embedding"),
|
||||
("spkr", "speaker_embedding"),
|
||||
("dict.", "unit_embedding."),
|
||||
("dur_predictor.conv1.0", "dur_predictor.conv1"),
|
||||
("dur_predictor.conv2.0", "dur_predictor.conv2"),
|
||||
]
|
||||
|
||||
# order is important
|
||||
wav2vec_convert_list = [
|
||||
("speech_encoder_frontend.model_dim_proj", "feature_projection.projection"),
|
||||
("speech_encoder_frontend.post_extract_layer_norm", "feature_projection.layer_norm"),
|
||||
("speech_encoder_frontend.pos_encoder.conv", "encoder.pos_conv_embed.conv"),
|
||||
("speech_encoder.inner.layers", "encoder.layers"),
|
||||
("speech_encoder.inner_layer_norm", "encoder.layer_norm"),
|
||||
("speech_encoder.adaptor_layers", "adapter.layers"),
|
||||
("inner_proj", "intermediate_dense"),
|
||||
("self_attn.output_proj", "self_attn.linear_out"),
|
||||
("output_proj", "output_dense"),
|
||||
("self_attn.k_proj", "self_attn.linear_k"),
|
||||
("self_attn.v_proj", "self_attn.linear_v"),
|
||||
("self_attn.q_proj", "self_attn.linear_q"),
|
||||
("self_attn.sdpa.u_bias", "self_attn.pos_bias_u"),
|
||||
("self_attn.sdpa.v_bias", "self_attn.pos_bias_v"),
|
||||
("self_attn.sdpa.r_proj", "self_attn.linear_pos"),
|
||||
("conv.pointwise_conv1", "conv_module.pointwise_conv1"),
|
||||
("conv.pointwise_conv2", "conv_module.pointwise_conv2"),
|
||||
("conv.depthwise_conv", "conv_module.depthwise_conv"),
|
||||
("conv.batch_norm", "conv_module.batch_norm"),
|
||||
("conv_layer_norm", "conv_module.layer_norm"),
|
||||
("speech_encoder.proj1", "intermediate_ffn.intermediate_dense"),
|
||||
("speech_encoder.proj2", "intermediate_ffn.output_dense"),
|
||||
("speech_encoder.layer_norm", "inner_layer_norm"),
|
||||
]
|
||||
|
||||
t2u_convert_list = [
|
||||
("t2u_model.final_proj", "lm_head"),
|
||||
("t2u_model.", "model."),
|
||||
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
|
||||
("encoder_decoder_attn", "cross_attention"),
|
||||
("linear_k", "k_proj"),
|
||||
("linear_v", "v_proj"),
|
||||
("linear_q", "q_proj"),
|
||||
("ffn.inner_proj", "ffn.fc1"),
|
||||
("ffn.output_proj", "ffn.fc2"),
|
||||
("output_proj", "out_proj"),
|
||||
("decoder_frontend.embed", "decoder.embed_tokens"),
|
||||
]
|
||||
|
||||
text_convert_list = [
|
||||
("text_encoder.", ""),
|
||||
("text_decoder.", ""),
|
||||
("text_encoder_frontend.embed", "embed_tokens"),
|
||||
("text_decoder_frontend.embed", "embed_tokens"),
|
||||
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
|
||||
("encoder_decoder_attn", "cross_attention"),
|
||||
("linear_k", "k_proj"),
|
||||
("linear_v", "v_proj"),
|
||||
("linear_q", "q_proj"),
|
||||
("ffn.inner_proj", "ffn.fc1"),
|
||||
("ffn.output_proj", "ffn.fc2"),
|
||||
("output_proj", "out_proj"),
|
||||
("final_proj", "lm_head"),
|
||||
]
|
||||
|
||||
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
|
||||
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
|
||||
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "huggingface", "hub")
|
||||
|
||||
|
||||
def _load_hf_config(model_type="medium"):
|
||||
if model_type == "medium":
|
||||
kwargs = {
|
||||
"vocab_size": 256206,
|
||||
"t2u_vocab_size": 10082,
|
||||
"hidden_size": 1024,
|
||||
"max_position_embeddings": 4096,
|
||||
"encoder_layers": 12,
|
||||
"decoder_layers": 12,
|
||||
"encoder_ffn_dim": 4096,
|
||||
"decoder_ffn_dim": 4096,
|
||||
"t2u_encoder_layers": 4,
|
||||
"t2u_decoder_layers": 4,
|
||||
"speech_encoder_layers": 12,
|
||||
}
|
||||
return SeamlessM4TConfig(**kwargs)
|
||||
else:
|
||||
return SeamlessM4TConfig()
|
||||
|
||||
|
||||
def _convert_model(
|
||||
original_model,
|
||||
hf_model,
|
||||
convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict="speech",
|
||||
exclude_state_dict=None,
|
||||
):
|
||||
state_dict = original_model.state_dict()
|
||||
|
||||
# filter func
|
||||
if isinstance(filter_state_dict, str):
|
||||
|
||||
def filter_func(x):
|
||||
return filter_state_dict in x[0]
|
||||
|
||||
else:
|
||||
|
||||
def filter_func(item):
|
||||
if exclude_state_dict is not None and exclude_state_dict in item[0]:
|
||||
return False
|
||||
for filter_el in filter_state_dict:
|
||||
if filter_el in item[0]:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
state_dict = dict(filter(filter_func, state_dict.items()))
|
||||
|
||||
for k, v in list(state_dict.items()):
|
||||
new_k = k[len(unwanted_prefix) :]
|
||||
for old_layer_name, new_layer_name in convert_list:
|
||||
if old_layer_name in new_k:
|
||||
new_k = new_k.replace(old_layer_name, new_layer_name)
|
||||
|
||||
# must do it by hand
|
||||
if ".layer_norm" in new_k and new_k.split(".layer_norm")[0][-1].isnumeric():
|
||||
new_k = new_k.replace("layer_norm", "final_layer_norm")
|
||||
|
||||
state_dict[new_k] = state_dict.pop(k)
|
||||
|
||||
extra_keys = set(state_dict.keys()) - set(hf_model.state_dict().keys())
|
||||
extra_keys = set(extra_keys)
|
||||
missing_keys = set(hf_model.state_dict().keys()) - set(state_dict.keys())
|
||||
missing_keys = set({k for k in missing_keys if "final_logits_bias" not in k})
|
||||
if len(extra_keys) != 0:
|
||||
raise ValueError(f"extra keys found: {extra_keys}")
|
||||
if len(missing_keys) != 0:
|
||||
raise ValueError(f"missing keys: {missing_keys}")
|
||||
hf_model.load_state_dict(state_dict, strict=False)
|
||||
n_params = param_count(hf_model)
|
||||
|
||||
logger.info(f"model loaded: {round(n_params/1e6,1)}M params")
|
||||
|
||||
hf_model.eval()
|
||||
hf_model.to(device)
|
||||
del state_dict
|
||||
|
||||
return hf_model
|
||||
|
||||
|
||||
def load_model(save_dir, model_type, repo_id):
|
||||
"""
|
||||
Meta SeamlessM4T is made of 8 main components:
|
||||
- speech_encoder (#1) and speech_encoder_frontend (#2)
|
||||
- t2u_model (#3)
|
||||
- text_encoder (#4) and text_encoder_frontend (#5)
|
||||
- text_decoder (#6) [and text_decoder_frontend (#5) = equals to text_encoder_frontend]
|
||||
- final_proj (#7)
|
||||
- vocoder (#8)
|
||||
"""
|
||||
device = _grab_best_device()
|
||||
if model_type == "medium":
|
||||
name = "seamlessM4T_medium"
|
||||
else:
|
||||
name = "seamlessM4T_large"
|
||||
|
||||
original_model = Translator(name, "vocoder_36langs", device, torch.float32)
|
||||
|
||||
######### TOKENIZER
|
||||
|
||||
langs = MEDIUM_SUPPORTED_LANGUAGES if model_type == "medium" else LARGE_SUPPORTED_LANGUAGES
|
||||
langs = [f"__{lang}__" for lang in langs]
|
||||
vocab_file = os.path.join(os.path.expanduser("~"), "tokenizer", model_type, "tokenizer.model")
|
||||
|
||||
save_dir = os.path.join(save_dir, name)
|
||||
Path(save_dir).mkdir(exist_ok=True)
|
||||
|
||||
tokenizer = SeamlessM4TTokenizer(vocab_file, additional_special_tokens=langs)
|
||||
|
||||
sanity_check_lang_id = tokenizer.convert_tokens_to_ids("__fra__")
|
||||
|
||||
tokenizer.save_pretrained(save_dir)
|
||||
tokenizer = SeamlessM4TTokenizer.from_pretrained(save_dir)
|
||||
|
||||
if sanity_check_lang_id != tokenizer.convert_tokens_to_ids("__fra__"):
|
||||
raise ValueError(
|
||||
f"Error in tokenizer saving/loading - __fra__ lang id is not coherent: {sanity_check_lang_id} vs {tokenizer.convert_tokens_to_ids('__fra__')}"
|
||||
)
|
||||
|
||||
####### get language to ids dict
|
||||
text_decoder_lang_code_to_id = {lang.replace("__", ""): tokenizer.convert_tokens_to_ids(lang) for lang in langs}
|
||||
# offset: vocoder unit vocab size + 5 (for EOS/PAD/BOS/UNK/MSK) + len(supported_languages)
|
||||
t2u_lang_code_to_id = {
|
||||
code.replace("__", ""): i + 10005 + len(UNIT_SUPPORTED_LANGUAGES)
|
||||
for i, code in enumerate(UNIT_SUPPORTED_LANGUAGES)
|
||||
}
|
||||
vocoder_lang_code_to_id = {code.replace("__", ""): i for i, code in enumerate(VOCODER_SUPPORTED_LANGUAGES)}
|
||||
|
||||
######### FE
|
||||
|
||||
fe = SeamlessM4TFeatureExtractor(language_code=langs)
|
||||
|
||||
fe.save_pretrained(save_dir)
|
||||
fe = SeamlessM4TFeatureExtractor.from_pretrained(save_dir)
|
||||
|
||||
processor = SeamlessM4TProcessor(feature_extractor=fe, tokenizer=tokenizer)
|
||||
processor.save_pretrained(save_dir)
|
||||
processor.push_to_hub(repo_id=repo_id, create_pr=True)
|
||||
|
||||
processor = SeamlessM4TProcessor.from_pretrained(save_dir)
|
||||
|
||||
######## Model
|
||||
|
||||
# init model
|
||||
hf_config = _load_hf_config(model_type)
|
||||
hf_model = SeamlessM4TModel(hf_config)
|
||||
|
||||
hf_model.generation_config.__setattr__("text_decoder_lang_to_code_id", text_decoder_lang_code_to_id)
|
||||
hf_model.generation_config.__setattr__("t2u_lang_code_to_id", t2u_lang_code_to_id)
|
||||
hf_model.generation_config.__setattr__("vocoder_lang_code_to_id", vocoder_lang_code_to_id)
|
||||
|
||||
# -1. take care of vocoder
|
||||
# similarly to speech T5 must apply and remove weight norm
|
||||
hf_model.vocoder.apply_weight_norm()
|
||||
hf_model.vocoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.vocoder,
|
||||
vocoder_convert_list,
|
||||
device,
|
||||
unwanted_prefix="vocoder.code_generator.",
|
||||
filter_state_dict="vocoder",
|
||||
)
|
||||
hf_model.vocoder.remove_weight_norm()
|
||||
|
||||
# 1. take care of speech encoder
|
||||
wav2vec = hf_model.speech_encoder
|
||||
hf_model.speech_encoder = _convert_model(
|
||||
original_model, wav2vec, wav2vec_convert_list, device, unwanted_prefix="model.", filter_state_dict="speech"
|
||||
)
|
||||
|
||||
# 2. take care of t2u
|
||||
|
||||
hf_model.t2u_model = _convert_model(
|
||||
original_model,
|
||||
hf_model.t2u_model,
|
||||
t2u_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 3. take care of text encoder
|
||||
hf_model.text_encoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.text_encoder,
|
||||
text_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.text_encoder"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 4. take care of text decoder
|
||||
hf_model.text_decoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.text_decoder,
|
||||
text_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.text_decoder"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 5. take care of final proj
|
||||
hf_model.lm_head = _convert_model(
|
||||
original_model,
|
||||
hf_model.lm_head,
|
||||
[("final_proj.", "")],
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.final_proj"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# sanity check
|
||||
print(find_tied_parameters(hf_model))
|
||||
|
||||
count_1 = param_count(hf_model)
|
||||
count_2 = param_count(original_model)
|
||||
|
||||
print(f"HF MODEL:{count_1}, ORIGINAL_MODEL: {count_2}, diff:{count_1 - count_2}")
|
||||
print(f"HF MODEL excluding embeddings:{hf_model.num_parameters(exclude_embeddings=True)}")
|
||||
|
||||
del original_model
|
||||
|
||||
hf_model.generation_config._from_model_config = False
|
||||
hf_model.save_pretrained(save_dir)
|
||||
hf_model.push_to_hub(repo_id=repo_id, create_pr=True)
|
||||
hf_model = SeamlessM4TModel.from_pretrained(save_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
|
||||
parser.add_argument(
|
||||
"--model_type",
|
||||
default="medium",
|
||||
type=str,
|
||||
help="Model type.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--save_dir",
|
||||
default="/home/ubuntu/weights",
|
||||
type=str,
|
||||
help="Path to the output PyTorch model.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--repo_id",
|
||||
default="facebook/hf-seamless-m4t-medium",
|
||||
type=str,
|
||||
help="Repo ID.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
load_model(args.save_dir, args.model_type, args.repo_id)
|
|
@ -0,0 +1,305 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Feature extractor class for SeamlessM4T
|
||||
"""
|
||||
|
||||
import copy
|
||||
from typing import List, Optional, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...audio_utils import mel_filter_bank, spectrogram, window_function
|
||||
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...utils import PaddingStrategy, TensorType, logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class SeamlessM4TFeatureExtractor(SequenceFeatureExtractor):
|
||||
r"""
|
||||
Constructs a SeamlessM4T feature extractor.
|
||||
|
||||
This feature extractor inherits from [`SequenceFeatureExtractor`] which contains most of the main methods. Users
|
||||
should refer to this superclass for more information regarding those methods.
|
||||
|
||||
This class extracts mel-filter bank features from raw speech.
|
||||
|
||||
Args:
|
||||
feature_size (`int`, *optional*, defaults to 80):
|
||||
The feature dimension of the extracted features.
|
||||
sampling_rate (`int`, *optional*, defaults to 16000):
|
||||
The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
|
||||
num_mel_bins (`int`, *optional*, defaults to 80):
|
||||
Number of Mel-frequency bins.
|
||||
padding_value (`float`, *optional*, defaults to 0.0):
|
||||
The value that is used to fill the padding vectors.
|
||||
stride (`int`, *optional*, defaults to 2):
|
||||
Stride used to reshape audios from shape (batch_size,num_frames,num_mel_bins) to
|
||||
(batch_size,num_frames//stride,num_mel_bins*stride).
|
||||
"""
|
||||
|
||||
model_input_names = ["input_features", "attention_mask"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
feature_size=80,
|
||||
sampling_rate=16000,
|
||||
num_mel_bins=80,
|
||||
padding_value=0.0,
|
||||
stride=2,
|
||||
**kwargs,
|
||||
):
|
||||
self.num_mel_bins = num_mel_bins
|
||||
self.return_attention_mask = True
|
||||
self.stride = stride
|
||||
|
||||
mel_filters = mel_filter_bank(
|
||||
num_frequency_bins=256,
|
||||
num_mel_filters=self.num_mel_bins,
|
||||
min_frequency=20,
|
||||
max_frequency=sampling_rate // 2,
|
||||
sampling_rate=sampling_rate,
|
||||
norm=None,
|
||||
mel_scale="kaldi",
|
||||
triangularize_in_mel_space=True,
|
||||
)
|
||||
|
||||
self.mel_filters = np.pad(mel_filters, ((0, 1), (0, 0)))
|
||||
self.window = window_function(400, "povey", periodic=False)
|
||||
|
||||
super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
|
||||
|
||||
@staticmethod
|
||||
# Copied from transformers.models.wav2vec2.feature_extraction_wav2vec2.Wav2Vec2FeatureExtractor.zero_mean_unit_var_norm
|
||||
def zero_mean_unit_var_norm(
|
||||
input_values: List[np.ndarray], attention_mask: List[np.ndarray], padding_value: float = 0.0
|
||||
) -> List[np.ndarray]:
|
||||
"""
|
||||
Every array in the list is normalized to have zero mean and unit variance
|
||||
"""
|
||||
if attention_mask is not None:
|
||||
attention_mask = np.array(attention_mask, np.int32)
|
||||
normed_input_values = []
|
||||
|
||||
for vector, length in zip(input_values, attention_mask.sum(-1)):
|
||||
normed_slice = (vector - vector[:length].mean()) / np.sqrt(vector[:length].var() + 1e-7)
|
||||
if length < normed_slice.shape[0]:
|
||||
normed_slice[length:] = padding_value
|
||||
|
||||
normed_input_values.append(normed_slice)
|
||||
else:
|
||||
normed_input_values = [(x - x.mean()) / np.sqrt(x.var() + 1e-7) for x in input_values]
|
||||
|
||||
return normed_input_values
|
||||
|
||||
def _extract_fbank_features(
|
||||
self,
|
||||
waveform: np.ndarray,
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
|
||||
and hence the waveform should not be normalized before feature extraction.
|
||||
"""
|
||||
# by default, it extracts the left channel if stereo
|
||||
if len(waveform.shape) == 2:
|
||||
waveform = waveform[0]
|
||||
|
||||
waveform = np.squeeze(waveform) * (2**15) # Kaldi compliance: 16-bit signed integers
|
||||
features = spectrogram(
|
||||
waveform,
|
||||
self.window,
|
||||
frame_length=400,
|
||||
hop_length=160,
|
||||
fft_length=512,
|
||||
power=2.0,
|
||||
center=False,
|
||||
preemphasis=0.97,
|
||||
mel_filters=self.mel_filters,
|
||||
log_mel="log",
|
||||
mel_floor=1.192092955078125e-07,
|
||||
remove_dc_offset=True,
|
||||
).T
|
||||
return features
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
|
||||
padding: Union[bool, str, PaddingStrategy] = True,
|
||||
pad_to_multiple_of: Optional[int] = 2,
|
||||
max_length: Optional[int] = None,
|
||||
truncation: bool = False,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
sampling_rate: Optional[int] = None,
|
||||
return_attention_mask: Optional[bool] = None,
|
||||
do_normalize_per_mel_bins: Optional[bool] = True,
|
||||
**kwargs,
|
||||
) -> BatchFeature:
|
||||
"""
|
||||
Main method to featurize and prepare for the model one or several sequence(s).
|
||||
|
||||
Args:
|
||||
raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`, `List[List[List[float]]]`):
|
||||
The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
|
||||
values, a list of numpy arrays, a list of list of float values or a list of a list of list of float
|
||||
values. If `raw_speech` is a one-dimensional `np.ndarray` or a `List[float]`, `raw_speech` is
|
||||
considered a single-channel, single-sample sound. In all other cases, the first dimension of
|
||||
`raw_speech`, whether from an `np.ndarray` or a `List[...]`, corresponds to the number of samples in
|
||||
the batch, and the number of channels (i.e. mono or stereo character) is derived from the other
|
||||
dimensions (1D -> single-channel waveform batches; 2D-> stereo-channel waveform batches).
|
||||
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
|
||||
Select a strategy to pad the returned sequences (according to the model's padding side and padding
|
||||
index) among:
|
||||
|
||||
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
|
||||
acceptable input length for the model if that argument is not provided.
|
||||
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
|
||||
lengths).
|
||||
pad_to_multiple_of (`int`, *optional*, defaults to 2):
|
||||
If set will pad the sequence to a multiple of the provided value.
|
||||
|
||||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
|
||||
`>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128.
|
||||
max_length (`int`, *optional*):
|
||||
Maximum length of the returned list and optionally padding length (see above).
|
||||
truncation (`bool`):
|
||||
Activates truncation to cut input sequences longer than *max_length* to *max_length*.
|
||||
return_attention_mask (`bool`, *optional*):
|
||||
Whether to return the attention mask. If left to the default, will return the attention mask according
|
||||
to the specific feature_extractor's default.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
|
||||
<Tip>
|
||||
|
||||
For SeamlessM4T models, `attention_mask` should always be passed for batched inference, to avoid subtle
|
||||
bugs.
|
||||
|
||||
</Tip>
|
||||
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||
sampling_rate (`int`, *optional*):
|
||||
The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
|
||||
`sampling_rate` at the forward call to prevent silent errors.
|
||||
do_normalize_per_mel_bins (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to zero-mean unit-variance normalize the input per mel-channel.
|
||||
kwargs (*optional*):
|
||||
Remaining dictionary of keyword arguments that will be passed to the tokenizer or the feature
|
||||
extractor.
|
||||
"""
|
||||
if sampling_rate is not None:
|
||||
if sampling_rate != self.sampling_rate:
|
||||
raise ValueError(
|
||||
f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
|
||||
f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
|
||||
f" {self.sampling_rate} and not {sampling_rate}."
|
||||
)
|
||||
else:
|
||||
logger.warning(
|
||||
"It is strongly recommended to pass the `sampling_rate` argument to this function. "
|
||||
"Failing to do so can result in silent errors that might be hard to debug."
|
||||
)
|
||||
|
||||
is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
|
||||
if is_batched_numpy and len(raw_speech.shape) > 3:
|
||||
raise ValueError(f"Only mono-channel or stereo-channel audio is supported for input to {self}")
|
||||
|
||||
is_batched = is_batched_numpy or (
|
||||
isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
|
||||
)
|
||||
|
||||
if is_batched:
|
||||
raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
|
||||
elif not is_batched and not isinstance(raw_speech, np.ndarray):
|
||||
raw_speech = np.asarray(raw_speech, dtype=np.float32)
|
||||
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
|
||||
raw_speech = raw_speech.astype(np.float32)
|
||||
|
||||
# always return batch
|
||||
if not is_batched:
|
||||
raw_speech = [raw_speech]
|
||||
|
||||
# extract fbank features
|
||||
features = [self._extract_fbank_features(waveform) for waveform in raw_speech]
|
||||
|
||||
if do_normalize_per_mel_bins:
|
||||
# torch defaults to ddof=1, and numpy defaults to ddof=0
|
||||
features = [
|
||||
(x - np.expand_dims(x.mean(0), 0)) / np.sqrt(np.expand_dims(x.var(0, ddof=1), 0) + 1e-7)
|
||||
for x in features
|
||||
]
|
||||
|
||||
# convert into correct format for padding
|
||||
encoded_inputs = BatchFeature({"input_features": features})
|
||||
|
||||
padded_inputs = self.pad(
|
||||
encoded_inputs,
|
||||
padding=padding,
|
||||
max_length=max_length,
|
||||
truncation=truncation,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_tensors="np",
|
||||
)
|
||||
|
||||
# SeamlessM4T needs to process extracted features
|
||||
input_features = padded_inputs.get("input_features")
|
||||
attention_mask = padded_inputs.get("attention_mask")
|
||||
|
||||
batch_size, num_frames, num_channels = input_features.shape
|
||||
|
||||
remainder = num_frames % self.stride
|
||||
if remainder != 0:
|
||||
input_features = input_features[:, :num_frames, :]
|
||||
attention_mask = attention_mask[:, :num_frames]
|
||||
|
||||
input_features = np.reshape(
|
||||
input_features, (batch_size, num_frames // self.stride, num_channels * self.stride)
|
||||
)
|
||||
|
||||
indices = np.arange(0, num_frames)
|
||||
attention_mask = attention_mask[:, indices % self.stride == 1]
|
||||
|
||||
padded_inputs["input_features"] = input_features
|
||||
padded_inputs["attention_mask"] = attention_mask
|
||||
|
||||
if return_tensors is not None:
|
||||
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
|
||||
|
||||
return padded_inputs
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary.
|
||||
|
||||
Returns:
|
||||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["feature_extractor_type"] = self.__class__.__name__
|
||||
if "mel_filters" in output:
|
||||
del output["mel_filters"]
|
||||
if "window" in output:
|
||||
del output["window"]
|
||||
return output
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,116 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Audio/Text processor class for SeamlessM4T
|
||||
"""
|
||||
|
||||
from ...processing_utils import ProcessorMixin
|
||||
|
||||
|
||||
class SeamlessM4TProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a SeamlessM4T processor which wraps a SeamlessM4T feature extractor and a SeamlessM4T tokenizer into a
|
||||
single processor.
|
||||
|
||||
[`SeamlessM4TProcessor`] offers all the functionalities of [`SeamlessM4TFeatureExtractor`] and
|
||||
[`SeamlessM4TTokenizerFast`]. See the [`~SeamlessM4TProcessor.__call__`] and [`~SeamlessM4TProcessor.decode`] for
|
||||
more information.
|
||||
|
||||
Args:
|
||||
feature_extractor ([`SeamlessM4TFeatureExtractor`]):
|
||||
The audio processor is a required input.
|
||||
tokenizer ([`SeamlessM4TTokenizerFast`]):
|
||||
The tokenizer is a required input.
|
||||
"""
|
||||
feature_extractor_class = "SeamlessM4TFeatureExtractor"
|
||||
tokenizer_class = ("SeamlessM4TTokenizer", "SeamlessM4TTokenizerFast")
|
||||
|
||||
def __init__(self, feature_extractor, tokenizer):
|
||||
super().__init__(feature_extractor, tokenizer)
|
||||
|
||||
def __call__(self, text=None, audios=None, src_lang=None, tgt_lang=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and audio(s). This method forwards the `text`
|
||||
and `kwargs` arguments to SeamlessM4TTokenizerFast's [`~SeamlessM4TTokenizerFast.__call__`] if `text` is not
|
||||
`None` to encode the text. To prepare the audio(s), this method forwards the `audios` and `kwrags` arguments to
|
||||
SeamlessM4TFeatureExtractor's [`~SeamlessM4TFeatureExtractor.__call__`] if `audios` is not `None`. Please refer
|
||||
to the doctsring of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
audios (`np.ndarray`, `torch.Tensor`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The audio or batch of audios to be prepared. Each audio can be NumPy array or PyTorch tensor. In case
|
||||
of a NumPy array/PyTorch tensor, each audio should be of shape (C, T), where C is a number of channels,
|
||||
and T the sample length of the audio.
|
||||
src_lang (`str`, *optional*):
|
||||
The language code of the input texts/audios. If not specified, the last `src_lang` specified will be
|
||||
used.
|
||||
tgt_lang (`str`, *optional*):
|
||||
The code of the target language. If not specified, the last `tgt_lang` specified will be used.
|
||||
kwargs (*optional*):
|
||||
Remaining dictionary of keyword arguments that will be passed to the feature extractor and/or the
|
||||
tokenizer.
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **input_features** -- Audio input features to be fed to a model. Returned when `audios` is not `None`.
|
||||
"""
|
||||
sampling_rate = kwargs.pop("sampling_rate", None)
|
||||
|
||||
if text is None and audios is None:
|
||||
raise ValueError("You have to specify either text or audios. Both cannot be none.")
|
||||
elif text is not None and audios is not None:
|
||||
raise ValueError(
|
||||
"Text and audios are mututally exclusive when passed to `SeamlessM4T`. Specify one or another."
|
||||
)
|
||||
elif text is not None:
|
||||
if tgt_lang is not None:
|
||||
self.tokenizer.tgt_lang = tgt_lang
|
||||
if src_lang is not None:
|
||||
self.tokenizer.src_lang = src_lang
|
||||
encoding = self.tokenizer(text, **kwargs)
|
||||
|
||||
return encoding
|
||||
|
||||
else:
|
||||
encoding = self.feature_extractor(audios, sampling_rate=sampling_rate, **kwargs)
|
||||
return encoding
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to SeamlessM4TTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
|
||||
Please refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to SeamlessM4TTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
||||
|
||||
@property
|
||||
def model_input_names(self):
|
||||
tokenizer_input_names = self.tokenizer.model_input_names
|
||||
feature_extractor_input_names = self.feature_extractor.model_input_names
|
||||
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
|
|
@ -0,0 +1,565 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Tokenization classes for SeamlessM4T."""
|
||||
import os
|
||||
from shutil import copyfile
|
||||
from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
|
||||
import sentencepiece as spm
|
||||
|
||||
from ...convert_slow_tokenizer import import_protobuf
|
||||
from ...tokenization_utils import (
|
||||
BatchEncoding,
|
||||
PreTokenizedInput,
|
||||
PreTrainedTokenizer,
|
||||
TextInput,
|
||||
)
|
||||
from ...tokenization_utils_base import AddedToken
|
||||
from ...utils import PaddingStrategy, logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
"vocab_file": {
|
||||
"facebook/hf-seamless-m4t-medium": (
|
||||
"https://huggingface.co/facebook/hf-seamless-m4t-medium/blob/main/sentencepiece.bpe.model"
|
||||
),
|
||||
}
|
||||
}
|
||||
|
||||
SPIECE_UNDERLINE = "▁"
|
||||
|
||||
|
||||
VOCAB_FILES_NAMES = {"vocab_file": "sentencepiece.bpe.model"}
|
||||
|
||||
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
"facebook/hf-seamless-m4t-medium": 2048,
|
||||
}
|
||||
|
||||
|
||||
class SeamlessM4TTokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
Construct a SeamlessM4T tokenizer.
|
||||
|
||||
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
|
||||
[SentencePiece](https://github.com/google/sentencepiece).
|
||||
|
||||
The tokenization method is `<language code> <tokens> <eos>` for source language documents, and `<eos> <language
|
||||
code> <tokens> <eos>` for target language documents.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4TTokenizer
|
||||
|
||||
>>> tokenizer = SeamlessM4TTokenizer.from_pretrained(
|
||||
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
|
||||
... )
|
||||
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
|
||||
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
|
||||
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
|
||||
```
|
||||
|
||||
Args:
|
||||
vocab_file (`str`):
|
||||
Path to the vocabulary file.
|
||||
bos_token (`str`, *optional*, defaults to `"<s>"`):
|
||||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||
|
||||
<Tip>
|
||||
|
||||
When building a sequence using special tokens, this is not the token that is used for the beginning of
|
||||
sequence. The token used is the `cls_token`.
|
||||
|
||||
</Tip>
|
||||
|
||||
eos_token (`str`, *optional*, defaults to `"</s>"`):
|
||||
The end of sequence token.
|
||||
|
||||
<Tip>
|
||||
|
||||
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
|
||||
The token used is the `sep_token`.
|
||||
|
||||
</Tip>
|
||||
|
||||
sep_token (`str`, *optional*, defaults to `"</s>"`):
|
||||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
|
||||
sequence classification or for a text and a question for question answering. It is also used as the last
|
||||
token of a sequence built with special tokens.
|
||||
cls_token (`str`, *optional*, defaults to `"<s>"`):
|
||||
The classifier token which is used when doing sequence classification (classification of the whole sequence
|
||||
instead of per-token classification). It is the first token of the sequence when built with special tokens.
|
||||
unk_token (`str`, *optional*, defaults to `"<unk>"`):
|
||||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||
token instead.
|
||||
pad_token (`str`, *optional*, defaults to `"<pad>"`):
|
||||
The token used for padding, for example when batching sequences of different lengths.
|
||||
tokenizer_file (`str`, *optional*):
|
||||
The path to a tokenizer file to use instead of the vocab file.
|
||||
src_lang (`str`, *optional*, defaults to `"eng"`):
|
||||
The language to use as source language for translation.
|
||||
tgt_lang (`str`, *optional*, defaults to `"fra"`):
|
||||
The language to use as target language for translation.
|
||||
sp_model_kwargs (`Dict[str, Any]`, *optional*):
|
||||
Additional keyword arguments to pass to the model initialization.
|
||||
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
|
||||
A tuple or a list of additional special tokens. Can be used to specify the list of languages that will be
|
||||
supported by the tokenizer.
|
||||
"""
|
||||
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
model_input_names = ["input_ids", "attention_mask"]
|
||||
|
||||
prefix_tokens: List[int] = []
|
||||
suffix_tokens: List[int] = []
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_file,
|
||||
bos_token="<s>",
|
||||
eos_token="</s>",
|
||||
sep_token="</s>",
|
||||
cls_token="<s>",
|
||||
unk_token="<unk>",
|
||||
pad_token="<pad>",
|
||||
tokenizer_file=None,
|
||||
src_lang="eng",
|
||||
tgt_lang="fra",
|
||||
sp_model_kwargs: Optional[Dict[str, Any]] = None,
|
||||
additional_special_tokens=None,
|
||||
**kwargs,
|
||||
):
|
||||
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
|
||||
# Add this unused argument to keep some important Copied from statements
|
||||
self.legacy = False
|
||||
self.vocab_file = vocab_file
|
||||
|
||||
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
|
||||
|
||||
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
|
||||
# -------- | ------- | ------- | ------ | ------- | ---- | ---- | ---- | ---- | ---- | ----
|
||||
# spm | '<unk>' | '<s>' | '</s>' | 'an' | 'en' | '_d' | 'er' | 'in' | '_s' | '_a'
|
||||
# fairseq | '<pad>' | '<unk>' | '<s>' | '</s>' | 'an' | 'en' | '▁d' | 'er' | 'in' | '▁s'
|
||||
|
||||
# Mimic fairseq token-to-id alignment for the first 4 token
|
||||
self._added_tokens_decoder = {
|
||||
0: AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token,
|
||||
1: AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token,
|
||||
2: AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token,
|
||||
3: AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token,
|
||||
}
|
||||
|
||||
# The first "real" token "an" has position 4 in the original fairseq vocab and position 3 in the spm vocab
|
||||
self.fairseq_offset = 1
|
||||
|
||||
self.sp_model_size = len(self.sp_model)
|
||||
|
||||
self._src_lang = f"__{src_lang}__" if "__" not in src_lang else src_lang
|
||||
self._tgt_lang = f"__{tgt_lang}__" if "__" not in tgt_lang else tgt_lang
|
||||
|
||||
super().__init__(
|
||||
bos_token=bos_token,
|
||||
eos_token=eos_token,
|
||||
unk_token=unk_token,
|
||||
sep_token=sep_token,
|
||||
cls_token=cls_token,
|
||||
pad_token=pad_token,
|
||||
tokenizer_file=tokenizer_file,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
additional_special_tokens=additional_special_tokens,
|
||||
sp_model_kwargs=self.sp_model_kwargs,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.set_src_lang_special_tokens(self._src_lang)
|
||||
self.set_tgt_lang_special_tokens(self._tgt_lang)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.__getstate__
|
||||
def __getstate__(self):
|
||||
state = self.__dict__.copy()
|
||||
state["sp_model"] = None
|
||||
state["sp_model_proto"] = self.sp_model.serialized_model_proto()
|
||||
return state
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.__setstate__
|
||||
def __setstate__(self, d):
|
||||
self.__dict__ = d
|
||||
|
||||
# for backward compatibility
|
||||
if not hasattr(self, "sp_model_kwargs"):
|
||||
self.sp_model_kwargs = {}
|
||||
|
||||
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||
self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
return len(self.sp_model)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
|
||||
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
|
||||
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
|
||||
text_pair_target: Optional[
|
||||
Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
|
||||
] = None,
|
||||
padding: Union[bool, str, PaddingStrategy] = True,
|
||||
pad_to_multiple_of: Optional[int] = 2,
|
||||
src_lang: Optional[str] = None,
|
||||
tgt_lang: Optional[str] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
|
||||
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
|
||||
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
|
||||
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
|
||||
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
|
||||
Select a strategy to pad the returned sequences (according to the model's padding side and padding
|
||||
index) among:
|
||||
|
||||
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
|
||||
acceptable input length for the model if that argument is not provided.
|
||||
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
|
||||
lengths).
|
||||
pad_to_multiple_of (`int`, *optional*):
|
||||
If set will pad the sequence to a multiple of the provided value.
|
||||
|
||||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
|
||||
`>= 7.5` (Volta).
|
||||
src_lang (`str`, *optional*):
|
||||
A string representing the source language. If not specified, the last `src_lang` specified (either
|
||||
during initialization or when calling this tokenizer) will be used.
|
||||
tgt_lang (`str`, *optional*):
|
||||
A string representing the target language. If not specified, the last `tgt_lang` specified (either
|
||||
during initialization or when calling this tokenizer) will be used.
|
||||
kwargs (*optional*):
|
||||
Remaining dictionary of keyword arguments that will be passed to [`PreTrainedTokenizer.__call__`].
|
||||
"""
|
||||
if src_lang is not None:
|
||||
self.src_lang = src_lang
|
||||
if tgt_lang is not None:
|
||||
self.tgt_lang = tgt_lang
|
||||
|
||||
output = super().__call__(
|
||||
text=text,
|
||||
text_pair=text_pair,
|
||||
text_target=text_target,
|
||||
text_pair_target=text_pair_target,
|
||||
padding=padding,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return BatchEncoding(output, tensor_type=kwargs.get("return_tensors"))
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.src_lang
|
||||
def src_lang(self) -> str:
|
||||
return self._src_lang
|
||||
|
||||
@src_lang.setter
|
||||
def src_lang(self, new_src_lang: str) -> None:
|
||||
if "__" not in new_src_lang:
|
||||
self._src_lang = f"__{new_src_lang}__"
|
||||
else:
|
||||
self._src_lang = new_src_lang
|
||||
self.set_src_lang_special_tokens(self._src_lang)
|
||||
|
||||
@property
|
||||
def tgt_lang(self) -> str:
|
||||
return self._tgt_lang
|
||||
|
||||
@tgt_lang.setter
|
||||
def tgt_lang(self, new_tgt_lang: str) -> None:
|
||||
if "__" not in new_tgt_lang:
|
||||
self._tgt_lang = f"__{new_tgt_lang}__"
|
||||
else:
|
||||
self._tgt_lang = new_tgt_lang
|
||||
self.set_tgt_lang_special_tokens(self._tgt_lang)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.get_special_tokens_mask
|
||||
def get_special_tokens_mask(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
|
||||
) -> List[int]:
|
||||
"""
|
||||
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
|
||||
special tokens using the tokenizer `prepare_for_model` method.
|
||||
|
||||
Args:
|
||||
token_ids_0 (`List[int]`):
|
||||
List of IDs.
|
||||
token_ids_1 (`List[int]`, *optional*):
|
||||
Optional second list of IDs for sequence pairs.
|
||||
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not the token list is already formatted with special tokens for the model.
|
||||
|
||||
Returns:
|
||||
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
|
||||
"""
|
||||
|
||||
if already_has_special_tokens:
|
||||
return super().get_special_tokens_mask(
|
||||
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
|
||||
)
|
||||
|
||||
prefix_ones = [1] * len(self.prefix_tokens)
|
||||
suffix_ones = [1] * len(self.suffix_tokens)
|
||||
if token_ids_1 is None:
|
||||
return prefix_ones + ([0] * len(token_ids_0)) + suffix_ones
|
||||
return prefix_ones + ([0] * len(token_ids_0)) + ([0] * len(token_ids_1)) + suffix_ones
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.build_inputs_with_special_tokens
|
||||
def build_inputs_with_special_tokens(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
|
||||
) -> List[int]:
|
||||
"""
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
|
||||
adding special tokens. An NLLB sequence has the following format, where `X` represents the sequence:
|
||||
|
||||
- `input_ids` (for encoder) `X [eos, src_lang_code]`
|
||||
- `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
|
||||
|
||||
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
|
||||
separator.
|
||||
|
||||
Args:
|
||||
token_ids_0 (`List[int]`):
|
||||
List of IDs to which the special tokens will be added.
|
||||
token_ids_1 (`List[int]`, *optional*):
|
||||
Optional second list of IDs for sequence pairs.
|
||||
|
||||
Returns:
|
||||
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
|
||||
"""
|
||||
if token_ids_1 is None:
|
||||
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
|
||||
# We don't expect to process pairs, but leave the pair logic for API consistency
|
||||
return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.create_token_type_ids_from_sequences
|
||||
def create_token_type_ids_from_sequences(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
|
||||
) -> List[int]:
|
||||
"""
|
||||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not
|
||||
make use of token type ids, therefore a list of zeros is returned.
|
||||
|
||||
Args:
|
||||
token_ids_0 (`List[int]`):
|
||||
List of IDs.
|
||||
token_ids_1 (`List[int]`, *optional*):
|
||||
Optional second list of IDs for sequence pairs.
|
||||
|
||||
Returns:
|
||||
`List[int]`: List of zeros.
|
||||
|
||||
"""
|
||||
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
|
||||
|
||||
def _build_translation_inputs(
|
||||
self, raw_inputs, return_tensors: str, src_lang: Optional[str], tgt_lang: Optional[str], **extra_kwargs
|
||||
):
|
||||
"""Used by translation pipeline, to prepare inputs for the generate function"""
|
||||
if src_lang is None or tgt_lang is None:
|
||||
raise ValueError("Translation requires a `src_lang` and a `tgt_lang` for this model.")
|
||||
self.src_lang = src_lang
|
||||
inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs)
|
||||
if "__" not in tgt_lang:
|
||||
tgt_lang = f"__{tgt_lang}__"
|
||||
tgt_lang_id = self.convert_tokens_to_ids(tgt_lang)
|
||||
inputs["forced_bos_token_id"] = tgt_lang_id
|
||||
return inputs
|
||||
|
||||
def get_vocab(self):
|
||||
vocab = {
|
||||
self.convert_ids_to_tokens(i): i for i in range(self.fairseq_offset, self.vocab_size + self.fairseq_offset)
|
||||
}
|
||||
vocab.update(self.added_tokens_encoder)
|
||||
return vocab
|
||||
|
||||
@property
|
||||
def unk_token_length(self):
|
||||
return len(self.sp_model.encode(str(self.unk_token)))
|
||||
|
||||
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.get_spm_processor
|
||||
def get_spm_processor(self, from_slow=False):
|
||||
tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
|
||||
if self.legacy or from_slow: # no dependency on protobuf
|
||||
tokenizer.Load(self.vocab_file)
|
||||
return tokenizer
|
||||
|
||||
with open(self.vocab_file, "rb") as f:
|
||||
sp_model = f.read()
|
||||
model_pb2 = import_protobuf(f"The new behaviour of {self.__class__.__name__} (with `self.legacy = False`)")
|
||||
model = model_pb2.ModelProto.FromString(sp_model)
|
||||
normalizer_spec = model_pb2.NormalizerSpec()
|
||||
normalizer_spec.add_dummy_prefix = False
|
||||
model.normalizer_spec.MergeFrom(normalizer_spec)
|
||||
sp_model = model.SerializeToString()
|
||||
tokenizer.LoadFromSerializedProto(sp_model)
|
||||
return tokenizer
|
||||
|
||||
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer.tokenize
|
||||
def tokenize(self, text: "TextInput", add_special_tokens=False, **kwargs) -> List[str]:
|
||||
"""
|
||||
Converts a string to a list of tokens. If `self.legacy` is set to `False`, a prefix token is added unless the
|
||||
first token is special.
|
||||
"""
|
||||
if self.legacy or len(text) == 0:
|
||||
return super().tokenize(text, **kwargs)
|
||||
|
||||
tokens = super().tokenize(SPIECE_UNDERLINE + text.replace(SPIECE_UNDERLINE, " "), **kwargs)
|
||||
|
||||
if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
|
||||
tokens = tokens[1:]
|
||||
return tokens
|
||||
|
||||
# Copied from transformers.models.t5.tokenization_t5.T5Tokenizer._tokenize
|
||||
def _tokenize(self, text, **kwargs):
|
||||
"""
|
||||
Returns a tokenized string.
|
||||
|
||||
We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
|
||||
SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
|
||||
`['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
|
||||
`unk_token`. Here is an example with `unk_token = "<unk>"` and `unk_token_length = 4`.
|
||||
`self.tokenizer.sp_model.encode("<unk> Hey", out_type = str)[4:]`.
|
||||
"""
|
||||
tokens = self.sp_model.encode(text, out_type=str)
|
||||
if self.legacy or not text.startswith((SPIECE_UNDERLINE, " ")):
|
||||
return tokens
|
||||
|
||||
# 1. Encode string + prefix ex: "<unk> Hey"
|
||||
tokens = self.sp_model.encode(self.unk_token + text, out_type=str)
|
||||
# 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
|
||||
return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens
|
||||
|
||||
def _convert_token_to_id(self, token):
|
||||
"""Converts a token (str) in an id using the vocab."""
|
||||
spm_id = self.sp_model.PieceToId(token)
|
||||
|
||||
# Need to return unknown token if the SP model returned 0
|
||||
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
|
||||
|
||||
def _convert_id_to_token(self, index):
|
||||
"""Converts an index (integer) in a token (str) using the vocab."""
|
||||
return self.sp_model.IdToPiece(index - self.fairseq_offset)
|
||||
|
||||
def convert_tokens_to_string(self, tokens):
|
||||
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
|
||||
if tokens[0].startswith(SPIECE_UNDERLINE):
|
||||
tokens[0] = tokens[0][1:]
|
||||
|
||||
out_string = "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
|
||||
return out_string
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.save_vocabulary
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||
return
|
||||
out_vocab_file = os.path.join(
|
||||
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||
)
|
||||
|
||||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
|
||||
copyfile(self.vocab_file, out_vocab_file)
|
||||
elif not os.path.isfile(self.vocab_file):
|
||||
with open(out_vocab_file, "wb") as fi:
|
||||
content_spiece_model = self.sp_model.serialized_model_proto()
|
||||
fi.write(content_spiece_model)
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.prepare_seq2seq_batch with eng_Latn->eng, fra_Latn->fra
|
||||
def prepare_seq2seq_batch(
|
||||
self,
|
||||
src_texts: List[str],
|
||||
src_lang: str = "eng",
|
||||
tgt_texts: Optional[List[str]] = None,
|
||||
tgt_lang: str = "fra",
|
||||
**kwargs,
|
||||
) -> BatchEncoding:
|
||||
self.src_lang = src_lang
|
||||
self.tgt_lang = tgt_lang
|
||||
return super().prepare_seq2seq_batch(src_texts, tgt_texts, **kwargs)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer._switch_to_input_mode
|
||||
def _switch_to_input_mode(self):
|
||||
return self.set_src_lang_special_tokens(self.src_lang)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer._switch_to_target_mode
|
||||
def _switch_to_target_mode(self):
|
||||
return self.set_tgt_lang_special_tokens(self.tgt_lang)
|
||||
|
||||
def set_src_lang_special_tokens(self, src_lang) -> None:
|
||||
"""Reset the special tokens to the source lang setting.
|
||||
Prefix=[src_lang_code], suffix = [eos]
|
||||
"""
|
||||
self.cur_lang_code = self.convert_tokens_to_ids(src_lang)
|
||||
self.init_kwargs["src_lang"] = src_lang
|
||||
|
||||
if self.cur_lang_code == self.unk_token_id:
|
||||
logger.warning_once(
|
||||
f"`src_lang={src_lang}` has not be found in the vocabulary. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
|
||||
)
|
||||
|
||||
self.prefix_tokens = [self.cur_lang_code]
|
||||
self.suffix_tokens = [self.eos_token_id]
|
||||
|
||||
# https://github.com/facebookresearch/fairseq2/blob/c53f18e6be6b8b46b722f2249b8397b7eccd7ad3/src/fairseq2/models/nllb/tokenizer.py#L112-L116
|
||||
def set_tgt_lang_special_tokens(self, lang: str) -> None:
|
||||
"""Reset the special tokens to the target lang setting.
|
||||
Prefix=[eos, tgt_lang_code] and suffix=[eos].
|
||||
"""
|
||||
self.cur_lang_code = self.convert_tokens_to_ids(lang)
|
||||
self.init_kwargs["tgt_lang"] = lang
|
||||
|
||||
if self.cur_lang_code == self.unk_token_id:
|
||||
logger.warning_once(
|
||||
f"`tgt_lang={lang}` has not be found in the vocabulary. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
|
||||
)
|
||||
|
||||
self.prefix_tokens = [self.eos_token_id, self.cur_lang_code]
|
||||
self.suffix_tokens = [self.eos_token_id]
|
|
@ -0,0 +1,459 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Fast Tokenization class for SeamlessM4T."""
|
||||
import os
|
||||
from shutil import copyfile
|
||||
from typing import List, Optional, Tuple, Union
|
||||
|
||||
from tokenizers import processors
|
||||
|
||||
from ...tokenization_utils import (
|
||||
BatchEncoding,
|
||||
PreTokenizedInput,
|
||||
TextInput,
|
||||
)
|
||||
from ...tokenization_utils_fast import PreTrainedTokenizerFast
|
||||
from ...utils import PaddingStrategy, logging
|
||||
from .tokenization_seamless_m4t import (
|
||||
SeamlessM4TTokenizer,
|
||||
)
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
VOCAB_FILES_NAMES = {"vocab_file": "sentencepiece.bpe.model", "tokenizer_file": "tokenizer.json"}
|
||||
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
"vocab_file": {
|
||||
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/vocab.txt",
|
||||
},
|
||||
"tokenizer_file": {
|
||||
"facebook/hf-seamless-m4t-medium": "https://huggingface.co/facebook/hf-seamless-m4t-medium/resolve/main/tokenizer.json",
|
||||
},
|
||||
}
|
||||
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
"facebook/hf-seamless-m4t-medium": 2048,
|
||||
}
|
||||
|
||||
|
||||
class SeamlessM4TTokenizerFast(PreTrainedTokenizerFast):
|
||||
"""
|
||||
Construct a "fast" SeamlessM4T tokenizer (backed by HuggingFace's *tokenizers* library). Based on
|
||||
[BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
|
||||
|
||||
This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
|
||||
refer to this superclass for more information regarding those methods.
|
||||
|
||||
The tokenization method is `<language code> <tokens> <eos>` for source language documents, and `<eos> <language
|
||||
code> <tokens> <eos>` for target language documents.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4TTokenizerFast
|
||||
|
||||
>>> tokenizer = SeamlessM4TTokenizerFast.from_pretrained(
|
||||
... "facebook/hf-seamless-m4t-medium", src_lang="eng", tgt_lang="fra"
|
||||
... )
|
||||
>>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria"
|
||||
>>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie."
|
||||
>>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt")
|
||||
```
|
||||
|
||||
Args:
|
||||
vocab_file (`str`, *optional*):
|
||||
Path to the vocabulary file.
|
||||
tokenizer_file (`str`, *optional*):
|
||||
The path to a tokenizer file to use instead of the vocab file.
|
||||
bos_token (`str`, *optional*, defaults to `"<s>"`):
|
||||
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||
|
||||
<Tip>
|
||||
|
||||
When building a sequence using special tokens, this is not the token that is used for the beginning of
|
||||
sequence. The token used is the `cls_token`.
|
||||
|
||||
</Tip>
|
||||
|
||||
eos_token (`str`, *optional*, defaults to `"</s>"`):
|
||||
The end of sequence token.
|
||||
|
||||
<Tip>
|
||||
|
||||
When building a sequence using special tokens, this is not the token that is used for the end of sequence.
|
||||
The token used is the `sep_token`.
|
||||
|
||||
</Tip>
|
||||
|
||||
sep_token (`str`, *optional*, defaults to `"</s>"`):
|
||||
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
|
||||
sequence classification or for a text and a question for question answering. It is also used as the last
|
||||
token of a sequence built with special tokens.
|
||||
cls_token (`str`, *optional*, defaults to `"<s>"`):
|
||||
The classifier token which is used when doing sequence classification (classification of the whole sequence
|
||||
instead of per-token classification). It is the first token of the sequence when built with special tokens.
|
||||
unk_token (`str`, *optional*, defaults to `"<unk>"`):
|
||||
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
|
||||
token instead.
|
||||
pad_token (`str`, *optional*, defaults to `"<pad>"`):
|
||||
The token used for padding, for example when batching sequences of different lengths.
|
||||
src_lang (`str`, *optional*, defaults to `"eng"`):
|
||||
The language to use as source language for translation.
|
||||
tgt_lang (`str`, *optional*, defaults to `"fra"`):
|
||||
The language to use as target language for translation.
|
||||
additional_special_tokens (tuple or list of `str` or `tokenizers.AddedToken`, *optional*):
|
||||
A tuple or a list of additional special tokens.
|
||||
"""
|
||||
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
slow_tokenizer_class = SeamlessM4TTokenizer
|
||||
model_input_names = ["input_ids", "attention_mask"]
|
||||
|
||||
prefix_tokens: List[int] = []
|
||||
suffix_tokens: List[int] = []
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_file=None,
|
||||
tokenizer_file=None,
|
||||
bos_token="<s>",
|
||||
eos_token="</s>",
|
||||
sep_token="</s>",
|
||||
cls_token="<s>",
|
||||
unk_token="<unk>",
|
||||
pad_token="<pad>",
|
||||
src_lang="eng",
|
||||
tgt_lang="fra",
|
||||
additional_special_tokens=None,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(
|
||||
vocab_file=vocab_file,
|
||||
tokenizer_file=tokenizer_file,
|
||||
bos_token=bos_token,
|
||||
eos_token=eos_token,
|
||||
sep_token=sep_token,
|
||||
cls_token=cls_token,
|
||||
unk_token=unk_token,
|
||||
pad_token=pad_token,
|
||||
src_lang=src_lang,
|
||||
tgt_lang=tgt_lang,
|
||||
additional_special_tokens=additional_special_tokens,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.vocab_file = vocab_file
|
||||
self._src_lang = f"__{src_lang}__" if "__" not in src_lang else src_lang
|
||||
self._tgt_lang = f"__{tgt_lang}__" if "__" not in tgt_lang else tgt_lang
|
||||
self.set_src_lang_special_tokens(self._src_lang)
|
||||
self.set_tgt_lang_special_tokens(self._tgt_lang)
|
||||
|
||||
@property
|
||||
def can_save_slow_tokenizer(self) -> bool:
|
||||
return os.path.isfile(self.vocab_file) if self.vocab_file else False
|
||||
|
||||
@property
|
||||
# Copied from transformers.models.nllb.tokenization_nllb.NllbTokenizer.src_lang
|
||||
def src_lang(self) -> str:
|
||||
return self._src_lang
|
||||
|
||||
@src_lang.setter
|
||||
def src_lang(self, new_src_lang: str) -> None:
|
||||
if "__" not in new_src_lang:
|
||||
self._src_lang = f"__{new_src_lang}__"
|
||||
else:
|
||||
self._src_lang = new_src_lang
|
||||
self.set_src_lang_special_tokens(self._src_lang)
|
||||
|
||||
@property
|
||||
def tgt_lang(self) -> str:
|
||||
return self._tgt_lang
|
||||
|
||||
@tgt_lang.setter
|
||||
def tgt_lang(self, new_tgt_lang: str) -> None:
|
||||
if "__" not in new_tgt_lang:
|
||||
self._tgt_lang = f"__{new_tgt_lang}__"
|
||||
else:
|
||||
self._tgt_lang = new_tgt_lang
|
||||
self.set_tgt_lang_special_tokens(self._tgt_lang)
|
||||
|
||||
def build_inputs_with_special_tokens(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
|
||||
) -> List[int]:
|
||||
"""
|
||||
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
|
||||
adding special tokens. The special tokens depend on calling set_lang.
|
||||
|
||||
An SeamlessM4T sequence has the following format, where `X` represents the sequence:
|
||||
|
||||
- `input_ids` (for encoder) `[src_lang_code] X [eos]`
|
||||
- `decoder_input_ids`: (for decoder) `[eos, tgt_lang_code] X [eos]`
|
||||
|
||||
BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
|
||||
separator.
|
||||
|
||||
Args:
|
||||
token_ids_0 (`List[int]`):
|
||||
List of IDs to which the special tokens will be added.
|
||||
token_ids_1 (`List[int]`, *optional*):
|
||||
Optional second list of IDs for sequence pairs.
|
||||
|
||||
Returns:
|
||||
`List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
|
||||
"""
|
||||
if token_ids_1 is None:
|
||||
return self.prefix_tokens + token_ids_0 + self.suffix_tokens
|
||||
# We don't expect to process pairs, but leave the pair logic for API consistency
|
||||
return self.prefix_tokens + token_ids_0 + token_ids_1 + self.suffix_tokens
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.create_token_type_ids_from_sequences
|
||||
def create_token_type_ids_from_sequences(
|
||||
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
|
||||
) -> List[int]:
|
||||
"""
|
||||
Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not
|
||||
make use of token type ids, therefore a list of zeros is returned.
|
||||
|
||||
Args:
|
||||
token_ids_0 (`List[int]`):
|
||||
List of IDs.
|
||||
token_ids_1 (`List[int]`, *optional*):
|
||||
Optional second list of IDs for sequence pairs.
|
||||
|
||||
Returns:
|
||||
`List[int]`: List of zeros.
|
||||
|
||||
"""
|
||||
|
||||
sep = [self.sep_token_id]
|
||||
cls = [self.cls_token_id]
|
||||
|
||||
if token_ids_1 is None:
|
||||
return len(cls + token_ids_0 + sep) * [0]
|
||||
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
|
||||
|
||||
def _build_translation_inputs(
|
||||
self, raw_inputs, return_tensors: str, src_lang: Optional[str], tgt_lang: Optional[str], **extra_kwargs
|
||||
):
|
||||
"""Used by translation pipeline, to prepare inputs for the generate function"""
|
||||
if src_lang is None or tgt_lang is None:
|
||||
raise ValueError("Translation requires a `src_lang` and a `tgt_lang` for this model")
|
||||
self.src_lang = src_lang
|
||||
inputs = self(raw_inputs, add_special_tokens=True, return_tensors=return_tensors, **extra_kwargs)
|
||||
if "__" not in tgt_lang:
|
||||
tgt_lang = f"__{tgt_lang}__"
|
||||
tgt_lang_id = self.convert_tokens_to_ids(tgt_lang)
|
||||
inputs["forced_bos_token_id"] = tgt_lang_id
|
||||
return inputs
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.prepare_seq2seq_batch with "fra_Latn"->"fra", "eng_Latn"->"eng"
|
||||
def prepare_seq2seq_batch(
|
||||
self,
|
||||
src_texts: List[str],
|
||||
src_lang: str = "eng",
|
||||
tgt_texts: Optional[List[str]] = None,
|
||||
tgt_lang: str = "fra",
|
||||
**kwargs,
|
||||
) -> BatchEncoding:
|
||||
self.src_lang = src_lang
|
||||
self.tgt_lang = tgt_lang
|
||||
return super().prepare_seq2seq_batch(src_texts, tgt_texts, **kwargs)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast._switch_to_input_mode
|
||||
def _switch_to_input_mode(self):
|
||||
return self.set_src_lang_special_tokens(self.src_lang)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast._switch_to_target_mode
|
||||
def _switch_to_target_mode(self):
|
||||
return self.set_tgt_lang_special_tokens(self.tgt_lang)
|
||||
|
||||
def set_src_lang_special_tokens(self, src_lang) -> None:
|
||||
"""Reset the special tokens to the source lang setting.
|
||||
Prefix=[src_lang_code], suffix = [eos]
|
||||
"""
|
||||
self.cur_lang_code = self.convert_tokens_to_ids(src_lang)
|
||||
|
||||
if self.cur_lang_code == self.unk_token_id:
|
||||
logger.warning_once(
|
||||
f"`tgt_lang={src_lang}` has not be found in the `vocabulary`. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
|
||||
)
|
||||
|
||||
self.init_kwargs["src_lang"] = src_lang
|
||||
|
||||
self.prefix_tokens = [self.cur_lang_code]
|
||||
self.suffix_tokens = [self.eos_token_id]
|
||||
|
||||
prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
|
||||
suffix_tokens_str = self.convert_ids_to_tokens(self.suffix_tokens)
|
||||
|
||||
self._tokenizer.post_processor = processors.TemplateProcessing(
|
||||
single=prefix_tokens_str + ["$A"] + suffix_tokens_str,
|
||||
pair=prefix_tokens_str + ["$A", "$B"] + suffix_tokens_str,
|
||||
special_tokens=list(zip(prefix_tokens_str + suffix_tokens_str, self.prefix_tokens + self.suffix_tokens)),
|
||||
)
|
||||
|
||||
def set_tgt_lang_special_tokens(self, lang: str) -> None:
|
||||
"""Reset the special tokens to the target lang setting.
|
||||
Prefix=[eos, tgt_lang_code] and suffix=[eos].
|
||||
"""
|
||||
self.cur_lang_code = self.convert_tokens_to_ids(lang)
|
||||
|
||||
if self.cur_lang_code == self.unk_token_id:
|
||||
logger.warning_once(
|
||||
f"`tgt_lang={lang}` has not be found in the `vocabulary`. Behaviour will probably be unexpected because the language token id will be replaced by the unknown token id."
|
||||
)
|
||||
|
||||
self.init_kwargs["tgt_lang"] = lang
|
||||
|
||||
self.prefix_tokens = [self.eos_token_id, self.cur_lang_code]
|
||||
self.suffix_tokens = [self.eos_token_id]
|
||||
|
||||
prefix_tokens_str = self.convert_ids_to_tokens(self.prefix_tokens)
|
||||
suffix_tokens_str = self.convert_ids_to_tokens(self.suffix_tokens)
|
||||
|
||||
self._tokenizer.post_processor = processors.TemplateProcessing(
|
||||
single=prefix_tokens_str + ["$A"] + suffix_tokens_str,
|
||||
pair=prefix_tokens_str + ["$A", "$B"] + suffix_tokens_str,
|
||||
special_tokens=list(zip(prefix_tokens_str + suffix_tokens_str, self.prefix_tokens + self.suffix_tokens)),
|
||||
)
|
||||
|
||||
# Copied from transformers.models.nllb.tokenization_nllb_fast.NllbTokenizerFast.save_vocabulary
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
if not self.can_save_slow_tokenizer:
|
||||
raise ValueError(
|
||||
"Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
|
||||
"tokenizer."
|
||||
)
|
||||
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error(f"Vocabulary path ({save_directory}) should be a directory.")
|
||||
return
|
||||
out_vocab_file = os.path.join(
|
||||
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||
)
|
||||
|
||||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
|
||||
copyfile(self.vocab_file, out_vocab_file)
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
@classmethod
|
||||
def _from_pretrained(
|
||||
cls,
|
||||
resolved_vocab_files,
|
||||
pretrained_model_name_or_path,
|
||||
init_configuration,
|
||||
*init_inputs,
|
||||
token=None,
|
||||
cache_dir=None,
|
||||
local_files_only=False,
|
||||
_commit_hash=None,
|
||||
_is_local=False,
|
||||
**kwargs,
|
||||
):
|
||||
tokenizer = super()._from_pretrained(
|
||||
resolved_vocab_files,
|
||||
pretrained_model_name_or_path,
|
||||
init_configuration,
|
||||
*init_inputs,
|
||||
token=token,
|
||||
cache_dir=cache_dir,
|
||||
local_files_only=local_files_only,
|
||||
_commit_hash=_commit_hash,
|
||||
_is_local=_is_local,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
# ensure also set after from pretrained
|
||||
tokenizer.set_src_lang_special_tokens(tokenizer._src_lang)
|
||||
tokenizer.set_tgt_lang_special_tokens(tokenizer._tgt_lang)
|
||||
|
||||
return tokenizer
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
|
||||
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
|
||||
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
|
||||
text_pair_target: Optional[
|
||||
Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]
|
||||
] = None,
|
||||
padding: Union[bool, str, PaddingStrategy] = True,
|
||||
pad_to_multiple_of: Optional[int] = 2,
|
||||
src_lang: Optional[str] = None,
|
||||
tgt_lang: Optional[str] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
|
||||
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
|
||||
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):
|
||||
The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
|
||||
list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
|
||||
you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
|
||||
Select a strategy to pad the returned sequences (according to the model's padding side and padding
|
||||
index) among:
|
||||
|
||||
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
|
||||
acceptable input length for the model if that argument is not provided.
|
||||
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
|
||||
lengths).
|
||||
pad_to_multiple_of (`int`, *optional*):
|
||||
If set will pad the sequence to a multiple of the provided value.
|
||||
|
||||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
|
||||
`>= 7.5` (Volta).
|
||||
src_lang (`str`, *optional*):
|
||||
A string representing the source language. If not specified, the last `src_lang` specified (either
|
||||
during initialization or when calling this tokenizer) will be used.
|
||||
tgt_lang (`str`, *optional*):
|
||||
A string representing the target language. If not specified, the last `tgt_lang` specified (either
|
||||
during initialization or when calling this tokenizer) will be used.
|
||||
kwargs (*optional*):
|
||||
Remaining dictionary of keyword arguments that will be passed to [`PreTrainedTokenizerFast.__call__`].
|
||||
"""
|
||||
if src_lang is not None:
|
||||
self.src_lang = src_lang
|
||||
if tgt_lang is not None:
|
||||
self.tgt_lang = tgt_lang
|
||||
|
||||
output = super().__call__(
|
||||
text=text,
|
||||
text_pair=text_pair,
|
||||
text_target=text_target,
|
||||
text_pair_target=text_pair_target,
|
||||
padding=padding,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return output
|
|
@ -6933,6 +6933,79 @@ class SamPreTrainedModel(metaclass=DummyObject):
|
|||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class SeamlessM4TCodeHifiGan(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TForSpeechToSpeech(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TForSpeechToText(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TForTextToSpeech(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TForTextToText(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4THifiGan(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TTextToUnitForConditionalGeneration(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4TTextToUnitModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
|
|
@ -177,6 +177,13 @@ class RemBertTokenizer(metaclass=DummyObject):
|
|||
requires_backends(self, ["sentencepiece"])
|
||||
|
||||
|
||||
class SeamlessM4TTokenizer(metaclass=DummyObject):
|
||||
_backends = ["sentencepiece"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["sentencepiece"])
|
||||
|
||||
|
||||
class Speech2TextTokenizer(metaclass=DummyObject):
|
||||
_backends = ["sentencepiece"]
|
||||
|
||||
|
|
|
@ -366,6 +366,13 @@ class RoFormerTokenizerFast(metaclass=DummyObject):
|
|||
requires_backends(self, ["tokenizers"])
|
||||
|
||||
|
||||
class SeamlessM4TTokenizerFast(metaclass=DummyObject):
|
||||
_backends = ["tokenizers"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["tokenizers"])
|
||||
|
||||
|
||||
class SplinterTokenizerFast(metaclass=DummyObject):
|
||||
_backends = ["tokenizers"]
|
||||
|
||||
|
|
|
@ -1588,7 +1588,7 @@ class GenerationTesterMixin:
|
|||
# may fix in the future: the following models fail with assisted decoding, and need model-specific fixes
|
||||
if any(
|
||||
model_name in model_class.__name__.lower()
|
||||
for model_name in ["bigbirdpegasus", "led", "mega", "speech2text", "git", "prophetnet"]
|
||||
for model_name in ["bigbirdpegasus", "led", "mega", "speech2text", "git", "prophetnet", "seamlessm4t"]
|
||||
):
|
||||
return
|
||||
|
||||
|
|
|
@ -0,0 +1,223 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import itertools
|
||||
import os
|
||||
import random
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import SeamlessM4TFeatureExtractor, is_speech_available
|
||||
from transformers.testing_utils import check_json_file_has_correct_format, require_torch
|
||||
from transformers.utils.import_utils import is_torch_available
|
||||
|
||||
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
global_rng = random.Random()
|
||||
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.floats_list
|
||||
def floats_list(shape, scale=1.0, rng=None, name=None):
|
||||
"""Creates a random float32 tensor"""
|
||||
if rng is None:
|
||||
rng = global_rng
|
||||
|
||||
values = []
|
||||
for batch_idx in range(shape[0]):
|
||||
values.append([])
|
||||
for _ in range(shape[1]):
|
||||
values[-1].append(rng.random() * scale)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
@require_torch
|
||||
class SeamlessM4TFeatureExtractionTester(unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=7,
|
||||
min_seq_length=400,
|
||||
max_seq_length=2000,
|
||||
feature_size=10,
|
||||
padding_value=0.0,
|
||||
sampling_rate=4_000,
|
||||
return_attention_mask=True,
|
||||
do_normalize=True,
|
||||
stride=2,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.min_seq_length = min_seq_length
|
||||
self.max_seq_length = max_seq_length
|
||||
self.seq_length_diff = (self.max_seq_length - self.min_seq_length) // (self.batch_size - 1)
|
||||
self.padding_value = padding_value
|
||||
self.sampling_rate = sampling_rate
|
||||
self.return_attention_mask = return_attention_mask
|
||||
self.do_normalize = do_normalize
|
||||
self.feature_size = feature_size
|
||||
self.stride = stride
|
||||
self.num_mel_bins = feature_size
|
||||
|
||||
def prepare_feat_extract_dict(self):
|
||||
return {
|
||||
"feature_size": self.feature_size,
|
||||
"num_mel_bins": self.num_mel_bins,
|
||||
"padding_value": self.padding_value,
|
||||
"sampling_rate": self.sampling_rate,
|
||||
"stride": self.stride,
|
||||
"return_attention_mask": self.return_attention_mask,
|
||||
"do_normalize": self.do_normalize,
|
||||
}
|
||||
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTester.prepare_inputs_for_common
|
||||
def prepare_inputs_for_common(self, equal_length=False, numpify=False):
|
||||
def _flatten(list_of_lists):
|
||||
return list(itertools.chain(*list_of_lists))
|
||||
|
||||
if equal_length:
|
||||
speech_inputs = [floats_list((self.max_seq_length, self.feature_size)) for _ in range(self.batch_size)]
|
||||
else:
|
||||
# make sure that inputs increase in size
|
||||
speech_inputs = [
|
||||
floats_list((x, self.feature_size))
|
||||
for x in range(self.min_seq_length, self.max_seq_length, self.seq_length_diff)
|
||||
]
|
||||
if numpify:
|
||||
speech_inputs = [np.asarray(x) for x in speech_inputs]
|
||||
return speech_inputs
|
||||
|
||||
|
||||
@require_torch
|
||||
class SeamlessM4TFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
|
||||
feature_extraction_class = SeamlessM4TFeatureExtractor if is_speech_available() else None
|
||||
|
||||
def setUp(self):
|
||||
self.feat_extract_tester = SeamlessM4TFeatureExtractionTester(self)
|
||||
|
||||
def test_feat_extract_from_and_save_pretrained(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
|
||||
check_json_file_has_correct_format(saved_file)
|
||||
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
self.assertDictEqual(dict_first, dict_second)
|
||||
|
||||
def test_feat_extract_to_json_file(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
json_file_path = os.path.join(tmpdirname, "feat_extract.json")
|
||||
feat_extract_first.to_json_file(json_file_path)
|
||||
feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
self.assertEqual(dict_first, dict_second)
|
||||
|
||||
def test_call(self):
|
||||
# Tests that all call wrap to encode_plus and batch_encode_plus
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
# create three inputs of length 800, 1000, and 1200
|
||||
speech_inputs = [floats_list((1, x))[0] for x in range(800, 1400, 200)]
|
||||
np_speech_inputs = [np.asarray(speech_input) for speech_input in speech_inputs]
|
||||
|
||||
# Test feature size
|
||||
input_features = feature_extractor(np_speech_inputs, padding=True, return_tensors="np").input_features
|
||||
self.assertTrue(input_features.ndim == 3)
|
||||
self.assertTrue(input_features.shape[0] == 3)
|
||||
self.assertTrue(input_features.shape[-1] == feature_extractor.feature_size * feature_extractor.stride)
|
||||
|
||||
# Test not batched input
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs[0], return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs[0], return_tensors="np").input_features
|
||||
self.assertTrue(np.allclose(encoded_sequences_1, encoded_sequences_2, atol=1e-3))
|
||||
|
||||
# Test batched
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
|
||||
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
|
||||
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
|
||||
|
||||
# Test 2-D numpy arrays are batched.
|
||||
speech_inputs = [floats_list((1, x))[0] for x in (800, 800, 800)]
|
||||
np_speech_inputs = np.asarray(speech_inputs)
|
||||
encoded_sequences_1 = feature_extractor(speech_inputs, return_tensors="np").input_features
|
||||
encoded_sequences_2 = feature_extractor(np_speech_inputs, return_tensors="np").input_features
|
||||
for enc_seq_1, enc_seq_2 in zip(encoded_sequences_1, encoded_sequences_2):
|
||||
self.assertTrue(np.allclose(enc_seq_1, enc_seq_2, atol=1e-3))
|
||||
|
||||
@require_torch
|
||||
# Copied from tests.models.whisper.test_feature_extraction_whisper.WhisperFeatureExtractionTest.test_double_precision_pad
|
||||
def test_double_precision_pad(self):
|
||||
import torch
|
||||
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
np_speech_inputs = np.random.rand(100, 32).astype(np.float64)
|
||||
py_speech_inputs = np_speech_inputs.tolist()
|
||||
|
||||
for inputs in [py_speech_inputs, np_speech_inputs]:
|
||||
np_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="np")
|
||||
self.assertTrue(np_processed.input_features.dtype == np.float32)
|
||||
pt_processed = feature_extractor.pad([{"input_features": inputs}], return_tensors="pt")
|
||||
self.assertTrue(pt_processed.input_features.dtype == torch.float32)
|
||||
|
||||
def _load_datasample(self, id):
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
# automatic decoding with librispeech
|
||||
speech_sample = ds.sort("id")[id]["audio"]["array"]
|
||||
|
||||
return torch.from_numpy(speech_sample).unsqueeze(0)
|
||||
|
||||
def test_integration(self):
|
||||
# fmt: off
|
||||
EXPECTED_INPUT_FEATURES = torch.tensor(
|
||||
[
|
||||
-1.5621, -1.4236, -1.3335, -1.3991, -1.2881, -1.1133, -0.9710, -0.8895,
|
||||
-0.8280, -0.7376, -0.7194, -0.6896, -0.6849, -0.6788, -0.6545, -0.6610,
|
||||
-0.6566, -0.5738, -0.5252, -0.5533, -0.5887, -0.6116, -0.5971, -0.4956,
|
||||
-0.2881, -0.1512, 0.0299, 0.1762, 0.2728, 0.2236
|
||||
]
|
||||
)
|
||||
# fmt: on
|
||||
|
||||
input_speech = self._load_datasample(10)
|
||||
feature_extractor = SeamlessM4TFeatureExtractor()
|
||||
input_features = feature_extractor(input_speech, return_tensors="pt").input_features
|
||||
|
||||
feature_extractor(input_speech, return_tensors="pt").input_features[0, 5, :30]
|
||||
self.assertEqual(input_features.shape, (1, 279, 160))
|
||||
self.assertTrue(torch.allclose(input_features[0, 5, :30], EXPECTED_INPUT_FEATURES, atol=1e-4))
|
||||
|
||||
def test_zero_mean_unit_variance_normalization_trunc_np_longest(self):
|
||||
feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
audio = self._load_datasample(1)
|
||||
audio = ((audio - audio.min()) / (audio.max() - audio.min())) * 65535 # Rescale to [0, 65535] to show issue
|
||||
audio = feat_extract.zero_mean_unit_var_norm([audio], attention_mask=None)[0]
|
||||
|
||||
self.assertTrue((audio.mean() < 1e-3).all())
|
||||
self.assertTrue(((audio.var() - 1).abs() < 1e-3).all())
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,126 @@
|
|||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import SeamlessM4TFeatureExtractor, SeamlessM4TProcessor
|
||||
from transformers.models.seamless_m4t import (
|
||||
SeamlessM4TTokenizer,
|
||||
SeamlessM4TTokenizerFast,
|
||||
)
|
||||
from transformers.testing_utils import require_torch
|
||||
|
||||
from .test_feature_extraction_seamless_m4t import floats_list
|
||||
|
||||
|
||||
@require_torch
|
||||
class SeamlessM4TProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.checkpoint = "facebook/hf-seamless-m4t-medium"
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return SeamlessM4TTokenizer.from_pretrained(self.checkpoint, **kwargs)
|
||||
|
||||
def get_feature_extractor(self, **kwargs):
|
||||
return SeamlessM4TFeatureExtractor.from_pretrained(self.checkpoint, **kwargs)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def test_save_load_pretrained_default(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
|
||||
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
processor = SeamlessM4TProcessor.from_pretrained(self.tmpdirname)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
|
||||
tokenizer_instance = isinstance(processor.tokenizer, SeamlessM4TTokenizerFast) or isinstance(
|
||||
processor.tokenizer, SeamlessM4TTokenizer
|
||||
)
|
||||
self.assertTrue(tokenizer_instance)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, SeamlessM4TFeatureExtractor)
|
||||
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = SeamlessM4TProcessor(
|
||||
tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor()
|
||||
)
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
|
||||
|
||||
processor = SeamlessM4TProcessor.from_pretrained(
|
||||
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||
)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, SeamlessM4TFeatureExtractor)
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
|
||||
tokenizer_instance = isinstance(processor.tokenizer, SeamlessM4TTokenizerFast) or isinstance(
|
||||
processor.tokenizer, SeamlessM4TTokenizer
|
||||
)
|
||||
self.assertTrue(tokenizer_instance)
|
||||
|
||||
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_feature_extractor with Whisper->SeamlessM4T
|
||||
def test_feature_extractor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
raw_speech = floats_list((3, 1000))
|
||||
|
||||
input_feat_extract = feature_extractor(raw_speech, return_tensors="np")
|
||||
input_processor = processor(audios=raw_speech, return_tensors="np")
|
||||
|
||||
for key in input_feat_extract.keys():
|
||||
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||
|
||||
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_tokenizer with Whisper->SeamlessM4T
|
||||
def test_tokenizer(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
input_str = "This is a test string"
|
||||
|
||||
encoded_processor = processor(text=input_str)
|
||||
|
||||
encoded_tok = tokenizer(input_str)
|
||||
|
||||
for key in encoded_tok.keys():
|
||||
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||
|
||||
# Copied from test.models.whisper.test_processor_whisper.WhisperProcessorTest.test_tokenizer_decode with Whisper->SeamlessM4T
|
||||
def test_tokenizer_decode(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = SeamlessM4TProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||
|
||||
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||
|
||||
decoded_processor = processor.batch_decode(predicted_ids)
|
||||
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||
|
||||
self.assertListEqual(decoded_tok, decoded_processor)
|
|
@ -0,0 +1,672 @@
|
|||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import (
|
||||
SPIECE_UNDERLINE,
|
||||
AddedToken,
|
||||
BatchEncoding,
|
||||
PreTrainedTokenizerFast,
|
||||
SeamlessM4TTokenizer,
|
||||
SeamlessM4TTokenizerFast,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.testing_utils import (
|
||||
get_tests_dir,
|
||||
nested_simplify,
|
||||
require_sentencepiece,
|
||||
require_tokenizers,
|
||||
require_torch,
|
||||
)
|
||||
|
||||
from ...test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
SAMPLE_VOCAB = get_tests_dir("fixtures/test_sentencepiece.model")
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
from transformers.models.m2m_100.modeling_m2m_100 import shift_tokens_right
|
||||
|
||||
EN_CODE = 256047
|
||||
RO_CODE = 256145
|
||||
|
||||
SMALL_TRAINING_CORPUS = [
|
||||
["This is the first sentence.", "This is the second one."],
|
||||
["This sentence (contains #) over symbols and numbers 12 3.", "But not this one."],
|
||||
]
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class SeamlessM4TTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
tokenizer_class = SeamlessM4TTokenizer
|
||||
rust_tokenizer_class = SeamlessM4TTokenizerFast
|
||||
test_rust_tokenizer = True
|
||||
test_sentencepiece = True
|
||||
from_pretrained_kwargs = {}
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
|
||||
# We have a SentencePiece fixture for testing
|
||||
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, keep_accents=True)
|
||||
tokenizer.save_pretrained(self.tmpdirname)
|
||||
|
||||
def test_full_tokenizer(self):
|
||||
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, keep_accents=True)
|
||||
|
||||
tokens = tokenizer.tokenize("This is a test")
|
||||
self.assertListEqual(tokens, ["▁This", "▁is", "▁a", "▁t", "est"])
|
||||
|
||||
self.assertListEqual(
|
||||
tokenizer.convert_tokens_to_ids(tokens),
|
||||
[value + tokenizer.fairseq_offset for value in [285, 46, 10, 170, 382]],
|
||||
)
|
||||
|
||||
tokens = tokenizer.tokenize("I was born in 92000, and this is falsé.")
|
||||
self.assertListEqual(
|
||||
tokens,
|
||||
[
|
||||
SPIECE_UNDERLINE + "I",
|
||||
SPIECE_UNDERLINE + "was",
|
||||
SPIECE_UNDERLINE + "b",
|
||||
"or",
|
||||
"n",
|
||||
SPIECE_UNDERLINE + "in",
|
||||
SPIECE_UNDERLINE + "",
|
||||
"9",
|
||||
"2",
|
||||
"0",
|
||||
"0",
|
||||
"0",
|
||||
",",
|
||||
SPIECE_UNDERLINE + "and",
|
||||
SPIECE_UNDERLINE + "this",
|
||||
SPIECE_UNDERLINE + "is",
|
||||
SPIECE_UNDERLINE + "f",
|
||||
"al",
|
||||
"s",
|
||||
"é",
|
||||
".",
|
||||
],
|
||||
)
|
||||
ids = tokenizer.convert_tokens_to_ids(tokens)
|
||||
self.assertListEqual(
|
||||
ids,
|
||||
[
|
||||
value + tokenizer.fairseq_offset
|
||||
for value in [8, 21, 84, 55, 24, 19, 7, 0, 602, 347, 347, 347, 3, 12, 66, 46, 72, 80, 6, 0, 4]
|
||||
],
|
||||
)
|
||||
|
||||
back_tokens = tokenizer.convert_ids_to_tokens(ids)
|
||||
self.assertListEqual(
|
||||
back_tokens,
|
||||
[
|
||||
SPIECE_UNDERLINE + "I",
|
||||
SPIECE_UNDERLINE + "was",
|
||||
SPIECE_UNDERLINE + "b",
|
||||
"or",
|
||||
"n",
|
||||
SPIECE_UNDERLINE + "in",
|
||||
SPIECE_UNDERLINE + "",
|
||||
"<unk>",
|
||||
"2",
|
||||
"0",
|
||||
"0",
|
||||
"0",
|
||||
",",
|
||||
SPIECE_UNDERLINE + "and",
|
||||
SPIECE_UNDERLINE + "this",
|
||||
SPIECE_UNDERLINE + "is",
|
||||
SPIECE_UNDERLINE + "f",
|
||||
"al",
|
||||
"s",
|
||||
"<unk>",
|
||||
".",
|
||||
],
|
||||
)
|
||||
|
||||
def test_maximum_encoding_length_single_input(self):
|
||||
tokenizers = self.get_tokenizers(do_lower_case=False, model_max_length=100)
|
||||
for tokenizer in tokenizers:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||
seq_0, ids = self.get_clean_sequence(tokenizer, max_length=20)
|
||||
|
||||
sequence = tokenizer.encode(seq_0, add_special_tokens=False)
|
||||
total_length = len(sequence)
|
||||
|
||||
self.assertGreater(
|
||||
total_length, 4, "Issue with the testing sequence, please update it, it's too short"
|
||||
)
|
||||
|
||||
# Test with max model input length
|
||||
model_max_length = tokenizer.model_max_length
|
||||
self.assertEqual(model_max_length, 100)
|
||||
seq_1 = seq_0 * model_max_length
|
||||
|
||||
sequence1 = tokenizer(seq_1, add_special_tokens=False)
|
||||
total_length1 = len(sequence1["input_ids"])
|
||||
self.assertGreater(
|
||||
total_length1,
|
||||
model_max_length,
|
||||
"Issue with the testing sequence, please update it, it's too short",
|
||||
)
|
||||
|
||||
# Simple
|
||||
padding_strategies = (
|
||||
[False, True, "longest"] if tokenizer.pad_token and tokenizer.pad_token_id >= 0 else [False]
|
||||
)
|
||||
for padding_state in padding_strategies:
|
||||
with self.subTest(f"Padding: {padding_state}"):
|
||||
for truncation_state in [True, "longest_first", "only_first"]:
|
||||
with self.subTest(f"Truncation: {truncation_state}"):
|
||||
output = tokenizer(seq_1, padding=padding_state, truncation=truncation_state)
|
||||
self.assertEqual(len(output["input_ids"]), model_max_length)
|
||||
|
||||
output = tokenizer([seq_1], padding=padding_state, truncation=truncation_state)
|
||||
self.assertEqual(len(output["input_ids"][0]), model_max_length)
|
||||
|
||||
# Simple with no truncation
|
||||
# Reset warnings
|
||||
tokenizer.deprecation_warnings = {}
|
||||
with self.assertLogs("transformers", level="WARNING") as cm:
|
||||
output = tokenizer(seq_1, padding=padding_state, truncation=False)
|
||||
self.assertNotEqual(len(output["input_ids"]), model_max_length)
|
||||
self.assertEqual(len(cm.records), 1)
|
||||
self.assertTrue(
|
||||
cm.records[0].message.startswith(
|
||||
"Token indices sequence length is longer than the specified maximum sequence length"
|
||||
" for this model"
|
||||
)
|
||||
)
|
||||
|
||||
tokenizer.deprecation_warnings = {}
|
||||
with self.assertLogs("transformers", level="WARNING") as cm:
|
||||
output = tokenizer([seq_1], padding=padding_state, truncation=False)
|
||||
self.assertNotEqual(len(output["input_ids"][0]), model_max_length)
|
||||
self.assertEqual(len(cm.records), 1)
|
||||
self.assertTrue(
|
||||
cm.records[0].message.startswith(
|
||||
"Token indices sequence length is longer than the specified maximum sequence length"
|
||||
" for this model"
|
||||
)
|
||||
)
|
||||
|
||||
# Overflowing tokens
|
||||
stride = 2
|
||||
|
||||
# modify padding because it's activated by default in seamlessM4T
|
||||
information = tokenizer(
|
||||
seq_0,
|
||||
max_length=total_length - 2,
|
||||
add_special_tokens=False,
|
||||
stride=stride,
|
||||
truncation="longest_first",
|
||||
return_overflowing_tokens=True,
|
||||
padding=False,
|
||||
# add_prefix_space=False,
|
||||
)
|
||||
|
||||
# Overflowing tokens are handled quite differently in slow and fast tokenizers
|
||||
if isinstance(tokenizer, PreTrainedTokenizerFast):
|
||||
truncated_sequence = information["input_ids"][0]
|
||||
overflowing_tokens = information["input_ids"][1]
|
||||
self.assertEqual(len(information["input_ids"]), 2)
|
||||
|
||||
self.assertEqual(len(truncated_sequence), total_length - 2)
|
||||
self.assertEqual(truncated_sequence, sequence[:-2])
|
||||
|
||||
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||
self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
|
||||
else:
|
||||
truncated_sequence = information["input_ids"]
|
||||
overflowing_tokens = information["overflowing_tokens"]
|
||||
|
||||
self.assertEqual(len(truncated_sequence), total_length - 2)
|
||||
self.assertEqual(truncated_sequence, sequence[:-2])
|
||||
|
||||
self.assertEqual(len(overflowing_tokens), 2 + stride)
|
||||
self.assertEqual(overflowing_tokens, sequence[-(2 + stride) :])
|
||||
|
||||
@unittest.skip("By defaults, uses pad_to_multiple_of which breaks the test")
|
||||
def test_maximum_encoding_length_pair_input(self):
|
||||
pass
|
||||
|
||||
def test_padding_to_multiple_of(self):
|
||||
tokenizers = self.get_tokenizers()
|
||||
for tokenizer in tokenizers:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||
if tokenizer.pad_token is None:
|
||||
self.skipTest("No padding token.")
|
||||
else:
|
||||
empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
|
||||
normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
|
||||
for key, value in empty_tokens.items():
|
||||
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
# default to padding=True so need to precise which padding is called
|
||||
normal_tokens = tokenizer("This", pad_to_multiple_of=8, padding=False)
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
# Should also work with truncation
|
||||
normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
# truncation to something which is not a multiple of pad_to_multiple_of raises an error
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
tokenizer.__call__,
|
||||
"This",
|
||||
padding=True,
|
||||
truncation=True,
|
||||
max_length=12,
|
||||
pad_to_multiple_of=8,
|
||||
)
|
||||
|
||||
@require_torch
|
||||
def test_prepare_seq2seq_batch(self):
|
||||
if not self.test_seq2seq:
|
||||
return
|
||||
|
||||
tokenizers = self.get_tokenizers()
|
||||
for tokenizer in tokenizers:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||
# Longer text that will definitely require truncation.
|
||||
src_text = [
|
||||
" UN Chief Says There Is No Military Solution in Syria",
|
||||
" Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for"
|
||||
" Syria is that 'there is no military solution' to the nearly five-year conflict and more weapons"
|
||||
" will only worsen the violence and misery for millions of people.",
|
||||
]
|
||||
tgt_text = [
|
||||
"Şeful ONU declară că nu există o soluţie militară în Siria",
|
||||
"Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al"
|
||||
' Rusiei pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi'
|
||||
" că noi arme nu vor face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
|
||||
]
|
||||
try:
|
||||
batch = tokenizer.prepare_seq2seq_batch(
|
||||
src_texts=src_text,
|
||||
tgt_texts=tgt_text,
|
||||
max_length=3,
|
||||
max_target_length=10,
|
||||
return_tensors="pt",
|
||||
src_lang="eng",
|
||||
tgt_lang="ron",
|
||||
pad_to_multiple_of=None,
|
||||
)
|
||||
except NotImplementedError:
|
||||
return
|
||||
self.assertEqual(batch.input_ids.shape[1], 3)
|
||||
self.assertEqual(batch.labels.shape[1], 10)
|
||||
|
||||
# TODO: not working for tgt_text
|
||||
# max_target_length will default to max_length if not specified
|
||||
batch = tokenizer.prepare_seq2seq_batch(
|
||||
src_texts=src_text,
|
||||
tgt_texts=tgt_text,
|
||||
max_length=4,
|
||||
return_tensors="pt",
|
||||
pad_to_multiple_of=None,
|
||||
)
|
||||
self.assertEqual(batch.input_ids.shape[1], 4)
|
||||
self.assertEqual(batch.labels.shape[1], 4)
|
||||
|
||||
batch_encoder_only = tokenizer.prepare_seq2seq_batch(
|
||||
src_texts=src_text,
|
||||
max_length=4,
|
||||
max_target_length=10,
|
||||
return_tensors="pt",
|
||||
pad_to_multiple_of=None,
|
||||
)
|
||||
self.assertEqual(batch_encoder_only.input_ids.shape[1], 4)
|
||||
self.assertEqual(batch_encoder_only.attention_mask.shape[1], 4)
|
||||
self.assertNotIn("decoder_input_ids", batch_encoder_only)
|
||||
|
||||
@unittest.skip("Unfortunately way too slow to build a BPE with SentencePiece.")
|
||||
def test_save_slow_from_fast_and_reload_fast(self):
|
||||
pass
|
||||
|
||||
# Copied from tests.models.nllb.test_tokenization_nllb.NllbTokenizationTest.test_special_tokens_initialization
|
||||
def test_special_tokens_initialization(self):
|
||||
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
|
||||
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
|
||||
added_tokens = [AddedToken("<special>", lstrip=True)]
|
||||
|
||||
tokenizer_r = self.rust_tokenizer_class.from_pretrained(
|
||||
pretrained_name, additional_special_tokens=added_tokens, **kwargs
|
||||
)
|
||||
r_output = tokenizer_r.encode("Hey this is a <special> token")
|
||||
|
||||
special_token_id = tokenizer_r.encode("<special>", add_special_tokens=False)[0]
|
||||
|
||||
self.assertTrue(special_token_id in r_output)
|
||||
|
||||
if self.test_slow_tokenizer:
|
||||
tokenizer_cr = self.rust_tokenizer_class.from_pretrained(
|
||||
pretrained_name,
|
||||
additional_special_tokens=added_tokens,
|
||||
**kwargs, # , from_slow=True <- unfortunately too slow to convert
|
||||
)
|
||||
tokenizer_p = self.tokenizer_class.from_pretrained(
|
||||
pretrained_name, additional_special_tokens=added_tokens, **kwargs
|
||||
)
|
||||
|
||||
p_output = tokenizer_p.encode("Hey this is a <special> token")
|
||||
|
||||
cr_output = tokenizer_cr.encode("Hey this is a <special> token")
|
||||
|
||||
self.assertEqual(p_output, r_output)
|
||||
self.assertEqual(cr_output, r_output)
|
||||
self.assertTrue(special_token_id in p_output)
|
||||
self.assertTrue(special_token_id in cr_output)
|
||||
|
||||
@unittest.skip(
|
||||
"encode_plus and batch_encode_plus are deprecated and __call__ do some processing, so we expect different results."
|
||||
)
|
||||
def test_call(self):
|
||||
pass
|
||||
|
||||
def test_training_new_tokenizer(self):
|
||||
# This feature only exists for fast tokenizers
|
||||
if not self.test_rust_tokenizer:
|
||||
return
|
||||
|
||||
tokenizer = self.get_rust_tokenizer()
|
||||
new_tokenizer = tokenizer.train_new_from_iterator(SMALL_TRAINING_CORPUS, 100)
|
||||
|
||||
# Test we can use the new tokenizer with something not seen during training
|
||||
inputs = new_tokenizer(["This is the first sentence", "This sentence is different 🤗."])
|
||||
self.assertEqual(len(inputs["input_ids"]), 2)
|
||||
decoded_input = new_tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
|
||||
expected_result = "This is the first sentence"
|
||||
|
||||
if tokenizer.backend_tokenizer.normalizer is not None:
|
||||
expected_result = tokenizer.backend_tokenizer.normalizer.normalize_str(expected_result)
|
||||
self.assertEqual(expected_result, decoded_input)
|
||||
|
||||
# We check that the parameters of the tokenizer remained the same
|
||||
# Check we have the same number of added_tokens for both pair and non-pair inputs.
|
||||
# make sure it has the same prefix tokens first
|
||||
new_tokenizer.tgt_lang = tokenizer.tgt_lang
|
||||
tokenizer.tgt_lang = tokenizer.tgt_lang
|
||||
self.assertEqual(tokenizer.num_special_tokens_to_add(False), new_tokenizer.num_special_tokens_to_add(False))
|
||||
self.assertEqual(tokenizer.num_special_tokens_to_add(True), new_tokenizer.num_special_tokens_to_add(True))
|
||||
|
||||
# Check we have the correct max_length for both pair and non-pair inputs.
|
||||
self.assertEqual(tokenizer.max_len_single_sentence, new_tokenizer.max_len_single_sentence)
|
||||
self.assertEqual(tokenizer.max_len_sentences_pair, new_tokenizer.max_len_sentences_pair)
|
||||
|
||||
# Assert the set of special tokens match as we didn't ask to change them
|
||||
self.assertSequenceEqual(
|
||||
tokenizer.all_special_tokens_extended,
|
||||
new_tokenizer.all_special_tokens_extended,
|
||||
)
|
||||
|
||||
self.assertDictEqual(tokenizer.special_tokens_map, new_tokenizer.special_tokens_map)
|
||||
|
||||
@unittest.skip("Fails because of the hack of adding <unk> in _tokenize")
|
||||
def test_pickle_subword_regularization_tokenizer(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Fails because of the hack of adding <unk> in _tokenize")
|
||||
def test_subword_regularization_tokenizer(self):
|
||||
pass
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class SeamlessM4TDistilledIntegrationTest(unittest.TestCase):
|
||||
checkpoint_name = "facebook/hf-seamless-m4t-medium"
|
||||
src_text = [
|
||||
" UN Chief Says There Is No Military Solution in Syria",
|
||||
""" Secretary-General Ban Ki-moon says his response to Russia's stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people.""",
|
||||
]
|
||||
tgt_text = [
|
||||
"Şeful ONU declară că nu există o soluţie militară în Siria",
|
||||
"Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei"
|
||||
' pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor'
|
||||
" face decât să înrăutăţească violenţele şi mizeria pentru milioane de oameni.",
|
||||
]
|
||||
|
||||
# fmt: off
|
||||
expected_src_tokens = [256047, 16297, 134408, 8165, 248066, 14734, 950, 1135, 105721, 3573, 83, 27352, 108, 49486, 3]
|
||||
# fmt: on
|
||||
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
cls.tokenizer: SeamlessM4TTokenizer = SeamlessM4TTokenizer.from_pretrained(
|
||||
cls.checkpoint_name, src_lang="eng", tgt_lang="ron"
|
||||
)
|
||||
# cls.pad_token_id = 1
|
||||
return cls
|
||||
|
||||
def test_language_codes(self):
|
||||
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__ace_Latn__"), 256002)
|
||||
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__shn__"), 256152)
|
||||
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__eng__"), 256047)
|
||||
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__fra__"), 256057)
|
||||
self.assertEqual(self.tokenizer.convert_tokens_to_ids("__quy__"), 256144)
|
||||
|
||||
def test_tokenizer_tgt_lang(self):
|
||||
ids = self.tokenizer(self.src_text, src_lang="fra").input_ids[0]
|
||||
self.assertListEqual(self.expected_src_tokens[1:], ids[1 : len(self.expected_src_tokens)])
|
||||
self.assertEqual(256057, ids[0])
|
||||
|
||||
rest_ids = ids[len(self.expected_src_tokens) :]
|
||||
self.assertListEqual([0] * len(rest_ids), rest_ids)
|
||||
|
||||
ids = self.tokenizer(self.src_text, src_lang="__shn__").input_ids[0]
|
||||
self.assertListEqual(self.expected_src_tokens[1:], ids[1 : len(self.expected_src_tokens)])
|
||||
self.assertEqual(256152, ids[0])
|
||||
|
||||
# Copied from tests.models.nllb.test_tokenization_nllb.NllbDistilledIntegrationTest.test_enro_tokenizer_decode_ignores_language_codes
|
||||
def test_enro_tokenizer_decode_ignores_language_codes(self):
|
||||
self.assertIn(RO_CODE, self.tokenizer.all_special_ids)
|
||||
# fmt: off
|
||||
generated_ids = [RO_CODE, 4254, 98068, 112923, 39072, 3909, 713, 102767, 26, 17314, 35642, 14683, 33118, 2022, 66987, 2, 256047]
|
||||
# fmt: on
|
||||
|
||||
result = self.tokenizer.decode(generated_ids, skip_special_tokens=True)
|
||||
expected_romanian = self.tokenizer.decode(generated_ids[1:], skip_special_tokens=True)
|
||||
self.assertEqual(result, expected_romanian)
|
||||
self.assertNotIn(self.tokenizer.eos_token, result)
|
||||
|
||||
def test_enro_tokenizer_truncation(self):
|
||||
src_text = ["this is gunna be a long sentence " * 20]
|
||||
assert isinstance(src_text[0], str)
|
||||
desired_max_length = 10
|
||||
ids = self.tokenizer(src_text, max_length=desired_max_length, truncation=True).input_ids[0]
|
||||
self.assertEqual(ids[-1], 3)
|
||||
self.assertEqual(ids[0], EN_CODE)
|
||||
self.assertEqual(len(ids), desired_max_length)
|
||||
|
||||
# Copied from tests.models.nllb.test_tokenization_nllb.NllbDistilledIntegrationTest.test_special_tokens_unaffacted_by_save_load with fairseq_tokens_to_ids->additional_special_tokens, Nllb->SeamlessM4T, Dict->List
|
||||
def test_special_tokens_unaffacted_by_save_load(self):
|
||||
tmpdirname = tempfile.mkdtemp()
|
||||
original_special_tokens = self.tokenizer.additional_special_tokens
|
||||
self.tokenizer.save_pretrained(tmpdirname)
|
||||
new_tok = SeamlessM4TTokenizer.from_pretrained(tmpdirname)
|
||||
self.assertListEqual(new_tok.additional_special_tokens, original_special_tokens)
|
||||
|
||||
@require_torch
|
||||
def test_enro_tokenizer_prepare_batch(self):
|
||||
batch = self.tokenizer(
|
||||
self.src_text,
|
||||
text_target=self.tgt_text,
|
||||
padding=True,
|
||||
truncation=True,
|
||||
max_length=len(self.expected_src_tokens),
|
||||
pad_to_multiple_of=None,
|
||||
return_tensors="pt",
|
||||
)
|
||||
batch["decoder_input_ids"] = shift_tokens_right(
|
||||
batch["labels"], self.tokenizer.pad_token_id, self.tokenizer.convert_tokens_to_ids("__ron__")
|
||||
)
|
||||
|
||||
self.assertIsInstance(batch, BatchEncoding)
|
||||
|
||||
self.assertEqual((2, 15), batch.input_ids.shape)
|
||||
self.assertEqual((2, 15), batch.attention_mask.shape)
|
||||
result = batch.input_ids.tolist()[0]
|
||||
self.assertListEqual(self.expected_src_tokens, result)
|
||||
self.assertEqual(RO_CODE, batch.decoder_input_ids[0, 0]) # EOS
|
||||
# Test that special tokens are reset
|
||||
self.assertEqual(self.tokenizer.prefix_tokens, [EN_CODE])
|
||||
self.assertEqual(self.tokenizer.suffix_tokens, [self.tokenizer.eos_token_id])
|
||||
|
||||
def test_seq2seq_max_length(self):
|
||||
batch = self.tokenizer(
|
||||
self.src_text, padding=True, truncation=True, max_length=3, return_tensors="pt", pad_to_multiple_of=None
|
||||
)
|
||||
targets = self.tokenizer(
|
||||
text_target=self.tgt_text, padding=True, truncation=True, max_length=10, return_tensors="pt"
|
||||
)
|
||||
labels = targets["input_ids"]
|
||||
batch["decoder_input_ids"] = shift_tokens_right(
|
||||
labels,
|
||||
self.tokenizer.pad_token_id,
|
||||
decoder_start_token_id=self.tokenizer.convert_tokens_to_ids(self.tokenizer.tgt_lang),
|
||||
)
|
||||
|
||||
self.assertEqual(batch.input_ids.shape[1], 3)
|
||||
self.assertEqual(batch.decoder_input_ids.shape[1], 10)
|
||||
|
||||
@require_torch
|
||||
def test_tokenizer_translation(self):
|
||||
inputs = self.tokenizer._build_translation_inputs(
|
||||
"A test", return_tensors="pt", src_lang="eng", tgt_lang="fra"
|
||||
)
|
||||
|
||||
self.assertEqual(
|
||||
nested_simplify(inputs),
|
||||
{
|
||||
# A, test, EOS, en_XX
|
||||
"input_ids": [[256047, 70, 7356, 3]],
|
||||
"attention_mask": [[1, 1, 1, 1]],
|
||||
# ar_AR
|
||||
"forced_bos_token_id": 256057,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@require_sentencepiece
|
||||
@require_tokenizers
|
||||
class CommonSpmIntegrationTests(unittest.TestCase):
|
||||
"""
|
||||
A class that regroups important test to make sure that we properly handle the special tokens.
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
tokenizer = SeamlessM4TTokenizer(SAMPLE_VOCAB, extra_ids=0, add_bos_token=False, legacy=False)
|
||||
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("<s>", rstrip=False, lstrip=False)]})
|
||||
cls.tokenizer = tokenizer
|
||||
return cls
|
||||
|
||||
def test_add_dummy_prefix(self):
|
||||
# make sure `'▁'` is prepended, and outputs match sp_model's
|
||||
# `sentencepiece.NormalizerSpec.add_dummy_prefix` attribute
|
||||
input_ids = self.tokenizer.encode(". Hello")
|
||||
self.assertEqual(input_ids, [3, 1, 8, 5, 157, 87, 21, 3])
|
||||
sp_encode = self.tokenizer.sp_model.encode(". Hello")
|
||||
|
||||
# [bos, lang_id, _] + offset_sp_encode
|
||||
self.assertEqual(input_ids[:-1], [3, 1, 8] + [i + self.tokenizer.fairseq_offset for i in sp_encode])
|
||||
tokens = self.tokenizer.tokenize(". Hello")
|
||||
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])
|
||||
|
||||
tokens = self.tokenizer.tokenize("")
|
||||
self.assertEqual(tokens, [])
|
||||
self.assertEqual(tokens, self.tokenizer.sp_model.encode("", out_type=str))
|
||||
|
||||
tokens = self.tokenizer.tokenize(" ")
|
||||
self.assertEqual(tokens, [])
|
||||
self.assertEqual(tokens, self.tokenizer.sp_model.encode(" ", out_type=str))
|
||||
|
||||
tokens = self.tokenizer.tokenize("▁")
|
||||
self.assertEqual(tokens, [])
|
||||
self.assertEqual(tokens, self.tokenizer.sp_model.encode("▁", out_type=str))
|
||||
|
||||
def test_remove_extra_whitespaces(self):
|
||||
# make sure the extra spaces are eaten. Since the sample vocab does not have
|
||||
# `______`. sentencepiece.NormalizerSpec.remove_extra_whitespaces attribute is set to False
|
||||
|
||||
input_ids = self.tokenizer.encode(" . Hello")
|
||||
self.assertEqual(input_ids, [3, 1, 8, 5, 157, 87, 21, 3])
|
||||
sp_encode = self.tokenizer.sp_model.encode(" . Hello")
|
||||
self.assertEqual([i - self.tokenizer.fairseq_offset for i in input_ids[2:-1]], [7] + sp_encode)
|
||||
tokens = self.tokenizer.tokenize(" . Hello")
|
||||
self.assertEqual(tokens, ["▁", ".", "▁He", "ll", "o"])
|
||||
|
||||
# `'▁'` is also a whitespace
|
||||
input_ids = self.tokenizer.encode("▁He is not")
|
||||
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 3])
|
||||
tokens = self.tokenizer.tokenize("▁He is not")
|
||||
sp_encode = [
|
||||
self.tokenizer.sp_model.piece_to_id("▁He"),
|
||||
self.tokenizer.sp_model.piece_to_id("▁is"),
|
||||
self.tokenizer.sp_model.piece_to_id("▁not"),
|
||||
]
|
||||
self.assertEqual([i - self.tokenizer.fairseq_offset for i in input_ids[2:-1]], sp_encode)
|
||||
self.assertEqual(tokens, ["▁He", "▁is", "▁not"]) # no extra space added
|
||||
|
||||
input_ids = self.tokenizer.encode("▁He is not<s> ▁He")
|
||||
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 2, 157, 3])
|
||||
tokens = self.tokenizer.tokenize("▁He is not<s> ▁He")
|
||||
self.assertEqual(tokens, ["▁He", "▁is", "▁not", "<s>", "▁He"]) # spaces are eaten by spm + our strip
|
||||
# make sure that the output after the extra id is the same as if
|
||||
# extra_id was not there
|
||||
input_ids = self.tokenizer.encode("▁He is not ▁He")
|
||||
self.assertEqual(input_ids, [3, 1, 157, 47, 45, 157, 3])
|
||||
tokens = self.tokenizer.tokenize("▁He is not ▁He")
|
||||
self.assertEqual(tokens, ["▁He", "▁is", "▁not", "▁He"]) # spaces are eaten by spm even if not start
|
||||
|
||||
def test_character_after_special_token(self):
|
||||
# Make sure that `tokenizer.tokenize` is similar to
|
||||
# adding the equivalent special token to the vocab
|
||||
input_ids = self.tokenizer.encode("Hey <s>I")
|
||||
self.assertEqual(input_ids, [3, 1, 157, 31, 2, 101, 3])
|
||||
sp_encode = self.tokenizer.sp_model.encode("Hey .I")
|
||||
|
||||
# the last token besides eos should be 100 offset
|
||||
self.assertEqual(input_ids[-2] - self.tokenizer.fairseq_offset, sp_encode[-1])
|
||||
tokens = self.tokenizer.tokenize("<s>I")
|
||||
self.assertEqual(tokens, ["<s>", "I"])
|
||||
|
||||
input_ids = self.tokenizer.encode("Hello, <s>,")
|
||||
self.assertEqual(input_ids, [3, 1, 157, 87, 21, 4, 2, 4, 3])
|
||||
tokens = self.tokenizer.tokenize("Hello, <s>,")
|
||||
self.assertEqual(tokens, ["▁He", "ll", "o", ",", "<s>", ","])
|
||||
|
||||
def test_special_tokens_strip(self):
|
||||
input_ids = self.tokenizer.encode(" <s> ,")
|
||||
self.assertEqual(input_ids, [3, 1, 2, 8, 4, 3])
|
||||
tokens = self.tokenizer.tokenize(" <s> ,")
|
||||
# spaces are eaten by rstrip / lstrip + spm sp_model.encode(" ") = []
|
||||
self.assertEqual(tokens, ["<s>", "▁", ","])
|
||||
|
||||
input_ids = self.tokenizer.encode("No <s> ▁He")
|
||||
self.assertEqual(input_ids, [3, 1, 285, 2, 157, 3])
|
||||
tokens = self.tokenizer.tokenize("No <s> ▁He")
|
||||
self.assertEqual(tokens, ["▁No", "<s>", "▁He"]) # spaces are eaten by rstrip / lstrip
|
|
@ -84,6 +84,18 @@ SPECIAL_CASES_TO_ALLOW = {
|
|||
"ClapAudioConfig": ["num_classes"],
|
||||
# Not used, but providing useful information to users
|
||||
"SpeechT5HifiGanConfig": ["sampling_rate"],
|
||||
# Actually used in the config or generation config, in that case necessary for the sub-components generation
|
||||
"SeamlessM4TConfig": [
|
||||
"max_new_tokens",
|
||||
"t2u_max_new_tokens",
|
||||
"t2u_decoder_attention_heads",
|
||||
"t2u_decoder_ffn_dim",
|
||||
"t2u_decoder_layers",
|
||||
"t2u_encoder_attention_heads",
|
||||
"t2u_encoder_ffn_dim",
|
||||
"t2u_encoder_layers",
|
||||
"t2u_max_position_embeddings",
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -463,6 +463,7 @@ OBJECTS_TO_IGNORE = [
|
|||
"SEWForCTC",
|
||||
"SamConfig",
|
||||
"SamPromptEncoderConfig",
|
||||
"SeamlessM4TConfig", # use of unconventional markdown
|
||||
"Seq2SeqTrainingArguments",
|
||||
"SpecialTokensMixin",
|
||||
"Speech2Text2Config",
|
||||
|
|
|
@ -112,6 +112,9 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
|||
"BridgeTowerVisionModel", # No need to test it as it is tested by BridgeTowerModel model.
|
||||
"BarkCausalModel", # Building part of bigger (tested) model.
|
||||
"BarkModel", # Does not have a forward signature - generation tested with integration tests
|
||||
"SeamlessM4TTextToUnitModel", # Building part of bigger (tested) model.
|
||||
"SeamlessM4TCodeHifiGan", # Building part of bigger (tested) model.
|
||||
"SeamlessM4TTextToUnitForConditionalGeneration", # Building part of bigger (tested) model.
|
||||
]
|
||||
|
||||
# Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
|
||||
|
@ -281,6 +284,10 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||
"SpeechT5ForTextToSpeech",
|
||||
"SpeechT5HifiGan",
|
||||
"VitMatteForImageMatting",
|
||||
"SeamlessM4TTextToUnitModel",
|
||||
"SeamlessM4TTextToUnitForConditionalGeneration",
|
||||
"SeamlessM4TCodeHifiGan",
|
||||
"SeamlessM4TForSpeechToSpeech", # no auto class for speech-to-speech
|
||||
]
|
||||
|
||||
# DO NOT edit this list!
|
||||
|
|
|
@ -768,6 +768,7 @@ src/transformers/models/sam/image_processing_sam.py
|
|||
src/transformers/models/sam/modeling_sam.py
|
||||
src/transformers/models/sam/modeling_tf_sam.py
|
||||
src/transformers/models/sam/processing_sam.py
|
||||
src/transformers/models/seamless_m4t/convert_fairseq2_to_hf.py
|
||||
src/transformers/models/segformer/configuration_segformer.py
|
||||
src/transformers/models/segformer/convert_segformer_original_to_pytorch.py
|
||||
src/transformers/models/sew/convert_sew_original_pytorch_checkpoint_to_pytorch.py
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
docs/source/en/generation_strategies.md
|
||||
docs/source/en/model_doc/ctrl.md
|
||||
docs/source/en/model_doc/seamless_m4t.md
|
||||
docs/source/en/task_summary.md
|
||||
docs/source/en/tasks/prompting.md
|
||||
src/transformers/models/blip_2/modeling_blip_2.py
|
||||
|
|
Loading…
Reference in New Issue