Add SeamlessM4T v2 (#27779)
* add working convertion script * first non-working version of modeling code * update modeling code (working) * make style * make fix-copies * add config docstrings * add config to ignore docstrings formatage due to unconventional markdown * fix copies * fix generation num_return_sequences * enrich docs * add and fix tests beside integration tests * update integration tests * update repo id * add tie weights and make style * correct naming in .md * fix imports and so on * correct docstrings * fix fp16 speech forward * fix speechencoder attention * make style * fix copied from * rename SeamlessM4Tv2-v2 to SeamlessM4Tv2 * Apply suggestions on configuration Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * remove useless public models * fix private models + better naming for T2U models * clean speech encoder relative position embeddings * refactor chunk attention * add docstrings to chunk attention method * improve naming and docstrings * rename some attention variables + add temperature sampling in T2U model * rename DOCSTRINGS variable names * make style + remove 2 useless config parameters * enrich model card * remove any attention_head reference + fix temperature in T2U * new fmt and make style * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * rename spkr_id->speaker_id and change docstrings of get_char_input_ids * simplify v2attention * make style * Update seamless_m4t_v2.md * update code and tests with last update * update repo ids * fill article name, abstract andauthors * update not_doctested and slow_doc tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
parent
510270af34
commit
29f1aee3b6
|
@ -466,6 +466,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from <FILL INSTITUTION>) released with the paper [<FILL PAPER TITLE>](<FILL ARKIV LINK>) by <FILL AUTHORS>.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -441,6 +441,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -415,6 +415,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर] (https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित।
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng से) Bo Peng. द्वाराअनुसंधान पत्र [this repo](https://github.com/BlinkDL/RWKV-LM) के साथ जारी किया गया
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) के साथ जारी किया गया
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https ://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा।
|
||||
|
|
|
@ -475,6 +475,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology から), Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu から公開された研究論文: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng から) Bo Peng. から公開された研究論文 [this repo](https://github.com/BlinkDL/RWKV-LM)
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
|
||||
|
|
|
@ -390,6 +390,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology 에서) Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 의 a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 논문과 함께 발표했습니다.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng 에서 제공)은 Bo Peng.의 [this repo](https://github.com/BlinkDL/RWKV-LM)논문과 함께 발표했습니다.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)논문과 함께 발표했습니다.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
|
||||
|
|
|
@ -414,6 +414,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (来自 Bo Peng) 伴随论文 [this repo](https://github.com/BlinkDL/RWKV-LM) 由 Bo Peng 发布。
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||
|
|
|
@ -426,6 +426,7 @@ conda install -c huggingface transformers
|
|||
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng.
|
||||
1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team.
|
||||
1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
|
||||
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
|
||||
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
|
||||
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
|
|
@ -618,6 +618,8 @@
|
|||
title: Pop2Piano
|
||||
- local: model_doc/seamless_m4t
|
||||
title: Seamless-M4T
|
||||
- local: model_doc/seamless_m4t_v2
|
||||
title: SeamlessM4T-v2
|
||||
- local: model_doc/sew
|
||||
title: SEW
|
||||
- local: model_doc/sew-d
|
||||
|
|
|
@ -242,6 +242,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||
| [RWKV](model_doc/rwkv) | ✅ | ❌ | ❌ |
|
||||
| [SAM](model_doc/sam) | ✅ | ✅ | ❌ |
|
||||
| [SeamlessM4T](model_doc/seamless_m4t) | ✅ | ❌ | ❌ |
|
||||
| [SeamlessM4Tv2](model_doc/seamless_m4t_v2) | ✅ | ❌ | ❌ |
|
||||
| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ |
|
||||
| [SEW](model_doc/sew) | ✅ | ❌ | ❌ |
|
||||
| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ |
|
||||
|
|
|
@ -0,0 +1,194 @@
|
|||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# SeamlessM4T-v2
|
||||
|
||||
## Overview
|
||||
|
||||
The SeamlessM4T-v2 model was proposed in [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team from Meta AI.
|
||||
|
||||
SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the [previous version](./seamless_m4t.md). For more details on the differences between v1 and v2, refer to section [Difference with SeamlessM4T-v1](#difference-with-seamlessm4t-v1).
|
||||
|
||||
SeamlessM4T-v2 enables multiple tasks without relying on separate models:
|
||||
|
||||
- Speech-to-speech translation (S2ST)
|
||||
- Speech-to-text translation (S2TT)
|
||||
- Text-to-speech translation (T2ST)
|
||||
- Text-to-text translation (T2TT)
|
||||
- Automatic speech recognition (ASR)
|
||||
|
||||
[`SeamlessM4Tv2Model`] can perform all the above tasks, but each task also has its own dedicated sub-model.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.*
|
||||
|
||||
## Usage
|
||||
|
||||
In the following example, we'll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text.
|
||||
|
||||
First, load the processor and a checkpoint of the model:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, SeamlessM4Tv2Model
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
|
||||
>>> model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
|
||||
```
|
||||
|
||||
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
||||
|
||||
Here is how to use the processor to process text and audio:
|
||||
|
||||
```python
|
||||
>>> # let's load an audio sample from an Arabic speech corpus
|
||||
>>> from datasets import load_dataset
|
||||
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
|
||||
>>> audio_sample = next(iter(dataset))["audio"]
|
||||
|
||||
>>> # now, process it
|
||||
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
|
||||
|
||||
>>> # now, process some English text as well
|
||||
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
||||
```
|
||||
|
||||
|
||||
### Speech
|
||||
|
||||
[`SeamlessM4Tv2Model`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation:
|
||||
|
||||
```python
|
||||
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
||||
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
||||
```
|
||||
|
||||
With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
|
||||
|
||||
### Text
|
||||
|
||||
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4Tv2Model.generate`].
|
||||
This time, let's translate to French.
|
||||
|
||||
```python
|
||||
>>> # from audio
|
||||
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
|
||||
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
||||
|
||||
>>> # from text
|
||||
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
||||
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
|
||||
#### 1. Use dedicated models
|
||||
|
||||
[`SeamlessM4Tv2Model`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
|
||||
For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
|
||||
>>> model = SeamlessM4Tv2ForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large")
|
||||
```
|
||||
|
||||
Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`.
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4Tv2ForTextToText
|
||||
>>> model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")
|
||||
```
|
||||
|
||||
Feel free to try out [`SeamlessM4Tv2ForSpeechToText`] and [`SeamlessM4Tv2ForTextToSpeech`] as well.
|
||||
|
||||
#### 2. Change the speaker identity
|
||||
|
||||
You have the possibility to change the speaker used for speech synthesis with the `speaker_id` argument. Some `speaker_id` works better than other for some languages!
|
||||
|
||||
#### 3. Change the generation strategy
|
||||
|
||||
You can use different [generation strategies](../generation_strategies) for text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True)` which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. `.generate(..., speech_do_sample=True, speech_temperature=0.6)`.
|
||||
|
||||
#### 4. Generate speech and text at the same time
|
||||
|
||||
Use `return_intermediate_token_ids=True` with [`SeamlessM4Tv2Model`] to return both speech and text !
|
||||
|
||||
## Model architecture
|
||||
|
||||
SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text.
|
||||
|
||||
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model.
|
||||
|
||||
### Difference with SeamlessM4T-v1
|
||||
|
||||
The architecture of this new version differs from the first in a few aspects:
|
||||
|
||||
#### Improvements on the second-pass model
|
||||
|
||||
The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a **single forward pass**. This achievement is made possible by:
|
||||
- the use of **character-level embeddings**, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens.
|
||||
- the use of an intermediate duration predictor, that predicts speech duration at the **character-level** on the predicted translated text.
|
||||
- the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context.
|
||||
|
||||
#### Difference in the speech encoder
|
||||
|
||||
The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:
|
||||
- the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks.
|
||||
- the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to [Self-Attentionwith Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155) for more details.
|
||||
- the use of a causal depth-wise convolution instead of a non-causal one.
|
||||
|
||||
### Generation process
|
||||
|
||||
Here's how the generation process works:
|
||||
|
||||
- Input text or speech is processed through its specific encoder.
|
||||
- A decoder creates text tokens in the desired language.
|
||||
- If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way.
|
||||
- These unit tokens are then passed through the final vocoder to produce the actual speech.
|
||||
|
||||
|
||||
This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication).
|
||||
|
||||
## SeamlessM4Tv2Model
|
||||
|
||||
[[autodoc]] SeamlessM4Tv2Model
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4Tv2ForTextToSpeech
|
||||
|
||||
[[autodoc]] SeamlessM4Tv2ForTextToSpeech
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4Tv2ForSpeechToSpeech
|
||||
|
||||
[[autodoc]] SeamlessM4Tv2ForSpeechToSpeech
|
||||
- generate
|
||||
|
||||
|
||||
## SeamlessM4Tv2ForTextToText
|
||||
|
||||
[[autodoc]] transformers.SeamlessM4Tv2ForTextToText
|
||||
- forward
|
||||
- generate
|
||||
|
||||
## SeamlessM4Tv2ForSpeechToText
|
||||
|
||||
[[autodoc]] transformers.SeamlessM4Tv2ForSpeechToText
|
||||
- forward
|
||||
- generate
|
||||
|
||||
## SeamlessM4Tv2Config
|
||||
|
||||
[[autodoc]] SeamlessM4Tv2Config
|
|
@ -35,7 +35,7 @@ The task illustrated in this tutorial is supported by the following model archit
|
|||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SeamlessM4Tv2](../model_doc/seamless_m4t_v2), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
|
||||
<!--End of the generated tip-->
|
||||
|
||||
|
|
|
@ -32,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit
|
|||
|
||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->
|
||||
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SeamlessM4Tv2](../model_doc/seamless_m4t_v2), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet)
|
||||
|
||||
<!--End of the generated tip-->
|
||||
|
||||
|
|
|
@ -547,6 +547,10 @@ _import_structure = {
|
|||
"SeamlessM4TFeatureExtractor",
|
||||
"SeamlessM4TProcessor",
|
||||
],
|
||||
"models.seamless_m4t_v2": [
|
||||
"SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"SeamlessM4Tv2Config",
|
||||
],
|
||||
"models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"],
|
||||
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||
|
@ -2778,6 +2782,17 @@ else:
|
|||
"SeamlessM4TTextToUnitModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.seamless_m4t_v2"].extend(
|
||||
[
|
||||
"SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SeamlessM4Tv2ForSpeechToSpeech",
|
||||
"SeamlessM4Tv2ForSpeechToText",
|
||||
"SeamlessM4Tv2ForTextToSpeech",
|
||||
"SeamlessM4Tv2ForTextToText",
|
||||
"SeamlessM4Tv2Model",
|
||||
"SeamlessM4Tv2PreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.segformer"].extend(
|
||||
[
|
||||
"SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
|
@ -4794,6 +4809,10 @@ if TYPE_CHECKING:
|
|||
SeamlessM4TFeatureExtractor,
|
||||
SeamlessM4TProcessor,
|
||||
)
|
||||
from .models.seamless_m4t_v2 import (
|
||||
SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
SeamlessM4Tv2Config,
|
||||
)
|
||||
from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig
|
||||
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||
|
@ -6678,6 +6697,15 @@ if TYPE_CHECKING:
|
|||
SeamlessM4TTextToUnitForConditionalGeneration,
|
||||
SeamlessM4TTextToUnitModel,
|
||||
)
|
||||
from .models.seamless_m4t_v2 import (
|
||||
SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SeamlessM4Tv2ForSpeechToSpeech,
|
||||
SeamlessM4Tv2ForSpeechToText,
|
||||
SeamlessM4Tv2ForTextToSpeech,
|
||||
SeamlessM4Tv2ForTextToText,
|
||||
SeamlessM4Tv2Model,
|
||||
SeamlessM4Tv2PreTrainedModel,
|
||||
)
|
||||
from .models.segformer import (
|
||||
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SegformerDecodeHead,
|
||||
|
|
|
@ -185,6 +185,7 @@ from . import (
|
|||
rwkv,
|
||||
sam,
|
||||
seamless_m4t,
|
||||
seamless_m4t_v2,
|
||||
segformer,
|
||||
sew,
|
||||
sew_d,
|
||||
|
|
|
@ -191,6 +191,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||
("rwkv", "RwkvConfig"),
|
||||
("sam", "SamConfig"),
|
||||
("seamless_m4t", "SeamlessM4TConfig"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2Config"),
|
||||
("segformer", "SegformerConfig"),
|
||||
("sew", "SEWConfig"),
|
||||
("sew-d", "SEWDConfig"),
|
||||
|
@ -404,6 +405,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||
("rwkv", "RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sam", "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("seamless_m4t", "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("seamless_m4t_v2", "SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
|
@ -642,6 +644,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||
("rwkv", "RWKV"),
|
||||
("sam", "SAM"),
|
||||
("seamless_m4t", "SeamlessM4T"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2"),
|
||||
("segformer", "SegFormer"),
|
||||
("sew", "SEW"),
|
||||
("sew-d", "SEW-D"),
|
||||
|
|
|
@ -180,6 +180,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||
("rwkv", "RwkvModel"),
|
||||
("sam", "SamModel"),
|
||||
("seamless_m4t", "SeamlessM4TModel"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2Model"),
|
||||
("segformer", "SegformerModel"),
|
||||
("sew", "SEWModel"),
|
||||
("sew-d", "SEWDModel"),
|
||||
|
@ -685,6 +686,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||
("plbart", "PLBartForConditionalGeneration"),
|
||||
("prophetnet", "ProphetNetForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForTextToText"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2ForTextToText"),
|
||||
("switch_transformers", "SwitchTransformersForConditionalGeneration"),
|
||||
("t5", "T5ForConditionalGeneration"),
|
||||
("umt5", "UMT5ForConditionalGeneration"),
|
||||
|
@ -696,6 +698,7 @@ MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
|
|||
[
|
||||
("pop2piano", "Pop2PianoForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForSpeechToText"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2ForSpeechToText"),
|
||||
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
|
||||
("speech_to_text", "Speech2TextForConditionalGeneration"),
|
||||
("speecht5", "SpeechT5ForSpeechToText"),
|
||||
|
@ -1062,6 +1065,7 @@ MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = OrderedDict(
|
|||
("bark", "BarkModel"),
|
||||
("musicgen", "MusicgenForConditionalGeneration"),
|
||||
("seamless_m4t", "SeamlessM4TForTextToSpeech"),
|
||||
("seamless_m4t_v2", "SeamlessM4Tv2ForTextToSpeech"),
|
||||
("vits", "VitsModel"),
|
||||
]
|
||||
)
|
||||
|
|
|
@ -346,6 +346,13 @@ else:
|
|||
"SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
|
||||
),
|
||||
),
|
||||
(
|
||||
"seamless_m4t_v2",
|
||||
(
|
||||
"SeamlessM4TTokenizer" if is_sentencepiece_available() else None,
|
||||
"SeamlessM4TTokenizerFast" if is_tokenizers_available() else None,
|
||||
),
|
||||
),
|
||||
("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)),
|
||||
("speech_to_text_2", ("Speech2Text2Tokenizer", None)),
|
||||
("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)),
|
||||
|
|
|
@ -159,25 +159,6 @@ SEAMLESS_M4T_INPUTS_DOCSTRING_LAST_PART = r"""
|
|||
If you want to change padding behavior, you should read [`modeling_bart._prepare_decoder_attention_mask`]
|
||||
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
|
||||
information on the default strategy.
|
||||
head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
decoder_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in `[0,
|
||||
1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
|
||||
Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
|
||||
`last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)`, *optional*) is a sequence of
|
||||
|
@ -1090,10 +1071,10 @@ class SeamlessM4TSinusoidalPositionalEmbedding(nn.Module):
|
|||
return position_ids.unsqueeze(0).expand(input_shape).contiguous() + past_key_values_length
|
||||
|
||||
|
||||
# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->SeamlessM4T,key_value_states->encoder_hidden_states
|
||||
class SeamlessM4TAttention(nn.Module):
|
||||
"""Multi-headed attention from 'Attention Is All You Need' paper"""
|
||||
|
||||
# Copied from transformers.models.bart.modeling_bart.BartAttention.__init__ with Bart->SeamlessM4T
|
||||
def __init__(
|
||||
self,
|
||||
embed_dim: int,
|
||||
|
@ -1134,7 +1115,6 @@ class SeamlessM4TAttention(nn.Module):
|
|||
encoder_hidden_states: Optional[torch.Tensor] = None,
|
||||
past_key_value: Optional[Tuple[torch.Tensor]] = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
layer_head_mask: Optional[torch.Tensor] = None,
|
||||
output_attentions: bool = False,
|
||||
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
||||
"""Input shape: Batch x Time x Channel"""
|
||||
|
@ -1208,15 +1188,6 @@ class SeamlessM4TAttention(nn.Module):
|
|||
|
||||
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
|
||||
|
||||
if layer_head_mask is not None:
|
||||
if layer_head_mask.size() != (self.num_heads,):
|
||||
raise ValueError(
|
||||
f"Head mask for a single layer should be of size {(self.num_heads,)}, but is"
|
||||
f" {layer_head_mask.size()}"
|
||||
)
|
||||
attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
|
||||
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
|
||||
|
||||
if output_attentions:
|
||||
# this operation is a bit awkward, but it's required to
|
||||
# make sure that attn_weights keeps its gradient.
|
||||
|
@ -1298,7 +1269,6 @@ class SeamlessM4TEncoderLayer(nn.Module):
|
|||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
attention_mask: torch.Tensor,
|
||||
layer_head_mask: torch.Tensor,
|
||||
output_attentions: bool = False,
|
||||
) -> torch.Tensor:
|
||||
"""
|
||||
|
@ -1308,15 +1278,12 @@ class SeamlessM4TEncoderLayer(nn.Module):
|
|||
attention_mask (`torch.FloatTensor`):
|
||||
attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very
|
||||
large negative values.
|
||||
layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size
|
||||
`(encoder_attention_heads,)`.
|
||||
"""
|
||||
residual = hidden_states
|
||||
hidden_states = self.self_attn_layer_norm(hidden_states)
|
||||
hidden_states, attn_weights, _ = self.self_attn(
|
||||
hidden_states=hidden_states,
|
||||
attention_mask=attention_mask,
|
||||
layer_head_mask=layer_head_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
hidden_states = self.attn_dropout(hidden_states)
|
||||
|
@ -1375,8 +1342,6 @@ class SeamlessM4TDecoderLayer(nn.Module):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
encoder_hidden_states: Optional[torch.Tensor] = None,
|
||||
encoder_attention_mask: Optional[torch.Tensor] = None,
|
||||
layer_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_layer_head_mask: Optional[torch.Tensor] = None,
|
||||
past_key_value: Optional[Tuple[torch.Tensor]] = None,
|
||||
output_attentions: Optional[bool] = False,
|
||||
use_cache: Optional[bool] = True,
|
||||
|
@ -1393,10 +1358,6 @@ class SeamlessM4TDecoderLayer(nn.Module):
|
|||
encoder_attention_mask (`torch.FloatTensor`):
|
||||
encoder attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by
|
||||
very large negative values.
|
||||
layer_head_mask (`torch.FloatTensor`):
|
||||
mask for attention heads in a given layer of size `(encoder_attention_heads,)`.
|
||||
cross_attn_layer_head_mask (`torch.FloatTensor`):
|
||||
mask for cross-attention heads in a given layer of size `(decoder_attention_heads,)`.
|
||||
past_key_value (`Tuple(torch.FloatTensor)`):
|
||||
cached past key and value projection states
|
||||
output_attentions (`bool`, *optional*):
|
||||
|
@ -1414,7 +1375,6 @@ class SeamlessM4TDecoderLayer(nn.Module):
|
|||
hidden_states=hidden_states,
|
||||
past_key_value=self_attn_past_key_value,
|
||||
attention_mask=attention_mask,
|
||||
layer_head_mask=layer_head_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
hidden_states = self.attn_dropout(hidden_states)
|
||||
|
@ -1435,7 +1395,6 @@ class SeamlessM4TDecoderLayer(nn.Module):
|
|||
encoder_hidden_states=encoder_hidden_states,
|
||||
past_key_value=cross_attn_past_key_value,
|
||||
attention_mask=encoder_attention_mask,
|
||||
layer_head_mask=cross_attn_layer_head_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
hidden_states = self.attn_dropout(hidden_states)
|
||||
|
@ -1710,7 +1669,6 @@ class SeamlessM4TEncoder(SeamlessM4TPreTrainedModel):
|
|||
self,
|
||||
input_ids: torch.LongTensor = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
output_attentions: Optional[bool] = None,
|
||||
output_hidden_states: Optional[bool] = None,
|
||||
|
@ -1734,12 +1692,6 @@ class SeamlessM4TEncoder(SeamlessM4TPreTrainedModel):
|
|||
- 0 for tokens that are **masked**.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
|
||||
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
|
||||
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
|
||||
|
@ -1796,13 +1748,6 @@ class SeamlessM4TEncoder(SeamlessM4TPreTrainedModel):
|
|||
encoder_states = () if output_hidden_states else None
|
||||
all_attentions = () if output_attentions else None
|
||||
|
||||
# check if head_mask has a correct number of layers specified if desired
|
||||
if head_mask is not None:
|
||||
if head_mask.size()[0] != len(self.layers):
|
||||
raise ValueError(
|
||||
f"The head_mask should be specified for {len(self.layers)} layers, but it is for"
|
||||
f" {head_mask.size()[0]}."
|
||||
)
|
||||
for idx, encoder_layer in enumerate(self.layers):
|
||||
if output_hidden_states:
|
||||
encoder_states = encoder_states + (hidden_states,)
|
||||
|
@ -1821,14 +1766,12 @@ class SeamlessM4TEncoder(SeamlessM4TPreTrainedModel):
|
|||
encoder_layer.forward,
|
||||
hidden_states,
|
||||
attention_mask,
|
||||
(head_mask[idx] if head_mask is not None else None),
|
||||
output_attentions,
|
||||
)
|
||||
else:
|
||||
layer_outputs = encoder_layer(
|
||||
hidden_states,
|
||||
attention_mask,
|
||||
layer_head_mask=(head_mask[idx] if head_mask is not None else None),
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
|
||||
|
@ -1912,8 +1855,6 @@ class SeamlessM4TDecoder(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
encoder_hidden_states: Optional[torch.FloatTensor] = None,
|
||||
encoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
use_cache: Optional[bool] = None,
|
||||
|
@ -1949,19 +1890,6 @@ class SeamlessM4TDecoder(SeamlessM4TPreTrainedModel):
|
|||
- 0 for tokens that are **masked**.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
|
||||
Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing
|
||||
cross-attention on hidden heads. Mask values selected in `[0, 1]`:
|
||||
|
||||
- 1 indicates the head is **not masked**,
|
||||
- 0 indicates the head is **masked**.
|
||||
|
||||
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
|
||||
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
|
||||
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
|
||||
|
@ -2043,14 +1971,6 @@ class SeamlessM4TDecoder(SeamlessM4TPreTrainedModel):
|
|||
all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
|
||||
next_decoder_cache = () if use_cache else None
|
||||
|
||||
# check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired
|
||||
for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]):
|
||||
if attn_mask is not None:
|
||||
if attn_mask.size()[0] != len(self.layers):
|
||||
raise ValueError(
|
||||
f"The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for"
|
||||
f" {attn_mask.size()[0]}."
|
||||
)
|
||||
for idx, decoder_layer in enumerate(self.layers):
|
||||
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
|
||||
if output_hidden_states:
|
||||
|
@ -2069,8 +1989,6 @@ class SeamlessM4TDecoder(SeamlessM4TPreTrainedModel):
|
|||
attention_mask,
|
||||
encoder_hidden_states,
|
||||
encoder_attention_mask,
|
||||
head_mask[idx] if head_mask is not None else None,
|
||||
cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None,
|
||||
None,
|
||||
output_attentions,
|
||||
use_cache,
|
||||
|
@ -2081,10 +1999,6 @@ class SeamlessM4TDecoder(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=attention_mask,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
layer_head_mask=(head_mask[idx] if head_mask is not None else None),
|
||||
cross_attn_layer_head_mask=(
|
||||
cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None
|
||||
),
|
||||
past_key_value=past_key_value,
|
||||
output_attentions=output_attentions,
|
||||
use_cache=use_cache,
|
||||
|
@ -2143,16 +2057,12 @@ class SeamlessM4TTextToUnitModel(SeamlessM4TPreTrainedModel):
|
|||
# Initialize weights and apply final processing
|
||||
self.post_init()
|
||||
|
||||
# Copied from transformers.models.m2m_100.modeling_m2m_100.M2M100Model.forward
|
||||
def forward(
|
||||
self,
|
||||
input_ids: Optional[torch.LongTensor] = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -2173,7 +2083,6 @@ class SeamlessM4TTextToUnitModel(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.encoder(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -2193,8 +2102,6 @@ class SeamlessM4TTextToUnitModel(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -2278,9 +2185,6 @@ class SeamlessM4TTextToUnitForConditionalGeneration(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -2308,9 +2212,6 @@ class SeamlessM4TTextToUnitForConditionalGeneration(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids=decoder_input_ids,
|
||||
encoder_outputs=encoder_outputs,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
head_mask=head_mask,
|
||||
decoder_head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=inputs_embeds,
|
||||
decoder_inputs_embeds=decoder_inputs_embeds,
|
||||
|
@ -2348,9 +2249,6 @@ class SeamlessM4TTextToUnitForConditionalGeneration(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -2365,9 +2263,6 @@ class SeamlessM4TTextToUnitForConditionalGeneration(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
@ -2798,9 +2693,6 @@ class SeamlessM4TForTextToText(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -2832,7 +2724,6 @@ class SeamlessM4TForTextToText(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.text_encoder(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -2854,8 +2745,6 @@ class SeamlessM4TForTextToText(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -3007,9 +2896,6 @@ class SeamlessM4TForTextToText(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -3024,9 +2910,6 @@ class SeamlessM4TForTextToText(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
@ -3095,9 +2978,6 @@ class SeamlessM4TForSpeechToText(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -3129,7 +3009,6 @@ class SeamlessM4TForSpeechToText(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.speech_encoder(
|
||||
input_features=input_features,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -3158,8 +3037,6 @@ class SeamlessM4TForSpeechToText(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -3312,9 +3189,6 @@ class SeamlessM4TForSpeechToText(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -3329,9 +3203,6 @@ class SeamlessM4TForSpeechToText(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
@ -3408,9 +3279,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -3447,7 +3315,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.text_encoder(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -3469,8 +3336,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -3624,8 +3489,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
input_ids=sequences,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=attention_mask,
|
||||
head_mask=kwargs_text.get("decoder_head_mask"),
|
||||
cross_attn_head_mask=kwargs_text.get("cross_attn_head_mask"),
|
||||
).last_hidden_state
|
||||
|
||||
pad_token_id = self.generation_config.pad_token_id
|
||||
|
@ -3678,9 +3541,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -3695,9 +3555,6 @@ class SeamlessM4TForTextToSpeech(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
@ -3769,9 +3626,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -3809,7 +3663,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.speech_encoder(
|
||||
input_features=input_features,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -3838,8 +3691,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -3999,8 +3850,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
input_ids=sequences,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=attention_mask,
|
||||
head_mask=kwargs_text.get("decoder_head_mask"),
|
||||
cross_attn_head_mask=kwargs_text.get("cross_attn_head_mask"),
|
||||
).last_hidden_state
|
||||
|
||||
pad_token_id = self.generation_config.pad_token_id
|
||||
|
@ -4063,9 +3912,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -4080,9 +3926,6 @@ class SeamlessM4TForSpeechToSpeech(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
@ -4167,9 +4010,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
attention_mask: Optional[torch.Tensor] = None,
|
||||
decoder_input_ids: Optional[torch.LongTensor] = None,
|
||||
decoder_attention_mask: Optional[torch.LongTensor] = None,
|
||||
head_mask: Optional[torch.Tensor] = None,
|
||||
decoder_head_mask: Optional[torch.Tensor] = None,
|
||||
cross_attn_head_mask: Optional[torch.Tensor] = None,
|
||||
encoder_outputs: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||
|
@ -4242,7 +4082,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
encoder_outputs = self.text_encoder(
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
head_mask=head_mask,
|
||||
inputs_embeds=inputs_embeds,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
|
@ -4272,8 +4111,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
attention_mask=decoder_attention_mask,
|
||||
encoder_hidden_states=encoder_outputs[0],
|
||||
encoder_attention_mask=encoder_attention_mask,
|
||||
head_mask=decoder_head_mask,
|
||||
cross_attn_head_mask=cross_attn_head_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=decoder_inputs_embeds,
|
||||
use_cache=use_cache,
|
||||
|
@ -4477,8 +4314,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
input_ids=sequences,
|
||||
encoder_hidden_states=encoder_hidden_states,
|
||||
encoder_attention_mask=attention_mask,
|
||||
head_mask=kwargs_text.get("decoder_head_mask"),
|
||||
cross_attn_head_mask=kwargs_text.get("cross_attn_head_mask"),
|
||||
).last_hidden_state
|
||||
|
||||
pad_token_id = self.generation_config.pad_token_id
|
||||
|
@ -4531,9 +4366,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
decoder_input_ids,
|
||||
past_key_values=None,
|
||||
attention_mask=None,
|
||||
head_mask=None,
|
||||
decoder_head_mask=None,
|
||||
cross_attn_head_mask=None,
|
||||
use_cache=None,
|
||||
encoder_outputs=None,
|
||||
**kwargs,
|
||||
|
@ -4548,9 +4380,6 @@ class SeamlessM4TModel(SeamlessM4TPreTrainedModel):
|
|||
"past_key_values": past_key_values,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"head_mask": head_mask,
|
||||
"decoder_head_mask": decoder_head_mask,
|
||||
"cross_attn_head_mask": cross_attn_head_mask,
|
||||
"use_cache": use_cache,
|
||||
}
|
||||
|
||||
|
|
|
@ -0,0 +1,65 @@
|
|||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_seamless_m4t_v2": ["SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "SeamlessM4Tv2Config"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_seamless_m4t_v2"] = [
|
||||
"SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SeamlessM4Tv2ForTextToSpeech",
|
||||
"SeamlessM4Tv2ForSpeechToSpeech",
|
||||
"SeamlessM4Tv2ForTextToText",
|
||||
"SeamlessM4Tv2ForSpeechToText",
|
||||
"SeamlessM4Tv2Model",
|
||||
"SeamlessM4Tv2PreTrainedModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_seamless_m4t_v2 import SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, SeamlessM4Tv2Config
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_seamless_m4t_v2 import (
|
||||
SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
SeamlessM4Tv2ForSpeechToSpeech,
|
||||
SeamlessM4Tv2ForSpeechToText,
|
||||
SeamlessM4Tv2ForTextToSpeech,
|
||||
SeamlessM4Tv2ForTextToText,
|
||||
SeamlessM4Tv2Model,
|
||||
SeamlessM4Tv2PreTrainedModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
|
@ -0,0 +1,426 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" SeamlessM4Tv2 model configuration"""
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"": "https://huggingface.co//resolve/main/config.json",
|
||||
}
|
||||
|
||||
|
||||
class SeamlessM4Tv2Config(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`~SeamlessM4Tv2Model`]. It is used to instantiate
|
||||
an SeamlessM4Tv2 model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the SeamlessM4Tv2
|
||||
[""](https://huggingface.co/"") architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 256102):
|
||||
Vocabulary size of the text modality of the SeamlessM4Tv2 model. Defines the number of different tokens
|
||||
that can be represented by the `inputs_ids` passed when calling [`~SeamlessM4Tv2Model`],
|
||||
[`~SeamlessM4Tv2ForTextToSpeech`] or [`~SeamlessM4Tv2ForTextToText`].
|
||||
t2u_vocab_size (`int`, *optional*, defaults to 10082):
|
||||
Unit vocabulary size of the SeamlessM4Tv2 model. Defines the number of different "unit tokens" that can be
|
||||
represented by the `inputs_ids` passed when calling the Text-To-Units sub-model of [`~SeamlessM4Tv2Model`],
|
||||
[`~SeamlessM4Tv2ForSpeechToSpeech`] or [`~SeamlessM4Tv2ForTextToSpeech`].
|
||||
char_vocab_size (`int`, *optional*, defaults to 10943):
|
||||
Character vocabulary size of the SeamlessM4Tv2 model. Defines the number of different character tokens that
|
||||
can be represented by the `char_inputs_ids` passed when calling the Text-To-Units sub-model of
|
||||
[`~SeamlessM4Tv2Model`], [`~SeamlessM4Tv2ForSpeechToSpeech`] or [`~SeamlessM4Tv2ForTextToSpeech`].
|
||||
|
||||
> Parameters shared across sub-models
|
||||
|
||||
hidden_size (`int`, *optional*, defaults to 1024):
|
||||
Dimensionality of the "intermediate" layers in the architecture.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||
The epsilon used by the layer normalization layers.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
max_position_embeddings (`int`, *optional*, defaults to 4096):
|
||||
The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set
|
||||
this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||
is_encoder_decoder (`bool`, *optional*, defaults to `True`):
|
||||
Whether the model is used as an encoder/decoder or not.
|
||||
encoder_layerdrop (`float`, *optional*, defaults to 0.05):
|
||||
The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||
for more details.
|
||||
decoder_layerdrop (`float`, *optional*, defaults to 0.05):
|
||||
The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||
for more details.
|
||||
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
|
||||
The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,
|
||||
`"gelu"`, `"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
|
||||
dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all attention layers.
|
||||
activation_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probability for all activation layers in the model.
|
||||
scale_embedding (`bool`, *optional*, defaults to `True`):
|
||||
Scale embeddings by diving by sqrt(d_model).
|
||||
|
||||
> Text encoder and text decoder specific parameters
|
||||
|
||||
encoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer text encoder.
|
||||
encoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text encoder.
|
||||
encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text encoder.
|
||||
decoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer text decoder.
|
||||
decoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text decoder.
|
||||
decoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text decoder.
|
||||
decoder_start_token_id (`int`, *optional*, defaults to 3):
|
||||
If an encoder-decoder model starts decoding with a different token than _bos_, the id of that token. Only
|
||||
applied in the text decoder.
|
||||
max_new_tokens (`int`, *optional*, defaults to 256):
|
||||
The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
|
||||
pad_token_id (`int`, *optional*, defaults to 0):
|
||||
The id of the _padding_ text token. Only applied to the text-decoder model.
|
||||
bos_token_id (`int`, *optional*, defaults to 2):
|
||||
The id of the _beginning-of-stream_ text token. Only applied to the text-decoder model.
|
||||
eos_token_id (`int`, *optional*, defaults to 3):
|
||||
The id of the _end-of-stream_ text token. Only applied to the text-decoder model.
|
||||
|
||||
> Speech encoder specific parameters
|
||||
|
||||
speech_encoder_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer speech encoder.
|
||||
speech_encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer speech encoder.
|
||||
speech_encoder_intermediate_size (`int`, *optional*, defaults to 4096):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer speech encoder.
|
||||
speech_encoder_hidden_act (`str` or `function`, *optional*, defaults to `"swish"`):
|
||||
The non-linear activation function (function or string) in the speech encoder. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"`, `"swish"` and `"gelu_new"` are supported.
|
||||
speech_encoder_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probability for all layers in the speech encoder.
|
||||
add_adapter (`bool`, *optional*, defaults to `True`):
|
||||
Add an adapter layer on top of the speech encoder.
|
||||
speech_encoder_layerdrop (`float`, *optional*, defaults to 0.1):
|
||||
The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see
|
||||
https://arxiv.org/abs/1909.11556) for more details.
|
||||
feature_projection_input_dim (`int`, *optional*, defaults to 160):
|
||||
Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing
|
||||
input audios with [`SeamlessM4TFeatureExtractor`].
|
||||
adaptor_kernel_size (`int`, *optional*, defaults to 8):
|
||||
Kernel size of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
|
||||
adaptor_stride (`int`, *optional*, defaults to 8):
|
||||
Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
|
||||
adaptor_dropout (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all layers in the speech adapter.
|
||||
num_adapter_layers (`int`, *optional*, defaults to 1):
|
||||
Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
|
||||
True`.
|
||||
position_embeddings_type (`str`, *optional*, defaults to `"relative_key"`):
|
||||
Can be specified to `relative_key`. If left to `None`, no relative position embedding is applied. Only
|
||||
applied to the speech encoder. For more information on `"relative_key"`, please refer to [Self-Attention
|
||||
with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||
conv_depthwise_kernel_size (`int`, *optional*, defaults to 31):
|
||||
Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
|
||||
left_max_position_embeddings (`int`, *optional*, defaults to 64):
|
||||
The left clipping value for relative positions.
|
||||
right_max_position_embeddings (`int`, *optional*, defaults to 8):
|
||||
The right clipping value for relative positions.
|
||||
speech_encoder_chunk_size (`int`, *optional*, defaults to 20000): The size of each attention chunk.
|
||||
speech_encoder_left_chunk_num (`int`, *optional*, defaults to 128):
|
||||
Number of chunks on the left up to which lookahead is allowed.
|
||||
|
||||
> Text-To-Unit (t2u) model specific parameters
|
||||
|
||||
t2u_bos_token_id (`int`, *optional*, defaults to 0):
|
||||
The id of the _beginning-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_pad_token_id (`int`, *optional*, defaults to 1):
|
||||
The id of the _padding_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_eos_token_id (`int`, *optional*, defaults to 2):
|
||||
The id of the _end-of-stream_ unit token. Only applied to the text-to-unit seq2seq model.
|
||||
t2u_encoder_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer text-to-unit encoder.
|
||||
t2u_encoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
|
||||
t2u_encoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
|
||||
t2u_decoder_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer text-to-unit decoder.
|
||||
t2u_decoder_ffn_dim (`int`, *optional*, defaults to 8192):
|
||||
Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
|
||||
t2u_decoder_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
|
||||
t2u_max_position_embeddings (`int`, *optional*, defaults to 4096):
|
||||
The maximum sequence length that this model text-to-unit component might ever be used with. Typically set
|
||||
this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||
t2u_variance_predictor_embed_dim (`int`, *optional*, defaults to 1024):
|
||||
The projection dimension of the text-to-unit's duration predictor.
|
||||
t2u_variance_predictor_hidden_dim (`int`, *optional*, defaults to 256):
|
||||
Internal dimension of the text-to-unit's duration predictor.
|
||||
t2u_variance_predictor_kernel_size (`int`, *optional*, defaults to 3):
|
||||
Kernel size of the convolutional layers of the text-to-unit's duration predictor.
|
||||
t2u_variance_pred_dropout (`float`, *optional*, defaults to 0.5):
|
||||
The dropout probabilitiy of the text-to-unit's duration predictor.
|
||||
|
||||
> Hifi-Gan Vocoder specific parameters
|
||||
|
||||
sampling_rate (`int`, *optional*, defaults to 16000):
|
||||
The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
|
||||
upsample_initial_channel (`int`, *optional*, defaults to 512):
|
||||
The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
|
||||
upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[5, 4, 4, 2, 2]`):
|
||||
A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network.
|
||||
The length of *upsample_rates* defines the number of convolutional layers and has to match the length of
|
||||
*upsample_kernel_sizes*. Applies to the vocoder only.
|
||||
upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[11, 8, 8, 4, 4]`):
|
||||
A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling
|
||||
network. The length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match
|
||||
the length of *upsample_rates*. Applies to the vocoder only.
|
||||
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
|
||||
A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive
|
||||
field fusion (MRF) module. Applies to the vocoder only.
|
||||
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
|
||||
A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in
|
||||
the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
|
||||
leaky_relu_slope (`float`, *optional*, defaults to 0.1):
|
||||
The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder
|
||||
only.
|
||||
unit_hifi_gan_vocab_size (`int`, *optional*, defaults to 10000):
|
||||
Vocabulary size of the SeamlessM4Tv2 vocoder. Defines the number of different unit tokens that can be
|
||||
represented by the `inputs_ids` passed when calling the vocoder of [`~SeamlessM4Tv2Model`],
|
||||
[`~SeamlessM4Tv2ForSpeechToSpeech`] or [`~SeamlessM4Tv2ForTextToSpeech`].
|
||||
unit_embed_dim (`int`, *optional*, defaults to 1280):
|
||||
The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
lang_embed_dim (`int`, *optional*, defaults to 256):
|
||||
The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
spkr_embed_dim (`int`, *optional*, defaults to 256):
|
||||
The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
|
||||
vocoder_num_langs (`int`, *optional*, defaults to 36):
|
||||
Number of langs supported by the vocoder. Might be different from `t2u_num_langs`.
|
||||
vocoder_num_spkrs (`int`, *optional*, defaults to 200):
|
||||
Number of speakers supported by the vocoder.
|
||||
variance_predictor_kernel_size (`int`, *optional*, defaults to 3):
|
||||
Kernel size of the duration predictor. Applies to the vocoder only.
|
||||
var_pred_dropout (`float`, *optional*, defaults to 0.5):
|
||||
The dropout probabilitiy of the duration predictor. Applies to the vocoder only.
|
||||
vocoder_offset (`int`, *optional*, defaults to 4):
|
||||
Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.
|
||||
|
||||
```python
|
||||
>>> from transformers import SeamlessM4Tv2Model, SeamlessM4Tv2Config
|
||||
|
||||
>>> # Initializing a SeamlessM4Tv2 "" style configuration
|
||||
>>> configuration = SeamlessM4Tv2Config()
|
||||
|
||||
>>> # Initializing a model from the "" style configuration
|
||||
>>> model = SeamlessM4Tv2Model(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
|
||||
model_type = "seamless_m4t_v2"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=256102,
|
||||
t2u_vocab_size=10082,
|
||||
char_vocab_size=10943,
|
||||
# shared config
|
||||
hidden_size=1024,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-5,
|
||||
use_cache=True,
|
||||
max_position_embeddings=4096,
|
||||
is_encoder_decoder=True,
|
||||
encoder_layerdrop=0.05,
|
||||
decoder_layerdrop=0.05,
|
||||
activation_function="relu",
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
activation_dropout=0.0,
|
||||
scale_embedding=True,
|
||||
# text encoder|decoder
|
||||
encoder_layers=24,
|
||||
encoder_ffn_dim=8192,
|
||||
encoder_attention_heads=16,
|
||||
decoder_layers=24,
|
||||
decoder_ffn_dim=8192,
|
||||
decoder_attention_heads=16,
|
||||
decoder_start_token_id=3,
|
||||
max_new_tokens=256,
|
||||
pad_token_id=0,
|
||||
bos_token_id=2,
|
||||
eos_token_id=3,
|
||||
# speech_encoder
|
||||
speech_encoder_layers=24,
|
||||
speech_encoder_attention_heads=16,
|
||||
speech_encoder_intermediate_size=4096,
|
||||
speech_encoder_hidden_act="swish",
|
||||
speech_encoder_dropout=0.0,
|
||||
add_adapter=True,
|
||||
speech_encoder_layerdrop=0.1,
|
||||
feature_projection_input_dim=160,
|
||||
adaptor_kernel_size=8,
|
||||
adaptor_stride=8,
|
||||
adaptor_dropout=0.1,
|
||||
num_adapter_layers=1,
|
||||
position_embeddings_type="relative_key",
|
||||
conv_depthwise_kernel_size=31,
|
||||
left_max_position_embeddings=64,
|
||||
right_max_position_embeddings=8,
|
||||
speech_encoder_chunk_size=20000,
|
||||
speech_encoder_left_chunk_num=128,
|
||||
# t2u config
|
||||
t2u_bos_token_id=0,
|
||||
t2u_pad_token_id=1,
|
||||
t2u_eos_token_id=2,
|
||||
t2u_encoder_layers=6,
|
||||
t2u_encoder_ffn_dim=8192,
|
||||
t2u_encoder_attention_heads=16,
|
||||
t2u_decoder_layers=6,
|
||||
t2u_decoder_ffn_dim=8192,
|
||||
t2u_decoder_attention_heads=16,
|
||||
t2u_max_position_embeddings=4096,
|
||||
t2u_variance_predictor_embed_dim=1024,
|
||||
t2u_variance_predictor_hidden_dim=256,
|
||||
t2u_variance_predictor_kernel_size=3,
|
||||
t2u_variance_pred_dropout=0.5,
|
||||
# hifi-gan vocoder config
|
||||
sampling_rate=16000,
|
||||
upsample_initial_channel=512,
|
||||
upsample_rates=[5, 4, 4, 2, 2],
|
||||
upsample_kernel_sizes=[11, 8, 8, 4, 4],
|
||||
resblock_kernel_sizes=[3, 7, 11],
|
||||
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
|
||||
leaky_relu_slope=0.1,
|
||||
# specific to Code Hifi-Gan
|
||||
unit_hifi_gan_vocab_size=10000,
|
||||
unit_embed_dim=1280,
|
||||
lang_embed_dim=256,
|
||||
spkr_embed_dim=256,
|
||||
vocoder_num_langs=36,
|
||||
vocoder_num_spkrs=200,
|
||||
variance_predictor_kernel_size=3,
|
||||
var_pred_dropout=0.5,
|
||||
vocoder_offset=4,
|
||||
**kwargs,
|
||||
):
|
||||
# overall_config
|
||||
self.vocab_size = vocab_size
|
||||
self.t2u_vocab_size = t2u_vocab_size
|
||||
self.char_vocab_size = char_vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.use_cache = use_cache
|
||||
self.max_new_tokens = max_new_tokens
|
||||
self.encoder_layerdrop = encoder_layerdrop
|
||||
self.decoder_layerdrop = decoder_layerdrop
|
||||
self.activation_function = activation_function
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.scale_embedding = scale_embedding
|
||||
# for proper config init
|
||||
self.num_attention_heads = decoder_attention_heads
|
||||
self.num_hidden_layers = decoder_layers
|
||||
|
||||
# text|unit encoder|decoder
|
||||
self.encoder_layers = encoder_layers
|
||||
self.encoder_ffn_dim = encoder_ffn_dim
|
||||
self.encoder_attention_heads = encoder_attention_heads
|
||||
self.decoder_layers = decoder_layers
|
||||
self.decoder_ffn_dim = decoder_ffn_dim
|
||||
self.decoder_attention_heads = decoder_attention_heads
|
||||
|
||||
# speech_encoder
|
||||
self.speech_encoder_layers = speech_encoder_layers
|
||||
self.speech_encoder_hidden_act = speech_encoder_hidden_act
|
||||
self.speech_encoder_dropout = speech_encoder_dropout
|
||||
self.speech_encoder_attention_heads = speech_encoder_attention_heads
|
||||
self.speech_encoder_layerdrop = speech_encoder_layerdrop
|
||||
self.speech_encoder_intermediate_size = speech_encoder_intermediate_size
|
||||
self.feature_projection_input_dim = feature_projection_input_dim
|
||||
self.adaptor_kernel_size = adaptor_kernel_size
|
||||
self.adaptor_stride = adaptor_stride
|
||||
self.adaptor_dropout = adaptor_dropout
|
||||
self.num_adapter_layers = num_adapter_layers
|
||||
self.position_embeddings_type = position_embeddings_type
|
||||
self.conv_depthwise_kernel_size = conv_depthwise_kernel_size
|
||||
self.add_adapter = add_adapter
|
||||
self.left_max_position_embeddings = left_max_position_embeddings
|
||||
self.right_max_position_embeddings = right_max_position_embeddings
|
||||
self.speech_encoder_chunk_size = speech_encoder_chunk_size
|
||||
self.speech_encoder_left_chunk_num = speech_encoder_left_chunk_num
|
||||
|
||||
# t2u config
|
||||
self.t2u_bos_token_id = t2u_bos_token_id
|
||||
self.t2u_pad_token_id = t2u_pad_token_id
|
||||
self.t2u_eos_token_id = t2u_eos_token_id
|
||||
self.t2u_encoder_layers = t2u_encoder_layers
|
||||
self.t2u_encoder_ffn_dim = t2u_encoder_ffn_dim
|
||||
self.t2u_encoder_attention_heads = t2u_encoder_attention_heads
|
||||
self.t2u_decoder_layers = t2u_decoder_layers
|
||||
self.t2u_decoder_ffn_dim = t2u_decoder_ffn_dim
|
||||
self.t2u_decoder_attention_heads = t2u_decoder_attention_heads
|
||||
self.t2u_max_position_embeddings = t2u_max_position_embeddings
|
||||
self.t2u_variance_predictor_embed_dim = t2u_variance_predictor_embed_dim # TODO: add to docstrings
|
||||
self.t2u_variance_predictor_hidden_dim = t2u_variance_predictor_hidden_dim # TODO: add to docstrings
|
||||
self.t2u_variance_predictor_kernel_size = t2u_variance_predictor_kernel_size # TODO: add to docstrings
|
||||
self.t2u_variance_pred_dropout = t2u_variance_pred_dropout # TODO: add to docstrings
|
||||
|
||||
# hifi-gan vocoder config
|
||||
# original parameters specific to Hifi-Gan
|
||||
self.sampling_rate = sampling_rate
|
||||
self.upsample_initial_channel = upsample_initial_channel
|
||||
self.upsample_rates = upsample_rates
|
||||
self.upsample_kernel_sizes = upsample_kernel_sizes
|
||||
self.resblock_kernel_sizes = resblock_kernel_sizes
|
||||
self.resblock_dilation_sizes = resblock_dilation_sizes
|
||||
self.leaky_relu_slope = leaky_relu_slope
|
||||
|
||||
# specific to Code Hifi-Gan
|
||||
self.unit_hifi_gan_vocab_size = unit_hifi_gan_vocab_size
|
||||
self.unit_embed_dim = unit_embed_dim
|
||||
self.lang_embed_dim = lang_embed_dim
|
||||
self.spkr_embed_dim = spkr_embed_dim
|
||||
self.vocoder_num_langs = vocoder_num_langs
|
||||
self.vocoder_num_spkrs = vocoder_num_spkrs
|
||||
self.variance_predictor_kernel_size = variance_predictor_kernel_size
|
||||
self.var_pred_dropout = var_pred_dropout
|
||||
self.vocoder_offset = vocoder_offset
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
bos_token_id=bos_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
decoder_start_token_id=decoder_start_token_id,
|
||||
is_encoder_decoder=is_encoder_decoder,
|
||||
max_position_embeddings=max_position_embeddings,
|
||||
**kwargs,
|
||||
)
|
|
@ -0,0 +1,405 @@
|
|||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Converting Meta SeamlessM4Tv2 checkpoints from seamless_communication to HF."""
|
||||
|
||||
|
||||
import argparse
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from accelerate.utils.modeling import find_tied_parameters
|
||||
from seamless_communication.inference import Translator
|
||||
|
||||
from transformers import (
|
||||
SeamlessM4TFeatureExtractor,
|
||||
SeamlessM4TProcessor,
|
||||
SeamlessM4TTokenizer,
|
||||
SeamlessM4Tv2Config,
|
||||
SeamlessM4Tv2Model,
|
||||
)
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
# fmt: off
|
||||
UNIT_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kan__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tam__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__", ]
|
||||
# fmt: on
|
||||
|
||||
# fmt: off
|
||||
VOCODER_SUPPORTED_LANGUAGES = ["__arb__", "__ben__", "__cat__", "__ces__", "__cmn__", "__cym__", "__dan__", "__deu__", "__eng__", "__est__", "__fin__", "__fra__", "__hin__", "__ind__", "__ita__", "__jpn__", "__kor__", "__mlt__", "__nld__", "__pes__", "__pol__", "__por__", "__ron__", "__rus__", "__slk__", "__spa__", "__swe__", "__swh__", "__tel__", "__tgl__", "__tha__", "__tur__", "__ukr__", "__urd__", "__uzn__", "__vie__",]
|
||||
# fmt: on
|
||||
|
||||
# fmt: off
|
||||
LARGE_SUPPORTED_LANGUAGES = ["afr","amh","arb","ary","arz","asm","azj","bel","ben","bos","bul","cat","ceb","ces","ckb","cmn","cmn_Hant","cym","dan","deu","ell","eng","est","eus","fin","fra","fuv","gaz","gle","glg","guj","heb","hin","hrv","hun","hye","ibo","ind","isl","ita","jav","jpn","kan","kat","kaz","khk","khm","kir","kor","lao","lit","lug","luo","lvs","mai","mal","mar","mkd","mlt","mni","mya","nld","nno","nob","npi","nya","ory","pan","pbt","pes","pol","por","ron","rus","sat","slk","slv","sna","snd","som","spa","srp","swe","swh","tam","tel","tgk","tgl","tha","tur","ukr","urd","uzn","vie","yor","yue","zlm","zul",]
|
||||
# fmt: on
|
||||
|
||||
|
||||
def assert_param_count(model_1, model_2):
|
||||
count_1 = sum(p[1].numel() for p in model_1.named_parameters() if "final_proj" not in p[0])
|
||||
count_2 = sum(p[1].numel() for p in model_2.named_parameters() if "final_proj" not in p[0])
|
||||
assert count_1 == count_2, f"{model_1.__class__}: {count_1} != {model_2.__class__}: {count_2}"
|
||||
|
||||
|
||||
def param_count(model):
|
||||
return sum(p[1].numel() for p in model.named_parameters() if "final_proj" not in p[0])
|
||||
|
||||
|
||||
def _grab_best_device(use_gpu=True):
|
||||
if torch.cuda.device_count() > 0 and use_gpu:
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
return torch.device(device)
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
vocoder_convert_list = [
|
||||
("ups", "hifi_gan.upsampler"),
|
||||
("conv_pre", "hifi_gan.conv_pre"),
|
||||
("resblocks", "hifi_gan.resblocks"),
|
||||
("conv_post", "hifi_gan.conv_post"),
|
||||
("lang", "language_embedding"),
|
||||
("spkr", "speaker_embedding"),
|
||||
("dict.", "unit_embedding."),
|
||||
("dur_predictor.conv1.0", "dur_predictor.conv1"),
|
||||
("dur_predictor.conv2.0", "dur_predictor.conv2"),
|
||||
]
|
||||
|
||||
# order is important
|
||||
wav2vec_convert_list = [
|
||||
("speech_encoder_frontend.model_dim_proj", "feature_projection.projection"),
|
||||
("speech_encoder_frontend.post_extract_layer_norm", "feature_projection.layer_norm"),
|
||||
("speech_encoder_frontend.pos_encoder.conv", "encoder.pos_conv_embed.conv"),
|
||||
("speech_encoder.inner.layers", "encoder.layers"),
|
||||
("speech_encoder.inner_layer_norm", "encoder.layer_norm"),
|
||||
("speech_encoder.adaptor_layers", "adapter.layers"),
|
||||
("inner_proj", "intermediate_dense"),
|
||||
("self_attn.output_proj", "self_attn.linear_out"),
|
||||
("output_proj", "output_dense"),
|
||||
("self_attn.k_proj", "self_attn.linear_k"),
|
||||
("self_attn.v_proj", "self_attn.linear_v"),
|
||||
("self_attn.q_proj", "self_attn.linear_q"),
|
||||
("self_attn.sdpa.u_bias", "self_attn.pos_bias_u"),
|
||||
("self_attn.sdpa.v_bias", "self_attn.pos_bias_v"),
|
||||
("self_attn.sdpa.rel_k_embed", "self_attn.distance_embedding"),
|
||||
("self_attn.sdpa.r_proj", "self_attn.linear_pos"),
|
||||
("conv.pointwise_conv1", "conv_module.pointwise_conv1"),
|
||||
("conv.pointwise_conv2", "conv_module.pointwise_conv2"),
|
||||
("conv.depthwise_conv", "conv_module.depthwise_conv"),
|
||||
("conv.batch_norm", "conv_module.batch_norm"),
|
||||
("conv.layer_norm", "conv_module.depthwise_layer_norm"),
|
||||
("conv_layer_norm", "conv_module.layer_norm"),
|
||||
("speech_encoder.proj1", "intermediate_ffn.intermediate_dense"),
|
||||
("speech_encoder.proj2", "intermediate_ffn.output_dense"),
|
||||
("speech_encoder.layer_norm", "inner_layer_norm"),
|
||||
]
|
||||
|
||||
t2u_convert_list = [
|
||||
("t2u_model.final_proj", "lm_head"),
|
||||
("t2u_model.", "model."),
|
||||
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
|
||||
("encoder_decoder_attn", "cross_attention"),
|
||||
("linear_k", "k_proj"),
|
||||
("linear_v", "v_proj"),
|
||||
("linear_q", "q_proj"),
|
||||
("ffn.inner_proj", "ffn.fc1"),
|
||||
("ffn.output_proj", "ffn.fc2"),
|
||||
("output_proj", "out_proj"),
|
||||
("decoder_frontend.embed_char", "decoder.embed_char"),
|
||||
("decoder_frontend.pos_emb_alpha_char", "decoder.pos_emb_alpha_char"),
|
||||
("decoder_frontend.embed", "decoder.embed_tokens"),
|
||||
("decoder_frontend.pos_emb_alpha", "decoder.pos_emb_alpha"),
|
||||
("conv1d.conv", "conv"),
|
||||
("conv1d_layer_norm", "conv_layer_norm"),
|
||||
("decoder_frontend.variance_adaptor", "decoder"),
|
||||
("duration_predictor.conv1.0", "duration_predictor.conv1"),
|
||||
("duration_predictor.conv2.0", "duration_predictor.conv2"),
|
||||
]
|
||||
|
||||
text_convert_list = [
|
||||
("text_encoder.", ""),
|
||||
("text_decoder.", ""),
|
||||
("text_encoder_frontend.embed", "embed_tokens"),
|
||||
("text_decoder_frontend.embed", "embed_tokens"),
|
||||
("encoder_decoder_attn_layer_norm", "cross_attention_layer_norm"),
|
||||
("encoder_decoder_attn", "cross_attention"),
|
||||
("linear_k", "k_proj"),
|
||||
("linear_v", "v_proj"),
|
||||
("linear_q", "q_proj"),
|
||||
("ffn.inner_proj", "ffn.fc1"),
|
||||
("ffn.output_proj", "ffn.fc2"),
|
||||
("output_proj", "out_proj"),
|
||||
("final_proj", "lm_head"),
|
||||
]
|
||||
|
||||
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
|
||||
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
|
||||
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "huggingface", "hub")
|
||||
|
||||
|
||||
def _load_hf_config():
|
||||
return SeamlessM4Tv2Config()
|
||||
|
||||
|
||||
def _convert_model(
|
||||
original_model,
|
||||
hf_model,
|
||||
convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict="speech",
|
||||
exclude_state_dict=None,
|
||||
):
|
||||
state_dict = original_model.state_dict()
|
||||
|
||||
# filter func
|
||||
if isinstance(filter_state_dict, str):
|
||||
|
||||
def filter_func(x):
|
||||
return filter_state_dict in x[0]
|
||||
|
||||
else:
|
||||
|
||||
def filter_func(item):
|
||||
if exclude_state_dict is not None and exclude_state_dict in item[0]:
|
||||
return False
|
||||
for filter_el in filter_state_dict:
|
||||
if filter_el in item[0]:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
state_dict = dict(filter(filter_func, state_dict.items()))
|
||||
|
||||
for k, v in list(state_dict.items()):
|
||||
new_k = k[len(unwanted_prefix) :]
|
||||
for old_layer_name, new_layer_name in convert_list:
|
||||
if old_layer_name in new_k:
|
||||
new_k = new_k.replace(old_layer_name, new_layer_name)
|
||||
|
||||
# must do it by hand
|
||||
if ".layer_norm" in new_k and new_k.split(".layer_norm")[0][-1].isnumeric():
|
||||
new_k = new_k.replace("layer_norm", "final_layer_norm")
|
||||
|
||||
state_dict[new_k] = state_dict.pop(k)
|
||||
|
||||
extra_keys = set(state_dict.keys()) - set(hf_model.state_dict().keys())
|
||||
extra_keys = set(extra_keys)
|
||||
missing_keys = set(hf_model.state_dict().keys()) - set(state_dict.keys())
|
||||
missing_keys = set({k for k in missing_keys if "final_logits_bias" not in k})
|
||||
if len(extra_keys) != 0:
|
||||
raise ValueError(f"extra keys found: {extra_keys}")
|
||||
if len(missing_keys) != 0:
|
||||
raise ValueError(f"missing keys: {missing_keys}")
|
||||
hf_model.load_state_dict(state_dict, strict=False)
|
||||
n_params = param_count(hf_model)
|
||||
|
||||
logger.info(f"model loaded: {round(n_params/1e6,1)}M params")
|
||||
|
||||
hf_model.eval()
|
||||
hf_model.to(device)
|
||||
del state_dict
|
||||
|
||||
return hf_model
|
||||
|
||||
|
||||
def load_model(save_dir, model_type, repo_id):
|
||||
"""
|
||||
Meta SeamlessM4Tv2 is made of 8 main components:
|
||||
- speech_encoder (#1) and speech_encoder_frontend (#2)
|
||||
- t2u_model (#3)
|
||||
- text_encoder (#4) and text_encoder_frontend (#5)
|
||||
- text_decoder (#6) [and text_decoder_frontend (#5) = equals to text_encoder_frontend]
|
||||
- final_proj (#7)
|
||||
- vocoder (#8)
|
||||
"""
|
||||
device = _grab_best_device()
|
||||
name = "seamlessM4T_v2_large"
|
||||
|
||||
original_model = Translator(name, "vocoder_v2", device, dtype=torch.float32)
|
||||
|
||||
######### TOKENIZER
|
||||
|
||||
langs = LARGE_SUPPORTED_LANGUAGES
|
||||
langs = [f"__{lang}__" for lang in langs]
|
||||
vocab_file = os.path.join(os.path.expanduser("~"), "tokenizer", model_type, "tokenizer.model")
|
||||
|
||||
save_dir = os.path.join(save_dir, name)
|
||||
Path(save_dir).mkdir(exist_ok=True)
|
||||
|
||||
tokenizer = SeamlessM4TTokenizer(vocab_file, additional_special_tokens=langs)
|
||||
|
||||
sanity_check_lang_id = tokenizer.convert_tokens_to_ids("__fra__")
|
||||
|
||||
tokenizer.save_pretrained(save_dir)
|
||||
tokenizer = SeamlessM4TTokenizer.from_pretrained(save_dir)
|
||||
|
||||
if sanity_check_lang_id != tokenizer.convert_tokens_to_ids("__fra__"):
|
||||
raise ValueError(
|
||||
f"Error in tokenizer saving/loading - __fra__ lang id is not coherent: {sanity_check_lang_id} vs {tokenizer.convert_tokens_to_ids('__fra__')}"
|
||||
)
|
||||
|
||||
####### get language to ids dict
|
||||
text_decoder_lang_code_to_id = {lang.replace("__", ""): tokenizer.convert_tokens_to_ids(lang) for lang in langs}
|
||||
# offset: vocoder unit vocab size + 5 (for EOS/PAD/BOS/UNK/MSK) + len(supported_languages)
|
||||
t2u_lang_code_to_id = {
|
||||
code.replace("__", ""): i + 10005 + len(UNIT_SUPPORTED_LANGUAGES)
|
||||
for i, code in enumerate(UNIT_SUPPORTED_LANGUAGES)
|
||||
}
|
||||
vocoder_lang_code_to_id = {code.replace("__", ""): i for i, code in enumerate(VOCODER_SUPPORTED_LANGUAGES)}
|
||||
|
||||
######### FE
|
||||
|
||||
fe = SeamlessM4TFeatureExtractor(language_code=langs)
|
||||
|
||||
fe.save_pretrained(save_dir)
|
||||
fe = SeamlessM4TFeatureExtractor.from_pretrained(save_dir)
|
||||
|
||||
processor = SeamlessM4TProcessor(feature_extractor=fe, tokenizer=tokenizer)
|
||||
processor.save_pretrained(save_dir)
|
||||
processor.push_to_hub(repo_id=repo_id, create_pr=True)
|
||||
|
||||
processor = SeamlessM4TProcessor.from_pretrained(save_dir)
|
||||
|
||||
######## Model
|
||||
|
||||
# init config
|
||||
hf_config = _load_hf_config()
|
||||
|
||||
######## get id_to_text and char_to_id from original model tokenizers
|
||||
id_to_text = {i: original_model.text_tokenizer.model.index_to_token(i) for i in range(hf_config.vocab_size)}
|
||||
char_to_id = {
|
||||
original_model.model.t2u_model.decoder_frontend.char_tokenizer.model.index_to_token(i): i for i in range(10904)
|
||||
}
|
||||
|
||||
# init model
|
||||
hf_model = SeamlessM4Tv2Model(hf_config)
|
||||
|
||||
hf_model.generation_config.__setattr__("text_decoder_lang_to_code_id", text_decoder_lang_code_to_id)
|
||||
hf_model.generation_config.__setattr__("t2u_lang_code_to_id", t2u_lang_code_to_id)
|
||||
hf_model.generation_config.__setattr__("vocoder_lang_code_to_id", vocoder_lang_code_to_id)
|
||||
hf_model.generation_config.__setattr__("id_to_text", id_to_text)
|
||||
hf_model.generation_config.__setattr__("char_to_id", char_to_id)
|
||||
|
||||
# -1. take care of vocoder
|
||||
# similarly to speech T5 must apply and remove weight norm
|
||||
hf_model.vocoder.apply_weight_norm()
|
||||
hf_model.vocoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.vocoder,
|
||||
vocoder_convert_list,
|
||||
device,
|
||||
unwanted_prefix="vocoder.code_generator.",
|
||||
filter_state_dict="vocoder",
|
||||
)
|
||||
hf_model.vocoder.remove_weight_norm()
|
||||
|
||||
# 1. take care of speech encoder
|
||||
wav2vec = hf_model.speech_encoder
|
||||
hf_model.speech_encoder = _convert_model(
|
||||
original_model, wav2vec, wav2vec_convert_list, device, unwanted_prefix="model.", filter_state_dict="speech"
|
||||
)
|
||||
|
||||
# 2. take care of t2u
|
||||
|
||||
hf_model.t2u_model = _convert_model(
|
||||
original_model,
|
||||
hf_model.t2u_model,
|
||||
t2u_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 3. take care of text encoder
|
||||
hf_model.text_encoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.text_encoder,
|
||||
text_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.text_encoder"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 4. take care of text decoder
|
||||
hf_model.text_decoder = _convert_model(
|
||||
original_model,
|
||||
hf_model.text_decoder,
|
||||
text_convert_list,
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.text_decoder"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# 5. take care of final proj
|
||||
hf_model.lm_head = _convert_model(
|
||||
original_model,
|
||||
hf_model.lm_head,
|
||||
[("final_proj.", "")],
|
||||
device,
|
||||
unwanted_prefix="model.",
|
||||
filter_state_dict=["model.final_proj"],
|
||||
exclude_state_dict="t2u_model",
|
||||
)
|
||||
|
||||
# sanity check
|
||||
print(find_tied_parameters(hf_model))
|
||||
|
||||
count_1 = param_count(hf_model)
|
||||
count_2 = param_count(original_model)
|
||||
|
||||
print(f"HF MODEL:{count_1}, ORIGINAL_MODEL: {count_2}, diff:{count_1 - count_2}")
|
||||
print(f"HF MODEL excluding embeddings:{hf_model.num_parameters(exclude_embeddings=True)}")
|
||||
|
||||
del original_model
|
||||
|
||||
hf_model.generation_config._from_model_config = False
|
||||
hf_model.save_pretrained(save_dir)
|
||||
hf_model.push_to_hub(repo_id=repo_id, create_pr=True)
|
||||
hf_model = SeamlessM4Tv2Model.from_pretrained(save_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
|
||||
parser.add_argument(
|
||||
"--model_type",
|
||||
default="large",
|
||||
type=str,
|
||||
help="Model type.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--save_dir",
|
||||
default="/home/ubuntu/weights_v2",
|
||||
type=str,
|
||||
help="Path to the output PyTorch model.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--repo_id",
|
||||
default="facebook/seamless-m4t-v2-large",
|
||||
type=str,
|
||||
help="Repo ID.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
load_model(args.save_dir, args.model_type, args.repo_id)
|
File diff suppressed because it is too large
Load Diff
|
@ -7213,6 +7213,51 @@ class SeamlessM4TTextToUnitModel(metaclass=DummyObject):
|
|||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class SeamlessM4Tv2ForSpeechToSpeech(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4Tv2ForSpeechToText(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4Tv2ForTextToSpeech(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4Tv2ForTextToText(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4Tv2Model(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SeamlessM4Tv2PreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
|
|
@ -16,7 +16,6 @@
|
|||
|
||||
|
||||
import copy
|
||||
import inspect
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
|
@ -479,10 +478,6 @@ class SeamlessM4TModelWithSpeechInputTest(ModelTesterMixin, unittest.TestCase):
|
|||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="The speech encoder doesn't support head masking")
|
||||
def test_generate_with_head_masking(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="SeamlessM4TModel can takes input_ids or input_features")
|
||||
def test_forward_signature(self):
|
||||
pass
|
||||
|
@ -714,43 +709,6 @@ class SeamlessM4TModelWithTextInputTest(
|
|||
def test_model_weights_reload_no_missing_tied_weights(self):
|
||||
pass
|
||||
|
||||
def test_generate_with_head_masking(self):
|
||||
"""Test designed for encoder-decoder models to ensure the attention head masking is used."""
|
||||
attention_names = ["encoder_attentions", "decoder_attentions", "cross_attentions"]
|
||||
for model_class in self.all_generative_model_classes:
|
||||
config, input_ids, attention_mask, max_length = self._get_input_ids_and_config()
|
||||
|
||||
model = model_class(config).to(torch_device).eval()
|
||||
|
||||
head_masking = {
|
||||
"head_mask": torch.zeros(config.encoder_layers, config.encoder_attention_heads, device=torch_device),
|
||||
"decoder_head_mask": torch.zeros(
|
||||
config.decoder_layers, config.decoder_attention_heads, device=torch_device
|
||||
),
|
||||
"cross_attn_head_mask": torch.zeros(
|
||||
config.decoder_layers, config.decoder_attention_heads, device=torch_device
|
||||
),
|
||||
}
|
||||
|
||||
signature = inspect.signature(model.forward)
|
||||
# We want to test only models where encoder/decoder head masking is implemented
|
||||
if not set(head_masking.keys()) < {*signature.parameters.keys()}:
|
||||
continue
|
||||
|
||||
for attn_name, (name, mask) in zip(attention_names, head_masking.items()):
|
||||
out = model.generate(
|
||||
input_ids,
|
||||
attention_mask=attention_mask,
|
||||
num_beams=1,
|
||||
output_attentions=True,
|
||||
return_dict_in_generate=True,
|
||||
remove_invalid_values=True,
|
||||
**{name: mask},
|
||||
)
|
||||
# We check the state of decoder_attentions and cross_attentions just from the last step
|
||||
attn_weights = out[attn_name] if attn_name == attention_names[0] else out[attn_name][-1]
|
||||
self.assertEqual(sum([w.sum().item() for w in attn_weights]), 0.0)
|
||||
|
||||
@unittest.skip(reason="SeamlessM4TModel can take input_ids or input_features")
|
||||
def test_forward_signature(self):
|
||||
pass
|
||||
|
|
File diff suppressed because it is too large
Load Diff
|
@ -96,6 +96,21 @@ SPECIAL_CASES_TO_ALLOW = {
|
|||
"t2u_encoder_layers",
|
||||
"t2u_max_position_embeddings",
|
||||
],
|
||||
# Actually used in the config or generation config, in that case necessary for the sub-components generation
|
||||
"SeamlessM4Tv2Config": [
|
||||
"max_new_tokens",
|
||||
"t2u_decoder_attention_heads",
|
||||
"t2u_decoder_ffn_dim",
|
||||
"t2u_decoder_layers",
|
||||
"t2u_encoder_attention_heads",
|
||||
"t2u_encoder_ffn_dim",
|
||||
"t2u_encoder_layers",
|
||||
"t2u_max_position_embeddings",
|
||||
"t2u_variance_pred_dropout",
|
||||
"t2u_variance_predictor_embed_dim",
|
||||
"t2u_variance_predictor_hidden_dim",
|
||||
"t2u_variance_predictor_kernel_size",
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -463,6 +463,7 @@ OBJECTS_TO_IGNORE = [
|
|||
"SamConfig",
|
||||
"SamPromptEncoderConfig",
|
||||
"SeamlessM4TConfig", # use of unconventional markdown
|
||||
"SeamlessM4Tv2Config", # use of unconventional markdown
|
||||
"Seq2SeqTrainingArguments",
|
||||
"SpecialTokensMixin",
|
||||
"Speech2Text2Config",
|
||||
|
|
|
@ -76,6 +76,9 @@ PRIVATE_MODELS = [
|
|||
"Kosmos2TextModel",
|
||||
"Kosmos2TextForCausalLM",
|
||||
"Kosmos2VisionModel",
|
||||
"SeamlessM4Tv2TextToUnitModel",
|
||||
"SeamlessM4Tv2CodeHifiGan",
|
||||
"SeamlessM4Tv2TextToUnitForConditionalGeneration",
|
||||
]
|
||||
|
||||
# Update this list for models that are not tested with a comment explaining the reason it should not be.
|
||||
|
@ -296,6 +299,10 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||
"SeamlessM4TCodeHifiGan",
|
||||
"SeamlessM4TForSpeechToSpeech", # no auto class for speech-to-speech
|
||||
"TvpForVideoGrounding",
|
||||
"SeamlessM4Tv2NARTextToUnitModel",
|
||||
"SeamlessM4Tv2NARTextToUnitForConditionalGeneration",
|
||||
"SeamlessM4Tv2CodeHifiGan",
|
||||
"SeamlessM4Tv2ForSpeechToSpeech", # no auto class for speech-to-speech
|
||||
]
|
||||
|
||||
# DO NOT edit this list!
|
||||
|
|
|
@ -776,6 +776,7 @@ src/transformers/models/sam/modeling_sam.py
|
|||
src/transformers/models/sam/modeling_tf_sam.py
|
||||
src/transformers/models/sam/processing_sam.py
|
||||
src/transformers/models/seamless_m4t/convert_fairseq2_to_hf.py
|
||||
src/transformers/models/seamless_m4t_v2/convert_fairseq2_to_hf.py
|
||||
src/transformers/models/segformer/configuration_segformer.py
|
||||
src/transformers/models/segformer/convert_segformer_original_to_pytorch.py
|
||||
src/transformers/models/sew/convert_sew_original_pytorch_checkpoint_to_pytorch.py
|
||||
|
|
|
@ -2,6 +2,7 @@ docs/source/en/generation_strategies.md
|
|||
docs/source/en/model_doc/ctrl.md
|
||||
docs/source/en/model_doc/kosmos-2.md
|
||||
docs/source/en/model_doc/seamless_m4t.md
|
||||
docs/source/en/model_doc/seamless_m4t_v2.md
|
||||
docs/source/en/task_summary.md
|
||||
docs/source/en/tasks/prompting.md
|
||||
src/transformers/models/blip_2/modeling_blip_2.py
|
||||
|
|
Loading…
Reference in New Issue