86 lines
3.4 KiB
Markdown
86 lines
3.4 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# SpeechT5
|
|
|
|
## Overview
|
|
|
|
The SpeechT5 model was proposed in [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.*
|
|
|
|
This model was contributed by [Matthijs](https://huggingface.co/Matthijs). The original code can be found [here](https://github.com/microsoft/SpeechT5).
|
|
|
|
## SpeechT5Config
|
|
|
|
[[autodoc]] SpeechT5Config
|
|
|
|
## SpeechT5HifiGanConfig
|
|
|
|
[[autodoc]] SpeechT5HifiGanConfig
|
|
|
|
## SpeechT5Tokenizer
|
|
|
|
[[autodoc]] SpeechT5Tokenizer
|
|
- __call__
|
|
- save_vocabulary
|
|
- decode
|
|
- batch_decode
|
|
|
|
## SpeechT5FeatureExtractor
|
|
|
|
[[autodoc]] SpeechT5FeatureExtractor
|
|
- __call__
|
|
|
|
## SpeechT5Processor
|
|
|
|
[[autodoc]] SpeechT5Processor
|
|
- __call__
|
|
- pad
|
|
- from_pretrained
|
|
- save_pretrained
|
|
- batch_decode
|
|
- decode
|
|
|
|
## SpeechT5Model
|
|
|
|
[[autodoc]] SpeechT5Model
|
|
- forward
|
|
|
|
## SpeechT5ForSpeechToText
|
|
|
|
[[autodoc]] SpeechT5ForSpeechToText
|
|
- forward
|
|
|
|
## SpeechT5ForTextToSpeech
|
|
|
|
[[autodoc]] SpeechT5ForTextToSpeech
|
|
- forward
|
|
- generate
|
|
|
|
## SpeechT5ForSpeechToSpeech
|
|
|
|
[[autodoc]] SpeechT5ForSpeechToSpeech
|
|
- forward
|
|
- generate_speech
|
|
|
|
## SpeechT5HifiGan
|
|
|
|
[[autodoc]] SpeechT5HifiGan
|
|
- forward
|