86 lines
4.5 KiB
Markdown
86 lines
4.5 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# TVLT
|
|
|
|
<Tip warning={true}>
|
|
|
|
This model is in maintenance mode only, we don't accept any new PRs changing its code.
|
|
If you run into any issues running this model, please reinstall the last version that supported this model: v4.40.2.
|
|
You can do so by running the following command: `pip install -U transformers==4.40.2`.
|
|
|
|
</Tip>
|
|
|
|
## Overview
|
|
|
|
The TVLT model was proposed in [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156)
|
|
by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal (the first three authors contributed equally). The Textless Vision-Language Transformer (TVLT) is a model that uses raw visual and audio inputs for vision-and-language representation learning, without using text-specific modules such as tokenization or automatic speech recognition (ASR). It can perform various audiovisual and vision-language tasks like retrieval, question answering, etc.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering, image retrieval, video retrieval, and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text.*
|
|
|
|
<p align="center">
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/tvlt_architecture.png"
|
|
alt="drawing" width="600"/>
|
|
</p>
|
|
|
|
<small> TVLT architecture. Taken from the <a href="[https://arxiv.org/abs/2102.03334](https://arxiv.org/abs/2209.14156)">original paper</a>. </small>
|
|
|
|
The original code can be found [here](https://github.com/zinengtang/TVLT). This model was contributed by [Zineng Tang](https://huggingface.co/ZinengTang).
|
|
|
|
## Usage tips
|
|
|
|
- TVLT is a model that takes both `pixel_values` and `audio_values` as input. One can use [`TvltProcessor`] to prepare data for the model.
|
|
This processor wraps an image processor (for the image/video modality) and an audio feature extractor (for the audio modality) into one.
|
|
- TVLT is trained with images/videos and audios of various sizes: the authors resize and crop the input images/videos to 224 and limit the length of audio spectrogram to 2048. To make batching of videos and audios possible, the authors use a `pixel_mask` that indicates which pixels are real/padding and `audio_mask` that indicates which audio values are real/padding.
|
|
- The design of TVLT is very similar to that of a standard Vision Transformer (ViT) and masked autoencoder (MAE) as in [ViTMAE](vitmae). The difference is that the model includes embedding layers for the audio modality.
|
|
- The PyTorch version of this model is only available in torch 1.10 and higher.
|
|
|
|
## TvltConfig
|
|
|
|
[[autodoc]] TvltConfig
|
|
|
|
## TvltProcessor
|
|
|
|
[[autodoc]] TvltProcessor
|
|
- __call__
|
|
|
|
## TvltImageProcessor
|
|
|
|
[[autodoc]] TvltImageProcessor
|
|
- preprocess
|
|
|
|
## TvltFeatureExtractor
|
|
|
|
[[autodoc]] TvltFeatureExtractor
|
|
- __call__
|
|
|
|
## TvltModel
|
|
|
|
[[autodoc]] TvltModel
|
|
- forward
|
|
|
|
## TvltForPreTraining
|
|
|
|
[[autodoc]] TvltForPreTraining
|
|
- forward
|
|
|
|
## TvltForAudioVisualClassification
|
|
|
|
[[autodoc]] TvltForAudioVisualClassification
|
|
- forward
|