81 lines
4.2 KiB
Markdown
81 lines
4.2 KiB
Markdown
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# X-CLIP
|
|
|
|
## Overview
|
|
|
|
The X-CLIP model was proposed in [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling.
|
|
X-CLIP is a minimal extension of [CLIP](clip) for video. The model consists of a text encoder, a cross-frame vision encoder, a multi-frame integration Transformer, and a video-specific prompt generator.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited.*
|
|
|
|
Tips:
|
|
|
|
- Usage of X-CLIP is identical to [CLIP](clip).
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
|
|
alt="drawing" width="600"/>
|
|
|
|
<small> X-CLIP architecture. Taken from the <a href="https://arxiv.org/abs/2208.02816">original paper.</a> </small>
|
|
|
|
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
|
The original code can be found [here](https://github.com/microsoft/VideoX/tree/master/X-CLIP).
|
|
|
|
## Resources
|
|
|
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with X-CLIP.
|
|
|
|
- Demo notebooks for X-CLIP can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/X-CLIP).
|
|
|
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
|
|
|
## XCLIPProcessor
|
|
|
|
[[autodoc]] XCLIPProcessor
|
|
|
|
## XCLIPConfig
|
|
|
|
[[autodoc]] XCLIPConfig
|
|
- from_text_vision_configs
|
|
|
|
## XCLIPTextConfig
|
|
|
|
[[autodoc]] XCLIPTextConfig
|
|
|
|
## XCLIPVisionConfig
|
|
|
|
[[autodoc]] XCLIPVisionConfig
|
|
|
|
## XCLIPModel
|
|
|
|
[[autodoc]] XCLIPModel
|
|
- forward
|
|
- get_text_features
|
|
- get_video_features
|
|
|
|
## XCLIPTextModel
|
|
|
|
[[autodoc]] XCLIPTextModel
|
|
- forward
|
|
|
|
## XCLIPVisionModel
|
|
|
|
[[autodoc]] XCLIPVisionModel
|
|
- forward
|