103 lines
4.0 KiB
Markdown
103 lines
4.0 KiB
Markdown
<!---
|
|
Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
-->
|
|
|
|
# VisionTextDualEncoder and CLIP model training examples
|
|
|
|
The following example showcases how to train a CLIP-like vision-text dual encoder model
|
|
using a pre-trained vision and text encoder.
|
|
|
|
Such a model can be used for natural language image search and potentially zero-shot image classification.
|
|
The model is inspired by [CLIP](https://openai.com/blog/clip/), introduced by Alec Radford et al.
|
|
The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their
|
|
captions into the same embedding space, such that the caption embeddings are located near the embeddings
|
|
of the images they describe.
|
|
|
|
### Download COCO dataset (2017)
|
|
This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the
|
|
COCO dataset before training.
|
|
|
|
```bash
|
|
mkdir data
|
|
cd data
|
|
wget http://images.cocodataset.org/zips/train2017.zip
|
|
wget http://images.cocodataset.org/zips/val2017.zip
|
|
wget http://images.cocodataset.org/zips/test2017.zip
|
|
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
|
|
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
|
|
cd ..
|
|
```
|
|
|
|
Having downloaded COCO dataset manually you should be able to load with the `ydshieh/coc_dataset_script` dataset loading script:
|
|
|
|
```py
|
|
import os
|
|
import datasets
|
|
|
|
COCO_DIR = os.path.join(os.getcwd(), "data")
|
|
ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)
|
|
```
|
|
|
|
### Create a model from a vision encoder model and a text encoder model
|
|
Next, we create a [VisionTextDualEncoderModel](https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder#visiontextdualencoder).
|
|
The `VisionTextDualEncoderModel` class lets you load any vision and text encoder model to create a dual encoder.
|
|
Here is an example of how to load the model using pre-trained vision and text models.
|
|
|
|
```python3
|
|
from transformers import (
|
|
VisionTextDualEncoderModel,
|
|
VisionTextDualEncoderProcessor,
|
|
AutoTokenizer,
|
|
AutoImageProcessor
|
|
)
|
|
|
|
model = VisionTextDualEncoderModel.from_vision_text_pretrained(
|
|
"openai/clip-vit-base-patch32", "FacebookAI/roberta-base"
|
|
)
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
|
|
image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
|
processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
|
|
|
|
# save the model and processor
|
|
model.save_pretrained("clip-roberta")
|
|
processor.save_pretrained("clip-roberta")
|
|
```
|
|
|
|
This loads both the text and vision encoders using pre-trained weights, the projection layers are randomly
|
|
initialized except for CLIP's vision model. If you use CLIP to initialize the vision model then the vision projection weights are also
|
|
loaded using the pre-trained weights.
|
|
|
|
### Train the model
|
|
Finally, we can run the example script to train the model:
|
|
|
|
```bash
|
|
python examples/pytorch/contrastive-image-text/run_clip.py \
|
|
--output_dir ./clip-roberta-finetuned \
|
|
--model_name_or_path ./clip-roberta \
|
|
--data_dir $PWD/data \
|
|
--dataset_name ydshieh/coco_dataset_script \
|
|
--dataset_config_name=2017 \
|
|
--image_column image_path \
|
|
--caption_column caption \
|
|
--remove_unused_columns=False \
|
|
--do_train --do_eval \
|
|
--per_device_train_batch_size="64" \
|
|
--per_device_eval_batch_size="64" \
|
|
--learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
|
|
--overwrite_output_dir \
|
|
--push_to_hub
|
|
```
|