426 lines
20 KiB
Markdown
426 lines
20 KiB
Markdown
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||
the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations under the License.
|
||
|
||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||
rendered properly in your Markdown viewer.
|
||
|
||
-->
|
||
|
||
# Image tasks with IDEFICS
|
||
|
||
[[open-in-colab]]
|
||
|
||
While individual tasks can be tackled by fine-tuning specialized models, an alternative approach
|
||
that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning.
|
||
For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more.
|
||
This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can
|
||
solve image-text tasks with a large multimodal model called IDEFICS.
|
||
|
||
[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198),
|
||
a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image
|
||
and text inputs and generates coherent text as output. It can answer questions about images, describe visual content,
|
||
create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b)
|
||
and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed
|
||
versions of the model adapted for conversational use cases.
|
||
|
||
This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However,
|
||
being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether
|
||
this approach suits your use case better than fine-tuning specialized models for each individual task.
|
||
|
||
In this guide, you'll learn how to:
|
||
- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model)
|
||
- Use IDEFICS for:
|
||
- [Image captioning](#image-captioning)
|
||
- [Prompted image captioning](#prompted-image-captioning)
|
||
- [Few-shot prompting](#few-shot-prompting)
|
||
- [Visual question answering](#visual-question-answering)
|
||
- [Image classification](#image-classification)
|
||
- [Image-guided text generation](#image-guided-text-generation)
|
||
- [Run inference in batch mode](#running-inference-in-batch-mode)
|
||
- [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use)
|
||
|
||
Before you begin, make sure you have all the necessary libraries installed.
|
||
|
||
```bash
|
||
pip install -q bitsandbytes sentencepiece accelerate transformers
|
||
```
|
||
|
||
<Tip>
|
||
To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory.
|
||
</Tip>
|
||
|
||
## Loading the model
|
||
|
||
Let's start by loading the model's 9 billion parameters checkpoint:
|
||
|
||
```py
|
||
>>> checkpoint = "HuggingFaceM4/idefics-9b"
|
||
```
|
||
|
||
Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint.
|
||
The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of
|
||
preparing text and image inputs for the model.
|
||
|
||
```py
|
||
>>> import torch
|
||
|
||
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
|
||
|
||
>>> processor = AutoProcessor.from_pretrained(checkpoint)
|
||
|
||
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")
|
||
```
|
||
|
||
Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized
|
||
manner given existing devices.
|
||
|
||
### Quantized model
|
||
|
||
If high-memory GPU availability is an issue, you can load the quantized version of the model. To load the model and the
|
||
processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed
|
||
on the fly while loading.
|
||
|
||
```py
|
||
>>> import torch
|
||
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig
|
||
|
||
>>> quantization_config = BitsAndBytesConfig(
|
||
... load_in_4bit=True,
|
||
... bnb_4bit_compute_dtype=torch.float16,
|
||
... )
|
||
|
||
>>> processor = AutoProcessor.from_pretrained(checkpoint)
|
||
|
||
>>> model = IdeficsForVisionText2Text.from_pretrained(
|
||
... checkpoint,
|
||
... quantization_config=quantization_config,
|
||
... device_map="auto"
|
||
... )
|
||
```
|
||
|
||
Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for.
|
||
|
||
## Image captioning
|
||
Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired
|
||
people navigate through different situations, for instance, explore image content online.
|
||
|
||
To illustrate the task, get an image to be captioned, e.g.:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg" alt="Image of a puppy in a flower bed"/>
|
||
</div>
|
||
|
||
Photo by [Hendo Wang](https://unsplash.com/@hendoo).
|
||
|
||
IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the
|
||
model, only the preprocessed input image. Without a text prompt, the model will start generating text from the
|
||
BOS (beginning-of-sequence) token thus creating a caption.
|
||
|
||
As image input to the model, you can use either an image object (`PIL.Image`) or a url from which the image can be retrieved.
|
||
|
||
```py
|
||
>>> prompt = [
|
||
... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
|
||
... ]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
A puppy in a flower bed
|
||
```
|
||
|
||
<Tip>
|
||
|
||
It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing
|
||
the `max_new_tokens`: the model will want to generate a new `<image>` or `<fake_token_around_image>` token when there
|
||
is no image being generated by the model.
|
||
You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide.
|
||
</Tip>
|
||
|
||
## Prompted image captioning
|
||
|
||
You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take
|
||
another image to illustrate:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-prompted-im-captioning.jpg" alt="Image of the Eiffel Tower at night"/>
|
||
</div>
|
||
|
||
Photo by [Denys Nevozhai](https://unsplash.com/@dnevozhai).
|
||
|
||
Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs.
|
||
|
||
```py
|
||
>>> prompt = [
|
||
... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
|
||
... "This is an image of ",
|
||
... ]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
This is an image of the Eiffel Tower in Paris, France.
|
||
```
|
||
|
||
## Few-shot prompting
|
||
|
||
While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with
|
||
other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning.
|
||
By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples.
|
||
|
||
Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model
|
||
that in addition to learning what the object in an image is, we would also like to get some interesting information about it.
|
||
Then, let's see, if we can get the same response format for an image of the Statue of Liberty:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-few-shot.jpg" alt="Image of the Statue of Liberty"/>
|
||
</div>
|
||
|
||
Photo by [Juan Mayobre](https://unsplash.com/@jmayobres).
|
||
|
||
```py
|
||
>>> prompt = ["User:",
|
||
... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
|
||
... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
|
||
... "User:",
|
||
... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
|
||
... "Describe this image.\nAssistant:"
|
||
... ]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
User: Describe this image.
|
||
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.
|
||
User: Describe this image.
|
||
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.
|
||
```
|
||
|
||
Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks,
|
||
feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.).
|
||
|
||
## Visual question answering
|
||
|
||
Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image
|
||
captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer
|
||
service (questions about products based on images), and image retrieval.
|
||
|
||
Let's get a new image for this task:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-vqa.jpg" alt="Image of a couple having a picnic"/>
|
||
</div>
|
||
|
||
Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos).
|
||
|
||
You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions:
|
||
|
||
```py
|
||
>>> prompt = [
|
||
... "Instruction: Provide an answer to the question. Use the image to answer.\n",
|
||
... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
|
||
... "Question: Where are these people and what's the weather like? Answer:"
|
||
... ]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
Instruction: Provide an answer to the question. Use the image to answer.
|
||
Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.
|
||
```
|
||
|
||
## Image classification
|
||
|
||
IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing
|
||
labeled examples from those specific categories. Given a list of categories and using its image and text understanding
|
||
capabilities, the model can infer which category the image likely belongs to.
|
||
|
||
Say, we have this image of a vegetable stand:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-classification.jpg" alt="Image of a vegetable stand"/>
|
||
</div>
|
||
|
||
Photo by [Peter Wendt](https://unsplash.com/@peterwendt).
|
||
|
||
We can instruct the model to classify the image into one of the categories that we have:
|
||
|
||
```py
|
||
>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
|
||
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
|
||
... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
|
||
... "Category: "
|
||
... ]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
|
||
Category: Vegetables
|
||
```
|
||
|
||
In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification.
|
||
|
||
## Image-guided text generation
|
||
|
||
For more creative applications, you can use image-guided text generation to generate text based on an image. This can be
|
||
useful to create descriptions of products, ads, descriptions of a scene, etc.
|
||
|
||
Let's prompt IDEFICS to write a story based on a simple image of a red door:
|
||
|
||
<div class="flex justify-center">
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-story-generation.jpg" alt="Image of a red door with a pumpkin on the steps"/>
|
||
</div>
|
||
|
||
Photo by [Craig Tidball](https://unsplash.com/@devonshiremedia).
|
||
|
||
```py
|
||
>>> prompt = ["Instruction: Use the image to write a story. \n",
|
||
... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
|
||
... "Story: \n"]
|
||
|
||
>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> print(generated_text[0])
|
||
Instruction: Use the image to write a story.
|
||
Story:
|
||
Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world.
|
||
|
||
One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat.
|
||
|
||
The little girl ran inside and told her mother about the man.
|
||
|
||
Her mother said, “Don’t worry, honey. He’s just a friendly ghost.”
|
||
|
||
The little girl wasn’t sure if she believed her mother, but she went outside anyway.
|
||
|
||
When she got to the door, the man was gone.
|
||
|
||
The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.
|
||
|
||
He was wearing a long black coat and a top hat.
|
||
|
||
The little girl ran
|
||
```
|
||
|
||
Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost.
|
||
|
||
<Tip>
|
||
|
||
For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help
|
||
you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies)
|
||
to learn more.
|
||
</Tip>
|
||
|
||
## Running inference in batch mode
|
||
|
||
All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference
|
||
for a batch of examples by passing a list of prompts:
|
||
|
||
```py
|
||
>>> prompts = [
|
||
... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
|
||
... "This is an image of ",
|
||
... ],
|
||
... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
|
||
... "This is an image of ",
|
||
... ],
|
||
... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
|
||
... "This is an image of ",
|
||
... ],
|
||
... ]
|
||
|
||
>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> for i,t in enumerate(generated_text):
|
||
... print(f"{i}:\n{t}\n")
|
||
0:
|
||
This is an image of the Eiffel Tower in Paris, France.
|
||
|
||
1:
|
||
This is an image of a couple on a picnic blanket.
|
||
|
||
2:
|
||
This is an image of a vegetable stand.
|
||
```
|
||
|
||
## IDEFICS instruct for conversational use
|
||
|
||
For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub:
|
||
`HuggingFaceM4/idefics-80b-instruct` and `HuggingFaceM4/idefics-9b-instruct`.
|
||
|
||
These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction
|
||
fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings.
|
||
|
||
The use and prompting for the conversational use is very similar to using the base models:
|
||
|
||
```py
|
||
>>> import torch
|
||
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor
|
||
|
||
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
|
||
|
||
>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
|
||
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
|
||
>>> processor = AutoProcessor.from_pretrained(checkpoint)
|
||
|
||
>>> prompts = [
|
||
... [
|
||
... "User: What is in this image?",
|
||
... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
|
||
... "<end_of_utterance>",
|
||
|
||
... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
|
||
|
||
... "\nUser:",
|
||
... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
|
||
... "And who is that?<end_of_utterance>",
|
||
|
||
... "\nAssistant:",
|
||
... ],
|
||
... ]
|
||
|
||
>>> # --batched mode
|
||
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
|
||
>>> # --single sample mode
|
||
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)
|
||
|
||
>>> # Generation args
|
||
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
|
||
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
|
||
|
||
>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
|
||
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||
>>> for i, t in enumerate(generated_text):
|
||
... print(f"{i}:\n{t}\n")
|
||
```
|