transformers/docs/source/en/model_doc/idefics2.md

3.9 KiB
Raw Permalink Blame History

Idefics2

Overview

The Idefics2 model was created by the Hugging Face M4 team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh. The accompanying blog post can be found here.

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats images in their native aspect ratio and resolution, which allows for varying inference efficiency.

Tips:

  • Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images in a batch for input to the model.
  • The processor has a do_image_splitting option. If True, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure processor.image_processor.do_image_splitting is set to False if the model was not trained with this option.
  • text passed to the processor should have the <image> tokens where the images should be inserted. And <end_of_utterance> at the end of each utterance if the text is a chat message.
  • The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor.

Example of how to use the processor on chat messages:

import requests
from PIL import Image
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration

url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"

image_1 = Image.open(requests.get(url_1, stream=True).raw)
image_2 = Image.open(requests.get(url_2, stream=True).raw)
images = [image_1, image_2]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Whats the difference between these two images?"},
        {"type": "image"},
        {"type": "image"},
    ],
}]

processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")

text = processor.apply_chat_template(messages)
# "User: Whats the difference between these two images?<image><image><end_of_utterance>\n"
print(text)

inputs = processor(images=images, text=text)

generated_text = model.generate(**inputs)

This model was contributed by amyeroberts. The original code can be found here.

Idefics2Config

autodoc Idefics2Config

Idefics2Model

autodoc Idefics2Model - forward

Idefics2ForConditionalGeneration

autodoc Idefics2ForConditionalGeneration - forward

Idefics2ImageProcessor

autodoc Idefics2ImageProcessor - preprocess

Idefics2Processor

autodoc Idefics2Processor - call