transformers/docs/source/en/model_doc/paligemma.md

1.6 KiB

PaliGemma

Overview

The PaliGemma model was proposed by Google. It is a 3B VLM composed by a Siglip-400m vision encoder and a Gemma-2B decoder linked by a multimodal linear projection. It is not a chat model with images. It cuts an image into a fixed number of VIT tokens and prepends it to an optional prompt. One particularity is that the model uses full block attention on all the image tokens plus the input text tokens. It comes in 3 resolutions, 224x224, 448x448 and 896x896 with 3 base models, with 55 fine-tuned versions for different tasks, and 2 mix models.

This model was contributed by Molbap.

PaliGemmaConfig

autodoc PaliGemmaConfig

PaliGemmaProcessor

autodoc PaliGemmaProcessor

PaliGemmaForConditionalGeneration

autodoc PaliGemmaForConditionalGeneration - forward