transformers/docs/source/en/model_doc/llava.md

4.2 KiB
Raw Permalink Blame History

LLaVa

Overview

LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.

The LLaVa model was proposed in Visual Instruction Tuning and improved in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.

The abstract from the paper is the following:

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in 1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available

drawing

LLaVa architecture. Taken from the original paper.

This model was contributed by ArthurZ and ybelkada. The original code can be found here.

Usage tips

  • We advise users to use padding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = "left" before generating.

  • Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.

  • For better results, we recommend users to prompt the model with the correct prompt format:

"USER: <image>\n<prompt> ASSISTANT:"

For multiple turns conversation:

"USER: <image>\n<prompt1> ASSISTANT: <answer1></s>USER: <prompt2> ASSISTANT: <answer2></s>USER: <prompt3> ASSISTANT:"

Using Flash Attention 2

Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the Flash Attention 2 section of performance docs.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT.

LlavaConfig

autodoc LlavaConfig

LlavaProcessor

autodoc LlavaProcessor

LlavaForConditionalGeneration

autodoc LlavaForConditionalGeneration - forward