# CogVLM & CogAgent 📗 [中文版README](./README_zh.md) 🔥🔥🔥 🆕 ```2023/12/15```: **CogAgent Officially Launched!** CogAgent is an image understanding model developed based on CogVLM. It features **visual-based GUI Agent capabilities** and **has further enhancements in image understanding**. It supports image input with a **resolution of 1120*1120**, and possesses multiple abilities including **multi-turn dialogue with images, GUI Agent, Grounding**, and more. 🌟 **Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm), 🆕 [Introduction to CogAgent](#introduction-to-cogagent)**
CogVLM🌐 Web Demo: this link 📖 Paper: CogVLM: Visual Expert for Pretrained Language Models CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC. |
CogAgent🌐 Web Demo: Coming soon 📖 Paper: CogAgent: A Visual Language Model for GUI Agents CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities. CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web. |
Method | LLM | MM-VET | POPE(adversarial) | TouchStone |
BLIP-2 | Vicuna-13B | 22.4 | - | - |
Otter | MPT-7B | 24.7 | - | - |
MiniGPT4 | Vicuna-13B | 24.4 | 70.4 | 531.7 |
InstructBLIP | Vicuna-13B | 25.6 | 77.3 | 552.4 |
LLaMA-Adapter v2 | LLaMA-7B | 31.4 | - | 590.1 |
LLaVA | LLaMA2-7B | 28.1 | 66.3 | 602.7 |
mPLUG-Owl | LLaMA-7B | - | 66.8 | 605.4 |
LLaVA-1.5 | Vicuna-13B | 36.3 | 84.5 | - |
Emu | LLaMA-13B | 36.3 | - | - |
Qwen-VL-Chat | - | - | - | 645.2 |
DreamLLM | Vicuna-7B | 35.9 | 76.5 | - |
CogVLM | Vicuna-7B | 52.8 | 87.6 | 742.0 |
RefCOCO | RefCOCO+ | RefCOCOg | Visual7W | ||||||
val | testA | testB | val | testA | testB | val | test | test | |
cogvim-grounding-generalist | 92.51 | 93.95 | 88.73 | 87.52 | 91.81 | 81.43 | 89.46 | 90.09 | 90.96 |
cogvim-grounding-generalist-v1.1 | **92.76** | **94.75** | **88.99** | **88.68** | **92.91** | **83.39** | **89.75** | **90.79** | **91.05** |