upload code

This commit is contained in:
Qingsong Lv 2023-11-22 14:00:42 +08:00
parent 3cace3cd57
commit c253a06bad
48 changed files with 3662 additions and 0 deletions

13
LICENSE Normal file
View File

@ -0,0 +1,13 @@
Copyright 2023 CogVLM team @ Zhipu AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

73
MODEL_LICENSE Normal file
View File

@ -0,0 +1,73 @@
The CogVLM License
1. Definitions
“Licensor” means the CogVLM Model Team that distributes its Software.
“Software” means the CogVLM model parameters made available under this license.
2. License Grant
Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software.
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
3. Restriction
You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
4. Disclaimer
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
5. Limitation of Liability
EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
6. Dispute Resolution
This license shall be governed and construed in accordance with the laws of Peoples Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
7. Llama2 and EVA-CLIP2 license
For CogVLM-17B version, Llama2 license (https://ai.meta.com/llama/license/) and EVA license (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) are applied.
1. 定义
“许可方”是指分发其软件的 CogVLM 模型团队。
“软件”是指根据本许可提供的 CogVLM 模型参数。
2. 许可授予
根据本许可的条款和条件,许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
3.限制
您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
4.免责声明
本软件“按原样”提供,不提供任何明示或暗示的保证,包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下,作者或版权持有人均不对任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他方面,由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
5. 责任限制
除适用法律禁止的范围外,在任何情况下且根据任何法律理论,无论是基于侵权行为、疏忽、合同、责任或其他原因,任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害,或任何其他商业损失,即使许可人已被告知此类损害的可能性。
6.争议解决
本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
请注意,许可证可能会更新到更全面的版本。 有关许可和版权的任何问题,请通过 license@zhipuai.cn 与我们联系。
7. Llama2 和 EVA-CLIP2 许可
针对 CogVLM-17B 版本, Llama2 许可条件 (https://ai.meta.com/llama/license/) 和 EVA 许可条件 (MIT, https://github.com/baaivision/EVA/blob/master/LICENSE) 同时适用于模型权重。

270
README.md
View File

@ -1,2 +1,272 @@
# CogVLM
📖 [Paper论文](https://arxiv.org/abs/2311.03079)
🌐 [web demo测试网址](http://36.103.203.44:7861/)
🔥 **News**: ```2023/11/20``` We have updated the checkpoint, unified the versions of chat and VQA, and refreshed the SOTA on various datasets.
🔥 **News**: ```2023/11/20``` We release **[cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf)**, **[cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf)**, **[cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf)** on 🤗Huggingface. you can infer with transformers in [a few lines of code](#-transformers) now!
🔥 **News**: ```2023/10/27``` CogVLM bilingual version is available [online](https://chatglm.cn/)! Welcome to try it out!
[中文版README](./README_zh.md)
## Introduction
- CogVLM is a powerful **open-source visual language model** (**VLM**). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters.
- CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and rank the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., **surpassing or matching PaLI-X 55B**. CogVLM can also [chat with you](http://36.103.203.44:7861) about images.
<div align="center">
<img src=assets/metrics-min.png width=80% />
</div>
| Method | LLM | MM-VET | POPE(adversarial) | TouchStone |
| ---------------- | ------------- |--------| --------- |------------|
| BLIP-2 | Vicuna-13B | 22.4 | - | - |
| Otter | MPT-7B | 24.7 | - | - |
| MiniGPT4 | Vicuna-13B | 24.4 | 70.4 | 531.7 |
| InstructBLIP | Vicuna-13B | 25.6 | 77.3 | 552.4 |
| LLaMA-Adapter v2 | LLaMA-7B | 31.4 | - | 590.1 |
| LLaVA | LLaMA2-7B | 28.1 | 66.3 | 602.7 |
| mPLUG-Owl | LLaMA-7B | - | 66.8 | 605.4 |
| LLaVA-1.5 | Vicuna-13B | 36.3 | 84.5 | - |
| Emu | LLaMA-13B | 36.3 | - | - |
| Qwen-VL-Chat | - | - | - | 645.2 |
| DreamLLM | Vicuna-7B | 35.9 | 76.5 | - |
| CogVLM | Vicuna-7B | **52.8** | **87.6** | **742.0** |
## Examples
<!-- CogVLM is powerful for answering various types of visual questions, including **Detailed Description & Visual Question Answering**, **Complex Counting**, **Visual Math Problem Solving**, **OCR-Free Reasonging**, **OCR-Free Visual Question Answering**, **World Knowledge**, **Referring Expression Comprehension**, **Programming with Visual Input**, **Grounding with Caption**, **Grounding Visual Question Answering**, etc. -->
* CogVLM can accurately describe images in details with **very few hallucinations**.
<details>
<summary>Click for comparison with LLAVA-1.5 and MiniGPT-4.</summary>
![LLAVA Comparision](assets/llava-comparison-min.png)
</details>
<br>
* CogVLM can understand and answer various types of questions, and has a **visual grounding** version.
<div align="center">
<img src=assets/pear_grounding.png width=90% />
</div>
<br>
* CogVLM sometimes captures more detailed content than GPT-4V(ision).
<div align="center">
<img src=assets/compare-min.png width=90% />
</div>
<!-- ![compare](assets/compare.png) -->
<br>
<details>
<summary>Click to expand more examples.</summary>
![Chat Examples](assets/chat.png)
</details>
## Method
CogVLM model comprises four fundamental components: a vision transformer (ViT) encoder, an MLP adapter, a pretrained large language model (GPT), and a **visual expert module**. See [Paper](./assets/cogvlm-paper.pdf) for more details.
<div align="center">
<img src=assets/method-min.png width=70% />
</div>
## Get Started
We support two GUIs for model inference, **web demo** and **CLI**. If you want to use it in your python code, it is easy to modify the CLI scripts for your case.
First, we need to install the dependencies.
```bash
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
#### Hardware requirement
* Model Inference: 1 * A100(80G) or 2 * RTX 3090(24G).
* Finetuning: 4 * A100(80G) *[Recommend]* or 8* RTX 3090(24G).
<!-- ### Online Web Demo
We provide a [web demo](http://36.103.203.44:7861/) based on [Gradio](https://gradio.app). -->
### Web Demo
We also offer a local web demo based on Gradio. First, install Gradio by running: `pip install gradio`. Then download and enter this repository and run `web_demo.py`. See the next section for detailed usage:
```bash
python web_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat --english --bf16
python web_demo.py --from_pretrained cogvlm-grounding-generalist --version base --english --bf16
```
The GUI of the web demo looks like:
<div align="center">
<img src=assets/web_demo-min.png width=70% />
</div>
### CLI
We open-source different checkpoints for different downstreaming tasks:
* `cogvlm-chat-v1.1` The model supports multiple rounds of chat and vqa simultaneously, with different prompts.
* `cogvlm-base-224` The original checkpoint after text-image pretraining.
* `cogvlm-base-490` Amplify the resolution to 490 through position encoding interpolation from `cogvlm-base-224`.
* `cogvlm-grounding-generalist`. This checkpoint supports different visual grounding tasks, e.g. REC, Grounding Captioning, etc.
Run CLI demo via:
```bash
# Chat version will provide detailed answers, while vqa version usually only has one word in answer.
python cli_demo.py --from_pretrained cogvlm-base-224 --version base --english --bf16 --no_prompt
python cli_demo.py --from_pretrained cogvlm-base-490 --version base --english --bf16 --no_prompt
python cli_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat --english --bf16
python cli_demo.py --from_pretrained cogvlm-chat-v1.1 --version vqa --english --bf16
python cli_demo.py --from_pretrained cogvlm-grounding-generalist --version base --english --bf16
```
The program will automatically download the sat model and interact in the command line. You can generate replies by entering instructions and pressing enter.
Enter `clear` to clear the conversation history and `stop` to stop the program.
#### Multi-GPU inference
We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. `--nproc-per-node=[n]` in the following command controls the number of used GPUs.
```
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat --english --bf16
```
**Note**:
* If you have trouble in accessing huggingface.co, you can add `--local_tokenizer /path/to/vicuna-7b-v1.5` to load the tokenizer.
* If you have trouble in automatically downloading model with 🔨[SAT](https://github.com/THUDM/SwissArmyTransformer), try downloading from 🤖[modelscope](https://www.modelscope.cn/models/ZhipuAI/CogVLM/summary) or 🤗[huggingface](https://huggingface.co/THUDM/CogVLM) manually.
* Download model using 🔨[SAT](https://github.com/THUDM/SwissArmyTransformer), the model will be saved to the default location `~/.sat_models`. Change the default location by setting the environment variable `SAT_HOME`. For example, if you want to save the model to `/path/to/my/models`, you can run `export SAT_HOME=/path/to/my/models` before running the python command.
The program provides the following hyperparameters to control the generation process:
```
usage: cli_demo.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE] [--english]
optional arguments:
-h, --help show this help message and exit
--max_length MAX_LENGTH
max length of the total sequence
--top_p TOP_P top p for nucleus sampling
--top_k TOP_K top k for top k sampling
--temperature TEMPERATURE
temperature for sampling
--english only output English
```
### Finetuning
You may want to use CogVLM in your own task, which needs a **different output style or domain knowledge**. We here provide a finetuning example for **Captcha Recognition**.
1. Start by downloading the [Captcha Images dataset](https://www.kaggle.com/datasets/aadhavvignesh/captcha-images). Once downloaded, extract the contents of the ZIP file.
2. To create a train/validation/test split in the ratio of 80/5/15, execute the following:
```bash
python scripts/split_dataset.py
```
3. Start the fine-tuning process with this command:
```bash
bash scripts/finetune_(224/490)_lora.sh
```
4. Merge the model to `model_parallel_size=1`: (replace the 4 below with your training `MP_SIZE`)
```bash
torchrun --standalone --nnodes=1 --nproc-per-node=4 merge_model.py --version base --bf16 --from_pretrained ./checkpoints/merged_lora_(224/490)
```
5. Evaluate the performance of your model.
```bash
bash scripts/evaluate_(224/490).sh
```
It is recommended to use the `490px` version. However, if you have limited GPU resources (such as only one node with 8* RTX 3090), you can try `224px` version with model parallel.
The anticipated result of this script is around `95%` accuracy on test set.
It is worth noting that the fine-tuning examples only tune limited parameters. (Expert only) If you want to get `>98%` accuracy, you need to increase the trainable parameters in `finetune_demo.py`.
### 🤗 Transformers
To use CogVLM for the inference with transformers, use the following code:
```python
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to('cuda').eval()
# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
# This image captures a moment from a basketball game. Two players are prominently featured: one wearing a yellow jersey with the number
# 24 and the word 'Lakers' written on it, and the other wearing a navy blue jersey with the word 'Washington' and the number 34. The player
# in yellow is holding a basketball and appears to be dribbling it, while the player in navy blue is reaching out with his arm, possibly
# trying to block or defend. The background shows a filled stadium with spectators, indicating that this is a professional game.</s>
# vqa example
query = 'How many houses are there in this cartoon?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/3.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa') # vqa mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
# 4</s>
```
## License
The code in this repository is open source under the [Apache-2.0 license](./LICENSE), while the use of the CogVLM model weights must comply with the [Model License](./MODEL_LICENSE).
## Citation & Acknowledgements
If you find our work helpful, please consider citing the following papers
```
@article{wang2023cogvlm,
title={CogVLM: Visual Expert for Pretrained Language Models},
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
year={2023},
eprint={2311.03079},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
In the instruction fine-tuning phase of the CogVLM, there are some English image-text data from the [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4), [LLAVA](https://github.com/haotian-liu/LLaVA), [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction), [LLaVAR](https://github.com/SALT-NLP/LLaVAR) and [Shikra](https://github.com/shikras/shikra) projects, as well as many classic cross-modal work datasets. We sincerely thank them for their contributions.

201
README_zh.md Normal file
View File

@ -0,0 +1,201 @@
# CogVLM
📖 [Paper论文](./assets/cogvlm-paper.pdf)
🌐 [web demo测试网址](http://36.103.203.44:7861/)
🔥 **News**: ```2023/11/20``` cogvlm-chat 更新 v1.1 版本,该版本同时支持对话和问答,在多项数据集刷新 SOTA 效果。
🔥 **News**: ```2023/10/27``` CogVLM 中英双语版正式[上线](https://chatglm.cn/)了!欢迎体验!
🔥 **News**: ```2023/11/20``` CogVLM 的 🤗huggingface 版已开源!包括[**cogvlm-chat**](https://huggingface.co/THUDM/cogvlm-chat-hf), **[cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf)**, **[cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf)**. 仅使用几行代码即可进行推理,具体使用方法请参考[这里](#-transformers)。
[README in English](./README.md)
## 简介
- CogVLM 是一个强大的开源视觉语言模型VLM。CogVLM-17B 拥有 100 亿视觉参数和 70 亿语言参数。
- CogVLM-17B 在 10 个经典跨模态基准测试上取得了 SOTA 性能,包括 NoCaps、Flicker30k captioning、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC而在 VQAv2、OKVQA、TextVQA、COCO captioning 等方面则排名第二,超越或与 PaLI-X 55B 持平。您可以通过线上 [demo](http://36.103.203.44:7861) 体验 CogVLM 多模态对话。
<div align="center">
<img src=assets/metrics-min.png width=80% />
</div>
| Method | LLM | MM-VET | POPE(adversarial) | TouchStone |
| ---------------- | ------------- |--------| --------- |------------|
| BLIP-2 | Vicuna-13B | 22.4 | - | - |
| Otter | MPT-7B | 24.7 | - | - |
| MiniGPT4 | Vicuna-13B | 24.4 | 70.4 | 531.7 |
| InstructBLIP | Vicuna-13B | 25.6 | 77.3 | 552.4 |
| LLaMA-Adapter v2 | LLaMA-7B | 31.4 | - | 590.1 |
| LLaVA | LLaMA2-7B | 28.1 | 66.3 | 602.7 |
| mPLUG-Owl | LLaMA-7B | - | 66.8 | 605.4 |
| LLaVA-1.5 | Vicuna-13B | 36.3 | 84.5 | - |
| Emu | LLaMA-13B | 36.3 | - | - |
| Qwen-VL-Chat | - | - | - | 645.2 |
| DreamLLM | Vicuna-7B | 35.9 | 76.5 | - |
| CogVLM | Vicuna-7B | **52.8** | **87.6** | **742.0** |
## 示例
<!-- CogVLM is powerful for answering various types of visual questions, including **Detailed Description & Visual Question Answering**, **Complex Counting**, **Visual Math Problem Solving**, **OCR-Free Reasonging**, **OCR-Free Visual Question Answering**, **World Knowledge**, **Referring Expression Comprehension**, **Programming with Visual Input**, **Grounding with Caption**, **Grounding Visual Question Answering**, etc. -->
* CogVLM 能够准确地描述图像,**几乎不会出现幻觉**。
<details>
<summary>点击查看与 LLAVA-1.5 和 MiniGPT-4 的比较。</summary>
![LLAVA Comparision](assets/llava-comparison-min.png)
</details>
<br>
* CogVLM 能理解和回答各种类型的问题,并有一个**视觉定位**版本。
<div align="center">
<img src=assets/pear_grounding.png width=90% />
</div>
<br>
* CogVLM 有时比 GPT-4V(ision) 提取到更多的细节信息。
<div align="center">
<img src=assets/compare-min.png width=90% />
</div>
<!-- ![compare](assets/compare.png) -->
<br>
<details>
<summary>点击展开更多示例。</summary>
![Chat Examples](assets/chat.png)
</details>
## 方法
CogVLM 模型包括四个基本组件视觉变换器ViT编码器、MLP适配器、预训练的大型语言模型GPT和一个**视觉专家模块**。更多细节请参见[论文](./assets/cogvlm-paper.pdf)。
<div align="center">
<img src=assets/method-min.png width=70% />
</div>
## 入门指南
我们提供两种图形用户界面GUI进行模型推断分别是**网页演示**和**命令行界面CLI**。如果您想在Python代码中使用它很容易修改CLI脚本以适应您的情况。
首先,需要安装依赖项。
```bash
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
#### 硬件要求
* 模型推断1 * A100(80G) 或 2 * RTX 3090(24G)。
* 微调4 * A100(80G) [推荐] 或 8 * RTX 3090(24G)。
<!-- ### Online Web Demo
We provide a [web demo](http://36.103.203.44:7861/) based on [Gradio](https://gradio.app). -->
### 网页演示
我们还提供基于Gradio的本地网页演示。首先通过运行 pip install gradio 安装Gradio。然后下载并进入此仓库运行 web_demo.py。具体使用方式如下
```bash
python web_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16
python web_demo.py --from_pretrained cogvlm-grounding-generalist --version base --english --bf16
```
网页演示的 GUI 界面如下:
<div align="center">
<img src=assets/web_demo-min.png width=70% />
</div>
### CLI
我们开源了不同下游任务的模型权重:
* cogvlm-chat 用于对齐的模型,在此之后支持像 GPT-4V 一样的聊天。
* cogvlm-base-224 文本-图像预训练后的原始权重。
* cogvlm-base-490 从 cogvlm-base-224 微调得到的 490px 分辨率版本。
* cogvlm-grounding-generalist 这个权重支持不同的视觉定位任务,例如 REC、Grounding Captioning 等。
通过CLI演示执行以下命令
```bash
python cli_demo.py --from_pretrained cogvlm-base-224 --version base --english --bf16 --no_prompt
python cli_demo.py --from_pretrained cogvlm-base-490 --version base --english --bf16 --no_prompt
python cli_demo.py --from_pretrained cogvlm-chat --version chat --english --bf16
python cli_demo.py --from_pretrained cogvlm-grounding-generalist --version base --english --bf16
```
该程序会自动下载 sat 模型并在命令行中进行交互。您可以通过输入指令并按 Enter 生成回复。
输入 clear 可清除对话历史,输入 stop 可停止程序。
### 🤗 Transformers
使用Transformers对CogVLM进行推理只需要如下几行代码
```python
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to('cuda').eval()
# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
# This image captures a moment from a basketball game. Two players are prominently featured: one wearing a yellow jersey with the number
# 24 and the word 'Lakers' written on it, and the other wearing a navy blue jersey with the word 'Washington' and the number 34. The player
# in yellow is holding a basketball and appears to be dribbling it, while the player in navy blue is reaching out with his arm, possibly
# trying to block or defend. The background shows a filled stadium with spectators, indicating that this is a professional game.</s>
# vqa example
query = 'How many houses are there in this cartoon?'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/3.jpg?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa') # vqa mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
# 4</s>
```
## 许可
此存储库中的代码是根据 [Apache-2.0 许可](./LICENSE) 开放源码,而使用 CogVLM 模型权重必须遵循 [模型许可](./MODEL_LICENSE)。
## 引用 & 鸣谢
如果您觉得我们的工作有帮助,请考虑引用以下论文:
```
```
在 CogVLM 的指令微调阶段,我们使用了来自 [MiniGPT-4](https://github.com/Vision-CAIR/MiniGPT-4) 、 [LLAVA](https://github.com/haotian-liu/LLaVA) 、 [LRV-Instruction](https://github.com/FuxiaoLiu/LRV-Instruction)、 [LLaVAR](https://github.com/SALT-NLP/LLaVAR) 和 [Shikra](https://github.com/shikras/shikra) 项目的一些英文图像-文本数据,以及许多经典的跨模态工作数据集。我们衷心感谢他们的贡献。

6
assets/WECHAT.md Normal file
View File

@ -0,0 +1,6 @@
<div align="center">
<img src=wechat.jpg width="60%"/>
<p> 扫码关注公众号加入「ChatGLM交流群」 </p>
<p> Scan the QR code to follow the official account and join the "ChatGLM Discussion Group" </p>
</div>

BIN
assets/chat-min.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

BIN
assets/chat.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.1 MiB

BIN
assets/cogvlm-paper.pdf Normal file

Binary file not shown.

BIN
assets/compare-min.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

BIN
assets/compare.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 247 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 470 KiB

BIN
assets/method-min.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

BIN
assets/method.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 286 KiB

BIN
assets/metrics-min.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

BIN
assets/metrics.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

BIN
assets/pear_grounding.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 KiB

BIN
assets/web_demo-min.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

BIN
assets/web_demo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 302 KiB

BIN
assets/wechat.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

154
cli_demo.py Normal file
View File

@ -0,0 +1,154 @@
# -*- encoding: utf-8 -*-
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import torch
import argparse
from sat.model.mixins import CachedAutoregressiveMixin
from utils.chat import chat
from models.cogvlm_model import CogVLMModel
from utils.language import llama2_tokenizer, llama2_text_processor_inference
from utils.vision import get_image_processor
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--max_length", type=int, default=2048, help='max length of the total sequence')
parser.add_argument("--top_p", type=float, default=0.4, help='top p for nucleus sampling')
parser.add_argument("--top_k", type=int, default=1, help='top k for top k sampling')
parser.add_argument("--temperature", type=float, default=.8, help='temperature for sampling')
parser.add_argument("--english", action='store_true', help='only output English')
parser.add_argument("--version", type=str, default="chat", help='version to interact with')
parser.add_argument("--from_pretrained", type=str, default="cogvlm-chat-v1.1", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--no_prompt", action='store_true', help='Sometimes there is no prompt in stage 1')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
args = parser.parse_args()
rank = int(os.environ.get('RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))
parser = CogVLMModel.add_model_specific_args(parser)
args = parser.parse_args()
# load model
model, model_args = CogVLMModel.from_pretrained(
args.from_pretrained,
args=argparse.Namespace(
deepspeed=None,
local_rank=rank,
rank=rank,
world_size=world_size,
model_parallel_size=world_size,
mode='inference',
skip_init=True,
use_gpu_initialization=True if torch.cuda.is_available() else False,
device='cuda',
**vars(args)
), overwrite_args={'model_parallel_size': world_size} if world_size != 1 else {})
model = model.eval()
from sat.mpu import get_model_parallel_world_size
assert world_size == get_model_parallel_world_size(), "world size must equal to model parallel size for cli_demo!"
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=args.version)
image_processor = get_image_processor(model_args.eva_args["image_size"][0])
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
text_processor_infer = llama2_text_processor_inference(tokenizer, args.max_length, model.image_length)
if not args.english:
if rank == 0:
print('欢迎使用 CogVLM-CLI 输入图像URL或本地路径读图继续输入内容对话clear 重新开始stop 终止程序')
else:
if rank == 0:
print('Welcome to CogVLM-CLI. Enter an image URL or local file path to load an image. Continue inputting text to engage in a conversation. Type "clear" to start over, or "stop" to end the program.')
with torch.no_grad():
while True:
history = None
cache_image = None
if not args.english:
if rank == 0:
image_path = [input("请输入图像路径或URL回车进入纯文本对话 ")]
else:
image_path = [None]
else:
if rank == 0:
image_path = [input("Please enter the image path or URL (press Enter for plain text conversation): ")]
else:
image_path = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(image_path, 0)
image_path = image_path[0]
assert image_path is not None
if image_path == 'stop':
break
if args.no_prompt and len(image_path) > 0:
query = ""
else:
if not args.english:
if rank == 0:
query = [input("用户:")]
else:
query = [None]
else:
if rank == 0:
query = [input("User: ")]
else:
query = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(query, 0)
query = query[0]
assert query is not None
while True:
if query == "clear":
break
if query == "stop":
sys.exit(0)
try:
response, history, cache_image = chat(
image_path,
model,
text_processor_infer,
image_processor,
query,
history=history,
image=cache_image,
max_length=args.max_length,
top_p=args.top_p,
temperature=args.temperature,
top_k=args.top_k,
invalid_slices=text_processor_infer.invalid_slices,
no_prompt=args.no_prompt
)
except Exception as e:
print(e)
break
if rank == 0:
if not args.english:
print("模型:"+response)
if tokenizer.signal_type == "grounding":
print("Grounding 结果已保存至 ./output.png")
else:
print("Model: "+response)
if tokenizer.signal_type == "grounding":
print("Grounding result is saved at ./output.png")
image_path = None
if not args.english:
if rank == 0:
query = [input("用户:")]
else:
query = [None]
else:
if rank == 0:
query = [input("User: ")]
else:
query = [None]
if world_size > 1:
torch.distributed.broadcast_object_list(query, 0)
query = query[0]
assert query is not None
if __name__ == "__main__":
main()

217
evaluate_demo.py Normal file
View File

@ -0,0 +1,217 @@
import os
import torch
import argparse
from sat import mpu, get_args, get_tokenizer
from sat.training.deepspeed_training import training_main
from sat.helpers import print_rank0
from models.cogvlm_model import FineTuneTestCogVLMModel
from utils.language import llama2_text_processor, llama2_text_processor_inference
from utils.vision import get_image_processor
from functools import partial
def data_collator(examples):
examples = [ex for ex in examples if len(ex) > 0] # drop {}
for example in examples:
for k in example:
if isinstance(example[k], list):
example[k] = torch.tensor(example[k])
elif isinstance(example[k], np.ndarray):
example[k] = torch.from_numpy(example[k])
img_args = {}
tmp_example = examples[0]
for k in tmp_example['vision']:
if type(tmp_example['vision'][k]) is torch.Tensor:
img_args['vision_'+k] = torch.cat([example['vision'][k] for example in examples])
else:
img_args['vision_'+k] = example['vision'][k]
for example in examples:
example.pop('vision')
if 'cross' in example:
example.pop('cross')
model_args = {}
tmp_example = examples[0]
for k in tmp_example:
if type(tmp_example[k]) is torch.Tensor:
model_args[k] = torch.cat([example[k] for example in examples])
else:
model_args[k] = tmp_example[k]
model_args.update(img_args)
return model_args
from collections import defaultdict
def broadcast_auto(data_dict):
type2list = defaultdict(list)
other = []
for k in data_dict:
if type(data_dict[k]) is torch.Tensor:
type2list[data_dict[k].dtype].append(k)
else:
other.append(k)
new_data = {}
for k in type2list:
new_data.update(mpu.broadcast_data(type2list[k], data_dict, k))
for k in other:
new_data[k] = data_dict[k]
return new_data
def get_batch(data_iterator, args, timers):
# Broadcast data.
timers('data loader').start()
if data_iterator is not None:
data = next(data_iterator)
else:
data = None
timers('data loader').stop()
data_b = broadcast_auto(data)
for k in data_b:
if type(data_b[k]) is torch.Tensor and data_b[k].dtype is not torch.int32 and data_b[k].dtype is not torch.long:
if args.fp16:
data_b[k] = data_b[k].half()
elif args.bf16:
data_b[k] = data_b[k].bfloat16()
return data_b
from torch.nn import CrossEntropyLoss
import numpy as np
from sat.model.mixins import CachedAutoregressiveMixin
from sat.generation.autoregressive_sampling import filling_sequence
from sat.generation.sampling_strategies import BaseStrategy, BeamSearchStrategy
def chat(model, tokenizer, tokens,
max_length: int = 1800, num_beams=5, top_p=0.95, top_k=0, temperature=0.8, **kwargs):
inputs = tokens.to(model.parameters().__next__().device)[0]
seq = torch.cat(
[inputs, torch.tensor([-1] * (max_length - len(inputs)), device=inputs.device)], dim=0
)
strategy = BaseStrategy(temperature=temperature, top_p=0.4, top_k=1, end_tokens=[tokenizer.eos_token_id])
# strategy = BeamSearchStrategy(temperature=temperature, top_p=top_p, top_k=top_k, end_tokens=[tokenizer.eos_token_id],
# num_beams=num_beams, consider_end=True)
get_func = llama2_text_processor_inference.get_func(None, None, image_rope_mask=kwargs['image_rope_mask'])
output = filling_sequence(
model, seq,
batch_size=1,
strategy=strategy,
get_masks_and_position_ids=get_func,
**kwargs
)[0] # drop memory
return output
def forward_step_eval(data_iterator, model, args, timers):
def compute_metrics(eval_preds):
preds, labels, device = eval_preds
preds = preds.unsqueeze(0)
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
if args.ignore_pad_token_for_loss:
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
score_dict = {
"acc": [],
"acc_w/o_case": [],
}
for pred, label in zip(decoded_preds, decoded_labels):
if args.rank == 0:
print('pred', pred, 'label', label, flush=True)
if pred == label:
score_dict['acc'].append(1.)
else:
score_dict['acc'].append(0.)
if pred.lower() == label.lower():
score_dict['acc_w/o_case'].append(1.)
else:
score_dict['acc_w/o_case'].append(0.)
for k, v in score_dict.items():
score_dict[k] = float(np.mean(v))
return score_dict
# Get the batch.
timers('batch generator').start()
data_b = get_batch(
data_iterator, args, timers)
timers('batch generator').stop()
context_len = int(data_b['context_length'][0])
tokens = data_b['input_ids'][:, :context_len]
data_b['vision_expert_mask'] = data_b['vision_expert_mask'][:, :context_len]
data_b['image_embed_mask'] = data_b['image_embed_mask'][:, :context_len]
data_b['image_rope_mask'] = data_b['image_rope_mask'][:, :context_len]
data_b.pop('input_ids')
data_b.pop('attention_mask')
data_b.pop('position_ids')
labels = data_b.pop('labels')
qid = data_b.pop('question_id')
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
outputs = chat(model, tokenizer, tokens, **data_b)[0][context_len:]
# print(outputs)
model.del_mixin('auto-regressive')
return torch.tensor(0, device=outputs.device), {k: torch.tensor(v, device=outputs.device) for k, v in
compute_metrics(
(outputs.cpu(), labels.cpu(), outputs.device)).items()}
from torch.nn import CrossEntropyLoss
def forward_step(data_iterator, model, args, timers):
"""Forward step."""
# Get the batch.
timers('batch generator').start()
data_b = get_batch(
data_iterator, args, timers)
labels = data_b.pop('labels')
timers('batch generator').stop()
logits = model(**data_b)[0]
lm_logits = logits.to(torch.float32)
# Shift so that tokens < n predict n
shift_labels = labels[..., 1:].contiguous()
shift_logits = lm_logits[..., -1-shift_labels.size(-1):-1, :].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss(ignore_index=-100)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
loss = loss.to(torch.float32)
return loss, {'loss': loss}
from utils.dataset import ItemDataset
def create_dataset_function(image_processor, text_processor, path, args):
dataset = ItemDataset(image_processor, text_processor, args, path)
return dataset
if __name__ == '__main__':
py_parser = argparse.ArgumentParser(add_help=False)
py_parser.add_argument('--max_length', type=int)
py_parser.add_argument('--ignore_pad_token_for_loss', action='store_false')
py_parser.add_argument("--version", type=str, default="chat", help='version to interact with')
py_parser.add_argument("--from_pretrained", type=str, default="cogvlm-chat", help='pretrained ckpt')
py_parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
py_parser.add_argument("--vit_checkpoint_activations", action='store_true')
py_parser = FineTuneTestCogVLMModel.add_model_specific_args(py_parser)
known, args_list = py_parser.parse_known_args()
args = get_args(args_list)
args = argparse.Namespace(**vars(args), **vars(known))
if args.use_qlora:
args.device = 'cpu'
model, args = FineTuneTestCogVLMModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {})
if args.use_qlora and torch.cuda.is_available():
model = model.to('cuda')
from utils.language import llama2_tokenizer
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=args.version)
image_processor = get_image_processor(args.eva_args["image_size"][0])
text_processor = llama2_text_processor(tokenizer, args.max_length, args.image_length)
training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)

BIN
examples/1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

BIN
examples/2.jpeg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

BIN
examples/3.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

BIN
examples/4.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

BIN
examples/5.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 335 KiB

BIN
examples/6.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 331 KiB

View File

@ -0,0 +1,6 @@
{"id":1, "text": "Describe this image", "image": "examples/1.png"}
{"id":2, "text": "what did Musk talk about?", "image": "examples/2.jpeg"}
{"id":3, "text": "How many houses are there in this cartoon?", "image": "examples/3.jpg"}
{"id":4, "text": "Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?", "image": "examples/4.jpg"}
{"id":5, "text": "Where is the tree closer to the sun?", "image": "examples/5.jpg"}
{"id":6, "text": "What color are the clothes of the girl whose hands are holding flowers? Let's think step by step", "image": "examples/6.jpg"}

262
finetune_demo.py Normal file
View File

@ -0,0 +1,262 @@
import os
import torch
import argparse
from sat import mpu, get_args, get_tokenizer
from sat.training.deepspeed_training import training_main
from sat.helpers import print_rank0
from models.cogvlm_model import FineTuneTrainCogVLMModel
from utils.language import llama2_text_processor, llama2_text_processor_inference
from utils.vision import get_image_processor
from functools import partial
def disable_untrainable_params(self):
total_trainable = 0
enable = [('mlp', 'vit')]
if self.args.use_ptuning:
enable.extend(['ptuning'])
if self.args.use_lora or self.args.use_qlora:
enable.extend(['matrix_A', 'matrix_B'])
for n, p in self.named_parameters():
flag = False
for e in enable:
if type(e) is tuple:
if e[0].lower() in n.lower() and e[1].lower() in n.lower() and 55 > int(n[:n.find('.mlp')].split('.')[-1]) > 45:
flag = True
break
else:
if e.lower() in n.lower():
flag = True
break
if not flag:
p.requires_grad_(False)
else:
total_trainable += p.numel()
print_rank0(n)
print_rank0("***** Total trainable parameters: "+str(total_trainable)+" *****")
FineTuneTrainCogVLMModel.disable_untrainable_params = disable_untrainable_params
def data_collator(examples):
examples = [ex for ex in examples if len(ex) > 0] # drop {}
for example in examples:
for k in example:
if isinstance(example[k], list):
example[k] = torch.tensor(example[k])
elif isinstance(example[k], np.ndarray):
example[k] = torch.from_numpy(example[k])
img_args = {}
tmp_example = examples[0]
for k in tmp_example['vision']:
if type(tmp_example['vision'][k]) is torch.Tensor:
img_args['vision_'+k] = torch.cat([example['vision'][k] for example in examples])
else:
img_args['vision_'+k] = example['vision'][k]
for example in examples:
example.pop('vision')
if 'cross' in example:
example.pop('cross')
model_args = {}
tmp_example = examples[0]
for k in tmp_example:
if type(tmp_example[k]) is torch.Tensor:
model_args[k] = torch.cat([example[k] for example in examples])
else:
model_args[k] = tmp_example[k]
model_args.update(img_args)
return model_args
from collections import defaultdict
def broadcast_auto(data_dict):
type2list = defaultdict(list)
other = []
for k in data_dict:
if type(data_dict[k]) is torch.Tensor:
type2list[data_dict[k].dtype].append(k)
else:
other.append(k)
new_data = {}
for k in type2list:
new_data.update(mpu.broadcast_data(type2list[k], data_dict, k))
for k in other:
new_data[k] = data_dict[k]
return new_data
def get_batch(data_iterator, args, timers):
# Broadcast data.
timers('data loader').start()
if data_iterator is not None:
data = next(data_iterator)
else:
data = None
timers('data loader').stop()
data_b = broadcast_auto(data)
for k in data_b:
if type(data_b[k]) is torch.Tensor and data_b[k].dtype is not torch.int32 and data_b[k].dtype is not torch.long:
if args.fp16:
data_b[k] = data_b[k].half()
elif args.bf16:
data_b[k] = data_b[k].bfloat16()
return data_b
from torch.nn import CrossEntropyLoss
import numpy as np
from sat.model.mixins import CachedAutoregressiveMixin
from sat.generation.autoregressive_sampling import filling_sequence
from sat.generation.sampling_strategies import BaseStrategy, BeamSearchStrategy
def chat(model, tokenizer, tokens,
max_length: int = 1800, num_beams=5, top_p=0.95, top_k=0, temperature=0.8, **kwargs):
inputs = tokens.to(model.parameters().__next__().device)[0]
seq = torch.cat(
[inputs, torch.tensor([-1] * (max_length - len(inputs)), device=inputs.device)], dim=0
)
strategy = BaseStrategy(temperature=temperature, top_p=0.4, top_k=1, end_tokens=[tokenizer.eos_token_id])
# strategy = BeamSearchStrategy(temperature=temperature, top_p=top_p, top_k=top_k, end_tokens=[tokenizer.eos_token_id],
# num_beams=num_beams, consider_end=True)
get_func = llama2_text_processor_inference.get_func(None, None, image_rope_mask=kwargs['image_rope_mask'])
output = filling_sequence(
model, seq,
batch_size=1,
strategy=strategy,
get_masks_and_position_ids=get_func,
**kwargs
)[0] # drop memory
return output
def forward_step_eval(data_iterator, model, args, timers):
def compute_metrics(eval_preds):
preds, labels, device = eval_preds
preds = preds.unsqueeze(0)
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
if args.ignore_pad_token_for_loss:
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
score_dict = {
"acc": [],
"acc_w/o_case": [],
}
for pred, label in zip(decoded_preds, decoded_labels):
if args.rank == 0:
print('pred', pred, 'label', label, flush=True)
if pred == label:
score_dict['acc'].append(1.)
else:
score_dict['acc'].append(0.)
if pred.lower() == label.lower():
score_dict['acc_w/o_case'].append(1.)
else:
score_dict['acc_w/o_case'].append(0.)
for k, v in score_dict.items():
score_dict[k] = float(np.mean(v))
return score_dict
# Get the batch.
timers('batch generator').start()
data_b = get_batch(
data_iterator, args, timers)
timers('batch generator').stop()
context_len = int(data_b['context_length'][0])
tokens = data_b['input_ids'][:, :context_len]
data_b['vision_expert_mask'] = data_b['vision_expert_mask'][:, :context_len]
data_b['image_embed_mask'] = data_b['image_embed_mask'][:, :context_len]
data_b['image_rope_mask'] = data_b['image_rope_mask'][:, :context_len]
data_b.pop('input_ids')
data_b.pop('attention_mask')
data_b.pop('position_ids')
labels = data_b.pop('labels')
qid = data_b.pop('question_id')
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
outputs = chat(model, tokenizer, tokens, **data_b)[0][context_len:]
# print(outputs)
model.del_mixin('auto-regressive')
return torch.tensor(0, device=outputs.device), {k: torch.tensor(v, device=outputs.device) for k, v in
compute_metrics(
(outputs.cpu(), labels.cpu(), outputs.device)).items()}
from torch.nn import CrossEntropyLoss
def forward_step(data_iterator, model, args, timers):
"""Forward step."""
# Get the batch.
timers('batch generator').start()
data_b = get_batch(
data_iterator, args, timers)
labels = data_b.pop('labels')
timers('batch generator').stop()
logits = model(**data_b)[0]
lm_logits = logits.to(torch.float32)
# Shift so that tokens < n predict n
shift_labels = labels[..., 1:].contiguous()
shift_logits = lm_logits[..., -1-shift_labels.size(-1):-1, :].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss(ignore_index=-100)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
loss = loss.to(torch.float32)
return loss, {'loss': loss}
from utils.dataset import ItemDataset
def create_dataset_function(image_processor, text_processor, path, args):
dataset = ItemDataset(image_processor, text_processor, args, path)
return dataset
from sat.model.finetune.lora2 import LoraMixin
from sat.model.finetune.prompt_tuning import PTuningV2Mixin
if __name__ == '__main__':
py_parser = argparse.ArgumentParser(add_help=False)
py_parser.add_argument('--max_length', type=int)
py_parser.add_argument('--ignore_pad_token_for_loss', action='store_false')
py_parser.add_argument("--version", type=str, default="chat", help='version to interact with')
py_parser.add_argument("--from_pretrained", type=str, default="cogvlm-chat", help='pretrained ckpt')
py_parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
py_parser.add_argument("--vit_checkpoint_activations", action='store_true')
py_parser = FineTuneTrainCogVLMModel.add_model_specific_args(py_parser)
known, args_list = py_parser.parse_known_args()
args = get_args(args_list)
args = argparse.Namespace(**vars(args), **vars(known))
if args.use_qlora:
args.device = 'cpu'
model, args = FineTuneTrainCogVLMModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {})
if args.use_ptuning:
model.add_mixin("ptuning", PTuningV2Mixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.pre_seq_len))
if args.use_lora:
model.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)
model.get_mixin("eva").vit_model.add_mixin("lora", LoraMixin(args.eva_args['num_layers'], args.lora_rank, layer_range=args.layer_range), reinit=True)
elif args.use_qlora:
model.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True)
if args.use_qlora and torch.cuda.is_available():
model = model.to('cuda')
from utils.language import llama2_tokenizer
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=args.version)
image_processor = get_image_processor(args.eva_args["image_size"][0])
text_processor = llama2_text_processor(tokenizer, args.max_length, args.image_length)
model = training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=partial(create_dataset_function, image_processor, text_processor), collate_fn=data_collator, forward_step_eval=forward_step_eval)
if args.use_lora:
model.get_mixin("lora").merge_lora()
model.get_mixin("eva").vit_model.get_mixin("lora").merge_lora()
args.use_lora = False
args.save = "checkpoints/merged_lora_{}".format(args.eva_args["image_size"][0])
from sat.training.model_io import save_checkpoint
save_checkpoint(1, model, None, None, args)

42
merge_model.py Normal file
View File

@ -0,0 +1,42 @@
# -*- encoding: utf-8 -*-
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import torch
import argparse
from models.cogvlm_model import FineTuneTestCogVLMModel
from sat.training.model_io import save_checkpoint
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--version", type=str, default="base", help='version to interact with')
parser.add_argument("--from_pretrained", type=str, default="checkpoints/merged_lora", help='pretrained ckpt')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
args = parser.parse_args()
rank = int(os.environ.get('RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))
parser = FineTuneTestCogVLMModel.add_model_specific_args(parser)
args = parser.parse_args()
# load model
model, model_args = FineTuneTestCogVLMModel.from_pretrained(
args.from_pretrained,
args=argparse.Namespace(
deepspeed=None,
local_rank=rank,
rank=rank,
world_size=world_size,
model_parallel_size=world_size,
mode='inference',
skip_init=True,
use_gpu_initialization=True if torch.cuda.is_available() else False,
device='cuda',
**vars(args)
), url='local', overwrite_args={'model_parallel_size': 1})
model = model.eval()
model_args.save = './checkpoints/merged_model_{}'.format(model_args.eva_args["image_size"][0])
save_checkpoint(1, model, None, None, model_args)
if __name__ == "__main__":
main()

165
models/cogvlm_model.py Normal file
View File

@ -0,0 +1,165 @@
from sat.model.official.llama_model import LLaMAModel
import json
import torch
from sat.model.base_model import BaseMixin
import torch.nn as nn
from models.mixin import LlamaVisionExpertFCMixin, LlamaVisionExpertAttnMixin
from sat.resources.urls import MODEL_URLS
MODEL_URLS["cogvlm-base-224"] = "r2://cogvlm-base-224.zip"
MODEL_URLS["cogvlm-base-490"] = "r2://cogvlm-base-490.zip"
MODEL_URLS["cogvlm-chat-v1.1"] = "r2://cogvlm-chat-v1.1.zip"
MODEL_URLS["cogvlm-grounding-base"] = "r2://cogvlm-grounding-base.zip"
MODEL_URLS["cogvlm-grounding-generalist"] = "r2://cogvlm-grounding-generalist.zip"
class GLU(nn.Module):
def __init__(self, args, in_features):
super().__init__()
self.linear_proj = nn.Linear(in_features, args.hidden_size, bias=False)
self.norm1 = nn.LayerNorm(args.hidden_size)
self.act1 = nn.GELU()
self.act2 = nn.functional.silu
self.dense_h_to_4h = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
self.gate_proj = nn.Linear(args.hidden_size, args.inner_hidden_size, bias=False)
self.dense_4h_to_h = nn.Linear(args.inner_hidden_size, args.hidden_size, bias=False)
def forward(self, x):
x = self.linear_proj(x)
x = self.act1(self.norm1(x))
x = self.act2(self.gate_proj(x)) * self.dense_h_to_4h(x)
x = self.dense_4h_to_h(x)
return x
from models.eva_clip_model import EVA2CLIPModel
import argparse
from copy import deepcopy
def override_dist_dtype_device_args(args, b={}):
if args.mode == 'inference':
minimal_args = argparse.Namespace(
world_size=args.world_size,
rank=args.rank,
local_rank=args.local_rank,
skip_init=args.skip_init,
use_gpu_initialization=args.use_gpu_initialization,
deepspeed=args.deepspeed,
bf16=args.bf16,
fp16=args.fp16,
mode=args.mode,
device=args.device
)
else:
minimal_args = argparse.Namespace(
world_size=args.world_size,
rank=args.rank,
local_rank=args.local_rank,
skip_init=args.skip_init,
use_gpu_initialization=args.use_gpu_initialization,
deepspeed=args.deepspeed,
bf16=args.bf16,
fp16=args.fp16,
mode=args.mode,
checkpoint_activations=args.checkpoint_activations if not hasattr(args, 'vit_checkpoint_activations') else args.vit_checkpoint_activations,
checkpoint_num_layers=args.checkpoint_num_layers,
device=args.device,
hidden_dropout=0.,
attention_dropout=0.,
)
if hasattr(args, 'model_parallel_size'):
b['model_parallel_size'] = args.model_parallel_size
return argparse.Namespace(**deepcopy(b), **vars(minimal_args))
class ImageMixin(BaseMixin):
def __init__(self, args):
super().__init__()
vit_args = override_dist_dtype_device_args(args, args.eva_args)
self.vit_model = EVA2CLIPModel(EVA2CLIPModel.get_args(**vars(vit_args)))
self.in_features = 1792
self.linear_proj = GLU(args, self.in_features)
self.image_length = args.image_length
self.boi = nn.Parameter(torch.zeros(1, 1, args.hidden_size))
self.eoi = nn.Parameter(torch.zeros(1, 1, args.hidden_size))
def word_embedding_forward(self, input_ids, output_cross_layer, **kw_args):
vision_inputs = {}
for k in kw_args:
if k.startswith('vision_') and k != 'vision_expert_mask':
vision_inputs[k[7:]] = kw_args[k]
if input_ids.shape[1] == 1 or not vision_inputs:
return self.transformer.word_embeddings(input_ids)
image_emb = self.vit_model(**vision_inputs)[0]
image_emb = self.linear_proj(image_emb)
image_embed_mask = kw_args['image_embed_mask']
word_embedding = self.transformer.word_embeddings(input_ids).clone()
word_embedding[image_embed_mask.bool()] = torch.cat([self.boi.repeat(len(image_emb), 1, 1), image_emb, self.eoi.repeat(len(image_emb), 1, 1)], dim=1).reshape(-1, image_emb.shape[-1])
return word_embedding.contiguous()
class CogVLMModel(LLaMAModel):
def __init__(self, args, transformer=None, parallel_output=True, **kwargs):
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
self.image_length = args.image_length
self.add_mixin("eva", ImageMixin(args))
self.del_mixin("mlp")
self.add_mixin("mlp", LlamaVisionExpertFCMixin(args.hidden_size, args.inner_hidden_size, args.num_layers, 32))
self.del_mixin("rotary")
self.add_mixin("rotary", LlamaVisionExpertAttnMixin(args.hidden_size, args.num_attention_heads, args.num_layers, 32))
@classmethod
def add_model_specific_args(cls, parser):
group = parser.add_argument_group('CogVLM', 'CogVLM Configurations')
group.add_argument('--image_length', type=int, default=256)
group.add_argument('--eva_args', type=json.loads, default={})
return super().add_model_specific_args(parser)
def forward(self, input_ids, vision_expert_mask, image_embed_mask, **kwargs):
if input_ids.shape[1] > 1:
return super().forward(input_ids=input_ids, vision_expert_mask=vision_expert_mask, image_embed_mask=image_embed_mask, **kwargs)
return super().forward(input_ids=input_ids, **kwargs)
class FineTuneTrainCogVLMModel(CogVLMModel):
def __init__(self, args, transformer=None, parallel_output=True, **kw_args):
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kw_args)
self.args = args
# If you want to use model parallel with a mp_size=1 checkpoint, and meanwhile you also want to use lora,
# you have to add_mixin after loading model checkpoint.
@classmethod
def add_model_specific_args(cls, parser):
group = parser.add_argument_group('CogVLM-finetune', 'CogVLM finetune Configurations')
group.add_argument('--pre_seq_len', type=int, default=8)
group.add_argument('--lora_rank', type=int, default=10)
group.add_argument('--use_ptuning', action="store_true")
group.add_argument('--use_lora', action="store_true")
group.add_argument('--use_qlora', action="store_true")
group.add_argument('--layer_range', nargs='+', type=int, default=None)
return super().add_model_specific_args(parser)
from sat.model.finetune import PTuningV2Mixin
from sat.model.finetune.lora2 import LoraMixin
class FineTuneTestCogVLMModel(CogVLMModel):
def __init__(self, args, transformer=None, parallel_output=True, **kw_args):
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kw_args)
if args.use_ptuning:
self.add_mixin("ptuning", PTuningV2Mixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.pre_seq_len))
if args.use_lora:
self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)
self.get_mixin("eva").vit_model.add_mixin("lora", LoraMixin(args.eva_args['num_layers'], args.lora_rank, layer_range=args.layer_range), reinit=True)
elif args.use_qlora:
self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True)
self.args = args
@classmethod
def add_model_specific_args(cls, parser):
group = parser.add_argument_group('CogVLM-finetune', 'CogVLM finetune Configurations')
group.add_argument('--pre_seq_len', type=int, default=8)
group.add_argument('--lora_rank', type=int, default=10)
group.add_argument('--use_ptuning', action="store_true")
group.add_argument('--use_lora', action="store_true")
group.add_argument('--use_qlora', action="store_true")
group.add_argument('--layer_range', nargs='+', type=int, default=None)
return super().add_model_specific_args(parser)

127
models/eva_clip_model.py Normal file
View File

@ -0,0 +1,127 @@
import torch
from sat.model.base_model import BaseModel
from sat.model.mixins import BaseMixin
from sat.model.official.vit_model import ViTProperty, ImagePatchEmbeddingMixin, InterpolatedPositionEmbeddingMixin, gelu
from sat import mpu
class IdentityMixin(BaseMixin):
def __init__(self):
super().__init__()
def final_forward(self, logits, **kwargs):
return logits[:, 1:]
import xformers.ops as xops
class XAttn(BaseMixin):
def __init__(self, head_dim):
super().__init__()
self.scale = head_dim ** -0.5
def attention_fn(self, query_layer, key_layer, value_layer, attention_mask,
attention_dropout=None, log_attention_weights=None, scaling_attention_score=True, **kwargs):
dropout_p = 0. # xformers does not support dropout for eva hidden size
query_layer = query_layer.permute(0, 2, 1, 3) # B, num_heads, N, C -> B, N, num_heads, C
key_layer = key_layer.permute(0, 2, 1, 3)
value_layer = value_layer.permute(0, 2, 1, 3)
out = xops.memory_efficient_attention(
query_layer, key_layer, value_layer,
p=dropout_p,
scale=self.scale,
)
return out
def attention_forward(self, hidden_states, mask, **kw_args):
self = self.transformer.layers[kw_args['layer_id']].attention
attention_fn = self.hooks['attention_fn']
mixed_raw_layer = self.query_key_value(hidden_states)
B, N, C = hidden_states.shape
mixed_raw_layer = mixed_raw_layer.reshape(B, N, 3, self.num_attention_heads_per_partition, -1).permute(2, 0, 3, 1, 4) # 3, B, num_heads, N, C
query_layer, key_layer, value_layer = mixed_raw_layer[0], mixed_raw_layer[1], mixed_raw_layer[2]
dropout_fn = self.attention_dropout if self.training else None
context_layer = attention_fn(query_layer, key_layer, value_layer, mask, dropout_fn, **kw_args)
context_layer = context_layer.view(B, N, -1)
output = self.dense(context_layer)
if self.training:
output = self.output_dropout(output)
return output
class NewLayerForward(BaseMixin):
def __init__(self):
super().__init__()
def layer_forward(self, hidden_states, mask, *args, **kw_args):
'''
hidden_states: [batch, seq_len, hidden_size]
mask: [(1, 1), seq_len, seq_len]
'''
self = self.transformer.layers[kw_args['layer_id']]
attention_input = hidden_states
# Self attention.
attention_output = self.input_layernorm(self.attention(attention_input, mask, **kw_args))
# DropPath for attention
if self.training and self.drop_path > 0.:
if mpu.get_cuda_rng_tracker is not None:
# drop_path must use model parallel rng tracker
# the tracker is initialized as seed of `seed + model_parallel_rank`
# deepspeed act-ckpt record the model parallel tracker states
with mpu.get_cuda_rng_tracker().fork():
# drop_path percentage 0, others 1/(1-p)
random_tensor = (1-self.drop_path
+ torch.rand((attention_output.shape[0],), dtype=attention_output.dtype, device=attention_output.device)).floor_() / (1-self.drop_path)
attention_output = random_tensor.view(-1, 1, 1) * attention_output
# Residual connection.
hidden_states = attention_input + attention_output
mlp_input = hidden_states
# MLP.
mlp_output = self.post_attention_layernorm(self.mlp(mlp_input, **kw_args))
# DropPath for mlp
if self.training and self.drop_path > 0.:
if mpu.get_cuda_rng_tracker is not None:
with mpu.get_cuda_rng_tracker().fork():
random_tensor = (1-self.drop_path
+ torch.rand((mlp_output.shape[0],), dtype=mlp_output.dtype, device=mlp_output.device)).floor_() / (1-self.drop_path)
mlp_output = random_tensor.view(-1, 1, 1) * mlp_output
# Second residual connection.
output = mlp_input + mlp_output
return output
class EVA2CLIPModel(BaseModel):
def __init__(self, args, transformer=None, parallel_output=True, **kwargs):
property = ViTProperty(args.image_size, args.patch_size, args.pre_len, args.post_len)
args.max_sequence_length = property.pre_len + property.num_patches + property.post_len
if 'activation_func' not in kwargs:
kwargs['activation_func'] = gelu
super().__init__(args, transformer=transformer, parallel_output=parallel_output, **kwargs)
self.transformer.property = property
self.add_mixin("patch_embedding", ImagePatchEmbeddingMixin(args.in_channels, args.hidden_size, property))
self.add_mixin("pos_embedding", InterpolatedPositionEmbeddingMixin())
self.add_mixin("final", IdentityMixin())
self.add_mixin("newpost", NewLayerForward())
self.add_mixin("xattn", XAttn(args.hidden_size // args.num_attention_heads))
@classmethod
def add_model_specific_args(cls, parser):
group = parser.add_argument_group('EVA2CLIP', 'EVA2CLIP Configurations')
group.add_argument('--image-size', nargs='+', type=int, default=[224, 224])
group.add_argument('--pre-len', type=int, default=1) # [cls] by default
group.add_argument('--post-len', type=int, default=0) # empty by default, but sometimes with special tokens, such as [det] in yolos.
group.add_argument('--in-channels', type=int, default=3)
group.add_argument('--patch-size', type=int, default=16)
return parser

274
models/mixin.py Normal file
View File

@ -0,0 +1,274 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from sat.transformer_defaults import attention_fn_default
from sat.model.base_model import BaseMixin, non_conflict
from sat.mpu.layers import ColumnParallelLinear, RowParallelLinear
from sat.mpu.utils import split_tensor_along_last_dim
from sat import mpu
class LlamaVisionExpertFCMixin(BaseMixin):
def __init__(self, in_features, hidden_features, num_layers=32, num_vision_layers=0, vision_layer_range=None,
params_dtype=torch.float, device=torch.device('cpu')):
super().__init__()
self.num_layers = num_layers
self.num_vision_layers = num_vision_layers
if vision_layer_range is None:
vision_layer_range = [i for i in range(min(num_vision_layers, num_layers))]
self.vision_layer_range = vision_layer_range
self.gate_proj = nn.ModuleList([ColumnParallelLinear(
in_features,
hidden_features,
gather_output=False,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="dense_h_to_4h_gate",
skip_init=True,
device=device
) for i in range(num_layers)])
# Trainable vision expert parameters
vision_dense_h_to_4h_list = []
vision_dense_4h_to_h_list = []
gate_proj_list = []
for i in vision_layer_range:
vision_dense_h_to_4h = ColumnParallelLinear(
in_features,
hidden_features,
gather_output=False,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="vision_dense_h_to_4h",
skip_init=True,
device=device
)
# Project back to h.
vision_dense_4h_to_h = RowParallelLinear(
hidden_features,
in_features,
input_is_parallel=True,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="vision_dense_4h_to_h",
skip_init=True,
device=device
)
gate_proj = ColumnParallelLinear(
in_features,
hidden_features,
gather_output=False,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="vision_gate_proj",
skip_init=True,
device=device
)
vision_dense_h_to_4h_list.append(vision_dense_h_to_4h)
vision_dense_4h_to_h_list.append(vision_dense_4h_to_h)
gate_proj_list.append(gate_proj)
self.vision_dense_h_to_4h_list = nn.ModuleDict([
(str(layer_id), vision_dense_h_to_4h)
for layer_id, vision_dense_h_to_4h in zip(vision_layer_range, vision_dense_h_to_4h_list)
])
self.vision_dense_4h_to_h_list = nn.ModuleDict([
(str(layer_id), vision_dense_4h_to_h)
for layer_id, vision_dense_4h_to_h in zip(vision_layer_range, vision_dense_4h_to_h_list)
])
self.vision_gate_proj = nn.ModuleDict([
(str(layer_id), gate_proj)
for layer_id, gate_proj in zip(vision_layer_range, gate_proj_list)
])
def mlp_forward(self, hidden_states, **kw_args):
mixin_self = self
self = self.transformer.layers[kw_args['layer_id']].mlp
if "vision_expert_mask" in kw_args:
vision_expert_mask = kw_args['vision_expert_mask']
else:
vision_expert_mask = None
layer_id_key = str(int(kw_args['layer_id']))
if kw_args['layer_id'] in mixin_self.vision_layer_range and (vision_expert_mask is not None) and vision_expert_mask.any():
vision_dense_h_to_4h = mixin_self.vision_dense_h_to_4h_list[layer_id_key]
vision_dense_4h_to_h = mixin_self.vision_dense_4h_to_h_list[layer_id_key]
vision_gate_proj = mixin_self.vision_gate_proj[layer_id_key]
output = torch.empty(hidden_states.shape, dtype=hidden_states.dtype, device=hidden_states.device)
language_hidden_state = hidden_states[~vision_expert_mask.bool()]
language_intermediate_parallel = self.activation_func(mixin_self.gate_proj[kw_args['layer_id']](language_hidden_state)) * self.dense_h_to_4h(language_hidden_state)
output[~vision_expert_mask.bool()] = self.dense_4h_to_h(language_intermediate_parallel) # language_output
vision_hidden_state = hidden_states[vision_expert_mask.bool()]
vision_intermediate_parallel = vision_dense_h_to_4h(vision_hidden_state)
gate_output = vision_gate_proj(vision_hidden_state)
vision_intermediate_parallel *= self.activation_func(gate_output)
output[vision_expert_mask.bool()] = vision_dense_4h_to_h(vision_intermediate_parallel) # vision_output
else:
intermediate_parallel = self.activation_func(mixin_self.gate_proj[kw_args['layer_id']](hidden_states)) * self.dense_h_to_4h(hidden_states)
output = self.dense_4h_to_h(intermediate_parallel)
return output.contiguous()
def copy_param(self):
with torch.no_grad():
for i in self.vision_layer_range:
self.vision_gate_proj[str(i)].weight.data.copy_(self.gate_proj[i].weight.data)
self.vision_dense_4h_to_h_list[str(i)].weight.data.copy_(self.transformer.layers[i].mlp.dense_4h_to_h.weight.data)
self.vision_dense_h_to_4h_list[str(i)].weight.data.copy_(self.transformer.layers[i].mlp.dense_h_to_4h.weight.data)
from sat.mpu import get_model_parallel_world_size
from sat.mpu.utils import divide
from sat.model.position_embedding.triton_rotary_embeddings import FastRotaryEmbedding
class LlamaVisionExpertAttnMixin(BaseMixin):
def __init__(self, hidden_size, num_heads, num_layers=28, num_vision_layers=0, use_vision_expert=True, vision_layer_range=None,
params_dtype=torch.float, device=torch.device('cpu')):
super().__init__()
world_size = get_model_parallel_world_size()
self.hidden_size = hidden_size
self.num_attention_heads = num_heads
self.hidden_size_per_attention_head = divide(hidden_size, num_heads)
self.num_attention_heads_per_partition = divide(num_heads, world_size)
self.inner_hidden_size = num_heads * self.hidden_size_per_attention_head
self.rotary_emb = FastRotaryEmbedding(
hidden_size // num_heads, pos_idx_in_fp32=False
)
self.num_vision_layers = num_vision_layers
self.num_layers = num_layers
if vision_layer_range is None:
vision_layer_range = [i for i in range(min(num_vision_layers, num_layers))]
self.vision_layer_range = vision_layer_range
self.use_vision_expert = use_vision_expert
# Trainable vision expert parameters
if self.use_vision_expert:
vision_query_key_value_list = []
vision_dense_list = []
for i in vision_layer_range:
vision_query_key_value = ColumnParallelLinear(
hidden_size,
3 * hidden_size,
stride=3,
gather_output=False,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="vision_query_key_value",
skip_init=True,
device=device
)
vision_dense = RowParallelLinear(
self.inner_hidden_size,
hidden_size,
input_is_parallel=True,
init_method=None,
bias=False,
params_dtype=params_dtype,
module=self,
name="vision_dense",
skip_init=True,
device=device,
final_bias=False
)
vision_query_key_value_list.append(vision_query_key_value)
vision_dense_list.append(vision_dense)
self.vision_query_key_value_list = nn.ModuleDict([
(str(layer_id), vision_query_key_value)
for layer_id, vision_query_key_value in zip(vision_layer_range, vision_query_key_value_list)
])
self.vision_dense_list = nn.ModuleDict([
(str(layer_id), vision_dense)
for layer_id, vision_dense in zip(vision_layer_range, vision_dense_list)
])
def attention_forward(self, hidden_states, mask, **kw_args):
mixin_self = self
self = self.transformer.layers[kw_args['layer_id']].attention
attention_fn = attention_fn_default
if 'attention_fn' in self.hooks:
attention_fn = self.hooks['attention_fn']
if "vision_expert_mask" in kw_args:
vision_expert_mask = kw_args['vision_expert_mask']
else:
vision_expert_mask = None
layer_id_key = str(int(kw_args['layer_id']))
if mixin_self.use_vision_expert and kw_args['layer_id'] in mixin_self.vision_layer_range and (
vision_expert_mask is not None) and vision_expert_mask.any():
shape = list(hidden_states.shape)
parallel_size = mpu.get_model_parallel_world_size()
shape[-1] = shape[-1] * 3 // parallel_size
vision_query_key_value = mixin_self.vision_query_key_value_list[layer_id_key]
mixed_raw_layer = torch.empty(shape, dtype=hidden_states.dtype, device=hidden_states.device)
language_hidden_states = hidden_states[~vision_expert_mask.bool()]
vision_hidden_states = hidden_states[vision_expert_mask.bool()]
mixed_raw_layer[~vision_expert_mask.bool()] = self.query_key_value(
language_hidden_states) # language_mixed_raw_layer
mixed_raw_layer[vision_expert_mask.bool()] = vision_query_key_value(
vision_hidden_states) # vision_mixed_raw_layer
else:
mixed_raw_layer = self.query_key_value(hidden_states)
(mixed_query_layer,
mixed_key_layer,
mixed_value_layer) = split_tensor_along_last_dim(mixed_raw_layer, 3)
dropout_fn = self.attention_dropout if self.training else None
query_layer = self._transpose_for_scores(mixed_query_layer)
key_layer = self._transpose_for_scores(mixed_key_layer)
value_layer = self._transpose_for_scores(mixed_value_layer)
query_layer, key_layer = mixin_self.rotary_emb(query_layer,key_layer, kw_args['position_ids'], max_seqlen=kw_args['position_ids'].max()+1, layer_id=kw_args['layer_id'])
context_layer = attention_fn(query_layer, key_layer, value_layer, mask, dropout_fn, **kw_args)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size_per_partition,)
context_layer = context_layer.view(*new_context_layer_shape)
if mixin_self.use_vision_expert and kw_args['layer_id'] in mixin_self.vision_layer_range and (
vision_expert_mask is not None) and vision_expert_mask.any():
vision_dense = mixin_self.vision_dense_list[layer_id_key]
parallel_size = mpu.get_model_parallel_world_size()
target_shape = context_layer.shape[:-1] + (context_layer.shape[-1] * parallel_size,)
output = torch.empty(target_shape, dtype=hidden_states.dtype, device=hidden_states.device)
output[~vision_expert_mask.bool()] = self.dense(context_layer[~vision_expert_mask.bool()]) # language
output[vision_expert_mask.bool()] = vision_dense(context_layer[vision_expert_mask.bool()]) # vision
else:
output = self.dense(context_layer)
if self.training:
output = self.output_dropout(output)
return output.contiguous()
def copy_param(self):
with torch.no_grad():
for i in self.vision_layer_range:
self.vision_query_key_value_list[str(i)].weight.data.copy_(self.transformer.layers[i].attention.query_key_value.weight.data)
self.vision_dense_list[str(i)].weight.data.copy_(self.transformer.layers[i].attention.dense.weight.data)

9
requirements.txt Normal file
View File

@ -0,0 +1,9 @@
SwissArmyTransformer>=0.4.8
transformers>=4.33.1
xformers>=0.0.22
torch>1.10.0
torchvision
spacy>=3.6.0
pillow>=10.0.1
deepspeed>=0.11.0
seaborn

57
scripts/evaluate_224.sh Normal file
View File

@ -0,0 +1,57 @@
#! /bin/bash
# export PATH=/usr/local/cuda/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=8
MP_SIZE=4
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-base-224"
VERSION="base"
MODEL_ARGS="--from_pretrained ./checkpoints/merged_model_224 \
--max_length 319 \
--lora_rank 10 \
--use_lora \
--local_tokenizer lmsys/vicuna-7b-v1.5 \
--version $VERSION"
OPTIONS_SAT="SAT_HOME=~/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"
train_data="./archive_split/train"
test_data="./archive_split/test"
gpt_options=" \
--experiment-name finetune-$MODEL_TYPE \
--model-parallel-size ${MP_SIZE} \
--mode finetune \
--train-iters 0 \
--resume-dataloader \
$MODEL_ARGS \
--train-data ${train_data} \
--test-data ${test_data} \
--distributed-backend nccl \
--lr-decay-style cosine \
--warmup .02 \
--checkpoint-activations \
--save-interval 200 \
--eval-interval 200 \
--save "./checkpoints" \
--strict-eval \
--eval-batch-size 1 \
--split 1. \
--deepspeed_config scripts/test_config_bf16.json \
--skip-init \
--seed 2023
"
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} evaluate_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

57
scripts/evaluate_490.sh Normal file
View File

@ -0,0 +1,57 @@
#! /bin/bash
# export PATH=/usr/local/cuda/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=8
MP_SIZE=1
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-base-490"
VERSION="base"
MODEL_ARGS="--from_pretrained ./checkpoints/merged_lora_490 \
--max_length 1288 \
--lora_rank 10 \
--use_lora \
--local_tokenizer lmsys/vicuna-7b-v1.5 \
--version $VERSION"
OPTIONS_SAT="SAT_HOME=~/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"
train_data="./archive_split/train"
test_data="./archive_split/test"
gpt_options=" \
--experiment-name finetune-$MODEL_TYPE \
--model-parallel-size ${MP_SIZE} \
--mode finetune \
--train-iters 0 \
--resume-dataloader \
$MODEL_ARGS \
--train-data ${train_data} \
--test-data ${test_data} \
--distributed-backend nccl \
--lr-decay-style cosine \
--warmup .02 \
--checkpoint-activations \
--save-interval 200 \
--eval-interval 200 \
--save "./checkpoints" \
--strict-eval \
--eval-batch-size 1 \
--split 1. \
--deepspeed_config scripts/test_config_bf16_490.json \
--skip-init \
--seed 2023
"
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} evaluate_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,58 @@
#! /bin/bash
# export PATH=/usr/local/cuda/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=8
MP_SIZE=4
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-base-224"
VERSION="base"
MODEL_ARGS="--from_pretrained $MODEL_TYPE \
--max_length 319 \
--lora_rank 10 \
--use_lora \
--local_tokenizer lmsys/vicuna-7b-v1.5 \
--version $VERSION"
OPTIONS_SAT="SAT_HOME=~/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"
train_data="./archive_split/train"
valid_data="./archive_split/valid"
gpt_options=" \
--experiment-name finetune-$MODEL_TYPE \
--model-parallel-size ${MP_SIZE} \
--mode finetune \
--train-iters 800 \
--resume-dataloader \
$MODEL_ARGS \
--train-data ${train_data} \
--valid-data ${valid_data} \
--distributed-backend nccl \
--lr-decay-style cosine \
--warmup .02 \
--checkpoint-activations \
--vit_checkpoint_activations \
--save-interval 200 \
--eval-interval 200 \
--save "./checkpoints" \
--eval-iters 10 \
--eval-batch-size 1 \
--split 1. \
--deepspeed_config scripts/test_config_bf16.json \
--skip-init \
--seed 2023
"
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

View File

@ -0,0 +1,58 @@
#! /bin/bash
# export PATH=/usr/local/cuda/bin:$PATH
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
NUM_GPUS_PER_WORKER=8
MP_SIZE=1
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="cogvlm-base-490"
VERSION="base"
MODEL_ARGS="--from_pretrained $MODEL_TYPE \
--max_length 1288 \
--lora_rank 10 \
--use_lora \
--local_tokenizer lmsys/vicuna-7b-v1.5 \
--version $VERSION"
OPTIONS_SAT="SAT_HOME=~/.sat_models"
OPTIONS_NCCL="NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 LOCAL_WORLD_SIZE=$NUM_GPUS_PER_WORKER"
HOST_FILE_PATH="hostfile"
train_data="./archive_split/train"
valid_data="./archive_split/valid"
gpt_options=" \
--experiment-name finetune-$MODEL_TYPE \
--model-parallel-size ${MP_SIZE} \
--mode finetune \
--train-iters 800 \
--resume-dataloader \
$MODEL_ARGS \
--train-data ${train_data} \
--valid-data ${valid_data} \
--distributed-backend nccl \
--lr-decay-style cosine \
--warmup .02 \
--checkpoint-activations \
--vit_checkpoint_activations \
--save-interval 200 \
--eval-interval 200 \
--save "./checkpoints" \
--eval-iters 10 \
--eval-batch-size 1 \
--split 1. \
--deepspeed_config scripts/test_config_bf16_490.json \
--skip-init \
--seed 2023
"
run_cmd="${OPTIONS_NCCL} ${OPTIONS_SAT} deepspeed --master_port 16666 --hostfile ${HOST_FILE_PATH} finetune_demo.py ${gpt_options}"
echo ${run_cmd}
eval ${run_cmd}
set +x

35
scripts/split_dataset.py Normal file
View File

@ -0,0 +1,35 @@
import os
import shutil
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print(f'find {len(target_files)} files...')
return target_files
all_files = find_all_files('archive')
os.makedirs("archive_split", exist_ok=True)
os.makedirs("archive_split/train", exist_ok=True)
os.makedirs("archive_split/valid", exist_ok=True)
os.makedirs("archive_split/test", exist_ok=True)
import random
random.seed(2023)
random.shuffle(all_files)
train = all_files[:8000]
valid = all_files[8000:8000+500]
test = all_files[8000+500:8000+500+1500]
print("building train")
for file in train:
shutil.move(file, os.path.join("archive_split/train", file.split("/")[-1]))
print("building valid")
for file in valid:
shutil.move(file, os.path.join("archive_split/valid", file.split("/")[-1]))
print("building test")
for file in test:
shutil.move(file, os.path.join("archive_split/test", file.split("/")[-1]))
print("done")

41
scripts/test_config_bf16.json Executable file
View File

@ -0,0 +1,41 @@
{
"train_micro_batch_size_per_gpu": 32,
"gradient_accumulation_steps": 1,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 4e7,
"allgather_bucket_size": 1e8,
"load_from_fp32_weights": false
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": true
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00001,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 5e-2
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"wall_clock_breakdown": false
}

View File

@ -0,0 +1,41 @@
{
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 4e7,
"allgather_bucket_size": 1e8,
"load_from_fp32_weights": false
},
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": true
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00001,
"betas": [
0.9,
0.95
],
"eps": 1e-8,
"weight_decay": 5e-2
}
},
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false
},
"wall_clock_breakdown": false
}

109
utils/chat.py Normal file
View File

@ -0,0 +1,109 @@
# -*- encoding: utf-8 -*-
'''
@File : chat.py
@Time : 2023/05/08 19:10:08
@Author : Ming Ding
@Contact : dm18@mails.tsinghua.edu.cn
'''
from typing import Optional, Tuple, Union, List, Callable, Dict, Any
import requests
from PIL import Image
from io import BytesIO
import torch
from sat.generation.autoregressive_sampling import filling_sequence, get_masks_and_position_ids_default
from sat.generation.sampling_strategies import BaseStrategy, BeamSearchStrategy
from sat.mpu import get_model_parallel_rank
def process_image(image_path, img_processor, image):
if image is None:
if image_path.startswith("http"):
response = requests.get(image_path, timeout=10)
image = Image.open(BytesIO(response.content))
else:
image = Image.open(image_path)
if image is not None and isinstance(image, Image.Image):
pil_img = image.convert('RGB')
img_dict = img_processor(pil_img)
ret = (img_dict, pil_img)
else:
ret = image
return ret
def chat(image_path, model, text_processor, img_processor,
query: str, history: List[Tuple[str, str]] = None, image: Image = None,
max_length: int = 4096, top_p=0.95, top_k=5, temperature=0.95, repetition_penalty=1.0,
invalid_slices=[], no_prompt=False
):
if image is None:
assert image_path is not None
if not history:
history = []
if no_prompt:
query = ''
prompt = text_processor.history_to_prompt(query, history)
(torch_image, pil_img) = process_image(image_path, img_processor, image)
if torch_image is not None:
for k in torch_image:
if type(torch_image[k]) is torch.Tensor and torch_image[k].dtype is not torch.int and torch_image[k].dtype is not torch.long:
torch_image[k] = torch_image[k].to(next(model.parameters()).dtype)
if type(torch_image[k]) is torch.Tensor:
torch_image[k] = torch_image[k].to(next(model.parameters()).device)
inputs_dic = text_processor(prompt)
for k in inputs_dic:
if type(inputs_dic[k]) is torch.Tensor and inputs_dic[k].dtype is not torch.int and inputs_dic[k].dtype is not torch.long:
inputs_dic[k] = inputs_dic[k].to(next(model.parameters()).dtype)
if type(inputs_dic[k]) is torch.Tensor:
inputs_dic[k] = inputs_dic[k].to(next(model.parameters()).device)
input_ids = inputs_dic['input_ids'].to(model.parameters().__next__().device)[0]
if max_length-len(input_ids) <= 1:
response = "The prompt exceeds the context length limit, please try again."
return response, history, (torch_image, pil_img)
seq = torch.cat(
[input_ids, torch.tensor([-1]*(max_length-len(input_ids)), device=input_ids.device)], dim=0
)
strategy = BaseStrategy(temperature=temperature, top_p=top_p, top_k=top_k, end_tokens=[text_processor.tokenizer.eos_token_id],
invalid_slices=invalid_slices, repetition_penalty=repetition_penalty)
# use beam search to get a better result
# strategy = BeamSearchStrategy(temperature=temperature, top_p=top_p, top_k=top_k, end_tokens=[text_processor.tokenizer.eos_token_id],
# num_beams=5, consider_end=True, repetition_penalty=repetition_penalty)
get_func = text_processor.get_func(input_ids, **inputs_dic) if hasattr(text_processor, 'get_func') else get_masks_and_position_ids_default
img_inputs = {'vision_'+k: v for k, v in torch_image.items()}
inputs_dic.pop('input_ids')
inputs = {**img_inputs, **inputs_dic}
output = filling_sequence(
model, seq,
batch_size=1,
get_masks_and_position_ids=get_func,
strategy=strategy,
**inputs
)[0] # drop memory
# ---------------
# port from inference_glm.py, more general than chat mode
# clip -1s and fill back generated things into seq
if type(output) is not list:
output_list = output.tolist()
else:
output_list = output
response = text_processor.tokenizer.decode(output_list[0])
# print('original:', response)
if hasattr(text_processor, 'process_response'):
response = text_processor.process_response(response)
response = response.split(text_processor.sep)[-1].strip()
if get_model_parallel_rank() == 0:
from utils.parser import parse_response
parse_response(pil_img, response)
history = history + [(query, response)]
return response, history, (torch_image, pil_img)

59
utils/dataset.py Normal file
View File

@ -0,0 +1,59 @@
import os
import logging
import random
import logging
import jsonlines
from io import BytesIO
from PIL import Image
from torch.utils.data import Dataset
from sat.helpers import print_rank0
def find_all_files(path, suffix=".jpg"):
target_files = []
for cur_dir, _, files in os.walk(path, followlinks=True):
for f in files:
if f.endswith(suffix):
target_files.append(os.path.join(cur_dir, f))
print_rank0(f'find {len(target_files)} files...')
return target_files
class ItemDataset(Dataset):
def __init__(self, image_processor, text_processor, args, data_dirs, **kwargs):
super().__init__()
self.data = self.load_data(data_dirs)
self.image_processor, self.text_processor = image_processor, text_processor
def process_img(self, img):
img_dict = {'vision': self.image_processor(img)}
return img_dict
def process_text(self, answer, prompt):
return self.text_processor(answer, prompt)
def load_data(self, data_dir):
all_files = find_all_files(data_dir, suffix=".jpg")
print_rank0(f"find {len(all_files)} samples in all...")
return all_files
def __len__(self):
return len(self.data)
def __getitem__(self, index):
data = self.data[index]
# img
try:
img = Image.open(data).convert('RGB')
except Exception as e:
print_rank0(e, level=logging.WARNING)
return {}
img_dict = self.process_img(img)
# text
label = data.split('/')[-1].split('.')[0]
uni_key = label
text_dict = self.process_text(label, "CAPTCHA:")
if text_dict is None:
print_rank0(f"Process text failed. Please check the max_target_length & max_source_length.\n The data is {data}", level=logging.WARNING)
return {}
# other attr
ret = {**img_dict, **text_dict, "question_id": uni_key}
return ret

208
utils/language.py Normal file
View File

@ -0,0 +1,208 @@
def _history_to_prompt(self, signal_type, query, history):
if signal_type == 'base':
return '<EOI>' + query
if signal_type == 'vqa':
answer_format = 'Short answer:'
else:
answer_format = 'Answer:'
prompt = '<EOI>'
for i, (old_query, response) in enumerate(history):
prompt += 'Question: ' + old_query + " {} ".format(answer_format) + response + "\n"
prompt += 'Question: {} {}'.format(query, answer_format)
return prompt
from transformers import LlamaTokenizer
def llama2_tokenizer(tokenizer_path, signal_type="base"):
tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = 32000
assert signal_type in ["base", "chat", "vqa"]
tokenizer.signal_type = signal_type
return tokenizer
import re
import numpy as np
import torch
class llama2_text_processor:
def __init__(self, tokenizer, max_target_length=2048, image_length=1225):
self.tokenizer = tokenizer
self.max_target_length = max_target_length
self.image_length = image_length
def __call__(self, caption, prompt=""):
if '<EOI>' not in prompt:
prompt = self.replace_tags_with_empty(prompt)
# caption = self.replace_tags_with_empty(caption)
history = []
prompt = self.history_to_prompt(prompt, history)
input_ids = [self.tokenizer.bos_token_id]
prompt_splits = prompt.split('<EOI>')
caption_splits = caption.split('<EOI>')
if len(prompt_splits) > 0:
input_ids.extend(self.tokenizer.encode(prompt_splits[0], add_special_tokens=False))
for tokens in prompt_splits[1:]:
tokens_with_img = [-100] + self.tokenizer.encode(tokens, add_special_tokens=False)
input_ids.extend(tokens_with_img)
context_length = len(input_ids) + (len(prompt_splits)-1) * (self.image_length + 1)
if context_length > self.max_target_length - 50:
return None # prompt is too long
if len(caption_splits) > 0:
input_ids.extend(self.tokenizer.encode(caption_splits[0], add_special_tokens=False))
for tokens in caption_splits[1:]:
tokens_with_img = [-100] + self.tokenizer.encode(tokens, add_special_tokens=False)
input_ids.extend(tokens_with_img)
if len(input_ids) > self.max_target_length - self.image_length - 5:
input_ids = input_ids[:self.max_target_length - self.image_length - 5]
input_ids += [self.tokenizer.eos_token_id]
while -100 in input_ids:
img_idx = input_ids.index(-100)
input_ids = input_ids[:img_idx] + [0] * (self.image_length + 1) + [-1] + input_ids[img_idx+1:]
image_position = []
while -1 in input_ids:
img_idx = input_ids.index(-1)
input_ids[img_idx] = 0
image_position.append(img_idx)
image_embed_mask = [0] * len(input_ids)
vision_expert_mask = [0] * len(input_ids)
image_rope_mask = [0] * len(input_ids)
for idx in image_position:
image_embed_mask[idx-self.image_length-1: idx+1] = [1] * (self.image_length + 2)
vision_expert_mask[idx-self.image_length-1: idx] = [1] * (self.image_length + 1)
image_rope_mask[idx - self.image_length: idx] = [1] * self.image_length
attention_mask = [1] * len(input_ids)
labels = [-100] * context_length + input_ids[context_length:]
pad_len = self.max_target_length - len(input_ids)
input_ids = input_ids + [self.tokenizer.pad_token_id] * pad_len
attention_mask = attention_mask + [1] * pad_len
vision_expert_mask = vision_expert_mask + [0] * pad_len
image_embed_mask = image_embed_mask + [0] * pad_len
image_rope_mask = image_rope_mask + [0] * pad_len
np_mask = np.tril(np.expand_dims(np.array(attention_mask), 0).repeat(len(attention_mask), 0))
labels = labels + [-100] * pad_len
for idx in image_position:
labels[idx-self.image_length-1: idx+1] = [-100] * (self.image_length + 2)
position_ids = []
pid = -1
for i in range(len(input_ids)):
if image_rope_mask[i] == 0 or (i > 0 and image_rope_mask[i] != image_rope_mask[i - 1]):
pid += 1
position_ids.append(pid)
input_ids = torch.tensor(input_ids).unsqueeze(0)
labels = torch.tensor(labels).unsqueeze(0)
attention_mask = torch.from_numpy(np_mask).unsqueeze(0).unsqueeze(0)
image_embed_mask = torch.tensor(image_embed_mask).unsqueeze(0)
vision_expert_mask = torch.tensor(vision_expert_mask).unsqueeze(0)
image_rope_mask = torch.tensor(image_rope_mask).unsqueeze(0)
position_ids = torch.tensor(position_ids).unsqueeze(0)
context_length = torch.tensor(context_length).unsqueeze(0).long()
return {'input_ids': input_ids, 'labels': labels, 'position_ids': position_ids, 'attention_mask': attention_mask, 'image_embed_mask': image_embed_mask,
'context_length': context_length, 'image_position': image_position, 'vision_expert_mask': vision_expert_mask, 'image_rope_mask': image_rope_mask
}
def history_to_prompt(self, query, history):
return _history_to_prompt(self, self.tokenizer.signal_type, query, history)
def replace_tags_with_empty(self, text):
return re.sub('<pad>|<s>|</s>|<EOI>', '', text)
from functools import partial
def get_masks_and_position_ids(seq, image_logits_mask):
tokens = seq.unsqueeze(0)
attention_mask = torch.ones((1, len(seq), len(seq)), device=tokens.device)
attention_mask.tril_()
attention_mask.unsqueeze_(1)
position_ids = []
pid = -1
for i in range(len(image_logits_mask[0])):
if image_logits_mask[0][i] == 0 or (i > 0 and image_logits_mask[0][i] != image_logits_mask[0][i - 1]):
pid += 1
position_ids.append(pid)
for i in range(tokens.shape[1]-image_logits_mask.shape[1]):
pid += 1
position_ids.append(pid)
position_ids = torch.tensor(position_ids, dtype=torch.long, device=tokens.device)
position_ids = position_ids.unsqueeze(0)
return tokens, attention_mask, position_ids
class llama2_text_processor_inference:
def __init__(self, tokenizer, max_target_length=2048, image_length=1225):
self.tokenizer = tokenizer
self.max_target_length = max_target_length
self.image_length = image_length
if self.tokenizer.signal_type == "chat":
self.sep = 'Answer: '
elif self.tokenizer.signal_type == "vqa":
self.sep = 'Short answer: '
else:
self.sep = 'unk'
self.invalid_slices = []
self.no_eoi = True
def __call__(self, prompt=""):
if '<EOI>' not in prompt:
prompt = self.replace_tags_with_empty(prompt)
# caption = self.replace_tags_with_empty(caption)
history = []
prompt = self.history_to_prompt(history, prompt)
input_ids = [self.tokenizer.bos_token_id]
prompt_splits = prompt.split('<EOI>')
if len(prompt_splits) > 0:
input_ids.extend(self.tokenizer.encode(prompt_splits[0], add_special_tokens=False))
for tokens in prompt_splits[1:]:
tokens_with_img = [-100] + self.tokenizer.encode(tokens, add_special_tokens=False)
input_ids.extend(tokens_with_img)
while -100 in input_ids:
img_idx = input_ids.index(-100)
input_ids = input_ids[:img_idx] + [0] * (self.image_length + 1) + [-1] + input_ids[img_idx + 1:]
image_position = []
while -1 in input_ids:
img_idx = input_ids.index(-1)
input_ids[img_idx] = 0
image_position.append(img_idx)
image_embed_mask = [0] * len(input_ids)
vision_expert_mask = [0] * len(input_ids)
image_rope_mask = [0] * len(input_ids)
for idx in image_position:
image_embed_mask[idx - self.image_length - 1: idx + 1] = [1] * (self.image_length + 2)
vision_expert_mask[idx - self.image_length - 1: idx] = [1] * (self.image_length + 1)
image_rope_mask[idx - self.image_length: idx] = [1] * self.image_length
input_ids = torch.tensor(input_ids).unsqueeze(0)
image_embed_mask = torch.tensor(image_embed_mask).unsqueeze(0)
vision_expert_mask = torch.tensor(vision_expert_mask).unsqueeze(0)
image_rope_mask = torch.tensor(image_rope_mask).unsqueeze(0)
return {'input_ids': input_ids, 'image_embed_mask': image_embed_mask, 'vision_expert_mask': vision_expert_mask, 'image_rope_mask': image_rope_mask}
def history_to_prompt(self, query, history):
return _history_to_prompt(self, self.tokenizer.signal_type, query, history)
def replace_tags_with_empty(self, text):
return re.sub('<pad>|<s>|</s>|<EOI>', '', text)
def process_response(self, response):
return response.replace('</s>', '')
def get_func(self, inputs, **kwargs):
get_func = partial(get_masks_and_position_ids, image_logits_mask=kwargs['image_rope_mask'])
return get_func

86
utils/parser.py Normal file
View File

@ -0,0 +1,86 @@
import seaborn as sns
from PIL import Image, ImageDraw, ImageFont
import matplotlib.font_manager
import spacy
import re
nlp = spacy.load("en_core_web_sm")
def draw_boxes(image, boxes, texts, output_fn='output.png'):
box_width = 5
color_palette = sns.color_palette("husl", len(boxes))
colors = [(int(r*255), int(g*255), int(b*255)) for r, g, b in color_palette]
width, height = image.size
absolute_boxes = [[(int(box[0] * width), int(box[1] * height), int(box[2] * width), int(box[3] * height)) for box in b] for b in boxes]
overlay = Image.new('RGBA', image.size, (255, 255, 255, 0))
draw = ImageDraw.Draw(overlay)
font_path = sorted(matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf'))[0]
font = ImageFont.truetype(font_path, size=26)
for box, text, color in zip(absolute_boxes, texts, colors):
for b in box:
draw.rectangle(b, outline=color, width=box_width)
if not text:
continue
splited_text = text.split('\n')
num_lines = len(splited_text)
text_width, text_height = font.getbbox(splited_text[0])[-2:]
y_start = b[3] - text_height * num_lines - box_width
if b[2] - b[0] < 100 or b[3] - b[1] < 100:
y_start = b[3]
for i, line in enumerate(splited_text):
text_width, text_height = font.getbbox(line)[-2:]
x = b[0] + box_width
y = y_start + text_height * i
draw.rectangle([x, y, x+text_width, y+text_height], fill=(128, 128, 128, 160))
draw.text((x, y), line, font=font, fill=(255, 255, 255))
img_with_overlay = Image.alpha_composite(image.convert('RGBA'), overlay).convert('RGB')
img_with_overlay.save(output_fn)
def boxstr_to_boxes(box_str):
boxes = [[int(y)/1000 for y in x.split(',')] for x in box_str.split(';') if x.replace(',', '').isdigit()]
return boxes
def text_to_dict(text):
doc = nlp(text)
box_matches = list(re.finditer(r'\[\[([^\]]+)\]\]', text))
box_positions = [match.start() for match in box_matches]
noun_phrases = []
boxes = []
for match, box_position in zip(box_matches, box_positions):
nearest_np_start = max([0] + [chunk.start_char for chunk in doc.noun_chunks if chunk.end_char <= box_position])
noun_phrase = text[nearest_np_start:box_position].strip()
if noun_phrase and noun_phrase[-1] == '?':
noun_phrase = text[:box_position].strip()
box_string = match.group(1)
noun_phrases.append(noun_phrase)
boxes.append(boxstr_to_boxes(box_string))
pairs = []
for noun_phrase, box_string in zip(noun_phrases, boxes):
pairs.append((noun_phrase.lower(), box_string))
return dict(pairs)
def parse_response(img, response, output_fn='output.png'):
img = img.convert('RGB')
width, height = img.size
ratio = min(1920 / width, 1080 / height)
new_width = int(width * ratio)
new_height = int(height * ratio)
new_img = img.resize((new_width, new_height), Image.LANCZOS)
pattern = r"\[\[(.*?)\]\]"
positions = re.findall(pattern, response)
boxes = [[[int(y) for y in x.split(',')] for x in pos.split(';') if x.replace(',', '').isdigit()] for pos in positions]
dic = text_to_dict(response)
if not dic:
texts = []
boxes = []
else:
texts, boxes = zip(*dic.items())
draw_boxes(new_img, boxes, texts, output_fn=output_fn)

778
utils/template.py Normal file
View File

@ -0,0 +1,778 @@
cn_template=[
'这幅作品描绘了:',
'描述这张图片:',
'从这张图片中,我们可以看到:',
'这张图片中最引人注目的是什么?',
'如果要在这张图片上添加一个标语,您会写什么?',
'描述图片内容的关键词:',
'画面传达了以下信息:',
'这张图展示了:',
'这张图片展示了什么?',
'描述这张图片中的场景:',
'这张图片的主要焦点是:',
'适合这张图片的标题是:',
'这张图片可以被描述为:',
'图片中的元素包括:',
'这张图片想表达什么信息?',
'请用一句话来概括这张图片的主题。',
'图片呈现了以下场景:',
'以简短的语言概括这张图片:',
'这张照片的故事是:',
'从这幅画中,我们可以发现:',
'对这幅图像进行简要说明:',
]
en_template = [
'The essence of this image is:',
'A brief summary of this scene would be:',
'If this image could speak, it would say:',
'The image depicts:',
'A photo of',
'Key elements in this picture include:',
'This visual representation showcases:',
'The main focus of this photograph is:',
'Can you identify the main elements or characters in this image?',
'Summarize this image in a single sentence:',
'What\'s happening in this picture?',
'Give a creative title for this image:',
'In a few words, what does this image convey?',
'Capture the essence of this image with a phrase:',
'Describe the scene in this image:',
'The main focus of this picture is:',
'A suitable caption for this image would be:',
'This image can be best described as:',
]
en_template_q = [ # from gpt-4
"Describe the image.",
"Give me a summary of this image.",
"What do you see in the picture?",
"Tell me about this picture.",
"Explain the image to me.",
"Break down what's in the photo.",
"What is depicted in the picture?",
"Illustrate the content of the image.",
"Convey the essence of the image.",
"Elaborate on the picture.",
"Can you detail the image?",
"Provide an overview of this picture.",
"Walk me through this image.",
"What does the image show?",
"Characterize the picture for me.",
"Render a description of the image.",
"Can you clarify what's in the image?",
"Discuss the elements of the picture.",
"Provide insight into this image.",
"What's going on in this photo?"
] + [ # from https://github.com/shikras/shikra/blob/main/config/_base_/dataset/template/image_cap.json
"Describe this image as simply as possible.",
"What happened in the picture? Answer in short sentences.",
"Briefly say the content of this scene",
"Show the content in the photo in short text.",
"Please describe the content of the image in a few words.",
"What is the content of the image? Please answer in short sentences.",
"Can you give me a brief description of this image?",
"What do you see in this picture?",
"In a few words, describe the content of the image.",
"Provide a concise explanation of this photograph.",
"What is happening in this scene?",
"Summarize the content of the photo.",
"What are the main elements present in the image?",
"Quickly explain the content of this visual.",
"In a nutshell, what can you say about this picture?",
"What's the main subject in the image?",
"Describe the main features of the image.",
"What is depicted in this photograph?",
"Give me a short description of the picture.",
"Briefly describe the objects and actions in the image.",
"What is the context of this image?",
"What are the key elements shown in this image?",
"What is the main theme of the photograph?",
"In just a few words, tell me what you see in this image.",
"What is the essence of the image?",
"Give me a quick breakdown of what's happening in the image.",
"What does this picture represent?",
"Using simple words, tell me what the image is showing.",
"Quickly mention the content of the image.",
"Describe the general scenario happening in the image.",
"Can you summarize the main aspects of this image?",
"Briefly point out the significant aspects of the image.",
"What is the core subject illustrated in this picture?",
"Tell me the central theme of the image briefly.",
"What important features should I look for in this image?",
"Describe the primary elements of the photo.",
"In a sentence or two, describe the image.",
"Outline the main content of this image.",
"What event is captured in the picture?",
"Simply put, what is being shown in the image?",
"What do you notice immediately in the image?",
"Provide a brief interpretation of the image.",
"Tell me the key things happening in this image.",
"Express the general theme of this photograph.",
"What is the core idea of the image?",
"Explain briefly what the image conveys.",
"What is the primary focus of this visual?",
"Name the most important components of this image.",
"Explain the basic scene depicted in the image.",
"What subject matter is portrayed in the picture?",
"What are the prominent features of the image?",
"Give a concise interpretation of this image.",
"Quickly describe the situation happening in the image.",
"Identify the focal point of the photograph.",
"What can you gather from this image in a few words?",
"Describe the image in the simplest way possible.",
"What's happening in the image at a glance?",
"What is the basic idea behind this picture?",
"Enumerate the crucial elements of the photograph.",
"What is the fundamental concept shown in the image?",
"Using few words, tell me the main idea of the photo.",
"Describe the essential aspects of this image.",
"Briefly outline the content within the image.",
"In a simple manner, explain the image.",
"What are the most striking details in the picture?",
"What can you say about the image in a nutshell?",
"Give a summary of the essential components of the image.",
"What is the primary message conveyed by the image?",
"Tell me briefly what the photograph is all about.",
"What is the central idea behind this image?",
"What do you observe in the image in simple terms?",
"Briefly express the main points of the image.",
"Describe the simple version of what's happening in the image.",
"What is the context of the image in brief?",
"Briefly indicate the notable features of the image.",
"What stands out in the photograph?",
"What are the major details visible in the picture?",
"What characters or objects are present in the image?",
"What do you see at first glance in the image?",
"Explain in brief the subject matter of the photograph.",
"Mention the main objects and actions in the image briefly.",
"What are the main components of the picture?",
"What is the primary objective of the image?",
"Give a short overview of the scene in the image.",
"How would you describe the content of the image?",
"What significant elements can you spot in the image?",
"In your own words, quickly describe the image.",
"Quickly outline the main ideas of this photograph.",
"Briefly explain the components of this image.",
"What are the key points portrayed in the picture?",
"Describe in a simplified manner the content of the image.",
"Give the short version of what's going on in the image.",
"What are the major aspects of this photograph?",
"What essential details can you see in the image?",
"What core elements are present in the picture?",
"Explain the main idea behind the photograph.",
"Name the key features of this visual.",
"What are the crucial points presented in this image?",
"Sum up the most important things in the image.",
"What do you think is the primary focus of this picture?",
"What are the major factors visible in the image?",
"Briefly mention the key details of the photograph.",
"Describe the main events or objects in the image.",
"In a sentence, describe the content of the image.",
"What key aspects can you see in the photograph?",
"What are the primary elements of this picture?",
"Concisely explain the content of this visual.",
"Give a short analysis of the image.",
"Describe the notable features of the photograph.",
"What's the main story being told in the image?",
"Provide a simple description of this photograph.",
"Express the gist of the scene in the image.",
"What can you deduce from the image briefly?",
"What are the most important aspects of the visual?",
"What do you find most striking in the photo?",
"Describe the essence of the picture.",
"Give a brief outline of the image content.",
"What grabs your attention in the image?",
"Explain the focal points of this photograph.",
"Describe the core elements of the image.",
"Outline the key aspects of this picture.",
"What's happening in this image in brief?",
"What scene is represented in the photograph?",
"What central theme can you identify in the image?",
"Give a brief overview of the image.",
"What main features are present in the image?",
"Describe the simple context of the photograph.",
"What are the standout details in the image?",
"Explain the primary purpose of the image.",
"Capture the basic essence of the picture.",
"Identify the key components of this image.",
"What's the main idea shown in the image?",
"Concisely describe the core content of the image.",
"Describe the primary aspects of this image.",
"Outline the significant parts of the photo.",
"What is the most important part of the image?",
"In a short statement, explain the image.",
"Relay a brief, clear account of the picture shown. The image is",
"Can you provide a brief description of the image?",
"Summarize the content of this picture.",
"Please tell me what's happening in this photo.",
"Quickly describe what you see in the photograph.",
"In a sentence or two, describe the scene in this image.",
"Give me a short summary of what you see in this picture.",
"Provide a concise analysis of the image.",
"Tell me in a nutshell what's happening in this image.",
"What does this photo depict in brief?",
"Briefly explain the content of this image.",
"Express the idea of the image in short form.",
"Kindly give a condensed description of the picture.",
"In few words, describe what this picture is about.",
"Offer a succinct summary of the scene in this image.",
"Quickly tell me the main subject of the image.",
"What is the theme of this photo in brief?",
"In simple words, explain the image.",
"Please give a short and sweet description of this image.",
"Provide an abbreviated version of the content of this photo.",
"Shorten the scenario of this scene.",
"Please give a concise description of this image.",
"Boil down the content of this photograph.",
"Quickly summarize what you see in the image.",
"Sketch the main points of this picture.",
"Offer a compact summary of the elements in this image.",
"In one sentence, describe the theme of this picture.",
"Pare down the content of this photo.",
"Provide a to-the-point explanation of this image.",
"Highlight the main subject of the photograph.",
"Summary: What can you see in this image?",
"What's the brief context of this picture?",
"Describe this scene in a few words.",
"What's the main focus of this image?",
"In just a couple of words, tell me about this picture.",
"Give a snapshot description of this image.",
"Be succinct while describing the content of this photo.",
"Cut to the main part of the picture.",
"Quickly express the idea of this scene.",
"What's the abbreviated version of this image?",
"Outline the moment captured in this photo.",
"Please make a brief statement about the image.",
"What are the basic elements in this picture?",
"Trim down the content of this image.",
"Distill the content of the photograph.",
"Give me the main idea of this picture.",
"Point out the primary focus of this image.",
"What's the gist of this scene?",
"Provide a pithy description of this photo.",
"In brief, explain the elements in this image.",
"Offer a short version of the content of this photograph.",
"Capture the essence of this picture.",
"Curtly describe what's happening in this image.",
"Brief me on the content of this scene.",
"In a word or two, what does this photo show?",
"Condense the content of this image.",
"Simply summarize the elements in this picture.",
"What is the main object of interest in the image?",
"Highlight the crux of this photograph.",
"Provide a brief explanation of what's occurring in this image.",
"Quickly identify the central theme of this picture.",
"Reveal the core content of this image.",
"What's the focal point of this photo?",
"Give a compressed description of this scene.",
"Explain the key concept of this image in simple terms.",
"Wrap up the content of this picture.",
"Make a concise statement about the photograph.",
"Identify the primary subject in this image.",
"What's happening in the photo in few words?",
"Simplify the description of this scene.",
"In a nutshell, explain the content of this image.",
"Offer the main takeaway from this photograph.",
"In a few words, give me the main idea of this picture.",
"Share a brief description of the primary action in this image.",
"What can you observe in this image in short?",
"Whittle down the content of this photo.",
"Strike at the heart of the scene depicted in this image.",
"Preserve the essence while describing this picture.",
"Keep it short and explain this photograph.",
"What's the key thing to notice in this image?",
"Give me the abridged version of this scene.",
"Pare the content of this picture down to its essence.",
"Provide a trimmed down explanation of this photo.",
"Expressed briefly, what does this image show?",
"Offer a concise assessment of the scene in this picture.",
"What does the photograph illustrate in short?",
"What are the salient features of this image?",
"Bullet-point the main elements of this picture.",
"Concisely express the key aspect of this photo.",
"Briefly, what can you spot in this image?",
"Filter the description of this scene down to the essentials.",
"Illustrate the core concept of this picture.",
"Sum up the main event in this photograph.",
"What is the most striking feature of this image?",
"Cut to the chase and explain this scene.",
"Select the main element to describe in this picture.",
"What do you see in the photo in brief?",
"Give a short but informative description of this image.",
"What stands out the most in this scene?",
"In few words, summarize the main part of this picture.",
"Briefly, what's going on in this photograph?",
"List the key elements of this image.",
"State the essence of this picture.",
"Define the central idea of this photo briefly.",
"Shorten your description of this image.",
"Be concise while explaining the content of this picture.",
"What's the short version of this photo's content?",
"Point out the main component in this photograph.",
"In a phrase, explain the essence of this image.",
"Selectively describe the content of this scene.",
"Briefly, what is this picture all about?",
"What's the central subject of this photo?",
"Get to the point and explain this image.",
"Briefly, tell me the main action depicted in this picture.",
"What's the main message of this photograph in brief?",
"Condense the scene captured in this image.",
"Please stick to the main point of this photo.",
"Single out the main focus of this picture.",
"Streamline the content of this image.",
"What's the overall theme in this scene?",
"Distill the main idea from this photograph.",
"In a few words, what's the main event in this picture?",
"Give a terse description of the content of this image.",
"Catch the essence of this photo.",
"What's the main aspect of this image?",
"Briefly, describe the primary focus of this picture.",
"What is the key attribute of this photo?",
"What's the main highlight of this image?",
"Simplify the content of this scene.",
"Explain the key feature of this photograph concisely.",
"Abstain from details while describing this picture.",
"Be short while explaining this image.",
"What's the essential point in this photo?",
"Just tell me the main subject in the picture.",
"Highlight the primary idea of this image.",
"Get straight to the point about this scene.",
"Stick to the basics while describing this photo."
]
shikra_template = {
'caption2box': [
"Where is <expr>?",
"Where is <expr> in the image?",
"Where is <expr>? answer in [[x0,y0,x1,y1]] format.",
"Can you point out <expr> in the image and provide the bounding boxes of its location?",
"Help me to locate <expr> in and give me its bounding boxes, please.",
"In the given, could you find and tell me the bounding boxes of <expr>?",
"Guide me to the location of <expr> within the image by providing its bounding boxes.",
"I'd like to know the exact bounding boxes of <expr> in the photo.",
"Would you kindly provide the bounding boxes of <expr> located in the picture?",
"Can you find <expr> in and give me the bounding boxes of where it is located?",
"I'm trying to locate <expr> in. Can you determine its bounding boxes for me?",
"What are the bounding boxes of <expr> in the image?",
"Can you disclose the position of <expr> in the photograph by stating its bounding boxes?",
"In, could you let me know the location of <expr> in the form of bounding boxes?",
"I need the bounding boxes of <expr> in, can you please assist me with that?",
"Where in is <expr> located? Provide me with its bounding boxes, please.",
"May I have the bounding boxes of <expr>?",
"In the photograph, could you pinpoint the location of <expr> and tell me its bounding boxes?",
"Can you please search and find <expr> in, then let me know its bounding boxes?",
"Please, point out the position of <expr> in the image by giving its bounding boxes.",
"What are the exact bounding boxes of <expr> in the provided picture?",
"Detect the location of <expr> in and share the bounding boxes with me, please.",
"In the picture, I'd like you to locate <expr> and provide its coordinates.",
"Please indicate the location of <expr> in the photo by giving bounding boxes.",
"Find <expr> in and share its coordinates with me.",
"Could you please help me find the bounding boxes of <expr> in the image?",
"I am looking for the position of <expr> in. Can you provide its bounding boxes?",
"In the image, can you locate <expr> and let me know its coordinates?",
"I'd appreciate if you could find and tell me the bounding boxes of <expr>.",
"In, I need the bounding box bounding boxes of <expr>.",
"Point me to the location of <expr> in the picture by providing its bounding boxes.",
"Could you trace <expr> in and tell me its bounding boxes?",
"Can you assist me in locating <expr> in, and then provide its bounding boxes?",
"I'm curious, what are the bounding boxes of <expr> in the photo?",
"Kindly share the bounding boxes of <expr> located in the image.",
"I would like to find <expr> in. Can you give me its bounding boxes?",
"Can you spot <expr> in and disclose its bounding boxes to me?",
"Please, reveal the location of <expr> in the provided photograph as coordinates.",
"Help me locate and determine the bounding boxes of <expr>.",
"I request the bounding boxes of <expr> in the image.",
"In the given, can you find <expr> and tell me its bounding boxes?",
"I need to know the position of <expr> in as bounding boxes.",
"Locate <expr> in and provide its bounding boxes, please.",
"Assist me in finding <expr> in the photo and provide the bounding box bounding boxes.",
"In, can you guide me to the location of <expr> by providing bounding boxes?",
"I'd like the bounding boxes of <expr> as it appears in the image.",
"What location does <expr> hold in the picture? Inform me of its bounding boxes.",
"Identify the position of <expr> in and share its bounding boxes.",
"I'd like to request the bounding boxes of <expr> within the photo.",
"How can I locate <expr> in the image? Please provide the bounding boxes.",
"I am interested in knowing the bounding boxes of <expr> in the picture.",
"Assist me in locating the position of <expr> in the photograph and its bounding box bounding boxes.",
"In the image, I need to find <expr> and know its bounding boxes. Can you please help?"
],
'box2caption': [
"Can you give me a description of the region <objs> in image?",
"In the provided image, would you mind describing the selected area <objs>?",
"I need details about the area <objs> located within image.",
"Could you please share some information on the region <objs> in this photograph?",
"Describe what's happening within the coordinates <objs> of the given image.",
"What can you tell me about the selected region <objs> in the photo?",
"Please, can you help me understand what's inside the region <objs> in image?",
"Give me a comprehensive description of the specified area <objs> in the picture.",
"I'm curious about the area <objs> in the following image. Can you describe it?",
"Please elaborate on the area with the coordinates <objs> in the visual.",
"In the displayed image, help me understand the region defined by <objs>.",
"Regarding the image, what's going on in the section <objs>?",
"In the given photograph, can you explain the area with coordinates <objs>?",
"Kindly describe what I should be seeing in the area <objs> of image.",
"Within the input image, what can be found in the region defined by <objs>?",
"Tell me what you see within the designated area <objs> in the picture.",
"Please detail the contents of the chosen region <objs> in the visual input.",
"What's inside the area <objs> of the provided graphic?",
"I'd like some information about the specific region <objs> in the image.",
"Help me understand the details within the area <objs> in photograph.",
"Can you break down the region <objs> in the image for me?",
"What is taking place within the specified area <objs> in this capture?",
"Care to elaborate on the targeted area <objs> in the visual illustration?",
"What insights can you provide about the area <objs> in the selected picture?",
"What does the area <objs> within the given visual contain?",
"Analyze and describe the region <objs> in the included photo.",
"Please provide details for the area marked as <objs> in this photographic.",
"For the image, can you assess and describe what's happening at <objs>?",
"Fill me in about the selected portion <objs> within the presented image.",
"In the image, elaborate on the details found within the section <objs>.",
"Please interpret and describe the area <objs> inside the given picture.",
"What information can you give me about the coordinates <objs> in image?",
"Regarding the coordinates <objs> in image, can you provide a description?",
"In the photo, can you delve into the details of the region <objs>?",
"Please provide insights on the specified area <objs> within the graphic.",
"Detail the chosen region <objs> in the depicted scene.",
"Can you discuss the entities within the region <objs> of image?",
"I'd appreciate a breakdown of the area <objs> in the displayed image.",
"What's the story in the section <objs> of the included visual?",
"Please enlighten me about the region <objs> in the given photo.",
"Offer a thorough description of the area <objs> within the illustration.",
"What can you share about the area <objs> in the presented image?",
"Help me grasp the context of the region <objs> within image.",
"Kindly give an overview of the section <objs> in photo.",
"What details can you provide about the region <objs> in the snapshot?",
"Can you divulge the contents of the area <objs> within the given image?",
"In the submitted image, please give a synopsis of the area <objs>.",
"In the image, please describe the bounding box <objs>.",
"Please describe the region <objs> in the picture.",
"Describe the bbox <objs> in the provided photo.",
"What can you tell me about the area <objs> within the image?",
"Could you give me a description of the rectangular region <objs> found in?",
"In, what elements can be found within the coordinates <objs>?",
"Please provide details for the area within the bounding box <objs> in.",
"Can you generate a description for the selected region <objs> in the image?",
"Kindly describe the objects or scenery in the bounding box <objs> within.",
"What details can you provide for the rectangle defined by the coordinates <objs> in?",
"In relation to the picture, please describe the content of the area marked by <objs>.",
"I'd like to know more about the area <objs> in the given image. Can you describe it?",
"Can you help me by describing the part of that lies within the bounding box <objs>?",
"What's happening in the section of the photo enclosed by the coordinates <objs>?",
"Describe the image content present in the specified rectangular area <objs> of.",
"Please provide information about the area within the bounding box <objs> in the picture.",
"Could you offer a description of the contents in the selected area <objs> of the image?",
"I'm curious about the area <objs> in. Can you provide a description of it?",
"What can be observed in the rectangular region <objs> in the photograph?",
"Please explain what is contained in the portion of defined by the box <objs>.",
"In the photograph, can you describe the objects or scenery enclosed by <objs>?",
"Can you give a brief explanation of the specified area <objs> in the image?",
"What does the area <objs> look like in the context of the image?",
"Could you please describe the contents of the bounding box <objs> in the given image?",
"I would like to know more about the rectangular region <objs> within the picture. Can you describe it?",
"Please tell me about the area <objs> in the image. What does it contain?",
"Help me understand what's happening in the selected bounding box <objs> within.",
"Can you provide a description of the area <objs> in the image?",
"What sort of things can be seen in the region <objs> of the photo?",
"Describe what can be found within the bounds of <objs> in the image.",
"In, can you paint a picture of the area enclosed by coordinates <objs>?",
"Please provide a detailed account of the area covered by the bounding box <objs> in.",
"Give me a vivid description of what's happening in the area <objs> within the snapshot.",
"In the image, what do you observe within the rectangular box defined by the coordinates <objs>?",
"Could you give me a breakdown of the content in the specified area <objs> of the picture?",
"Please elucidate the area<objs> of the image.",
"I'd appreciate it if you could describe the portion of that lies within the rectangle <objs>.",
"Can you share some insights about the rectangular region <objs> in the image?",
"Help me visualize the section of the photo enclosed by the bounding box <objs>.",
"Would you kindly provide a description for the content within the rectangular area <objs> of?",
"In, can you tell me more about the area specified by the bounding box <objs>?",
"Please describe what can be seen in the rectangular region <objs> of the image.",
"Can you analyze the content of the area <objs> within the photograph?",
"In the provided image, please explain the content within the region <objs>.",
"I'm interested in the selected rectangle <objs> in. Can you tell me more about it?",
"Explain what can be found in the bounding box <objs> in the context of the image.",
"Kindly share your observations about the rectangular region <objs> within.",
"I'd like a thorough description of the area <objs> in the image.",
"Could you please provide a description of the rectangular area <objs> in?",
"Please describe the section of the picture defined by the bbox <objs>.",
"Tell me more about the scenery or objects within the rectangular region <objs> in.",
"Would you kindly describe the content of the area enclosed by <objs> in the image?",
"Help me understand the objects or scenery within the bounding box <objs> in the image.",
"I would like to know about the section of the image enclosed by the rectangle <objs>. Can you describe it?",
"Describe the selected rectangular area <objs> in the photo.",
"Tell me about the region <objs> of the image.",
"I request a description of the area <objs> in the picture.",
"Can you elaborate on the content of the bounding box <objs> in?",
"Please share details about the rectangular region <objs> within the image.",
"What can I find in the bbox <objs> of the provided image?",
"In the image, could you provide a description for the coordinates <objs>?",
"Could you tell me more about the area <objs> in the snapshot?",
"Fill me in on the details of the rectangular box <objs> within the image.",
"What's going on in the section of contained within the bounding box <objs>?",
"I would like a description of the content within the bbox <objs> in.",
"Please enlighten me about the area <objs> in the photograph.",
"Can you give me a visual rundown of the area <objs> in?",
"Describe the visual elements within the selected area <objs> of the image.",
"Tell me what you see in the area <objs> within the context of the image.",
"Explain the content within the rectangular region <objs> of the image.",
"I'd like some information about the bounding box <objs> in the photo.",
"What is happening within the rectangle defined by coordinates <objs> in the image?",
"Please describe the content within the area <objs> displayed in the image.",
"What can be seen in the bounding box <objs> in the context of the provided image?",
"Share some details about the objects or environment within the bounding box <objs> in.",
"Please describe the area <objs> in the image for me.",
"Can you generate a description of the contents within the selected region <objs> in?",
"What objects or scenery can be found in the area <objs> in the image?",
"Please tell me more about the rectangular section <objs> in the photo.",
"Could you describe the content of the bbox <objs> in the image?",
"What does the selected region <objs> in the image encompass?",
"I am interested in the region <objs> of the image; please describe it.",
"Can you provide some context for the area <objs> within the picture?",
"Please give me some details about the rectangle <objs> in the image.",
"In the photo, what can you see within the region defined by the bounding box <objs>?",
"I would like a detailed description of the portion of enclosed by the bbox <objs>.",
"Please help me understand the content present within the rectangle <objs> in.",
"Would you mind describing the rectangular area <objs> in the provided image?"
],
'caption_with_box': [
"Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?",
"Please explain what's happening in the photo and give coordinates [[xmin,ymin,xmax,ymax]] for the items you reference.",
"Analyze the contents of the picture and share the positions of mentioned items using the top-left and bottom-right coordinates.",
"What do you see in this image? Please mention the objects and their locations using the format [[x1,y1,x2,y2]].",
"Examine the image and describe its content, specifying the location of each mentioned noun using coordinates [[x1,y1,x2,y2]].",
"Could you interpret the scene from this image and provide the coordinates [[xmin,ymin,xmax,ymax]] for each element you describe?",
"Please provide an overview of the visual information in this image, along with the location data [[xmin,ymin,xmax,ymax]] for each mentioned object.",
"Tell me about the picture and include position info [[x0,y0,x1,y1]] for the objects you describe.",
"What is displayed in this image? Remember to mention the objects and their corresponding locations using the format [[xmin,ymin,xmax,ymax]].",
"Give a brief analysis of the image and make sure to include the location of objects using their coordinates [[x1,y1,x2,y2]].",
"Explain the content of this image and provide the coordinates [[x1,y1,x2,y2]] for all objects that you mention.",
"Describe the scene in this picture and give the top-left and bottom-right coordinates [[xmin,ymin,xmax,ymax]] for each item you talk about.",
"Please give a summary of the image and include the position info for each object you identify with coordinates [[x0,y0,x1,y1]].",
"What is happening in the photo? Please point out the objects and their locations using the format [[x1,y1,x2,y2]].",
"Illustrate the content of the image and specify the coordinates [[xmin,ymin,xmax,ymax]] for every object you mention.",
"What can you tell me about this image? Remember to provide location data for the objects you describe using coordinates [[x1,y1,x2,y2]].",
"Please interpret this image and give coordinates [[x1,y1,x2,y2]] for each object you mention.",
"Detail what you see in the image and provide the top-left and bottom-right coordinates [[xmin,ymin,xmax,ymax]] for each mentioned noun.",
"Take a look at this image and give an explanation of its content, including the position data [[x1,y1,x2,y2]] for each object you describe.",
"What is the image depicting? Please mention the positions of any mentioned objects using square brackets.",
"Describe the visual elements in the image and note the positions of any mentioned objects in square brackets.",
"Could you please analyze the content of the image and mention the positions of any mentioned objects in square brackets?",
"Tell me about the objects present in the image and note their positions using square brackets.",
"What can you tell me about the contents of the image? Please indicate the positions of any mentioned objects in square brackets.",
"Provide a comprehensive description of the image and specify the positions of any mentioned objects in square brackets.",
"Describe the scene in the image and mention the positions of any mentioned objects using square brackets.",
"Can you identify the objects in the image? Please include their positions in square brackets.",
"Please describe the visual details in the image and note the positions of any mentioned objects using square brackets.",
"What is happening in the image? Please mention the positions of any mentioned objects using square brackets.",
"Analyze the content of the image and provide the positions of any mentioned objects in square brackets.",
"Describe the main elements in the image and note the positions of any mentioned objects using square brackets.",
"Could you please provide a detailed description of the image? Don't forget to mention the positions of any mentioned objects in square brackets.",
"Can you provide a description of the image and include the coordinates [[x0,y0,x1,y1]] for each mentioned object?",
"Please explain what's happening in the photo and give coordinates [[xmin,ymin,xmax,ymax]] for the items you reference.",
"Analyze the contents of the picture and share the positions of mentioned items using the top-left and bottom-right coordinates.",
"What do you see in this image? Please mention the objects and their locations using the format [[x1,y1,x2,y2]].",
"Examine the image and describe its content, specifying the location of each mentioned noun using coordinates [[x1,y1,x2,y2]].",
"Could you interpret the scene from this image and provide the coordinates [[xmin,ymin,xmax,ymax]] for each element you describe?",
"Please provide an overview of the visual information in this image, along with the location data [[xmin,ymin,xmax,ymax]] for each mentioned object.",
"Tell me about the picture and include position info [[x0,y0,x1,y1]] for the objects you describe.",
"What is displayed in this image? Remember to mention the objects and their corresponding locations using the format [[xmin,ymin,xmax,ymax]].",
"Give a brief analysis of the image and make sure to include the location of objects using their coordinates [[x1,y1,x2,y2]].",
"Explain the content of this image and provide the coordinates [[x1,y1,x2,y2]] for all objects that you mention.",
"Describe the scene in this picture and give the top-left and bottom-right coordinates [[xmin,ymin,xmax,ymax]] for each item you talk about.",
"Please give a summary of the image and include the position info for each object you identify with coordinates [[x0,y0,x1,y1]].",
"What is happening in the photo? Please point out the objects and their locations using the format [[x1,y1,x2,y2]].",
"Illustrate the content of the image and specify the coordinates [[xmin,ymin,xmax,ymax]] for every object you mention.",
"What can you tell me about this image? Remember to provide location data for the objects you describe using coordinates [[x1,y1,x2,y2]].",
"Please interpret this image and give coordinates [[x1,y1,x2,y2]] for each object you mention.",
"Detail what you see in the image and provide the top-left and bottom-right coordinates [[xmin,ymin,xmax,ymax]] for each mentioned noun.",
"Take a look at this image and give an explanation of its content, including the position data [[x1,y1,x2,y2]] for each object you describe.",
"What are the details of this picture? Please include the coordinates [[x1,y1,x2,y2]] for each object you mention.",
"Can you provide a detailed description of the contents of the image? Please include the positions of any mentioned objects in square brackets.",
"What is the image depicting? Please mention the positions of any mentioned objects using square brackets.",
"Describe the visual elements in the image and note the positions of any mentioned objects in square brackets.",
"Could you please analyze the content of the image and mention the positions of any mentioned objects in square brackets?",
"Tell me about the objects present in the image and note their positions using square brackets.",
"How would you describe the contents of the image? Please provide the positions of mentioned objects in square brackets.",
"What do you observe in the image? Don't forget to mention the objects and their locations using square brackets.",
"Can you give an overview of the image and list the objects along with their positions using square brackets?",
"Describe the activities taking place in the image and point out the objects with their locations using square brackets.",
"Can you explain what is going on in this picture and give the bounding boxes for each object you mention?",
"Provide a summary of the image and include bounding box coordinates for the objects you talk about.",
"Help me understand what's in the image and also give me the bounding boxes for the objects you describe.",
"Explain the scene depicted in the image and include the bounding boxes for the nouns you reference.",
"Analyze this picture for me and provide coordinates for the items you discuss.",
"I need a breakdown of what is happening in the image, and please include the bounding box information.",
"Give me a rundown of what's in this image, along with the coordinates for each mentioned object.",
"Elaborate on the image and provide the boundaries for the objects you mention.",
"Discuss the contents of this image and include the bounding boxes for mentioned objects.",
"Unveil what's happening in the image and provide the coordinates for the objects in discussion.",
"Clarify the situation depicted in the photo and include bounding box details for the objects mentioned.",
"Break down the image and share the bounding box coordinates of objects you mention.",
"Reveal the meaning behind the image and provide me with the bounding box details for the mentioned nouns.",
"Examine the picture and disclose the bounding box coordinates for each object you discuss.",
"Interpret the image and include the bounding boxes of the items you discuss.",
"Convey the essence of the photo and provide the bounding box information for mentioned objects.",
"Enlighten me about the image and provide me with the bounding box coordinates for each subject.",
"Narrate the image and include the bounding boxes for the objects you describe.",
"Decipher the story behind the image and provide the bounding box for each object in the story.",
"Illustrate your understanding of the image, and give the boxes of the described objects.",
"Walk me through the contents of the image, and include the bounding box for the mentioned items.",
"I need to know what's in the image and please provide coordinates for the featured objects.",
"Dissect the components of the image and include the bounding boxes for each object discussed.",
"Give insights into the picture and provide the bounding box details for the objects mentioned.",
"Delve into the image, and furnish the coordinates for the items you reference.",
"Portray the events in the image and include the location and boundaries of the described objects.",
"Unravel the aspects of the image and give the bounding box for the mentioned items.",
"Tell me everything about the picture and don't forget to mention bounding boxes for the described items.",
"Disentangle the details of the picture and include bounding box coordinates for mentioned items.",
"Explore the elements within the picture and provide the bounding boxes for each object mentioned.",
"Detail the occurrences in the picture and supply the bounding box info for the talked-about objects.",
"Lay out the context of the picture and include bounding box details for the featured objects.",
"Uncover the truth behind the picture and include the bounding boxes for the described nouns.",
"What's happening in the picture? Please provide the bounding box info for mentioned objects.",
"Discuss the events taking place in and include the bboxes of the involved objects.",
"In the picture, describe what's going on and provide the bboxes of mentioned objects.",
"Decode the message in the picture, and provide boundaries for the relevant objects.",
"Let me know what you see in the picture and provide the bounding boxes for the objects you discuss.",
"Scrutinize the picture and include the coordinates for the items you talk about.",
"Summon the essence from the picture and present the bounding box coordinates for relevant objects.",
"Deconstruct the scene in the picture and include bounding box info for the mentioned nouns.",
"Identify the contents of the picture and provide the coordinates for the objects involved.",
"Make sense of the happenings in the picture and include bounding box coordinates for the objects.",
"What can you tell me about the picture? Please include bounding boxes for any mentioned objects.",
"Deduce the meaning of the picture and provide location details for the discussed items.",
"Can you give me the gist of the picture and provide the bboxes of the described objects?",
"Describe what is taking place in the picture, and include the bboxes of the involved items.",
"Please narrate the story in the picture, and provide the bounding box coordinates for the included objects.",
"Scrutinize the contents of the photo and include the location details of the items you talk about.",
"Analyze what's happening within the photo and provide bounding box info for the referenced objects.",
"Relate the situation in the photo and include the location details of the items you discuss.",
"Probe into the photo and provide the boundaries for the included objects.",
"Gather the meaning of the photo and provide location info for the mentioned nouns.",
"Resolve the context of the picture and supply the bounding box details for the objects you discuss.",
"Bring clarity to the situation in the photo and provide the bounding box for the relevant objects.",
"Unfold the story of the photo and include the bounding box coordinates for the included nouns.",
"Speaking on the photo, what do you see? Don't forget to include bounding boxes for mentioned objects.",
"Inform me about the particulars in the photo and provide the bounding box info for the discussed items.",
"Give me the lowdown on the photo, and include bounding boxes for the objects you discuss.",
"Share with me the details of the photo and provide the bounding boxes for the nouns mentioned.",
"Provide a glimpse into the happenings of the photo and include bounding boxes for the involved objects.",
"Decode the events occurring in the photo and provide the location details for the mentioned items.",
"Delineate the elements of the photo and include the bounding box for each object discussed.",
"Explain to me the context of the photo and provide bounding box details for any discussed objects.",
"Describe the subjects within the photo and include bounding box coordinates for the mentioned objects.",
"Break down the narrative of the picture and include the boundaries for any related items.",
"Tell me about the image and provide me with the bboxes of any mentioned objects.",
"What's the story in the image? Please include bounding boxes for any objects discussed.",
"Elucidate the context of the image and provide bounding box details for the objects you mention.",
"Annotate the image with the bounding box coordinates of the objects you discuss during your description.",
"Examine the image carefully and point out the objects along with their respective bounding boxes.",
"Dig into the scene on the image and provide the bounding box info for the mentioned items.",
"Study the photo and cite bounding box coordinates for the subjects you mention.",
"Inspect the image and give me the coordinates of the bounding box for each mentioned object.",
"Dive into the details of the picture, and include the bounding boxes for any referenced nouns.",
"Analyze the photo and provide the boxes of the objects involved.",
"Take a look at the image and give me the location details for any mentioned items.",
"Go through the scene, describing its content, and provide bounding boxes for the mentioned nouns.",
"Evaluate the scene and include the boxes of the items you reference.",
"Share your perspective of the scene and give the bounding box for each object you discuss.",
"Get into the specifics of the picture and provide the boxes of the mentioned items.",
"Quote the happenings unfolding in the frame and provide the bounding box coordinates of related objects.",
"Shed light on the events in the frame and include the location details of the mentioned items.",
"Bring out the description of the frame and provide the bounding box for each object you mention.",
"Dissect the scenario in the frame and include the bounding boxes for any referenced objects."
],
'box_qa_True': [
"<question> Let's think step by step.",
"<question> Let's think step by step.",
"<question> Please include the reasoning process.",
"<question> Please include the reasoning process.",
"Using the image as reference, can you answer the following question: <question> Please include the reasoning process.",
"After examining the image, I'd like to know the answer to this question: <question> Please provide an explanation of your answer.",
"<question> Can you give me an answer based on the image, along with the reasoning process?",
"Looking at the image, I need to ask this question '<question>'. Can you answer it and provide the explanation?",
"After checking out the picture, I have a question: <question> Can you give me an answer with reasoning?",
"Please have a look at the image and tell me the answer to my question: <question> Don't forget to provide the reasoning.",
"I want to know the answer to this question: <question> Please refer to the image and give an explanation as well.",
"Help me understand the answer to the following question based on the image: <question> Remember to explain the reasoning.",
"Consider the image and answer my question: <question> Be sure to offer reasoning for the answer.",
"Regarding the image, can you tell me the answer to the question '<question>' and explain your thought process?",
"Here's an image I need assistance with. What's the answer to the following question: <question> Please provide reasoning.",
"Can you deduce the answer to question '<question>' after examining the image, along with the reasoning process?",
"Having a look at image, can you tell me the answer to my question '<question>' and the logic leading to it?",
"Investigate the image and provide me with the answer to this question: <question> Don't forget to reveal your reasoning.",
"In reference to the image, I have a question: <question> Can you respond with your answer and an explanation?",
"If you take a glance at the image, can you give me the answer for my question: <question> and add an explanation?",
"Centered on the image, please unravel my query: <question> and be sure to involve the reasoning process.",
"Can you offer an answer to my following inquiry: <question> Make sure to examine the image and clarify your reasoning.",
"Looking at image, would you provide an answer to the question '<question>'? Kindly include your thought process as well.",
"Upon analyzing the image, please find the answer to my question '<question>' and provide a detailed explanation.",
"Please provide a solution to my question: <question> First, examine the image and then walk me through your reasoning.",
"After inspecting the picture thoroughly, kindly furnish the answer to the query: <question> and provide the reasoning.",
"Carefully observe the image and provide me with a well-reasoned answer to the question '<question>'.",
"Focusing on the image, please offer an answer to my question '<question>' along with the reasoning process.",
"Evaluate the image and let me know your answer regarding this question '<question>'. Include your thinking process as well.",
"Keeping the image in mind, please help me with the following question: <question> and explain the reasoning process.",
"Give your observation on the image and your response to the question '<question>', along with a clear reasoning explanation.",
"Based on the image, kindly address my query: <question> Remember to elucidate the reasoning process.",
"In view of the image, could you please respond to the question '<question>' and provide the reasoning process?",
"Deliberate on the image and enlighten me with an answer to the question '<question>' including the reasoning process.",
"Please share your insights on the image by answering the question '<question>'. Do illustrate your reasoning process.",
"Examine the following image closely and provide the answer to my question: <question> Do include the thinking process.",
"Critique the image and furnish the answer to my question '<question>', along with a thorough reasoning.",
"Please analyze the image and supply an answer to the following query: <question> Ensure to elucidate the justifying process.",
"Scrutinize the image and help me with the answer to this question: <question> and explain your deduction methodology.",
"Please answer the following question '<question>' based on the image, and describe your thought process."
],
'box_qa_False': [
"Please briefly answer: <question>",
"Can you give a concise response to: <question>",
"In relation to the image, provide a short answer for: <question>",
"<question> - I need a succinct reply, please.",
"Could you offer a brief explanation for: <question>",
"After looking at the image, quickly answer: <question>",
"I'm looking for a short response to: <question>",
"Based on the image, can you sum up your answer for: <question>",
"Without going into detail, answer: <question>",
"Briefly, what's your take on: <question>",
"Considering the image, please keep your answer brief for: <question>",
"Quickly tell me about: <question>",
"I don't need a lengthy explanation, just a quick answer to: <question>",
"Can you keep it brief and answer: <question>",
"<question> - I'm hoping for a brief response.",
"Without delving too deep, please reply to: <question>",
"Just a short answer will do for: <question>",
"Briefly elaborate on: <question>",
"For <question>, please keep your answer concise.",
"Simply put, how would you respond to: <question>",
"I'm in a rush, so a brief answer to <question> would be appreciated.",
"Your quick thoughts on: <question>",
"A concise reply for: <question>, please.",
"In light of the image, briefly explain: <question>",
"No need for details, just answer: <question>",
"Cut to the chase, what's your take on: <question>",
"<question> - A short explanation, if you will.",
"Trim the details, I just need an answer for: <question>",
"Quick and concise, please answer: <question>",
"For <question>, a succinct response would be great."
]
}
question_en = ["<img><Image></img> {} A:",
"<img><Image></img> {} Answer:",
"<img><Image></img> {} The answer is:",
"<img><Image></img> {}",
"<img><Image></img> {}",
"<img><Image></img> Q: {} A:",
"<img><Image></img> Question: {} Answer:",
]
question_cn = ["<img><Image></img> {} 答:",
"<img><Image></img> {} 答案是:",
"<img><Image></img> {}",
"<img><Image></img> {}",
"<img><Image></img> 问:{} 答:",
"<img><Image></img> 问:{} ",
"<img><Image></img> Q: {} A:",
"<img><Image></img> {} A:",
]

34
utils/vision.py Normal file
View File

@ -0,0 +1,34 @@
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
import torch
class BlipImageEvalProcessor:
def __init__(self, image_size=384, mean=None, std=None):
super().__init__()
if mean is None:
mean = (0.48145466, 0.4578275, 0.40821073)
if std is None:
std = (0.26862954, 0.26130258, 0.27577711)
self.normalize = transforms.Normalize(mean, std)
self.transform = transforms.Compose(
[
transforms.Resize(
(image_size, image_size), interpolation=InterpolationMode.BICUBIC
),
transforms.ToTensor(),
self.normalize,
]
)
def __call__(self, item):
return self.transform(item)
from functools import partial
def blip2_image_processor_func_with_inputs(image_processor, image):
return {'image': image_processor(image).unsqueeze(0), 'input_ids': torch.zeros(1, 1, dtype=torch.long), 'position_ids': None, 'attention_mask': torch.ones(1, 1, dtype=torch.long)}
def get_image_processor(image_size):
return partial(blip2_image_processor_func_with_inputs, BlipImageEvalProcessor(image_size))

222
web_demo.py Normal file
View File

@ -0,0 +1,222 @@
import gradio as gr
import os, sys
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from PIL import Image
import base64
import json
import requests
import base64
import hashlib
import torch
import time
import re
import argparse
from sat.model.mixins import CachedAutoregressiveMixin
from sat.mpu import get_model_parallel_world_size
from utils.parser import parse_response
from utils.chat import chat
from models.cogvlm_model import CogVLMModel
from utils.language import llama2_tokenizer, llama2_text_processor_inference
from utils.vision import get_image_processor
DESCRIPTION = '''<h2 style='text-align: center'> <a href="https://github.com/THUDM/CogVLM">CogVLM-17B</a> </h2>'''
NOTES = 'This app is adapted from <a href="https://github.com/THUDM/CogVLM">https://github.com/THUDM/CogVLM</a>. It would be recommended to check out the repo if you want to see the detail of our model.'
MAINTENANCE_NOTICE1 = 'Hint 1: If the app report "Something went wrong, connection error out", please turn off your proxy and retry.<br>Hint 2: If you upload a large size of image like 10MB, it may take some time to upload and process. Please be patient and wait.'
GROUNDING_NOTICE = 'Hint: When you check "Grounding", please use the <a href="https://github.com/THUDM/CogVLM/blob/main/utils/template.py#L344">corresponding prompt</a> or the examples below.'
default_chatbox = [("", "Hi, What do you want to know about this image?")]
model = image_processor = text_processor_infer = None
is_grounding = False
def process_image_without_resize(image_prompt):
image = Image.open(image_prompt)
# print(f"height:{image.height}, width:{image.width}")
timestamp = int(time.time())
file_ext = os.path.splitext(image_prompt)[1]
filename_grounding = f"examples/{timestamp}_grounding{file_ext}"
return image, filename_grounding
def load_model(args):
model, model_args = CogVLMModel.from_pretrained(
args.from_pretrained,
args=argparse.Namespace(
deepspeed=None,
local_rank=0,
rank=0,
world_size=world_size,
model_parallel_size=world_size,
mode='inference',
fp16=args.fp16,
bf16=args.bf16,
skip_init=True,
use_gpu_initialization=True,
device=f'cuda'),
overwrite_args={'model_parallel_size': world_size} if world_size != 1 else {}
)
model = model.eval()
assert world_size == get_model_parallel_world_size(), "world size must equal to model parallel size for cli_demo!"
tokenizer = llama2_tokenizer(args.local_tokenizer, signal_type=args.version)
image_processor = get_image_processor(model_args.eva_args["image_size"][0])
model.add_mixin('auto-regressive', CachedAutoregressiveMixin())
text_processor_infer = llama2_text_processor_inference(tokenizer, args.max_length, model.image_length)
return model, image_processor, text_processor_infer
def post(
input_text,
temperature,
top_p,
top_k,
image_prompt,
result_previous,
hidden_image,
):
result_text = [(ele[0], ele[1]) for ele in result_previous]
for i in range(len(result_text)-1, -1, -1):
if result_text[i][0] == "" or result_text[i][0] == None:
del result_text[i]
print(f"history {result_text}")
global model, image_processor, text_processor_infer, is_grounding
try:
with torch.no_grad():
pil_img, image_path_grounding = process_image_without_resize(image_prompt)
response, _, cache_image = chat(
image_path="",
model=model,
text_processor=text_processor_infer,
img_processor=image_processor,
query=input_text,
history=result_text,
image=pil_img,
max_length=2048,
top_p=top_p,
temperature=temperature,
top_k=top_k,
invalid_slices=text_processor_infer.invalid_slices if hasattr(text_processor_infer, "invalid_slices") else [],
no_prompt=False
)
except Exception as e:
print("error message", e)
result_text.append((input_text, 'Timeout! Please wait a few minutes and retry.'))
return "", result_text, hidden_image
answer = response
if is_grounding:
parse_response(pil_img, answer, image_path_grounding)
new_answer = answer.replace(input_text, "")
result_text.append((input_text, new_answer))
result_text.append((None, (image_path_grounding,)))
else:
result_text.append((input_text, answer))
print(result_text)
print('finished')
return "", result_text, hidden_image
def clear_fn(value):
return "", default_chatbox, None
def clear_fn2(value):
return default_chatbox
def main(args):
global model, image_processor, text_processor_infer, is_grounding
model, image_processor, text_processor_infer = load_model(args)
is_grounding = 'grounding' in args.from_pretrained
gr.close_all()
examples = []
example_ids = list(range(3)) if not is_grounding else list(range(3,6,1))
with open("./examples/example_inputs.jsonl") as f:
for i, line in enumerate(f):
if i not in example_ids: continue
data = json.loads(line)
examples.append(data)
with gr.Blocks(css='style.css') as demo:
gr.Markdown(DESCRIPTION)
gr.Markdown(NOTES)
with gr.Row():
with gr.Column(scale=4.5):
with gr.Group():
input_text = gr.Textbox(label='Input Text', placeholder='Please enter text prompt below and press ENTER.')
with gr.Row():
run_button = gr.Button('Generate')
clear_button = gr.Button('Clear')
image_prompt = gr.Image(type="filepath", label="Image Prompt", value=None)
with gr.Row():
temperature = gr.Slider(maximum=1, value=0.8, minimum=0, label='Temperature')
top_p = gr.Slider(maximum=1, value=0.4, minimum=0, label='Top P')
top_k = gr.Slider(maximum=100, value=10, minimum=1, step=1, label='Top K')
with gr.Column(scale=5.5):
result_text = gr.components.Chatbot(label='Multi-round conversation History', value=[("", "Hi, What do you want to know about this image?")]).style(height=550)
hidden_image_hash = gr.Textbox(visible=False)
gr_examples = gr.Examples(examples=[[example["text"], example["image"]] for example in examples],
inputs=[input_text, image_prompt],
label="Example Inputs (Click to insert an example into the input box)",
examples_per_page=6)
gr.Markdown(MAINTENANCE_NOTICE1)
print(gr.__version__)
run_button.click(fn=post,inputs=[input_text, temperature, top_p, top_k, image_prompt, result_text, hidden_image_hash],
outputs=[input_text, result_text, hidden_image_hash])
input_text.submit(fn=post,inputs=[input_text, temperature, top_p, top_k, image_prompt, result_text, hidden_image_hash],
outputs=[input_text, result_text, hidden_image_hash])
clear_button.click(fn=clear_fn, inputs=clear_button, outputs=[input_text, result_text, image_prompt])
image_prompt.upload(fn=clear_fn2, inputs=clear_button, outputs=[result_text])
image_prompt.clear(fn=clear_fn2, inputs=clear_button, outputs=[result_text])
print(gr.__version__)
demo.queue(concurrency_count=10)
demo.launch()
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--max_length", type=int, default=2048, help='max length of the total sequence')
parser.add_argument("--top_p", type=float, default=0.4, help='top p for nucleus sampling')
parser.add_argument("--top_k", type=int, default=1, help='top k for top k sampling')
parser.add_argument("--temperature", type=float, default=.8, help='temperature for sampling')
parser.add_argument("--english", action='store_true', help='only output English')
parser.add_argument("--version", type=str, default="chat", help='version to interact with')
parser.add_argument("--from_pretrained", type=str, default="cogvlm-chat", help='pretrained ckpt')
parser.add_argument("--local_tokenizer", type=str, default="lmsys/vicuna-7b-v1.5", help='tokenizer path')
parser.add_argument("--no_prompt", action='store_true', help='Sometimes there is no prompt in stage 1')
parser.add_argument("--fp16", action="store_true")
parser.add_argument("--bf16", action="store_true")
args = parser.parse_args()
rank = int(os.environ.get('RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))
parser = CogVLMModel.add_model_specific_args(parser)
args = parser.parse_args()
main(args)