Ziya-Visual多模態(tài)大模型基于姜子牙通用大模型V1訓(xùn)練,具有視覺問答和對話能力。今年3月份OpenAI發(fā)布具有識圖能力的多模態(tài)大模型GPT-4,遺憾的是,時至今日絕大部分用戶也都還沒有拿到GPT-4輸入圖片的權(quán)限,Ziya-Visual參考了Mini-GPT4、LLaVA等優(yōu)秀的開源實(shí)現(xiàn),補(bǔ)齊了Ziya的識圖能力,使中文用戶群體可以體驗(yàn)到結(jié)合視覺和語言兩大模態(tài)的大模型的卓越能力。
The Ziya-Visual multimodal Big Model is based on the Ziya-LLaMA-13B-v1 training and has visual question and answer and dialogue capabilities. In March this year, OpenAI released GPT-4, a multimodal big model with image recognition capabilities. Unfortunately, to date, the vast majority of users have not yet been given access to GPT-4 for image input, so Ziya-Visual refers to Mini-GPT4, LLaVA and other excellent open source implementations to complement Ziya’s image recognition capabilities, so that the Chinese user community can experience the superior capabilities of a large model combining two modalities: visual and language.
pip install torch==1.12.1 tokenizers==0.13.3 git+https://github.com/huggingface/transformers
需求 Demand | 任務(wù) Task | 系列 Series | 模型 Model | 參數(shù) Parameter | 額外 Extra |
---|---|---|---|---|---|
多模態(tài) Multi-Modal | 通用 General | 姜子牙-多模態(tài) Ziya-Visual | BLIP2 LLaMA | 14B | English&Chinese |
這個例子展示了模型的識圖能力、知識能力和創(chuàng)作能力。首先第一個問題中,模型識別出了圖片中是電影《泰坦尼克號》的場景,并給出電影導(dǎo)演、發(fā)布時間、獎項(xiàng)成就等信息;第二個問題,模型根據(jù)用戶的需求創(chuàng)作了一首現(xiàn)代愛情詩。
This example demonstrates the model’s ability to read pictures, its knowledge and its ability to compose. Firstly in the first problem, the model identifies the picture as a scene from the movie Titanic and gives information about the movie director, release date and award achievements; in the second problem, the model creates a modern love poem based on the user’s needs.
這個例子展示了Ziya-Visual傳統(tǒng)中國文化的識別和理解能力,模型識別出了中國畫中的信息,在得到提示《清明上河圖》之后給出了也給出畫家張擇端和北宋的歷史背景。
This example demonstrates Ziya-Visual’s ability to recognise and understand traditional Chinese culture. The model identifies information in Chinese paintings, and after getting the hint ‘Qingming Shanghe Tu’ gives also gives the historical context of the painter Zhang Zeduan and the Northern Song Dynasty.
如果輸入多張圖片進(jìn)行問答呢?Ziya-Visual也是勝任的,在這個例子中,Ziya-Visual展現(xiàn)了強(qiáng)大的多圖和多輪交互能力,根據(jù)用戶給的三張圖片,敘述了一個女士在城市夜景中邂逅一對母子貓咪,并與之交談、分別的小故事。
What if multiple images are entered for a quiz? Ziya-Visual is also up to the task. In this example, Ziya-Visual demonstrates the power of multiple images and multiple rounds of interaction, narrating a short story of a lady who encounters a mother and son cat in a city night scene, talks to them and separates them, based on three images given by the user.
在中文視覺問答模型訓(xùn)練上,最大的問題就是數(shù)據(jù)量少,數(shù)據(jù)質(zhì)量差。首先,封神榜團(tuán)隊(duì)在開源數(shù)據(jù)的基礎(chǔ)上清洗、積累了一部分高質(zhì)量數(shù)據(jù);其次,我們通過翻譯api得到了一部分英-中雙語數(shù)據(jù)集,我們發(fā)現(xiàn)雖然翻譯數(shù)據(jù)集會有“翻譯腔”等問題,但是借助Ziya-v1的雙語能力,最終的語言輸出是能夠緩解這一問題的;最后,團(tuán)隊(duì)結(jié)合BLIP,Grounded SAM等先進(jìn)視覺技術(shù),抽取圖像描述的粗粒度信息和圖像中物體、方位等細(xì)粒度信息,轉(zhuǎn)化為語言描述形式,構(gòu)造了一部分高質(zhì)量數(shù)據(jù)。最終,Ziya-Visual構(gòu)造了約2千萬的優(yōu)質(zhì)數(shù)據(jù)進(jìn)行訓(xùn)練。和Mini-GPT4、LLaVA一樣,Ziya-Visual-v1主要是一個以數(shù)據(jù)為中心的工作,因此數(shù)據(jù)的數(shù)量和質(zhì)量非常重要。
In the training of Chinese visual quiz model, the biggest problem is the small amount of data and poor data quality. Firstly, the team cleaned and accumulated some high-quality data based on open source data; secondly, we obtained a part of the English-Chinese bilingual dataset through translation api, and we found that although the translated dataset would have problems such as “translation accent”, the final language output was able to alleviate this problem with Ziya-v1’s bilingual capability. Finally, the team combined BLIP, Grounded SAM and other advanced vision technologies to extract coarse-grained information from image descriptions and fine-grained information such as objects and orientation in images, and transform them into linguistic descriptions to construct a portion of high-quality data. Ultimately, Ziya-Visual constructed approximately 20 million pieces of high quality data for training. Like Mini-GPT4 and LLaVA, Ziya-Visual-v1 is primarily a data-centric exercise, so the quantity and quality of data is very important.
為了更好的結(jié)合視覺預(yù)訓(xùn)練模型和LLM的能力,和Mini-GPT4和LLaVA工作一樣,Ziya-Visual-v1的訓(xùn)練遵循了BLIP2提出的經(jīng)典網(wǎng)絡(luò)結(jié)構(gòu)和兩階段訓(xùn)練的范式。而且我們在實(shí)驗(yàn)過程中發(fā)現(xiàn),是否訓(xùn)練Vision Encoder的參數(shù)對于最終的生成效果影響很小。因此,在整體模型上,視覺處理部分我們繼承了BLIP2的ViT + QFormer參數(shù),LLM部分繼承了Ziya-v1的權(quán)重,這兩個部分權(quán)重都是凍結(jié)不參與訓(xùn)練的。我們主要訓(xùn)練的部分是視覺映射層(Projection Layer)。第一階段,我們使用圖像Caption數(shù)據(jù)訓(xùn)練映射層,使Vision Encder抽取出來的圖像特征能夠和LLM中的文本特征空間進(jìn)行對齊;第二階段,我們使用圖像問答數(shù)據(jù)集,進(jìn)一步微調(diào)Ziya-Visual的視覺-語言能力。
In order to better combine the capabilities of the vision pre-training model and the LLM, as in the Mini-GPT4 and LLaVA work, the training of Ziya-Visual-v1 followed the classical network structure and the two-stage training paradigm proposed by BLIP2. Moreover, we found during our experiments that whether or not the parameters of the Vision Encoder are trained has very little impact on the final generation results. Therefore, for the overall model, we inherited the ViT + QFormer parameters from BLIP2 for the vision processing part and the Ziya-v1 weights for the LLM part, both of which were frozen from training. Our main training component is the visual mapping layer (Projection Layer). In the first stage, we use the image Caption data to train the mapping layer so that the image features extracted by Vision Encder can be aligned with the text feature space in LLM; in the second stage, we use the image Q & A dataset to further fine-tune the visual-verbal capabilities of Ziya-Visual.
首先是VQA效果上的評價,可以看到Ziya-Visual模型在GQA的中文和英文測試集上大部分指標(biāo)均高于VisualGLM,而在BLUE-4上分?jǐn)?shù)較低,這表明Ziya-Visual在大多數(shù)開放域的多模態(tài)問答上生成的答案更為泛化和準(zhǔn)確,但在一些發(fā)散性的問題上生成答案具有自主性。對于mPLUG-Owl模型,英文采用了 mPLUG-Owl 7B Instruction tuning (LoRA) 版本,中文則采用了多語言的mPLUG-Owl 7B (Multilingual) Instruction tuning (LoRA) 版本。因此在英文測評分?jǐn)?shù)上高于雙語版本的Ziya-Visual,另一方面,由于Ziya-Visual采用的LLaMA具備更優(yōu)秀的多語言理解和生成能力,并且在Ziya-Visual二階段訓(xùn)練時也通過翻譯工具引入了多語言多模態(tài)訓(xùn)練語料,因此在中文數(shù)據(jù)的測評結(jié)果上更有優(yōu)勢。
Firstly, the evaluation on the VQA effectiveness shows that the Ziya-Visual model outperforms VisualGLM on most of the metrics on both the Chinese and English test sets of GQA, while scoring lower on BLUE-4, indicating that Ziya-Visual generates more generalized and accurate answers on most open domain multimodal questions and answers, but generates some discrete questions on answers have autonomy. For the mPLUG-Owl model, the mPLUG-Owl 7B Instruction tuning (LoRA) version was used for English and the multilingual mPLUG-Owl 7B (Multilingual) Instruction tuning (LoRA) version was used for Chinese. On the other hand, Ziya-Visual’s LLaMA has better multilingual comprehension and generation capabilities, and the multilingual multimodal training corpus was introduced in the second phase of Ziya-Visual training through a translation tool, so it has an advantage in the Chinese data.
其次我們使用LLaVA的做法利用GPT-4打分評價,該方法利用coco數(shù)據(jù)集中的caption和物體檢測框信息輸入給GPT-4;然后將Ziya-Visual和VisualGLM的圖像問答的回答再輸入到GPT-4,要求GPT-4從回答的有用性、相關(guān)性、準(zhǔn)確性、細(xì)節(jié)程度進(jìn)行評分(1-10分);LLaVA中將對話任務(wù)劃分為conv(簡單對話),detail(細(xì)節(jié)對話)和complex(復(fù)雜推理),all是三種對話任務(wù)的綜合平均分。最終評價結(jié)果如下,可以看到在簡單對話和細(xì)節(jié)對話中,Ziya-Viusual優(yōu)于VisualGLM,在復(fù)雜推理中略輸于VisualGLM,最終總體平均結(jié)果優(yōu)于VisualGLM。在對比mPLUG-Owl中我們得到的結(jié)論是類似的,Ziya-Visual總體平均結(jié)果優(yōu)于mPLUG-Owl。
Secondly, we used the LLaVA approach to score the evaluation using the GPT-4, which uses the caption and object detection box information from the coco dataset to input to the GPT-4; the responses to the image quiz from Ziya-Visual and VisualGLM are then input to the GPT-4, which is asked to score the responses in terms of usefulness, relevance, accuracy, and The responses were then fed back into GPT-4, which was asked to rate the responses in terms of usefulness, relevance, accuracy, and level of detail (on a scale of 1-10); LLaVA divided the dialogue tasks into conv (simple dialogue), detail (detailed dialogue) and complex (complex reasoning), and all was the combined average score of the three dialogue tasks. The final evaluation results are as follows, and it can be seen that Ziya-Viusual outperforms VisualGLM in simple and detail dialogues, slightly loses out to VisualGLM in complex reasoning, and finally outperforms VisualGLM in overall average results.
In comparing mPLUG-Owl we reach a similar conclusion, with Ziya-Viusual outperforming mPLUG-Owl on average overall.
首先加載Ziya-Visual模型:需要注意的是Visual-Ziya的模型倉庫只包含視覺模型部分的參數(shù),Ziya LLM部分的參數(shù)通過Ziya-LLaMA-13B-v1獲得。得到這兩部分的模型參數(shù)后,我們加載模型:
First load the Ziya-Visual model: it should be noted that the model repository of Visual-Ziya contains only the parameters of the visual model part, the parameters of the Ziya LLM part are obtained through Ziya-LLaMA-13B-v1. Once we have the parameters for both parts of the model, we load the model:
from transformers import LlamaForCausalLM, LlamaTokenizer, BlipImageProcessor
from modeling_ziya_blip2 import ZiyaBlip2ForCausalLM
from PIL import Image
# model path of IDEA-CCNL/Ziya-LLaMA-13B-v1
LM_MODEL_PATH="local path of model Ziya-LLaMA-13B-v1"
lm_model = LlamaForCausalLM.from_pretrained(LM_MODEL_PATH)
tokenizer = LlamaTokenizer.from_pretrained(LM_MODEL_PATH)
# visual model
OPENAI_CLIP_MEAN = [0.48145466, 0.4578275, 0.40821073]
OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]
# demo.py is in the project path, so we can use local path ".". Otherwise you should use "IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1"
model = ZiyaBlip2ForCausalLM.from_pretrained(".", language_model=lm_model)
image_size = model.config.vision_config.image_size
image_processor = BlipImageProcessor(
size={"height": image_size, "width": image_size},
image_mean=OPENAI_CLIP_MEAN,
image_std=OPENAI_CLIP_STD,
)
model.cuda() # if you use on cpu, comment this line
模型加載完畢后,我們就可以愉快地使用Ziya-Visual模型了:
Once the model has been loaded, we can happily use the Ziya-Visual model:
generate_config = {
"max_new_tokens": 128,
"top_p": 0.1,
"temperature": 0.7
}
output = model.chat(
tokenizer=tokenizer,
pixel_values=image_processor(Image.open("wzry.jpg"), return_tensors="pt").pixel_values.to(model.device),
query="這是什么游戲",
previous_querys=[],
previous_outputs=[],
**generate_config,
)
print(output)
# 這是一款名為《王者榮耀》的多人在線競技游戲。在游戲中,玩家扮演不同的角色,并與其他玩家進(jìn)行戰(zhàn)斗。游戲中的人物和環(huán)境都是虛擬的,但它們看起來非常逼真。玩家需要使用各種技能和策略來擊敗對手,并獲得勝利。
如果您在您的工作中使用了我們的模型,可以引用我們的論文:
If you are using the resource for your work, please cite the our paper:
@article{fengshenbang,
author = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
journal = {CoRR},
volume = {abs/2209.02970},
year = {2022}
}
You can also cite our website:
歡迎引用我們的網(wǎng)站:
@misc{Fengshenbang-LM,
title={Fengshenbang-LM},
author={IDEA-CCNL},
year={2021},
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}