Qwen-7B ?? | ??  | Qwen-7B-Chat ?? | ??  |  Demo  |  Report
通義千問-7B(Qwen-7B) 是阿里云研發(fā)的通義千問大模型系列的70億參數(shù)規(guī)模的模型。Qwen-7B是基于Transformer的大語言模型, 在超大規(guī)模的預(yù)訓(xùn)練數(shù)據(jù)上進(jìn)行訓(xùn)練得到。預(yù)訓(xùn)練數(shù)據(jù)類型多樣,覆蓋廣泛,包括大量網(wǎng)絡(luò)文本、專業(yè)書籍、代碼等。同時(shí),在Qwen-7B的基礎(chǔ)上,我們使用對(duì)齊機(jī)制打造了基于大語言模型的AI助手Qwen-7B-Chat。本倉(cāng)庫(kù)為Qwen-7B-Chat的倉(cāng)庫(kù)。
如果您想了解更多關(guān)于通義千問-7B開源模型的細(xì)節(jié),我們建議您參閱Github代碼庫(kù)。
Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B`is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-7B, we release Qwen-7B-Chat, a large-model-based AI assistant, which is trained with alignment techniques. This repository is the one for Qwen-7B-Chat.
For more details about the open-source model of Qwen-7B, please refer to the Github code repository.
運(yùn)行Qwen-7B-Chat,請(qǐng)確保機(jī)器環(huán)境pytorch版本不低于1.12,再執(zhí)行以下pip命令安裝依賴庫(kù)
To run Qwen-7B-Chat, please make sure that pytorch version is not lower than 1.12, and then execute the following pip commands to install the dependent libraries.
pip install modelscope
pip install transformers_stream_generator
另外,推薦安裝flash-attention
庫(kù),以實(shí)現(xiàn)更高的效率和更低的顯存占用。
In addition, it is recommended to install the flash-attention
library for higher efficiency and lower memory usage.
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
pip install csrc/layer_norm
pip install csrc/rotary
下面我們展示了一個(gè)使用Qwen-7B-Chat模型,進(jìn)行多輪對(duì)話交互的樣例(非流式):
We show an example of multi-turn interaction with Qwen-7B-Chat in the following code:
from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.5',device_map="auto", trust_remote_code=True,fp16 = True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat",revision = 'v1.0.5', trust_remote_code=True) # 可指定不同的生成長(zhǎng)度、top_p等相關(guān)超參
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省會(huì)在哪里?", history=history)
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景點(diǎn)", history=history)
print(response)
下面我們展示了一個(gè)使用Qwen-7B-Chat模型,進(jìn)行多輪對(duì)話交互的樣例(流式):
We show an example of multi-turn interaction with Qwen-7B-Chat in the following code:
import os
import platform
from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model_id = 'qwen/Qwen-7B-Chat'
revision = 'v1.0.5'
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision, trust_remote_code=True)
# use fp16
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", revision=revision,
trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained(model_id,
trust_remote_code=True) # 可指定不同的生成長(zhǎng)度、top_p等相關(guān)超參
stop_stream = False
def clear_screen():
if platform.system() == "Windows":
os.system("cls")
else:
os.system("clear")
def print_history(history):
for pair in history:
print(f"\nUser:{pair[0]}\nQwen-7B:{pair[1]}")
def main():
history, response = [], ''
global stop_stream
clear_screen()
print("歡迎使用 Qwen-7B 模型,輸入內(nèi)容即可進(jìn)行對(duì)話,clear 清空對(duì)話歷史,stop 終止程序")
while True:
query = input("\nUser:")
if query.strip() == "stop":
break
if query.strip() == "clear":
history = []
clear_screen()
print("歡迎使用 Qwen-7B 模型,輸入內(nèi)容即可進(jìn)行對(duì)話,clear 清空對(duì)話歷史,stop 終止程序")
continue
for response in model.chat_stream(tokenizer, query, history=history):
if stop_stream:
stop_stream = False
break
else:
clear_screen()
print_history(history)
print(f"\nUser: {query}")
print("\nQwen-7B:", end="")
print(response)
history.append((query, response))
if __name__ == "__main__":
main()
關(guān)于更多的使用說明,請(qǐng)參考我們的Github repo獲取更多信息。
For more information, please refer to our Github repo for more information.
與Qwen-7B預(yù)訓(xùn)練模型相同,Qwen-7B-Chat模型規(guī)?;厩闆r如下所示
The details of the model architecture of Qwen-7B-Chat are listed as follows
Hyperparameter | Value |
---|---|
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 151851 |
sequence length | 2048 |
在位置編碼、FFN激活函數(shù)和normalization的實(shí)現(xiàn)方式上,我們也采用了目前最流行的做法,
即RoPE相對(duì)位置編碼、SwiGLU激活函數(shù)、RMSNorm(可選安裝flash-attention加速)。
在分詞器方面,相比目前主流開源模型以中英詞表為主,Qwen-7B-Chat使用了約15萬token大小的詞表。
該詞表在GPT-4使用的BPE詞表cl100k_base
基礎(chǔ)上,對(duì)中文、多語言進(jìn)行了優(yōu)化,在對(duì)中、英、代碼數(shù)據(jù)的高效編解碼的基礎(chǔ)上,對(duì)部分多語言更加友好,方便用戶在不擴(kuò)展詞表的情況下對(duì)部分語種進(jìn)行能力增強(qiáng)。
詞表對(duì)數(shù)字按單個(gè)數(shù)字位切分。調(diào)用較為高效的tiktoken分詞庫(kù)進(jìn)行分詞。
For position encoding, FFN activation function, and normalization calculation methods, we adopt the prevalent practices, i.e., RoPE relative position encoding, SwiGLU for activation function, and RMSNorm for normalization (optional installation of flash-attention for acceleration).
For tokenization, compared to the current mainstream open-source models based on Chinese and English vocabularies, Qwen-7B-Chat uses a vocabulary of over 150K tokens.
It first considers efficient encoding of Chinese, English, and code data, and is also more friendly to multilingual languages, enabling users to directly enhance the capability of some languages without expanding the vocabulary.
It segments numbers by single digit, and calls the tiktoken tokenizer library for efficient tokenization.
對(duì)于Qwen-7B-Chat模型,我們同樣評(píng)測(cè)了常規(guī)的中文理解(C-Eval)、英文理解(MMLU)、代碼(HumanEval)和數(shù)學(xué)(GSM8K)等權(quán)威任務(wù),同時(shí)包含了長(zhǎng)序列任務(wù)的評(píng)測(cè)結(jié)果。由于Qwen-7B-Chat模型經(jīng)過對(duì)齊后,激發(fā)了較強(qiáng)的外部系統(tǒng)調(diào)用能力,我們還進(jìn)行了工具使用能力方面的評(píng)測(cè)。
提示:由于硬件和框架造成的舍入誤差,復(fù)現(xiàn)結(jié)果如有波動(dòng)屬于正?,F(xiàn)象。
For Qwen-7B-Chat, we also evaluate the model on C-Eval, MMLU, HumanEval, GSM8K, etc., as well as the benchmark evaluation for long-context understanding, and tool usage.
Note: Due to rounding errors caused by hardware and framework, differences in reproduced results are possible.
在C-Eval驗(yàn)證集上,我們?cè)u(píng)價(jià)了Qwen-7B-Chat模型的zero-shot準(zhǔn)確率
We demonstrate the zero-shot accuracy of Qwen-7B-Chat on C-Eval validation set
Model | Avg. Acc. |
---|---|
LLaMA2-7B-Chat | 31.9 |
LLaMA2-13B-Chat | 40.6 |
Chinese-Alpaca-2-7B | 41.3 |
Chinese-Alpaca-Plus-13B | 43.3 |
Baichuan-13B-Chat | 50.4 |
ChatGLM2-6B-Chat | 50.7 |
InternLM-7B-Chat | 53.2 |
Qwen-7B-Chat | 54.2 |
C-Eval測(cè)試集上,Qwen-7B-Chat模型的zero-shot準(zhǔn)確率結(jié)果如下:
The zero-shot accuracy of Qwen-7B-Chat on C-Eval testing set is provided below:
Model | Avg. | STEM | Social Sciences | Humanities | Others |
---|---|---|---|---|---|
Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
Qwen-7B-Chat | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 |
在7B規(guī)模模型上,經(jīng)過人類指令對(duì)齊的Qwen-7B-Chat模型,準(zhǔn)確率在同類相近規(guī)模模型中仍然處于前列。
Compared with other pretrained models with comparable model size, the human-aligned Qwen-7B-Chat performs well in C-Eval accuracy.
MMLU評(píng)測(cè)集上,Qwen-7B-Chat模型的zero-shot準(zhǔn)確率如下,效果同樣在同類對(duì)齊模型中同樣表現(xiàn)較優(yōu)。
The zero-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
Model | Avg. Acc. |
---|---|
ChatGLM2-6B-Chat | 45.5 |
LLaMA2-7B-Chat | 47.0 |
InternLM-7B-Chat | 50.8 |
Baichuan-13B-Chat | 52.1 |
ChatGLM2-12B-Chat | 52.1 |
Qwen-7B-Chat | 53.9 |
Qwen-7B-Chat在HumanEval的zero-shot Pass@1效果如下
The zero-shot Pass@1 of Qwen-7B-Chat on HumanEval is demonstrated below
Model | Pass@1 |
---|---|
LLaMA2-7B-Chat | 12.2 |
InternLM-7B-Chat | 14.0 |
Baichuan-13B-Chat | 16.5 |
LLaMA2-13B-Chat | 18.9 |
Qwen-7B-Chat | 21.3 |
在評(píng)測(cè)數(shù)學(xué)能力的GSM8K上,Qwen-7B-Chat的準(zhǔn)確率結(jié)果如下
The accuracy of Qwen-7B-Chat on GSM8K is shown below
Model | Zero-shot Acc. | 4-shot Acc. |
---|---|---|
ChatGLM2-6B-Chat | - | 28.0 |
LLaMA2-7B-Chat | 20.4 | 28.2 |
LLaMA2-13B-Chat | 29.4 | 36.7 |
InternLM-7B-Chat | 32.6 | 34.5 |
Baichuan-13B-Chat | - | 36.3 |
ChatGLM2-12B-Chat | - | 38.1 |
Qwen-7B-Chat | 41.1 | 43.5 |
通過NTK插值,LogN注意力縮放可以擴(kuò)展Qwen-7B-Chat的上下文長(zhǎng)度。在長(zhǎng)文本摘要數(shù)據(jù)集VCSUM上(文本平均長(zhǎng)度在15K左右),Qwen-7B-Chat的Rouge-L結(jié)果如下:
(若要啟用這些技巧,請(qǐng)將config.json里的use_dynamc_ntk
和use_logn_attn
設(shè)置為true)
We introduce NTK-aware interpolation, LogN attention scaling to extend the context length of Qwen-7B-Chat. The Rouge-L results of Qwen-7B-Chat on long-text summarization dataset VCSUM (The average length of this dataset is around 15K) are shown below:
(To use these tricks, please set use_dynamic_ntk
and use_long_attn
to true in config.json.)
Model | VCSUM (zh) |
---|---|
GPT-3.5-Turbo-16k | 16.0 |
LLama2-7B-Chat | 0.2 |
InternLM-7B-Chat | 13.0 |
ChatGLM2-6B-Chat | 16.3 |
Qwen-7B-Chat | 16.6 |
千問支持通過 ReAct Prompting 調(diào)用插件/工具/API。ReAct 也是 LangChain 框架采用的主要方式之一。在即將開源的、用于評(píng)估工具使用能力的自建評(píng)測(cè)基準(zhǔn)上,千問的表現(xiàn)如下:
Qwen-7B-Chat supports calling plugins/tools/APIs through ReAct Prompting. ReAct is also one of the main approaches used by the LangChain framework. In the soon-to-be-released evaluation benchmark for assessing tool usage capabilities, Qwen-7B-Chat’s performance is as follows:
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
---|---|---|---|
GPT-4 | 95% | 0.90 | 15% |
GPT-3.5 | 85% | 0.88 | 75% |
Qwen-7B-Chat | 99% | 0.89 | 8.5% |
評(píng)測(cè)基準(zhǔn)中出現(xiàn)的插件均沒有出現(xiàn)在千問的訓(xùn)練集中。該基準(zhǔn)評(píng)估了模型在多個(gè)候選插件中選擇正確插件的準(zhǔn)確率、傳入插件的參數(shù)的合理性、以及假陽率。假陽率(False Positive)定義:在處理不該調(diào)用插件的請(qǐng)求時(shí),錯(cuò)誤地調(diào)用了插件。
The plugins that appear in the evaluation set do not appear in the training set of Qwen-7B-Chat. This benchmark evaluates the accuracy of the model in selecting the correct plugin from multiple candidate plugins, the rationality of the parameters passed into the plugin, and the false positive rate. False Positive: Incorrectly invoking a plugin when it should not have been called when responding to a query.
關(guān)于 ReAct Prompting 的 prompt 怎么寫、怎么使用,請(qǐng)參考 ReAct 樣例說明。使用工具能使模型更好地完成任務(wù)?;谇柕墓ぞ呤褂媚芰?,我們能實(shí)現(xiàn)下圖所展示的效果:
For how to write and use prompts for ReAct Prompting, please refer to the ReAct examples. The use of tools can enable the model to better perform tasks, as shown in the following figures:
如希望使用更低精度的量化模型,如4比特和8比特的模型,我們提供了簡(jiǎn)單的示例來說明如何快速使用量化模型:
To load the model in lower precision, e.g., 4 bits and 8 bits, we provide examples to show how to load by adding quantization configuration:
from modelscope import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
from modelscope import GenerationConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.1',trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision = 'v1.0.1', quantization_config=quantization_config, device_map="auto", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat",revision = 'v1.0.1', trust_remote_code=True) # 可指定不同>的生成長(zhǎng)度、top_p等相關(guān)超參
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
response, history = model.chat(tokenizer, "浙江的省會(huì)在哪里?", history=history)
print(response)
response, history = model.chat(tokenizer, "它有什么好玩的景點(diǎn)", history=history)
print(response)
上述方法可以讓我們將模型量化成NF4
和Int8
精度的模型進(jìn)行讀取,幫助我們節(jié)省顯存開銷。我們也提供了相關(guān)性能數(shù)據(jù)。我們發(fā)現(xiàn)盡管模型在效果上存在損失,但模型的顯存開銷大幅降低。
With this method, it is available to load Qwen-7B-Chat in NF4
and Int8
, which saves you memory usage. We provide related statistics of model performance below. We find that the quantization downgrades the effectiveness slightly but significantly increases inference efficiency and reduces memory costs.
Precision | MMLU | Memory |
---|---|---|
BF16 | 56.7 | 16.2G |
Int8 | 52.8 | 10.1G |
NF4 | 48.9 | 7.4G |
我們的代碼和模型權(quán)重對(duì)學(xué)術(shù)研究完全開放,并支持商用。請(qǐng)查看LICENSE了解具體的開源協(xié)議細(xì)節(jié)。
Our code and checkpoints are open to research purpose, and they are allowed for commercial purposes. Check LICENSE for more details about the license.
如果你想給我們的研發(fā)團(tuán)隊(duì)和產(chǎn)品團(tuán)隊(duì)留言,請(qǐng)通過郵件(qianwen_opensource@alibabacloud.com)聯(lián)系我們。
If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.