baichuan-7B是由百川智能開發(fā)的一個(gè)開源的大規(guī)模預(yù)訓(xùn)練模型?;赥ransformer結(jié)構(gòu),在大約1.2萬(wàn)億tokens上訓(xùn)練的70億參數(shù)模型,支持中英雙語(yǔ),上下文窗口長(zhǎng)度為4096。在標(biāo)準(zhǔn)的中文和英文權(quán)威benchmark(C-EVAL/MMLU)上均取得同尺寸最好的效果。
如果希望使用baichuan-7B(如進(jìn)行推理、Finetune等),我們推薦使用配套代碼庫(kù)baichuan-7B。
baichuan-7B is an open-source large-scale pre-trained model developed by Baichuan Intelligent Technology. Based on the Transformer architecture, it is a model with 7 billion parameters trained on approximately 1.2 trillion tokens. It supports both Chinese and English, with a context window length of 4096. It achieves the best performance of its size on standard Chinese and English authoritative benchmarks (C-EVAL/MMLU).
If you wish to use baichuan-7B (for inference, finetuning, etc.), we recommend using the accompanying code library baichuan-7B.
在同尺寸模型中baichuan-7B達(dá)到了目前SOTA的水平,參考下面MMLU指標(biāo)
baichuan-7B使用自有的中英文雙語(yǔ)語(yǔ)料進(jìn)行訓(xùn)練,在中文上進(jìn)行優(yōu)化,在C-Eval達(dá)到SOTA水平
不同于LLaMA完全禁止商業(yè)使用,baichuan-7B使用更寬松的開源協(xié)議,允許用于商業(yè)目的
Among models of the same size, baichuan-7B has achieved the current state-of-the-art (SOTA) level, as evidenced by the following MMLU metrics.
baichuan-7B is trained on proprietary bilingual Chinese-English corpora, optimized for Chinese, and achieves SOTA performance on C-Eval.
Unlike LLaMA, which completely prohibits commercial use, baichuan-7B employs a more lenient open-source license, allowing for commercial purposes.
整體模型基于標(biāo)準(zhǔn)的Transformer結(jié)構(gòu),我們采用了和LLaMA一樣的模型設(shè)計(jì)
具體參數(shù)和見下表
Hyperparameter | Value |
---|---|
n_parameters | 7000559616 |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 64000 |
sequence length | 4096 |
The overall model is based on the standard Transformer structure, and we have adopted the same model design as LLaMA:
The specific parameters are as follows:
Hyperparameter | Value |
---|---|
n_parameters | 7000559616 |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 64000 |
sequence length | 4096 |
from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
text_generation_zh = pipeline(task=Tasks.text_generation, model='baichuan-inc/baichuan-7B', device_map='auto',model_revision='v1.0.5')
text_generation_zh._model_prepare = True
result_zh = text_generation_zh('今天天氣是真的', min_length=10, max_length=512, num_beams=3,temperature=0.8,do_sample=False, early_stopping=True,top_k=50,top_p=0.8, repetition_penalty=1.2, length_penalty=1.2, no_repeat_ngram_size=6)
print(result_zh)
我們同時(shí)開源出了和本模型配套的訓(xùn)練代碼,允許進(jìn)行高效的Finetune用于下游任務(wù),具體參見baichuan-7B。
We have also open-sourced the training code that accompanies this model, allowing for efficient finetuning for downstream tasks. For more details, please refer to baichuan-7B.
在沒有充分評(píng)估風(fēng)險(xiǎn)和采取緩解措施的情況下投入生產(chǎn)使用;任何可能被視為不負(fù)責(zé)任或有害的使用案例。
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
baichuan-7B可能會(huì)產(chǎn)生事實(shí)上不正確的輸出,不應(yīng)依賴它產(chǎn)生事實(shí)上準(zhǔn)確的信息。baichuan-7B是在各種公共數(shù)據(jù)集上進(jìn)行訓(xùn)練的。盡管我們已經(jīng)做出了巨大的努力來(lái)清洗預(yù)訓(xùn)練數(shù)據(jù),但這個(gè)模型可能會(huì)生成淫穢、偏見或其他冒犯性的輸出。
baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information. baichuan-7B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
訓(xùn)練具體設(shè)置參見baichuan-7B。
For specific training settings, please refer to baichuan-7B.
CEval數(shù)據(jù)集是一個(gè)全面的中文基礎(chǔ)模型評(píng)測(cè)數(shù)據(jù)集,涵蓋了52個(gè)學(xué)科和四個(gè)難度的級(jí)別。我們使用該數(shù)據(jù)集的dev集作為few-shot的來(lái)源,在test集上進(jìn)行了5-shot測(cè)試。
Model 5-shot | Average | Avg(Hard) | STEM | Social Sciences | Humanities | Others |
---|---|---|---|---|---|---|
GPT-4 | 68.7 | 54.9 | 67.1 | 77.6 | 64.5 | 67.8 |
ChatGPT | 54.4 | 41.4 | 52.9 | 61.8 | 50.9 | 53.6 |
Claude-v1.3 | 54.2 | 39.0 | 51.9 | 61.7 | 52.1 | 53.7 |
Claude-instant-v1.0 | 45.9 | 35.5 | 43.1 | 53.8 | 44.2 | 45.4 |
moss-moon-003-base (16B) | 27.4 | 24.5 | 27.0 | 29.1 | 27.2 | 26.9 |
Ziya-LLaMA-13B-pretrain | 30.2 | 22.7 | 27.7 | 34.4 | 32.0 | 28.9 |
LLaMA-7B-hf | 27.1 | 25.9 | 27.1 | 26.8 | 27.9 | 26.3 |
ChatGLM-6B | 34.5 | 23.1 | 30.4 | 39.6 | 37.4 | 34.5 |
Falcon-7B | 25.8 | 24.3 | 25.8 | 26.0 | 25.8 | 25.6 |
Open-LLaMA-v2-pretrain (7B) | 24.0 | 22.5 | 23.1 | 25.3 | 25.2 | 23.2 |
TigerBot-7B-base | 25.7 | 27.0 | 27.3 | 24.7 | 23.4 | 26.1 |
Aquila-7B* | 25.5 | 25.2 | 25.6 | 24.6 | 25.2 | 26.6 |
BLOOM-7B | 22.8 | 20.2 | 21.8 | 23.3 | 23.9 | 23.3 |
BLOOMZ-7B | 35.7 | 25.8 | 31.3 | 43.5 | 36.6 | 35.6 |
baichuan-7B | 42.8 | 31.5 | 38.2 | 52.0 | 46.2 | 39.3 |
Gaokao 是一個(gè)以中國(guó)高考題作為評(píng)測(cè)大語(yǔ)言模型能力的數(shù)據(jù)集,用以評(píng)估模型的語(yǔ)言能力和邏輯推理能力。
我們只保留了其中的單項(xiàng)選擇題,并對(duì)所有模型進(jìn)行統(tǒng)一5-shot測(cè)試。
以下是測(cè)試的結(jié)果。
Model | Average |
---|---|
Open-LLaMA-v2-pretrain | 21.41 |
Ziya-LLaMA-13B-pretrain | 23.17 |
Falcon-7B | 23.98 |
TigerBot-7B-base | 25.94 |
LLaMA-7B | 27.81 |
ChatGLM-6B | 21.41 |
BLOOM-7B | 26.96 |
BLOOMZ-7B | 28.72 |
Aquila-7B* | 24.39 |
baichuan-7B | 36.24 |
AGIEval 旨在評(píng)估模型的認(rèn)知和解決問(wèn)題相關(guān)的任務(wù)中的一般能力。
我們只保留了其中的四選一單項(xiàng)選擇題,隨機(jī)劃分后對(duì)所有模型進(jìn)行了統(tǒng)一5-shot測(cè)試。
Model | Average |
---|---|
Open-LLaMA-v2-pretrain | 23.49 |
Ziya-LLaMA-13B-pretrain | 27.64 |
Falcon-7B | 27.18 |
TigerBot-7B-base | 25.19 |
LLaMA-7B | 28.17 |
ChatGLM-6B | 23.49 |
BLOOM-7B | 26.55 |
BLOOMZ-7B | 30.27 |
Aquila-7B* | 25.58 |
baichuan-7B | 34.44 |
*其中Aquila模型來(lái)源于智源官方網(wǎng)站,僅做參考
In addition to Chinese, we also tested the model’s performance in English.
MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.
We adopted the open-source evaluation scheme, and the final 5-shot results are as follows:
Model | Humanities | Social Sciences | STEM | Other | Average |
---|---|---|---|---|---|
LLaMA-7B2 | 34.0 | 38.3 | 30.5 | 38.1 | 35.1 |
Falcon-7B1 | - | - | - | - | 35.0 |
mpt-7B1 | - | - | - | - | 35.6 |
ChatGLM-6B0 | 35.4 | 41.0 | 31.3 | 40.5 | 36.9 |
BLOOM 7B0 | 25.0 | 24.4 | 26.5 | 26.4 | 25.5 |
BLOOMZ 7B0 | 31.3 | 42.1 | 34.4 | 39.0 | 36.1 |
moss-moon-003-base (16B)0 | 24.2 | 22.8 | 22.4 | 24.4 | 23.6 |
moss-moon-003-sft (16B)0 | 30.5 | 33.8 | 29.3 | 34.4 | 31.9 |
baichuan-7B0 | 38.4 | 48.9 | 35.6 | 48.1 | 42.3 |
The superscript in the Model column indicates the source of the results.
0:reimplemented
1:https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2:https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu