Ziya-LLaMA-13B-v1

Main Page:Fengshenbang
Github: Fengshenbang-LM
ModelScope: Ziya Demo

姜子牙系列模型

Ziya-LLaMA-13B-v1.1
Ziya-LLaMA-13B-v1
Ziya-LLaMA-7B-Reward
Ziya-LLaMA-13B-Pretrain-v1

簡介 Brief Introduction

姜子牙通用大模型V1是基于LLaMa的130億參數(shù)的大規(guī)模預訓練模型，具備翻譯，編程，文本分類，信息抽取，摘要，文案生成，常識問答和數(shù)學計算等能力。目前姜子牙通用大模型已完成大規(guī)模預訓練、多任務有監(jiān)督微調和人類反饋學習三階段的訓練過程。

The Ziya-LLaMA-13B-v1 is a large-scale pre-trained model based on LLaMA with 13 billion parameters. It has the ability to perform tasks such as translation, programming, text classification, information extraction, summarization, copywriting, common sense Q&A, and mathematical calculation. The Ziya-LLaMA-13B-v1 has undergone three stages of training: large-scale continual pre-training (PT), multi-task supervised fine-tuning (SFT), and human feedback learning (RM, PPO).

模型分類 Model Taxonomy

需求 Demand	任務 Task	系列 Series	模型 Model	參數(shù) Parameter	額外 Extra
通用 General	AGI模型	姜子牙 Ziya	LLaMA	13B	English&Chinese

模型信息 Model Information

繼續(xù)預訓練 Continual pretraining

原始數(shù)據(jù)包含英文和中文，其中英文數(shù)據(jù)來自openwebtext、Books、Wikipedia和Code，中文數(shù)據(jù)來自清洗后的悟道數(shù)據(jù)集、自建的中文數(shù)據(jù)集。在對原始數(shù)據(jù)進行去重、模型打分、數(shù)據(jù)分桶、規(guī)則過濾、敏感主題過濾和數(shù)據(jù)評估后，最終得到125B tokens的有效數(shù)據(jù)。

為了解決LLaMA原生分詞對中文編解碼效率低下的問題，我們在LLaMA詞表的基礎上增加了7k+個常見中文字，通過和LLaMA原生的詞表去重，最終得到一個39410大小的詞表，并通過復用Transformers里LlamaTokenizer來實現(xiàn)了這一效果。

在增量訓練過程中，我們使用了160張40GB的A100，采用2.6M tokens的訓練集樣本數(shù)量和FP 16的混合精度，吞吐量達到118 TFLOP per GPU per second。因此我們能夠在8天的時間里在原生的LLaMA-13B模型基礎上，增量訓練110B tokens的數(shù)據(jù)。

訓練期間，雖然遇到了機器宕機、底層框架bug、loss spike等各種問題，但我們通過快速調整，保證了增量訓練的穩(wěn)定性。我們也放出訓練過程的loss曲線，讓大家了解可能出現(xiàn)的問題。

The original data contains both English and Chinese, with English data from openwebtext, Books, Wikipedia, and Code, and Chinese data from the cleaned Wudao dataset and self-built Chinese dataset. After deduplication, model scoring, data bucketing, rule filtering, sensitive topic filtering, and data evaluation, we finally obtained 125 billion tokens of valid data.

To address the issue of low efficiency in Chinese encoding and decoding caused by the native word segmentation of LLaMa, we added 8,000 commonly used Chinese characters to the LLaMa vocabulary. By removing duplicates with the original LLaMa vocabulary, we finally obtained a vocabulary of size 39,410. We achieved this by reusing the LlamaTokenizer in Transformers.

During the incremental training process, we used 160 A100s with a total of 40GB memory, using a training dataset with 2.6 million tokens and mixed precision of FP16. The throughput reached 118 TFLOP per GPU per second. As a result, we were able to incrementally train 110 billion tokens of data on top of the native LLaMa-13B model in just 8 days.

Throughout the training process, we encountered various issues such as machine crashes, underlying framework bugs, and loss spikes. However, we ensured the stability of the incremental training by making rapid adjustments. We have also released the loss curve during the training process to help everyone understand the potential issues that may arise.

多任務有監(jiān)督微調 Supervised finetuning

在多任務有監(jiān)督微調階段，采用了課程學習（curiculum learning）和增量訓練（continual learning）的策略，用大模型輔助劃分已有的數(shù)據(jù)難度，然后通過“Easy To Hard”的方式，分多個階段進行SFT訓練。

SFT訓練數(shù)據(jù)包含多個高質量的數(shù)據(jù)集，均經(jīng)過人工篩選和校驗：

Self-Instruct構造的數(shù)據(jù)（約2M）：BELLE、Alpaca、Alpaca-GPT4等多個數(shù)據(jù)集
內部收集Code數(shù)據(jù)（300K）：包含leetcode、多種Code任務形式
內部收集推理/邏輯相關數(shù)據(jù)（500K）：推理、申論、數(shù)學應用題、數(shù)值計算等
中英平行語料（2M）：中英互譯語料、COT類型翻譯語料、古文翻譯語料等
多輪對話語料（500K）：Self-Instruct生成、任務型多輪對話、Role-Playing型多輪對話等

During the supervised fine-tuning (SFT) phase of multitask learning, we used a strategy of curriculum learning and incremental training. We used the large model to assist in partitioning the existing data by difficulty and then conducted SFT training in multiple stages using the “easy to hard” approach.

The SFT training data consists of multiple high-quality datasets that have been manually selected and verified, including approximately 2 million samples from datasets such as BELLE, Alpaca, and Alpaca-GPT4, 300,000 samples of internally collected code data including LeetCode and various code tasks, 500,000 samples of internally collected inference/logic-related data such as reasoning, argumentative essays, mathematical application questions, and numerical calculations, 2 million samples of Chinese-English parallel corpora including translation, COT-type translation, and classical Chinese translation, and 500,000 samples of multi-turn dialogue corpora including self-instructed generation, task-oriented multi-turn dialogue, and role-playing multi-turn dialogue.

人類反饋學習 Human-Feedback training

為了進一步提升模型的綜合表現(xiàn)，使其能夠充分理解人類意圖、減少“幻覺”和不安全的輸出，基于指令微調后的模型，進行了人類反饋訓練（Human-Feedback Training，HFT）。在訓練中，我們采用了以人類反饋強化學習（RM、PPO）為主，結合多種其他手段聯(lián)合訓練的方法，手段包括人類反饋微調（Human-Feedback Fine-tuning，HFFT）、后見鏈微調（Chain-of-Hindsight Fine-tuning，COHFT）、AI反饋（AI Feedback）和基于規(guī)則的獎勵系統(tǒng)（Rule-based Reward System，RBRS）等，用來彌補PPO方法的短板，加速訓練。

我們在內部自研的框架上實現(xiàn)了HFT的訓練流程，該框架可以利用最少8張40G的A100顯卡完成Ziya-LLaMA-13B-v1的全參數(shù)訓練。在PPO訓練中，我們沒有限制生成樣本的長度，以確保長文本任務的獎勵準確性。每次訓練的總經(jīng)驗池尺寸超過100k樣本，確保了訓練的充分性。

To further improve the overall performance of the model, enabling it to fully understand human intentions, reduce “hallucinations” and unsafe outputs, we conducted Human-Feedback Training (HFT) based on the model fine-tuned with instructions. In the training process, we used a variety of methods, including human feedback reinforcement learning (RM, PPO), combined with other methods such as Human-Feedback Fine-tuning (HFFT), Chain-of-Hindsight Fine-tuning (COHFT), AI feedback, and Rule-based Reward System (RBRS), to supplement the shortcomings of the PPO method and accelerate training.

We implemented the HFT training process on an internally developed framework, which can use a minimum of 8 40GB A100 GPUs to complete the full parameter training of Ziya-LLaMA-13B-v1. In the PPO training, we did not limit the length of the generated samples to ensure the accuracy of rewards for long-text tasks. The total experience pool size for each training exceeded 100k samples, ensuring the sufficiency of the training.

效果評估 Performance

示例代碼

from modelscope.utils.constant import Tasks
from modelscope.pipelines import pipeline
pipe = pipeline(task=Tasks.text_generation, model='Fengshenbang/Ziya-LLaMA-13B-v1', model_revision='v1.0.7', device_map='auto')
query="幫我寫一份去西安的旅游計劃"
inputs = '<human>:' + query.strip() + '\n<bot>:'
result = pipe(inputs, max_new_tokens=1024, do_sample=True, top_p=0.85, temperature=1.0, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
print(result['text'])

引用 Citation

如果您在您的工作中使用了我們的模型，可以引用我們的論文：

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

歡迎引用我們的網(wǎng)站:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

五月天成人小说,中文字幕亚洲欧美专区,久久妇女,亚洲伊人久久大香线蕉综合,日日碰狠狠添天天爽超碰97