WizardLM - WizardLM：使大型预训练语言模型能够遵循复杂指令

Created at: 2023-04-23 21:26:46

Language: Python

编号: https://github.com/nlpxucan/WizardLM

WizardLM：使大型预训练语言模型能够遵循复杂的指令

🤗 HF Repo • 🐦 Twitter • [WizardLM] • [WizardCoder] • 📃 📃 📃 [WizardMath]

👋 加入我们的不和谐

非官方视频介绍

感谢朋友们的热情，他们的视频介绍更加生动有趣。

新闻

🔥🔥🔥 [2023/08/26] 我们发布了 WizardCoder-Python-34B-V1.0 ，它达到了 73.2 pass@1，并在 HumanEval 基准测试中超过了 GPT4 （2023/03/15）、ChatGPT-3.5 和 Claude2。有关更多详细信息，请参阅向导编码器。
[2023/06/16] 我们发布了 WizardCoder-15B-V1.0，在 HumanEval 基准测试中超过了 Claude-Plus （+6.8）、Bard （+15.3） 和 InstructCodeT5+ （+22.3）。有关更多详细信息，请参阅向导编码器。

型	检查站	纸	人类评估	MBPP	演示	许可证
向导编码器-Python-34B-V1.0	🤗 高频链路	📃 [向导编码器]	73.2	61.2	演示	美洲驼2
向导编码器-15B-V1.0	🤗 高频链路	📃 [向导编码器]	59.8	50.6	--	OpenRAIL-M
WizardCoder-Python-13B-V1.0	🤗 高频链路	📃 [向导编码器]	64.0	55.6	--	美洲驼2
WizardCoder-Python-7B-V1.0	🤗 高频链路	📃 [向导编码器]	55.5	51.6	演示	美洲驼2
向导编码器-3B-V1.0	🤗 高频链路	📃 [向导编码器]	34.8	37.4	--	OpenRAIL-M
向导编码器-1B-V1.0	🤗 高频链路	📃 [向导编码器]	23.8	28.6	--	OpenRAIL-M

我们的WizardMath-70B-V1.0模型略胜GSM8K上的一些闭源LLM，包括ChatGPT 3.5，Claude Instant 1和PaLM 2 540B。
我们的WizardMath-70B-V1.0模型在GSM8k基准测试上实现了81.6 pass@1，比SOTA开源LLM高24.8分，在MATH基准测试上实现了22.7 pass@1，比SOTA开源LLM高9.2分。

型	检查站	纸	GSM8k	数学	在线演示	许可证
巫师数学-70B-V1.0	🤗 高频链路	📃 [巫师数学]	81.6	22.7	演示	骆驼 2
巫师数学-13B-V1.0	🤗 高频链路	📃 [巫师数学]	63.9	14.0	演示	骆驼 2
巫师数学-7B-V1.0	🤗 高频链路	📃 [巫师数学]	54.9	10.7	演示	骆驼 2

[2023/08/09] 我们发布了 WizardLM-70B-V1.0 型号。这是完整的模型重量。

^型	^检查站	^纸	^MT-长凳	^羊驼	^GSM8k	^人类评估	^演示	^许可证
^{向导LM-70B-V1.0}	^{🤗 高频链路}	^{📃即将推出}	^7.78	^92.91%	^77.6%	^50.6		^{骆驼 2 许可证}
^{向导LM-13B-V1.2}	^{🤗 高频链路}		^7.06	^89.17%	^55.3%	^36.6	演示	^{骆驼 2 许可证}
^{向导LM-13B-V1.1}	^{🤗 高频链路}		^6.76	^86.32%		^25.0		^非商业
^{向导LM-30B-V1.0}	^{🤗 高频链路}		^7.01			^37.8		^非商业
^{向导LM-13B-V1.0}	^{🤗 高频链路}		^6.35	^75.31%		^24.0		^非商业
^{向导LM-7B-V1.0}	^{🤗 高频链路}	^{📃 [向导]}				^19.1		^非商业

引文

如果你使用来自WizardLM的数据或代码，请引用该论文。

@article{xu2023wizardlm,
  title={Wizardlm: Empowering large language models to follow complex instructions},
  author={Xu, Can and Sun, Qingfeng and Zheng, Kai and Geng, Xiubo and Zhao, Pu and Feng, Jiazhan and Tao, Chongyang and Jiang, Daxin},
  journal={arXiv preprint arXiv:2304.12244},
  year={2023}
}

如果你使用来自WizardCoder的数据或代码，请引用该论文。

@article{luo2023wizardcoder,
  title={WizardCoder: Empowering Code Large Language Models with Evol-Instruct},
  author={Luo, Ziyang and Xu, Can and Zhao, Pu and Sun, Qingfeng and Geng, Xiubo and Hu, Wenxiang and Tao, Chongyang and Ma, Jing and Lin, Qingwei and Jiang, Daxin},
  journal={arXiv preprint arXiv:2306.08568},
  year={2023}
}

如果你参考我们的模型或代码或来自WizardMath的数据或论文，请引用该论文。

@article{luo2023wizardmath,
  title={WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct},
  author={Luo, Haipeng and Sun, Qingfeng and Xu, Can and Zhao, Pu and Lou, Jianguang and Tao, Chongyang and Geng, Xiubo and Lin, Qingwei and Chen, Shifeng and Zhang, Dongmei},
  journal={arXiv preprint arXiv:2308.09583},
  year={2023}
}

❗关注数据集：

最近，我们整个组织的代码、数据和模型的开源策略和法规发生了明显变化。尽管如此，我们仍然努力首先获得模型的权重开放，但数据涉及更严格的审计，并且正在与我们的法律团队进行审查。我们的研究人员无权在未经授权的情况下公开发布它们。感谢你的理解。

招聘

📣 我们正在寻找积极进取的学生加入我们作为实习生，共同创造更智能的人工智能。请联系 caxu@microsoft.com

模型系统提示用法注意事项：

要获得与我们的演示相同的结果，请严格按照“src/infer_wizardlm13b.py”中提供的提示和调用方法使用我们的模型进行推理。我们的模型采用骆马的提示格式，支持多回合对话。

对于 WizardLM，提示符应如下所示：

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello.</s>USER: Who are you? ASSISTANT: I am WizardLM.</s>......

对于向导编码器 ，提示符应如下所示：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

对于WizardMath，提示应如下所示：

默认版本：

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

CoT 版本：（❗对于简单的数学问题，我们不建议使用 CoT 提示符。

"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response: Let's think step by step."

GPT-4 自动评估

我们采用快聊提出的基于GPT-4的自动评估框架来评估聊天机器人模型的性能。如下图所示，WizardLM-30B取得了比Guanaco-65B更好的结果。

WizardLM-30B在不同技能上的表现。

下图比较了WizardLM-30B和ChatGPT在Evol-Instruct测试集上的技能。结果表明，WizardLM-30B的平均性能达到了ChatGPT的97.8%，18个技能的容量接近100%（或更多），24个技能的容量超过90%。

WizardLM 在 NLP 基础任务上的性能。

下表提供了 NLP 基础任务上的 WizardLM 和其他 LLM 的比较。结果表明，与相同尺寸的LLaMa模型相比，WizardLM始终表现出优异的性能。此外，我们的WizardLM-30B型号在MMLU和HellaSwag基准测试中展示了与OpenAI的Text-davinci-003相当的性能。

型	MMLU 5 发	弧形 25 发	真实QA 0-shot	海拉赃物 10 发	平均
文本-达芬奇-003	56.9	85.2	59.3	82.2	70.9
骆马-13b 1.1	51.3	53.0	51.8	80.1	59.1
瓜纳科 30B	57.6	63.7	50.7	85.1	64.3
向导LM-7B 1.0	42.7	51.6	44.7	77.7	54.2
向导LM-13B 1.0	52.3	57.2	50.5	81.0	60.2
向导LM-30B 1.0	58.8	62.5	52.4	83.3	64.2

下表提供了WizardLM和其他几个LLM在代码生成任务（即HumanEval）上的全面比较。评估指标pass@1。结果表明，与相同尺寸的LLaMa模型相比，WizardLM始终表现出优异的性能。此外，我们的WizardLM-30B型号超过了StarCoder和OpenAI的code-cushman-001。此外，我们的Code LLM，WizardCoder，表现出卓越的性能，取得了57.3的pass@1分，比开源SOTA高出约20分。

型	人类评估Pass@1
拉玛-7B	10.5
拉玛-13B	15.8
代码生成-16B-多	18.3
代码吉克斯	22.9
拉玛-33B	21.7
拉玛-65B	23.7
帕LM-540B	26.2
代代-16B-单晶	29.3
代码-库什曼-001	33.5
星码器	33.6
向导LM-7B 1.0	19.1
向导LM-13B 1.0	24.0
向导LM-30B 1.0	37.8
向导编码器-15B 1.0	57.3