llama.cpp - 在 C/C++ 中移植 Facebook 的 LLaMA 模型

Created at: 2023-03-11 02:58:00
Language: C
License: MIT

美洲驼.cpp

操作状态 授权协议: MIT

在纯C/C++中推断Facebook的LLaMA模型

描述

主要目标是在MacBook上使用4位量化运行模型

  • 没有依赖关系的普通 C/C++ 实现
  • Apple 芯片一等公民 - 通过 Arm Neon 和 Accelerate 框架进行优化
  • 对 x86 架构的 AVX2 支持
  • 混合F16 / F32精度
  • 4 位量化支持
  • 在 CPU 上运行

这是在一个晚上被黑客入侵的 - 我不知道它是否正常工作。请不要根据此实现的结果对模型做出结论。据我所知,这可能是完全错误的。该项目用于教育目的。新功能可能主要通过社区贡献来添加。

支持的平台:

  • [X] 苹果操作系统
  • [X] Linux
  • [X] Windows (via CMake)

以下是使用 LLaMA-7B 的典型运行:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1678486056
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000


Building a website can be done in 10 simple steps:
1) Select a domain name and web hosting plan
2) Complete a sitemap
3) List your products
4) Write product descriptions
5) Create a user account
6) Build the template
7) Start building the website
8) Advertise the website
9) Provide email support
10) Submit the website to search engines
A website is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the user's browser.
The web pages are stored in a web server. The web server is also called a host. When the website is accessed, it is retrieved from the server and displayed on the user's computer.
A website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the user's screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones.
Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
The website is known as a website when it is hosted. This means that it is displayed on a host. The host is usually a web server.
A website can be displayed on different browsers. The browsers are basically the software that renders the website on the users screen.
A website can also be viewed on different devices such as desktops, tablets and smartphones. Hence, to have a website displayed on a browser, the website must be hosted.
A domain name is an address of a website. It is the name of the website.
A website is an address of a website. It is a collection of web pages that are formatted with HTML. HTML is the code that defines what the website looks like and how it behaves.
The HTML code is formatted into a template or a format. Once this is done, it is displayed on the users browser.
A website is known as a website when it is hosted

main: mem per token = 14434244 bytes
main:     load time =  1332.48 ms
main:   sample time =  1081.40 ms
main:  predict time = 31378.77 ms / 61.41 ms per token
main:    total time = 34036.74 ms

这是在单个M1 Pro MacBook上运行LLaMA-7B和whisper.cpp的另一个演示:

https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4

用法

以下是LLaMA-7B型号的步骤:

# build this repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize.sh 7B

# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 128

运行较大的模型时,请确保有足够的磁盘空间来存储所有中间文件。

待办事项:添加模型磁盘/内存要求

交互模式

如果你想要更像 ChatGPT 的体验,你可以通过作为参数传递在交互模式下运行。在此模式下,你始终可以通过按 Ctrl+C 并输入一行或多行文本来中断生成,这些文本将转换为标记并附加到当前上下文中。你还可以使用参数指定反向提示。这将导致每当在生成中遇到反向提示字符串的确切标记时,都会提示用户输入。一个典型的用途是使用提示,使LLaMa模拟多个用户之间的聊天,比如爱丽丝和鲍勃,并通过。

-i
-r "reverse prompt string"
-r "Alice:"

下面是一个示例 几个镜头交互,使用 命令调用

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 256 --repeat_penalty 1.0 --color -i -r "User:" \
                                           -p \
"Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:"

请注意使用 来区分用户输入和生成的文本。

--color

图像

局限性

  • 我还不知道量化对生成文本的质量有多大影响
  • 可能令牌采样可以改进
  • 加速框架实际上目前未使用,因为我发现对于解码器的典型张量形状,与ARM_NEON内部实现相比没有任何好处。当然,我可能不知道如何正确使用它。但无论如何,你甚至可以禁用它,并且性能将相同,因为当前实现不会调用任何 BLAS 调用
    LLAMA_NO_ACCELERATE=1 make

贡献

  • 贡献者可以打开 PR
  • 协作者可以推送到存储库中的分支
    llama.cpp
  • 将根据贡献邀请合作者

编码指南线

  • 避免添加第三方依赖项、额外文件、额外标头等。
  • 始终考虑与其他操作系统和体系结构的交叉兼容性
  • 避免外观花哨的现代 STL 结构,使用基本的 for 循环,避免模板,保持简单
  • 代码样式没有严格的规则,但请尝试遵循代码中的模式(缩进、空格等)。垂直对齐使内容更具可读性,更易于批量编辑
  • 清理任何尾部空格,使用4个空格缩进,括号在同一行上,
    int * var
  • 查看任务的第一个好问题