serge - A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.

Created at: 2023-03-19 16:33:29
Language: Python
License: MIT

Serge - LLaMa made easy 🦙

License Discord

A chat interface based on llama.cpp for running Alpaca models. Entirely self-hosted, no API keys needed. Fits on 4GB of RAM and runs on the CPU.

  • SvelteKit frontend
  • MongoDB for storing chat history & parameters
  • FastAPI + beanie for the API, wrapping calls to llama.cpp


Getting started

Setting up Serge is very easy. TLDR for running it with Alpaca 7B:

git clone
cd serge

docker compose up -d
docker compose exec serge python3 /usr/src/app/api/utils/ tokenizer 7B


⚠️ For cloning on windows, use git clone --config core.autocrlf=input.

Make sure you have docker desktop installed, WSL2 configured and enough free RAM to run models. (see below)


Setting up Serge on Kubernetes can be found in the wiki:

Using serge

(You can pass 7B 13B 30B as an argument to the script to download multiple models.)

Then just go to http://localhost:8008/ and you're good to go!

The API is available at http://localhost:8008/api/


Currently only the 7B, 13B and 30B alpaca models are supported. There's a download script for downloading them inside of the container, described above.

If you have existing weights from another project you can add them to the serge_weights volume using docker cp.

⚠️ A note on memory usage

llama will just crash if you don't have enough available memory for your model.

  • 7B requires about 4.5GB of free RAM
  • 13B requires about 12GB free
  • 30B requires about 20GB free


Feel free to join the discord if you need help with the setup:

What's next

  • [x] Front-end to interface with the API
  • [x] Pass model parameters when creating a chat
  • [ ] User profiles & authentication
  • [ ] Different prompt options
  • [ ] LangChain integration with a custom LLM
  • [ ] Support for other llama models, quantization, etc.

And a lot more!