privateGPT - 使用 GPT 的强大功能与你的文档进行私密交互,100% 私密,无数据泄露

Created at: 2023-05-02 17:15:31
Language: Python
License: Apache-2.0

私有GPT

在没有互联网连接的情况下,使用LLM的力量对你的文档提出问题。 100%私有,任何时候都没有数据离开你的执行环境。你可以在没有互联网连接的情况下摄取文档和提问!

使用LangChainGPT4AllLlamaCppChromaSentenceTransformers构建。

演示

环境设置

为了设置你的环境以运行此处的代码,请首先安装所有要求:

pip3 install -r requirements.txt

然后,下载LLM模型并将其放在你选择的目录中:

  • LLM:默认为 ggml-gpt4all-j-v1.3-groovy.bin。如果你更喜欢不同的 GPT4All-J 兼容型号,只需下载它并在你的文件中引用它。
    .env

重命名并相应地编辑变量。

example.env
.env

MODEL_TYPE: supports LlamaCpp or GPT4All
PERSIST_DIRECTORY: is the folder you want your vectorstore in
MODEL_PATH: Path to your GPT4All or LlamaCpp supported LLM
MODEL_N_CTX: Maximum token limit for the LLM model
EMBEDDINGS_MODEL_NAME: SentenceTransformers embeddings model name (see https://www.sbert.net/docs/pretrained_models.html)
TARGET_SOURCE_CHUNKS: The amount of chunks (sources) that will be used to answer a question

注意:由于加载嵌入的方式,第一次运行脚本时,需要互联网连接才能下载嵌入模型本身。

langchain
SentenceTransformers

测试数据集

此存储库使用联合状态脚本作为示例。

引入自己的数据集的说明

将任何和所有文件放入目录中

source_documents

支持的扩展包括:

  • .csv
    :.CSV
  • .docx
    :文字文档,
  • .doc
    :文字文档,
  • .enex
    :印象笔记,
  • .eml
    :电子邮件
  • .epub
    : EPub,
  • .html
    : HTML 文件,
  • .md
    : 降价,
  • .msg
    : 展望消息,
  • .odt
    :打开文档文本,
  • .pdf
    : 便携式文档格式 (PDF),
  • .pptx
    :幻灯片文档,
  • .ppt
    :幻灯片文档,
  • .txt
    : 文本文件 (UTF-8),

运行以下命令以引入所有数据。

python ingest.py

输出应如下所示:

Creating new vectorstore
Loading documents from source_documents
Loading new documents: 100%|██████████████████████| 1/1 [00:01<00:00,  1.73s/it]
Loaded 1 new documents from source_documents
Split into 90 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Using embedded DuckDB with persistence: data will be stored in: db
Ingestion complete! You can now run privateGPT.py to query your documents

它将创建一个包含本地向量存储的文件夹。每个文档需要 20-30 秒,具体取决于文档的大小。你可以根据需要引入任意数量的文档,所有文档都将累积在本地嵌入数据库中。如果要从空数据库开始,请删除该文件夹。

db
db

注意:在摄取过程中,没有数据离开你的本地环境。你可以在没有 Internet 连接的情况下进行摄取,但首次运行采集脚本时下载嵌入模型除外。

在本地对你的文件提出问题!

要提出问题,请运行如下命令:

python privateGPT.py

并等待脚本需要你的输入。

> Enter a query:

Hit enter. You'll need to wait 20-30 seconds (depending on your machine) while the LLM model consumes the prompt and prepares the answer. Once done, it will print the answer and the 4 sources it used as context from your documents; you can then ask another question without re-running the script, just wait for the prompt again.

Note: you could turn off your internet connection, and the script inference would still work. No data gets out of your local environment.

Type

exit
to finish the script.

CLI

The script also supports optional command-line arguments to modify its behavior. You can see a full list of these arguments by running the command

python privateGPT.py --help
in your terminal.

How does it work?

Selecting the right local models and the power of

LangChain
you can run the entire pipeline locally, without any data leaving your environment, and with reasonable performance.

  • ingest.py
    uses
    LangChain
    tools to parse the document and create embeddings locally using
    HuggingFaceEmbeddings
    (
    SentenceTransformers
    ). It then stores the result in a local vector database using
    Chroma
    vector store.
  • privateGPT.py
    uses a local LLM based on
    GPT4All-J
    or
    LlamaCpp
    to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the docs.
  • GPT4All-J
    wrapper was introduced in LangChain 0.0.162.

System Requirements

Python Version

To use this software, you must have Python 3.10 or later installed. Earlier versions of Python will not compile.

C++ Compiler

If you encounter an error while building a wheel during the

pip install
process, you may need to install a C++ compiler on your computer.

For Windows 10/11

To install a C++ compiler on Windows 10/11, follow these steps:

  1. Install Visual Studio 2022.
  2. Make sure the following components are selected:
    • Universal Windows Platform development
    • C++ CMake tools for Windows
  3. Download the MinGW installer from the MinGW website.
  4. Run the installer and select the
    gcc
    component.

Mac Running Intel

When running a Mac with Intel hardware (not M1), you may run into clang: error: the clang compiler does not support '-march=native' during pip install.

If so set your archflags during pip install. eg: ARCHFLAGS="-arch x86_64" pip3 install -r requirements.txt

Disclaimer

This is a test project to validate the feasibility of a fully private solution for question answering using LLMs and Vector embeddings. It is not production ready, and it is not meant to be used in production. The models selection is not optimized for performance, but for privacy; but it is possible to use different models and vectorstores to improve performance.