NExT-GPT - NExT-GPT 的代码和模型:任意多模态大语言模型

Created at: 2023-08-30 11:34:11
Language: Python
License: BSD-3-Clause

NExT-GPT:任意多式联运法学硕士

吴胜琼郝飞*、曲磊刚、季蔡达成.(*通信 )

NExT++,新加坡国立大学计算机学院


许可证 优酷

该存储库托管NExT-GPT的代码,数据和模型权重,NExT-GPT是第一个端到端MM-LLM,可感知输入并以文本,图像,视频和音频等的任意组合(任意对任意)生成输出。


🎉 新闻

  • [x] [2023.09.15] 🚀🚀 发布 NExT-GPT 代码版本 .
    7b_tiva_v0

👉 待办事项

  • [ ] 释放检查点(投影图层)。
  • [ ] 发布 MosIT 数据。
  • [ ] 在更多类型和大小的LLM中更新NExT-GPT。
  • [ ] 赋予NExT-GPT更多的输入和输出模式。
  • [ ] ...

示例演示

在这里,我们展示了从NExT-GPT生成的示例。有关更多示例,请访问网页或在线现场演示

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/0c2b3d88-a533-4899-ab44-65580fe54538

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/eb1319a6-38aa-4546-a96e-163207e7de93

https://github.com/NExT-GPT/NExT-GPT/assets/18722770/36bec0ad-9bad-4bcf-bc37-92b028f1bc6a

简介

NExt-GPT建立在现有的预训练LLM,多模态编码器和SoTA扩散模型之上,具有足够的端到端指令调谐。

视频-法学硕士

  • 多模态编码阶段。利用已建立的编码器以各种模式对输入进行编码,其中这些表示通过投影层投影为LLM可理解的类似语言的表示。
  • 法学硕士理解和推理阶段。利用现有的开源LLM作为核心来处理语义理解和推理的输入信息。LLM不仅直接生成文本标记,而且还生成独特的“模态信号”标记,这些令牌作为指令来指示解码层是否以及相应地输出什么模态内容。
  • 多模式生成阶段。基于变压器的输出投影层通过来自LLM(如果有)的特定指令接收多模态信号,将信号令牌表示映射到以下多模态解码器可以理解的表示中。

有关更多技术细节,请参阅本文


开始

目录


1. 代码结构

├── figures
├── data
│   ├── T-X_pair_data  
│   │   ├── audiocap                      # text-autio pairs data
│   │   │   ├── audios                    # audio files
│   │   │   └── audiocap.json             # the audio captions
│   │   ├── cc3m                          # text-image paris data
│   │   │   ├── images                    # image files
│   │   │   └── cc3m.json                 # the image captions
│   │   └── webvid                        # text-video pairs data
│   │   │   ├── videos                    # video files
│   │   │   └── webvid.json               # the video captions
│   ├── IT_data                           # instruction data
│   │   ├── T+X-T_data                    # text+[image/audio/video] to text instruction data
│   │   │   ├── alpaca                    # textual instruction data
│   │   │   ├── llava                     # visual instruction data
│   │   ├── T-T+X                         # synthesized text to text+[image/audio/video] instruction data
│   │   └── MosIT                         # Modality-switching Instruction Tuning instruction data
├── code
│   ├── config
│   │   ├── base.yaml                     # the model configuration 
│   │   ├── stage_1.yaml                  # enc-side alignment training configuration
│   │   ├── stage_2.yaml                  # dec-side alignment training configuration
│   │   └── stage_3.yaml                  # instruction-tuning configuration
│   ├── dsconfig
│   │   ├── stage_1.json                  # deepspeed configuration for enc-side alignment training
│   │   ├── stage_2.json                  # deepspeed configuration for dec-side alignment training
│   │   └── stage_3.json                  # deepspeed configuration for instruction-tuning training
│   ├── datast
│   │   ├── base_dataset.py
│   │   ├── cc3m_datast.py                # process and load text-image pair dataset
│   │   ├── audiocap_datast.py            # process and load text-audio pair dataset
│   │   ├── webvid_dataset.py             # process and load text-video pair dataset
│   │   └── instruction_dataset.py        # process and load instruction pair dataset
│   ├── model                     
│   │   ├── ImageBind                     # the code from ImageBind Model
│   │   ├── common
│   │   ├── anyToImageVideoAudio.py       # the main model file
│   │   ├── agent.py
│   │   ├── modeling_llama.py
│   │   ├── custom_ad.py                  # the audio diffusion 
│   │   ├── custom_sd.py                  # the image diffusion
│   │   ├── custom_vd.py                  # the video diffusion
│   │   ├── layers.py                     # the output projection layers
│   │   └── ...  
│   ├── scripts
│   │   ├── train.sh                      # training NExT-GPT script
│   │   └── app.sh                        # deploying demo script
│   ├── header.py
│   ├── process_embeddings.py             # precompute the captions embeddings
│   ├── train.py                          # training
│   ├── inference.py                      # inference
│   ├── demo_app.py                       # deploy Gradio demonstration 
│   └── ...
├── ckpt                           
│   ├── delta_ckpt                        # tunable NExT-GPT params
│   │   ├── nextgpt         
│   │   │   ├── 7b_tiva_v0                # the directory to save the log file
│   │   │   │   ├── log                   # the logs
│   └── ...       
│   ├── pretrained_ckpt                   # frozen params of pretrained modules
│   │   ├── imagebind_ckpt
│   │   │   ├──huge                       # version
│   │   │   │   └──imagebind_huge.pth
│   │   ├── vicuna_ckpt
│   │   │   ├── 7b_v0                     # version
│   │   │   │   ├── config.json
│   │   │   │   ├── pytorch_model-00001-of-00002.bin
│   │   │   │   ├── tokenizer.model
│   │   │   │   └── ...
├── LICENCE.md
├── README.md
└── requirements.txt

2. 环境准备 [返回顶部]

请先克隆存储库并安装所需的环境,这可以通过运行以下命令来完成:

conda env create -n nextgpt python=3.8

conda activate nextgpt

# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone https://github.com/NExT-GPT/NExT-GPT.git
cd NExT-GPT

pip install -r requirements.txt

3. 自行训练/适应 NExt-GPT

3.1. 准备预训练检查点 [返回顶部]

NExT-GPT基于以下优秀的现有模型进行训练。请按照说明准备检查站。

3.2. 准备数据集 [返回顶部]

请下载以下用于模型训练的数据集:

A) T-X 对数据

B) 指令数据

3.3. 预计算嵌入 [返回顶部]

在解码侧对齐训练中,我们最小化信号标记和标题表示之间的距离。为了节省时间和内存成本,我们使用相应扩散模型中的文本编码器预先计算图像、音频和视频字幕的文本嵌入。

请在以下 NExT-GPT 训练之前运行此命令,其中生成的文件将保存在 [./data/embed] 中。

embedding

cd ./code/
python process_embeddings.py ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5

Note of arguments:

  • args[1]: path of caption file;
  • args[2]: modality, which can be
    image
    ,
    video
    , and
    audio
    ;
  • args[3]: saving path of embedding file;
  • args[4]: corresponding pre-trained diffusion model name.

3.4. Training NExT-GPT [Back to Top]

First of all, please refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.

Then, the training of NExT-GPT starts with this script:

cd ./code
bash scripts/train.sh

Specifying the command:

deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28459 train.py \
    --model nextgpt \
    --stage 1\
    --dataset cc3m\
    --data_path  ../data/T-X_pair_data/cc3m/cc3m.json\
    --mm_root_path ../data/T-X_pair_data/cc3m/images/\
    --embed_path ../data/embed/\
    --save_path  ../ckpt/delta_ckpt/nextgpt/7b/\
    --log_path ../ckpt/delta_ckpt/nextgpt/7b/log/

where the key arguments are:

  • --include
    :
    localhost:0
    indicating the GPT cuda number
    0
    of deepspeed.
  • --stage
    : training stage.
  • --dataset
    : the dataset name for training model.
  • --data_path
    : the data path for the training file.
  • --mm_root_path
    : the data path for the image/video/audio file.
  • --embed_path
    : the data path for the text embedding file.
  • --save_path
    : the directory which saves the trained delta weights. This directory will be automatically created.
  • --log_path
    : the directory which saves the log file.

The whole NExT-GPT training involves 3 steps:

  • Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.

    Just run the above

    train.sh
    script by setting:

    • --stage 1
    • --dataset x
      , where
      x
      varies from [
      cc3m
      ,
      webvid
      ,
      audiocap
      ]
    • --data_path ../.../xxx.json
      , where
      xxx
      is the file name of the data in [./data/T-X_pair_data]
    • --mm_root_path .../.../x
      ,
      x
      varies from [
      images
      ,
      audios
      ,
      videos
      ]

    Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.

  • Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.

    Just run the above

    train.sh
    script by setting:

    • --stage 2
    • --dataset x
      , where
      x
      varies from [
      cc3m
      ,
      webvid
      ,
      audiocap
      ]
    • --data_path ../.../xxx.json
      , where
      xxx
      is the file name of the data in [./data/T-X_pair_data]
    • --mm_root_path .../.../x
      ,
      x
      varies from [
      images
      ,
      audios
      ,
      videos
      ]

    Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.

  • Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.

    Just run the above

    train.sh
    script by setting:

    Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.

4. Running NExT-GPT System [Back to Top]

4.1. Preparing Checkpoints

First, loading the pre-trained NExT-GPT system.

4.2. Deploying Gradio Demo

Upon completion of the checkpoint loading, you can run the demo locally via:

cd ./code
bash scripts/app.sh

Specifying the key arguments as:

  • --nextgpt_ckpt_path
    : the path of pre-trained NExT-GPT params.

Contact

For any questions or feedback, feel free to contact Shengqiong Wu and Hao Fei.

Citation

If you find NextGPT useful in your research or applications, please kindly cite:

@articles{wu2023nextgpt,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
  journal = {CoRR},
  volume = {abs/2309.05519},
  year={2023}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, ImageBind, Stable Diffusion, AudioLDM, and Zeroscope. We also partially draw inspirations from PandaGPT, VPGTrans, GILL, CoDi, Video-LLaMA, and MiniGPT-4. Thanks for their wonderful works.

License Notices

This repository is under BSD 3-Clause License. NExT-GPT is a research project intended for non-commercial use only. One must NOT use the code of NExT-GPT for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.