NExT++,新加坡国立大学计算机学院
该存储库托管NExT-GPT的代码,数据和模型权重,NExT-GPT是第一个端到端MM-LLM,可感知输入并以文本,图像,视频和音频等的任意组合(任意对任意)生成输出。
7b_tiva_v0
在这里,我们展示了从NExT-GPT生成的示例。有关更多示例,请访问网页或在线现场演示。
https://github.com/NExT-GPT/NExT-GPT/assets/18722770/0c2b3d88-a533-4899-ab44-65580fe54538
https://github.com/NExT-GPT/NExT-GPT/assets/18722770/eb1319a6-38aa-4546-a96e-163207e7de93
https://github.com/NExT-GPT/NExT-GPT/assets/18722770/36bec0ad-9bad-4bcf-bc37-92b028f1bc6a
NExt-GPT建立在现有的预训练LLM,多模态编码器和SoTA扩散模型之上,具有足够的端到端指令调谐。
有关更多技术细节,请参阅本文。
├── figures ├── data │ ├── T-X_pair_data │ │ ├── audiocap # text-autio pairs data │ │ │ ├── audios # audio files │ │ │ └── audiocap.json # the audio captions │ │ ├── cc3m # text-image paris data │ │ │ ├── images # image files │ │ │ └── cc3m.json # the image captions │ │ └── webvid # text-video pairs data │ │ │ ├── videos # video files │ │ │ └── webvid.json # the video captions │ ├── IT_data # instruction data │ │ ├── T+X-T_data # text+[image/audio/video] to text instruction data │ │ │ ├── alpaca # textual instruction data │ │ │ ├── llava # visual instruction data │ │ ├── T-T+X # synthesized text to text+[image/audio/video] instruction data │ │ └── MosIT # Modality-switching Instruction Tuning instruction data ├── code │ ├── config │ │ ├── base.yaml # the model configuration │ │ ├── stage_1.yaml # enc-side alignment training configuration │ │ ├── stage_2.yaml # dec-side alignment training configuration │ │ └── stage_3.yaml # instruction-tuning configuration │ ├── dsconfig │ │ ├── stage_1.json # deepspeed configuration for enc-side alignment training │ │ ├── stage_2.json # deepspeed configuration for dec-side alignment training │ │ └── stage_3.json # deepspeed configuration for instruction-tuning training │ ├── datast │ │ ├── base_dataset.py │ │ ├── cc3m_datast.py # process and load text-image pair dataset │ │ ├── audiocap_datast.py # process and load text-audio pair dataset │ │ ├── webvid_dataset.py # process and load text-video pair dataset │ │ └── instruction_dataset.py # process and load instruction pair dataset │ ├── model │ │ ├── ImageBind # the code from ImageBind Model │ │ ├── common │ │ ├── anyToImageVideoAudio.py # the main model file │ │ ├── agent.py │ │ ├── modeling_llama.py │ │ ├── custom_ad.py # the audio diffusion │ │ ├── custom_sd.py # the image diffusion │ │ ├── custom_vd.py # the video diffusion │ │ ├── layers.py # the output projection layers │ │ └── ... │ ├── scripts │ │ ├── train.sh # training NExT-GPT script │ │ └── app.sh # deploying demo script │ ├── header.py │ ├── process_embeddings.py # precompute the captions embeddings │ ├── train.py # training │ ├── inference.py # inference │ ├── demo_app.py # deploy Gradio demonstration │ └── ... ├── ckpt │ ├── delta_ckpt # tunable NExT-GPT params │ │ ├── nextgpt │ │ │ ├── 7b_tiva_v0 # the directory to save the log file │ │ │ │ ├── log # the logs │ └── ... │ ├── pretrained_ckpt # frozen params of pretrained modules │ │ ├── imagebind_ckpt │ │ │ ├──huge # version │ │ │ │ └──imagebind_huge.pth │ │ ├── vicuna_ckpt │ │ │ ├── 7b_v0 # version │ │ │ │ ├── config.json │ │ │ │ ├── pytorch_model-00001-of-00002.bin │ │ │ │ ├── tokenizer.model │ │ │ │ └── ... ├── LICENCE.md ├── README.md └── requirements.txt
请先克隆存储库并安装所需的环境,这可以通过运行以下命令来完成:
conda env create -n nextgpt python=3.8 conda activate nextgpt # CUDA 11.6 conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia git clone https://github.com/NExT-GPT/NExT-GPT.git cd NExT-GPT pip install -r requirements.txt
NExT-GPT基于以下优秀的现有模型进行训练。请按照说明准备检查站。
ImageBind是统一的图像/视频/音频编码器。预先训练的检查点可以从这里下载版本。之后,将文件放在 [./ckpt/pretrained_ckpt/imagebind_ckpt/huge]。
huge
imagebind_huge.pth
Vicuna:首先按照说明准备LLaMA [此处]。然后将预训练的模型放在 [./ckpt/pretrained_ckpt/vicuna_ckpt/] 处。
Image Diffusion用于生成图像。NExT-GPT使用稳定扩散版本。(将自动下载
v1-5)
Audio Diffusion用于制作音频内容。NExT-GPT采用AudioLDM版本。(将自动下载
l-full)
Video Diffusion用于视频生成。我们使用零范围与版本。(将自动下载
v2_576w)
请下载以下用于模型训练的数据集:
A) T-X 对数据
CC3M的文本-图像对,请按照此说明 [此处] 进行操作。然后将数据放在 [./data/T-X_pair_data/cc3m]。
WebVid的文字-视频对,请参见 [说明]。该文件应保存在 [./data/T-X_pair_data/webvid]。
AudioCap的文本-音频对,请参见 [说明]。将数据保存在 [./data/T-X_pair_data/audiocap] 中。
B) 指令数据
T+X-T
LLaVA的可视化指令数据,从这里下载,然后放在[./data/IT_data/T+X-T_data/llava]。
Alpaca的文本指令数据,从这里下载,然后放在[./data/IT_data/T+X-T_data/alpaca/]。
VideoChat,在此处下载视频指令数据,然后将其放在 [./data/IT_data/T+X-T_data/videochat/]。
T-X+T
T+X-T
T-X+T
instruction_data.json
cd ./code/dataset/
python instruction_dataset.py
MosIT
在解码侧对齐训练中,我们最小化信号标记和标题表示之间的距离。为了节省时间和内存成本,我们使用相应扩散模型中的文本编码器预先计算图像、音频和视频字幕的文本嵌入。
请在以下 NExT-GPT 训练之前运行此命令,其中生成的文件将保存在 [./data/embed] 中。
embedding
cd ./code/
python process_embeddings.py ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5
Note of arguments:
image,
video, and
audio;
First of all, please refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.
Then, the training of NExT-GPT starts with this script:
cd ./code
bash scripts/train.sh
Specifying the command:
deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28459 train.py \
--model nextgpt \
--stage 1\
--dataset cc3m\
--data_path ../data/T-X_pair_data/cc3m/cc3m.json\
--mm_root_path ../data/T-X_pair_data/cc3m/images/\
--embed_path ../data/embed/\
--save_path ../ckpt/delta_ckpt/nextgpt/7b/\
--log_path ../ckpt/delta_ckpt/nextgpt/7b/log/
where the key arguments are:
--include:
localhost:0indicating the GPT cuda number
0of deepspeed.
--stage: training stage.
--dataset: the dataset name for training model.
--data_path: the data path for the training file.
--mm_root_path: the data path for the image/video/audio file.
--embed_path: the data path for the text embedding file.
--save_path: the directory which saves the trained delta weights. This directory will be automatically created.
--log_path: the directory which saves the log file.
The whole NExT-GPT training involves 3 steps:
Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.
Just run the above
train.shscript by setting:
--stage 1
--dataset x, where
xvaries from [
cc3m,
webvid,
audiocap]
--data_path ../.../xxx.json, where
xxxis the file name of the data in [./data/T-X_pair_data]
--mm_root_path .../.../x,
xvaries from [
images,
audios,
videos]
Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.
Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.
Just run the above
train.shscript by setting:
--stage 2
--dataset x, where
xvaries from [
cc3m,
webvid,
audiocap]
--data_path ../.../xxx.json, where
xxxis the file name of the data in [./data/T-X_pair_data]
--mm_root_path .../.../x,
xvaries from [
images,
audios,
videos]
Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.
Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.
Just run the above
train.shscript by setting:
--stage 3
--dataset instruction
--data_path ../.../xxx.json, where
xxxis the file name of the data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/T+X-T_data] or data in [./data/IT_data/MosIT_data]
--mm_root_path .../.../x,
xvaries from [
images,
audios,
videos]
Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.
First, loading the pre-trained NExT-GPT system.
Step-1: load
Frozen parameters. Please refer to 3.1 Preparing Pre-trained Checkpoint.
Step-2: load
Tunable parameters. Please put the NExT-GPT system in [./ckpt/delta_ckpt/nextgpt/7b_tiva_v0]. You may either 1) use the params trained yourselves, or 2) download our checkpoints from here. (We are still working hard on optimizing the system, and will release the params shortly.)
Upon completion of the checkpoint loading, you can run the demo locally via:
cd ./code
bash scripts/app.sh
Specifying the key arguments as:
--nextgpt_ckpt_path: the path of pre-trained NExT-GPT params.
For any questions or feedback, feel free to contact Shengqiong Wu and Hao Fei.
If you find NextGPT useful in your research or applications, please kindly cite:
@articles{wu2023nextgpt, title={NExT-GPT: Any-to-Any Multimodal LLM}, author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua}, journal = {CoRR}, volume = {abs/2309.05519}, year={2023} }
You may refer to related work that serves as foundations for our framework and code repository, Vicuna, ImageBind, Stable Diffusion, AudioLDM, and Zeroscope. We also partially draw inspirations from PandaGPT, VPGTrans, GILL, CoDi, Video-LLaMA, and MiniGPT-4. Thanks for their wonderful works.
This repository is under BSD 3-Clause License. NExT-GPT is a research project intended for non-commercial use only. One must NOT use the code of NExT-GPT for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.