A fork of so-vits-svc
with realtime support and greatly improved interface. Based on branch 4.0
(v1) and the models are compatible.
QuickVC
ContentVec
in the original repository.1
CREPE
.pip
.fairseq
.This BAT file will automatically perform the steps described below.
Windows:
py -3.10 -m venv venv
venv\Scripts\activate
Linux/MacOS:
python3.10 -m venv venv
source venv/bin/activate
Anaconda:
conda create -n so-vits-svc-fork python=3.10 pip
conda activate so-vits-svc-fork
Installing without creating a virtual environment may cause a PermissionError
if Python is installed in Program Files, etc.
Install this via pip (or your favourite package manager that uses pip):
python -m pip install -U pip setuptools wheel
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -U so-vits-svc-fork
pip install -U torch torchaudio --index-url https://download.pytorch.org/whl/cu118
. MPS is probably supported.--index-url https://download.pytorch.org/whl/cu118
with --index-url https://download.pytorch.org/whl/rocm5.4.2
. AMD GPUs are not supported on Windows (#120).
Please update this package regularly to get the latest features and bug fixes.
pip install -U so-vits-svc-fork
GUI launches with the following command:
svcg
svc vc
svc infer source.wav
Pretrained models are available on Hugging Face or CIVITAI.
3_HP-Vocal-UVR.pth
or UVR-MDX-NET Main
is recommended. 3
svc pre-split
to split the dataset into multiple files (using librosa
).svc pre-sd
to split the dataset into multiple files (using pyannote.audio
). Further manual classification may be necessary due to accuracy issues. If speakers speak with a variety of speech styles, set --min-speakers larger than the actual number of speakers. Due to unresolved dependencies, please install pyannote.audio
manually: pip install pyannote-audio
.svc pre-classify
is available. Up and down arrow keys can be used to change the playback speed.If you do not have access to a GPU with more than 10 GB of VRAM, the free plan of Google Colab is recommended for light users and the Pro/Growth plan of Paperspace is recommended for heavy users. Conversely, if you have access to a high-end GPU, the use of cloud services is not recommended.
Place your dataset like dataset_raw/{speaker_id}/**/{wav_file}.{any_format}
(subfolders and non-ASCII filenames are acceptable) and run:
svc pre-resample
svc pre-config
svc pre-hubert
svc train -t
batch_size
as much as possible in config.json
before the train
command to match the VRAM capacity. Setting batch_size
to auto-{init_batch_size}-{max_n_trials}
(or simply auto
) will automatically increase batch_size
until OOM error occurs, but may not be useful in some cases.CREPE
, replace svc pre-hubert
with svc pre-hubert -fm crepe
.ContentVec
correctly, replace svc pre-config
with -t so-vits-svc-4.0v1
. Training may take slightly longer because some weights are reset due to reusing legacy initial generator weights.MS-iSTFT Decoder
, replace svc pre-config
with svc pre-config -t quickvc
.For more details, run svc -h
or svc <subcommand> -h
.
> svc -h
Usage: svc [OPTIONS] COMMAND [ARGS]...
so-vits-svc allows any folder structure for training data.
However, the following folder structure is recommended.
When training: dataset_raw/{speaker_name}/**/{wav_name}.{any_format}
When inference: configs/44k/config.json, logs/44k/G_XXXX.pth
If the folder structure is followed, you DO NOT NEED TO SPECIFY model path, config path, etc.
(The latest model will be automatically loaded.)
To train a model, run pre-resample, pre-config, pre-hubert, train.
To infer a model, run infer.
Options:
-h, --help Show this message and exit.
Commands:
clean Clean up files, only useful if you are using the default file structure
infer Inference
onnx Export model to onnx (currently not working)
pre-classify Classify multiple audio files into multiple files
pre-config Preprocessing part 2: config
pre-hubert Preprocessing part 3: hubert If the HuBERT model is not found, it will be...
pre-resample Preprocessing part 1: resample
pre-sd Speech diarization using pyannote.audio
pre-split Split audio files into multiple files
train Train model If D_0.pth or G_0.pth not found, automatically download from hub.
train-cluster Train k-means clustering
vc Realtime inference from microphone
Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!
If you register a referral code and then add a payment method, you may save about $5 on your first month's monthly billing. Note that both referral rewards are Paperspace credits and not cash. It was a tough decision but inserted because debugging and training the initial model requires a large amount of computing power and the developer is a student.