更新了 4.0-v2 模型，整个过程与 4.0 相同。与 4.0 相比，在某些情况下有一些改进，但也有一些情况有所倒退。有关详细信息，请参阅 4.0-v2 分支。
|4.0v2||使用 VISinger2 模型||不相容|
|4.0-Vec768-层12||要素输入是内容 Vec 的第 12 层转换器输出||不相容|
# contentvec wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt # Alternatively, you can manually download and place it in the hubert directory
Get them from svc-develop-team(TBD) or anywhere else.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.
# nsf_hifigan https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip # Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory # URL：https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
Simply place the dataset in thedataset_rawdirectory with the following file structure.dataset_raw ├───speaker0 │ ├───xxx1-xxx1.wav │ ├───... │ └───Lxx-0xx8.wav └───speaker1 ├───xx2-0xxx2.wav ├───... └───xxx7-xxx007.wav
You can customize the speaker name.dataset_raw └───suijiSUI ├───1.wav ├───... └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
0. Slice audio
Slice to5s - 15s, a bit longer is no problem. Too long may lead totorch.cuda.OutOfMemoryErrorduring training or even pre-processing.
In general, only theMinimum Intervalneeds to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to100or even50.
After slicing, delete audio that is too long and too short.
1. Resample to 44100Hz and monopython resample.py
2. Automatically split the dataset into training and validation sets, and generate configuration files.python preprocess_flist_config.py
3. Generate hubert and f0python preprocess_hubert_f0.py
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
You can modify some parameters in the generated config.json
keep_ckpts: Keep the last
keep_ckptsmodels during training. Set to
0will keep them all. Default is
all_in_mem: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.
python train.py -c configs/config.json -m 44k
🤖Inference# Example python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -s "nen" -n "君の知らない物語-src.wav" -t 0
--model_path: Path to the model.
--config_path: Path to the configuration file.
--spk_list: Target speaker name for conversion.
--clean_names: A list of wav file names located in the raw folder.
--trans: Pitch adjustment, supports positive and negative (semitone) values.
Optional parameters: see the next section
--auto_predict_f0: Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.
--clip: Voice forced slicing. Set to 0 to turn off(default), duration in seconds.
--linear_gradient: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.
--cluster_model_path: Path to the clustering model. Fill in any value if clustering is not trained.
--cluster_infer_ratio: Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.
--f0_mean_pooling: Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.
--enhance: Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
auto_predict_f0to true in inference_main.
Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
python cluster/train_cluster.py. The output model will be saved in
0means not using clustering at all,
1means only using clustering, and usually
Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.
f0_mean_poolingto true in
[23/03/16] No longer need to download hubert manually
[23/04/14] Support NSF_HIFIGAN enhancer
checkpointsand open it
checkpointsfolder as your project folder, naming it after your project, for example
model.pth, the configuration file as
config.json, and place them in the
aziplayerfolder you just created
path = "NyaruTaffy"in onnx_export.py to your project name,
path = "aziplayer"
model.onnxwill be generated in your project folder, which is the exported model.
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
CppDataProcess are some functions to preprocess data used in MoeSS
If the original project is equivalent to the Roman Empire, This project is Eastern Roman Empire(The Byzantine Empire) and so-vits-svc-5.0 is Kingdom of Romania
For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
Some members have not listed according to their personal wishes.