在歌声转换领域,不仅有一个项目,SoVitsSvc,还有许多其他项目,这里就不一一列举了。该项目正式停止维护并存档。但是,仍然有其他爱好者创建了自己的分支并继续维护SoVitsSvc项目(仍然与SvcDevelopTeam和存储库维护者无关),并对其进行了一些重大更改,供你自己发现。
本项目为开源离线项目,SvcDevelopTeam 的所有成员以及本项目的所有开发者和维护者(以下简称贡献者)对本项目没有控制权。本项目的贡献者从未向任何组织或个人提供任何形式的帮助,包括但不限于数据集提取、数据集处理、计算支持、培训支持、推理等。项目的贡献者不知道也无法知道用户使用项目的目的。因此,所有基于该项目训练的AI模型和合成音频都与该项目的贡献者无关。由此产生的所有问题均由用户承担。
本项目完全离线运行,无法收集任何用户信息或获取用户输入数据。因此,本项目的贡献者不了解所有用户输入和模型,因此不对任何用户输入负责。
本项目只是一个框架项目,本身不具备语音合成的功能,所有的功能都需要用户自己训练模型。同时,该项目没有附加模型,任何二次分布式项目都与该项目的贡献者无关
更新了 4.0-v2 模型,整个过程与 4.0 相同。与 4.0 相比,在某些情况下有一些改进,但也有一些情况有所倒退。有关详细信息,请参阅 4.0-v2 分支。
分支 | 特征 | 是否与主分支模型兼容 |
---|---|---|
4.0 | 总分行 | - |
4.0v2 | 使用 VISinger2 模型 | 不相容 |
4.0-Vec768-层12 | 要素输入是内容 Vec 的第 12 层转换器输出 | 不相容 |
歌声转换模型使用SoftVC内容编码器提取源音频语音特征,然后将向量直接馈送到VITS中,而不是转换为基于文本的中间;因此,音高和语调是守恒的。此外,声码器改为NSF HiFiGAN,以解决声音中断的问题。
经过测试,我们相信该项目在.
Python 3.8.9
hubert
# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory
Optional(Strongly recommend)
G_0.pth
D_0.pth
logs/44kdirectory
Get them from svc-develop-team(TBD) or anywhere else.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.
pretrain/nsf_hifigandirectory
# nsf_hifigan
https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
📊 Dataset Preparation
Simply place the dataset in the
dataset_raw
directory with the following file structure.
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
You can customize the speaker name.
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
🛠️ Preprocessing
0. Slice audioSlice to
5s - 15s
, a bit longer is no problem. Too long may lead to torch.cuda.OutOfMemoryError
during training or even pre-processing.
By using audio-slicer-GUI or audio-slicer-CLI
In general, only the
Minimum Interval
needs to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to 100
or even 50
.
After slicing, delete audio that is too long and too short.
1. Resample to 44100Hz and monopython resample.py
2. Automatically split the dataset into training and validation sets, and generate configuration files.python preprocess_flist_config.py
3. Generate hubert and f0python preprocess_hubert_f0.py
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
You can modify some parameters in the generated config.json
keep_ckpts: Keep the last
keep_ckptsmodels during training. Set to
0will keep them all. Default is
3.
all_in_mem: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.
python train.py -c configs/config.json -m 44k
🤖 Inference
# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -s "nen" -n "君の知らない物語-src.wav" -t 0
Required parameters:
-m|
--model_path: Path to the model.
-c|
--config_path: Path to the configuration file.
-s|
--spk_list: Target speaker name for conversion.
-n|
--clean_names: A list of wav file names located in the raw folder.
-t|
--trans: Pitch adjustment, supports positive and negative (semitone) values.
Optional parameters: see the next section
-a|
--auto_predict_f0: Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.
-cl|
--clip: Voice forced slicing. Set to 0 to turn off(default), duration in seconds.
-lg|
--linear_gradient: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.
-cm|
--cluster_model_path: Path to the clustering model. Fill in any value if clustering is not trained.
-cr|
--cluster_infer_ratio: Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.
-fmp|
--f0_mean_pooling: Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.
-eh|
--enhance: Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.
If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
auto_predict_f0to true in inference_main.
Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
python cluster/train_cluster.py. The output model will be saved in
logs/44k/kmeans_10000.pt.
cluster_model_pathin
inference_main.py.
cluster_infer_ratioin
inference_main.py, where
0means not using clustering at all,
1means only using clustering, and usually
0.5is sufficient.
Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.
f0_mean_poolingto true in
inference_main.py
[23/03/16] No longer need to download hubert manually
[23/04/14] Support NSF_HIFIGAN enhancer
Use onnx_export.py
checkpointsand open it
checkpointsfolder as your project folder, naming it after your project, for example
aziplayer
model.pth, the configuration file as
config.json, and place them in the
aziplayerfolder you just created
"NyaruTaffy"in
path = "NyaruTaffy"in onnx_export.py to your project name,
path = "aziplayer"
model.onnxwill be generated in your project folder, which is the exported model.
Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
CppDataProcess are some functions to preprocess data used in MoeSS
If the original project is equivalent to the Roman Empire, This project is Eastern Roman Empire(The Byzantine Empire) and so-vits-svc-5.0 is Kingdom of Romania
For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
Some members have not listed according to their personal wishes.
MistEO |
XiaoMiku01 |
しぐれ |
TomoGaSukunai |
Plachtaa |
zd小达 |
凍聲響世 |
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。