so-vits-svc - 歌声音色转换模型,通过SoftVC内容编码器提取源音频语音特征,与F0同时输入VITS替换原本的文本输入达到歌声转换的效果。同时,更换声码器为 NSF HiFiGAN 解决断音问题

Created at: 2023-03-10 17:31:09
Language: Python
License: BSD-3-Clause

软VC VITS 歌声转换

在歌声转换领域,不仅有一个项目,SoVitsSvc,还有许多其他项目,这里就不一一列举了。该项目正式停止维护并存档。但是,仍然有其他爱好者创建了自己的分支并继续维护SoVitsSvc项目(仍然与SvcDevelopTeam和存储库维护者无关),并对其进行了一些重大更改,供你自己发现。

界面大大改进的分叉:34j/so-vits-svc-fork

客户端支持实时转换:w-okada/语音转换器

这个项目与维特斯有着根本的不同。Vits 是 TTS,这个项目是 SVC。TTS不能在这个项目中进行,维茨不能进行SVC,两个项目模型不是通用的

免責聲明

本项目为开源离线项目,SvcDevelopTeam 的所有成员以及本项目的所有开发者和维护者(以下简称贡献者)对本项目没有控制权。本项目的贡献者从未向任何组织或个人提供任何形式的帮助,包括但不限于数据集提取、数据集处理、计算支持、培训支持、推理等。项目的贡献者不知道也无法知道用户使用项目的目的。因此,所有基于该项目训练的AI模型和合成音频都与该项目的贡献者无关。由此产生的所有问题均由用户承担。

本项目完全离线运行,无法收集任何用户信息或获取用户输入数据。因此,本项目的贡献者不了解所有用户输入和模型,因此不对任何用户输入负责。

本项目只是一个框架项目,本身不具备语音合成的功能,所有的功能都需要用户自己训练模型。同时,该项目没有附加模型,任何二次分布式项目都与该项目的贡献者无关

📏使用条款

警告:请自行解决数据集的授权问题。对于因使用未经授权的数据集进行训练而导致的任何问题及其一切后果,你应自行承担。存储库及其维护者 svc 开发团队与后果无关!

  1. 本项目仅以学术交流为目的,旨在用于交流和学习目的。它不适用于生产环境。
  2. 任何基于sovit的视频在视频平台上发布,都必须在描述中明确注明用于变声,并注明语音或音频的输入源,例如使用他人发布的视频或音频,并将人声分离作为输入源进行转换,必须提供清晰的原始视频或音乐链接。如果你自己的声音或其他商业人声合成软件的其他合成声音用作转换的输入源,你还必须在说明中说明。
  3. 对于输入源引起的任何侵权问题,你应自行承担。使用其他商业人声合成软件作为输入源时,请确保你遵守该软件的使用条款。请注意,许多人声合成引擎在其使用条款中明确声明它们不能用于输入源转换。
  4. 禁止利用该项目从事非法活动、宗教和政治活动。项目开发商坚决抵制上述活动。如果他们不同意本条,则禁止使用该项目。
  5. 继续使用本项目即视为同意本存储库自述文件中所述的相关规定。本存储库自述文件有义务说服,并且不对可能出现的任何后续问题负责。
  6. 如果你将此项目用于任何其他计划,请提前联系并告知此存储库的作者。谢谢。

🆕更新!

更新了 4.0-v2 模型,整个过程与 4.0 相同。与 4.0 相比,在某些情况下有一些改进,但也有一些情况有所倒退。有关详细信息,请参阅 4.0-v2 分支

📝 4.0 分支功能列表

分支 特征 是否与主分支模型兼容
4.0 总分行 -
4.0v2 使用 VISinger2 模型 不相容
4.0-Vec768-层12 要素输入是内容 Vec 的第 12 层转换器输出 不相容

📝型号介绍

歌声转换模型使用SoftVC内容编码器提取源音频语音特征,然后将向量直接馈送到VITS中,而不是转换为基于文本的中间;因此,音高和语调是守恒的。此外,声码器改为NSF HiFiGAN,以解决声音中断的问题。

🆕 4.0 版本更新内容

  • 功能输入更改为内容 Vec
  • 采样率统一使用44100Hz
  • 由于跳跃大小和其他参数的变化,以及某些模型结构的简化,推理所需的 GPU 内存显着减少。4.0 版的 44kHz GPU 内存使用量甚至小于 3.0 版的 32kHz 使用量。
  • 调整了一些代码结构
  • 数据集创建和训练过程与 3.0 版本一致,但模型完全非通用,数据集需要再次进行完全预处理。
  • 新增选项1:VC模式自动音高预测,即转换语音时无需手动输入音高键,男女声调可自动转换。但是,转换歌曲时,此模式会导致音高偏移。
  • 新增选项2:通过k均值聚类方案减少音色泄漏,使音色与目标音色更相似。
  • 新增选项 3:新增 NSF-HIFIGAN 增强器,对部分训练集较少的模型有一定的音质增强效果,但对训练良好的模型有负面影响,因此默认关闭

💬关于蟒蛇版本

经过测试,我们相信该项目在.

Python 3.8.9

📥预训练模型文件

必填

# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory

Optional(Strongly recommend)

  • Pre-trained model files:
    G_0.pth
    D_0.pth
    • Place them under the
      logs/44k
      directory

Get them from svc-develop-team(TBD) or anywhere else.

Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.

Optional(Select as Required)

If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.

  • Pre-trained NSF-HIFIGAN Vocoder: nsf_hifigan_20221211.zip
    • Unzip and place the four files under the
      pretrain/nsf_hifigan
      directory
# nsf_hifigan
https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1

📊 Dataset Preparation

Simply place the dataset in the

dataset_raw
directory with the following file structure.

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

You can customize the speaker name.

dataset_raw
└───suijiSUI
    ├───1.wav
    ├───...
    └───25788785-20221210-200143-856_01_(Vocals)_0_0.wav

🛠️ Preprocessing

0. Slice audio

Slice to

5s - 15s
, a bit longer is no problem. Too long may lead to
torch.cuda.OutOfMemoryError
during training or even pre-processing.

By using audio-slicer-GUI or audio-slicer-CLI

In general, only the

Minimum Interval
needs to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to
100
or even
50
.

After slicing, delete audio that is too long and too short.

1. Resample to 44100Hz and mono

python resample.py

2. Automatically split the dataset into training and validation sets, and generate configuration files.

python preprocess_flist_config.py

3. Generate hubert and f0

python preprocess_hubert_f0.py

After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

You can modify some parameters in the generated config.json

  • keep_ckpts
    : Keep the last
    keep_ckpts
    models during training. Set to
    0
    will keep them all. Default is
    3
    .

  • all_in_mem
    : Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.

🏋️‍♀️ Training

python train.py -c configs/config.json -m 44k

🤖 Inference

Use inference_main.py

# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -s "nen" -n "君の知らない物語-src.wav" -t 0

Required parameters:

  • -m
    |
    --model_path
    : Path to the model.
  • -c
    |
    --config_path
    : Path to the configuration file.
  • -s
    |
    --spk_list
    : Target speaker name for conversion.
  • -n
    |
    --clean_names
    : A list of wav file names located in the raw folder.
  • -t
    |
    --trans
    : Pitch adjustment, supports positive and negative (semitone) values.

Optional parameters: see the next section

  • -a
    |
    --auto_predict_f0
    : Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.
  • -cl
    |
    --clip
    : Voice forced slicing. Set to 0 to turn off(default), duration in seconds.
  • -lg
    |
    --linear_gradient
    : The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.
  • -cm
    |
    --cluster_model_path
    : Path to the clustering model. Fill in any value if clustering is not trained.
  • -cr
    |
    --cluster_infer_ratio
    : Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.
  • -fmp
    |
    --f0_mean_pooling
    : Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.
  • -eh
    |
    --enhance
    : Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.

🤔 Optional Settings

If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)

Automatic f0 prediction

During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!

  • Set
    auto_predict_f0
    to true in inference_main.

Cluster-based timbre leakage control

Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.

The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.

  • Training process:
    • Train on a machine with good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud machine with 6-core CPU.
    • Execute
      python cluster/train_cluster.py
      . The output model will be saved in
      logs/44k/kmeans_10000.pt
      .
  • Inference process:
    • Specify
      cluster_model_path
      in
      inference_main.py
      .
    • Specify
      cluster_infer_ratio
      in
      inference_main.py
      , where
      0
      means not using clustering at all,
      1
      means only using clustering, and usually
      0.5
      is sufficient.

F0 mean filtering

Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.

  • Set
    f0_mean_pooling
    to true in
    inference_main.py

在科拉布中打开 sovits4_for_colab.ipynb

[23/03/16] No longer need to download hubert manually

[23/04/14] Support NSF_HIFIGAN enhancer

📤 Exporting to Onnx

Use onnx_export.py

  • Create a folder named
    checkpoints
    and open it
  • Create a folder in the
    checkpoints
    folder as your project folder, naming it after your project, for example
    aziplayer
  • Rename your model as
    model.pth
    , the configuration file as
    config.json
    , and place them in the
    aziplayer
    folder you just created
  • Modify
    "NyaruTaffy"
    in
    path = "NyaruTaffy"
    in onnx_export.py to your project name,
    path = "aziplayer"
  • Run onnx_export.py
  • Wait for it to finish running. A
    model.onnx
    will be generated in your project folder, which is the exported model.

UI support for Onnx models

Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)

CppDataProcess are some functions to preprocess data used in MoeSS

🤔The relationship between original project(so-vits-svc by original author), this project and so-vits-svc-5.0

If the original project is equivalent to the Roman Empire, This project is Eastern Roman Empire(The Byzantine Empire) and so-vits-svc-5.0 is Kingdom of Romania

☀️ Previous contributors

For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.

Some members have not listed according to their personal wishes.


MistEO


XiaoMiku01


しぐれ


TomoGaSukunai


Plachtaa


zd小达


凍聲響世

📚 Some legal provisions for reference

Any country, region, organization, or individual using this project must comply with the following laws.

《民法典》

第一千零一十九条

任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。

第一千零二十四条

【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。

第一千零二十七条

【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。

中华人民共和国宪法

中华人民共和国刑法

中华人民共和国民法典

💪 Thanks to all contributors for their efforts