tortoise-tts - 以质量优先的多语音 TTS 系统

Created at: 2022-01-28 12:33:15

Language: Jupyter Notebook

编号: https://github.com/neonbjb/tortoise-tts

License: Apache-2.0

龟

Tortoise 是一个文本到语音转换程序，具有以下优先级：

强大的多语音能力。
高度逼真的韵律和语调。

此存储库包含在推理模式下运行 Tortoise TTS 所需的所有代码。

手稿：https://arxiv.org/abs/2305.07243

版本历史

2.7 版;2023/7/26

错误修复
添加了苹果芯片支持
更新的转换器版本

2.6 版;2023/7/26

错误修复

2.5 版;2023/7/09

新增kv_cache支持速度提高 5 倍
增加的高速支持速度提高了 10 倍
增加了半精度支持

2.4 版;2022/5/17

删除了 CVVP 模型。发现它实际上不会对输出产生明显差异。
添加更好的调试支持;现有工具现在会吐出可用于重现错误运行的调试文件。

2.3 版;2022/5/12

新的 CLVP 大型模型，用于进一步改进解码指导。
对 read.py 和do_tts.py的改进（新选项）

2.2 版;2022/5/5

从训练集中添加了几个新声音。
自动密文。将要用于提示模型的文本换行，但不要在括号中朗读。
错误修复

2.1 版;2022/5/2

添加了产生完全随机语音的功能。
添加了通过脚本下载语音条件反射潜伏，然后使用用户提供的预感条件反射的功能。
添加了使用你自己的预训练模型的功能。
重构的目录结构。
性能改进和错误修复。

名字里有什么？

我以莫哈韦沙漠动植物命名与语音相关的存储库。有点开玩笑：这个模型非常慢。它利用自回归解码器和扩散解码器;两者都以其低采样率而闻名。在 K80 上，预计每 2 分钟生成一个中等大小的句子。

演示

有关示例输出的大量列表，请参阅此页面。

+GPT-3的酷应用（不是我做的）：https://twitter.com/lexman_ai

使用指南

本地安装

如果你想在自己的电脑上使用它，你必须有一个 NVIDIA GPU。

在Windows上，我强烈建议使用Conda安装路径。有人告诉我，如果你不这样做，你将花费大量时间追逐依赖问题。

首先，安装miniconda：https://docs.conda.io/en/latest/miniconda.html

然后运行以下命令，使用 anaconda 提示符作为终端（或任何其他配置为使用 conda 的终端）

这将：

创建指定了最少依赖项的 conda 环境
激活环境
使用此处提供的命令安装 pytorch：https://pytorch.org/get-started/locally/
克隆龟-tts
将当前目录更改为 tortoise-tts
运行 anaconda 安装程序安装脚本

conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install

或者，pytorch 可以安装在基本环境中，以便其他 conda 环境也可以使用它。为此，只需在激活环境之前发送线路即可。

conda install pytorch...

注意：当你想使用Tortoise-tts时，你将始终必须确保激活conda环境。
tortoise

如果你在Windows上，你可能还需要安装pysoundfile：

conda install -c conda-forge pysoundfile

苹果硅

在带有 M13/M1 芯片的 MacOS 1+ 上，你需要安装几乎版本的 pytorch，如官方页面中所述，你可以执行以下操作：

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

请务必在激活环境后执行此操作。如果不使用 conda，命令将如下所示：

python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .

请注意，DeepSpeed在Apple Silicon上被禁用，因为它不起作用。该标志将被忽略。你可能需要在以下命令前面加上前缀才能使它们正常工作，因为 MPS 不支持 Pytorch 中的所有操作。

--use_deepspeed

PYTORCH_ENABLE_MPS_FALLBACK=1

do_tts.py

此脚本允许你使用一个或多个声音说出单个短语。

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

read.py

此脚本提供用于读取大量文本的工具。

python tortoise/read.py --textfile <your text to be read> --voice random

这会将文本文件分解为句子，然后一次将它们转换为语音。它将在生成时输出一系列语音剪辑。生成所有剪辑后，它会将它们合并为一个文件并输出。

有时会搞砸输出。你可以通过使用 --regenerate 参数重新运行来重新生成任何不良剪辑。

read.py

应用程序接口

Tortoise 可以通过编程方式使用，如下所示：

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

要使用 deepspeed：

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

要使用 kv 缓存：

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

要在 float16 中运行模型：

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

对于更快的运行，请使用所有三个：

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

语音定制指南

Tortoise经过专门训练，可以成为多扬声器模型。它通过查阅参考剪辑来实现这一点。

这些参考剪辑是你提供的用于指导语音生成的扬声器的录音。这些剪辑用于确定输出的许多属性，例如语音的音高和音调、语速，甚至口齿不清或口吃等说话缺陷。参考剪辑还用于确定音频输出的非语音相关方面，如音量、背景噪音、录音质量和混响。

提供的声音

此存储库带有几个预打包的声音。以“train_”开头的声音来自训练集，并且表现比其他声音好得多。如果你的目标是高质量的演讲，我建议你选择其中之一。如果你想看看在零镜头模仿方面能做些什么，看看其他的。

添加新语音

要向 Tortoise 添加新的声音，你需要执行以下操作：

收集扬声器的音频剪辑。好的来源是YouTube采访（你可以使用youtube-dl获取音频），有声读物或播客。好剪辑的指南在下一节中。
将剪辑剪辑成 ~10 秒的片段。你至少需要 3 个剪辑。越多越好，但我在测试中只尝试了多达 5 个。
将剪辑另存为具有浮点格式和 22，050 采样率的 WAV 文件。
在语音/ 中创建子目录
将你的剪辑放在该子目录中。
使用 --voice=<your_subdirectory_name> 运行 Tortoise 实用程序。

挑选好的参考剪辑

如上所述，你的参考剪辑对 Tortoise 的输出有深远的影响。以下是挑选好剪辑的一些提示：

避免使用带有背景音乐、噪音或混响的剪辑。这些剪辑已从训练数据集中删除。不太可能和他们相处得很好。
避免演讲。这些通常具有由放大系统引起的失真。
避免电话剪辑。
避免使用过度口吃、结结巴巴或“呃”或“喜欢”等字样的剪辑。
尝试查找以你希望输出听起来的方式朗读的剪辑。例如，如果你想听到目标声音阅读有声读物，请尝试查找他们阅读书籍的剪辑。
剪辑中说出的文字并不重要，但多样化的文字似乎表现得更好。

高级用法

生成设置

Tortoise主要是一个自回归解码器模型，与扩散模型相结合。这两个都有很多可以转动的旋钮，为了便于使用，我把它们抽象掉了。为此，我使用各种设置排列生成数千个剪辑，并使用语音真实度和清晰度指标来衡量其效果。我已经将默认值设置为我能够找到的最佳整体设置。对于特定的用例，使用这些设置可能会很有效（而且很可能我错过了一些东西！

这些设置在与 Tortoise 打包的普通脚本中不可用。但是，它们在 API 中可用。有关完整列表，请参阅。

api.tts

快速工程

有些人发现，用做快速工程是可能的！例如，你可以通过在文本前包含“我真的很伤心”之类的内容来唤起情感。我构建了一个自动编校系统，你可以使用它来利用这一点。它的工作原理是尝试编辑提示中用括号括起来的任何文本。例如，提示“[我真的很伤心，]请喂我”只会说出“请喂我”（带有悲伤的音调）。

玩声音潜伏

Tortoise 通过一个小子模型单独馈送参考剪辑来摄取参考剪辑，该子模型产生一个点潜伏，然后取所有生成的潜在点的平均值。我所做的实验表明，这些点潜伏非常具有表现力，影响从语气到语速再到言语异常的所有内容。

这适合一些巧妙的技巧。例如，你可以将两种不同的声音组合给，它将输出它认为这两种声音的“平均值”听起来像什么。

从语音生成条件反射潜伏

使用该脚本提取已安装语音的条件反射潜伏。此脚本会将潜伏转储到 .pth pickle 文件中。该文件将包含一个元组（autoregressive_latent、diffusion_latent）。

get_conditioning_latents.py

或者，使用 API。TextToSpeech.get_conditioning_latents（）获取潜伏。

使用原始条件反射潜伏生成语音

使用它们后，你可以使用它们生成语音，方法是在 voices/ 中创建一个子目录，其中包含一个包含腌制条件潜伏作为元组的“.pth”文件（autoregressive_latent，diffusion_latent）。

检测

出于对这个模型可能被滥用的担忧，我构建了一个分类器来告诉音频剪辑来自 Tortoise 的可能性。

此分类器可以在任何计算机上运行，用法如下：

python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>

This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives.

Model architecture

Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

Training

These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own DLAS trainer.

I currently do not have plans to release the training configurations or methodology. See the next section..

Ethical Considerations

Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how.

After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:

It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See
```
tortoise-detect
```
above.
If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.

Diversity

The diversity expressed by ML models is strongly tied to the datasets they were trained on.

Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents.

Looking forward

Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.

I want to mention here that I think Tortoise could be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS.

Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:

Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
Kim and Jung who implemented univnet pytorch model.
lucidrains who writes awesome open source pytorch models, many of which are used here.
Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.

Notice

Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.