tortoise-tts - 以质量优先的多语音 TTS 系统

Created at: 2022-01-28 12:33:15
Language: Python
License: Apache-2.0

Tortoise 是一个文本到语音转换程序,具有以下优先级:

  1. 强大的多语音能力。
  2. 高度逼真的韵律和语调。

此存储库包含在推理模式下运行 Tortoise TTS 所需的所有代码。

论文的(非常)草稿现在以文档格式提供。我绝对不胜感激任何意见,建议或评论:https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA

版本历史

2.4 版;2022/5/17

  • 删除了 CVVP 模型。发现它实际上不会对输出产生明显差异。
  • 添加更好的调试支持;现有工具现在会吐出可用于重现错误运行的调试文件。

2.3 版;2022/5/12

  • 新的 CLVP 大型模型,用于进一步改进解码指导。
  • 对 read.py 和do_tts.py的改进(新选项)

2.2 版;2022/5/5

  • 从训练集中添加了几个新声音。
  • 自动密文。将要用于提示模型的文本换行,但不要在括号中朗读。
  • 错误修复

2.1 版;2022/5/2

  • 添加了产生完全随机语音的功能。
  • 添加了通过脚本下载语音条件反射潜伏,然后使用用户提供的预感条件反射的功能。
  • 添加了使用你自己的预训练模型的功能。
  • 重构的目录结构。
  • 性能改进和错误修复。

名字里有什么?

我以莫哈韦沙漠动植物命名与语音相关的存储库。有点开玩笑:这个模型非常慢。它利用自回归解码器和扩散解码器;两者都以其低采样率而闻名。在 K80 上,预计每 2 分钟生成一个中等大小的句子。

演示

有关示例输出的大量列表,请参阅此页面

+GPT-3的酷应用(不是我做的):https://twitter.com/lexman_ai

使用指南

科拉布

Colab是尝试最简单的方法。我整理了一个笔记本,你可以在这里使用:https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing

本地安装

如果你想在自己的电脑上使用它,你必须有一个 NVIDIA GPU。

首先,使用以下说明安装 pytorch:https://pytorch.org/get-started/locally/。在Windows上,我强烈建议使用Conda安装路径。有人告诉我,如果你不这样做,你将花费大量时间追逐依赖问题。

接下来,安装 TorToiSe 及其依赖项:

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install

如果你在Windows上,你还需要安装pysoundfile:

conda install -c conda-forge pysoundfile

do_tts.py

此脚本允许你使用一个或多个声音说出单个短语。

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

read.py

此脚本提供用于读取大量文本的工具。

python tortoise/read.py --textfile <your text to be read> --voice random

这会将文本文件分解为句子,然后一次将它们转换为语音。它将在生成时输出一系列语音剪辑。生成所有剪辑后,它会将它们合并为一个文件并输出。

有时会搞砸输出。你可以通过使用 --regenerate 参数重新运行来重新生成任何不良剪辑。

read.py

应用程序接口

Tortoise 可以通过编程方式使用,如下所示:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

语音定制指南

Tortoise经过专门训练,可以成为多扬声器模型。它通过查阅参考剪辑来实现这一点。

这些参考剪辑是你提供的用于指导语音生成的扬声器的录音。这些剪辑用于确定输出的许多属性,例如语音的音高和音调、语速,甚至口齿不清或口吃等说话缺陷。参考剪辑还用于确定音频输出的非语音相关方面,如音量、背景噪音、录音质量和混响。

随机语音

我包含一个随机生成语音的功能。这些声音实际上并不存在,每次运行它时都是随机的。结果非常迷人,我建议你尝试一下!

你可以通过传入“随机”作为语音名称来使用随机语音。会照顾剩下的。

对于 ML 空间中的人:这是通过将随机向量投影到语音调节潜在空间来创建的。

提供的声音

此存储库带有几个预打包的声音。以“train_”开头的声音来自训练集,并且表现比其他声音好得多。如果你的目标是高质量的演讲,我建议你选择其中之一。如果你想看看可以做些什么来模仿零镜头,看看其他的。

添加新语音

要向 Tortoise 添加新的声音,你需要执行以下操作:

  1. 收集扬声器的音频剪辑。好的来源是YouTube采访(你可以使用youtube-dl获取音频),有声读物或播客。好剪辑的指南在下一节中。
  2. 将剪辑剪辑成 ~10 秒的片段。你至少需要 3 个剪辑。越多越好,但我在测试中只尝试了多达 5 个。
  3. 将剪辑另存为具有浮点格式和 22,050 采样率的 WAV 文件。
  4. 在语音/ 中创建子目录
  5. 将你的剪辑放在该子目录中。
  6. 使用 --voice=<your_subdirectory_name> 运行 Tortoise 实用程序。

挑选好的参考剪辑

如上所述,你的参考剪辑对 Tortoise 的输出有深远的影响。以下是挑选好剪辑的一些提示:

  1. 避免使用带有背景音乐、噪音或混响的剪辑。这些剪辑已从训练数据集中删除。不太可能和他们相处得很好。
  2. 避免演讲。这些通常具有由放大系统引起的失真。
  3. 避免电话剪辑。
  4. 避免使用过度口吃、结结巴巴或“呃”或“喜欢”等字样的剪辑。
  5. 尝试查找以你希望输出听起来的方式朗读的剪辑。例如,如果你想听到目标声音阅读有声读物,请尝试查找他们阅读书籍的剪辑。
  6. 剪辑中说出的文字并不重要,但多样化的文字似乎表现得更好。

高级用法

生成设置

Tortoise主要是一个自回归解码器模型,与扩散模型相结合。这两个都有很多可以转动的旋钮,为了便于使用,我把它们抽象掉了。为此,我使用各种设置排列生成数千个剪辑,并使用语音真实度和清晰度指标来衡量其效果。我已经将默认值设置为我能够找到的最佳整体设置。对于特定的用例,使用这些设置可能会很有效(而且很可能我错过了一些东西!

这些设置在与 Tortoise 打包的普通脚本中不可用。但是,它们在 API 中可用。有关完整列表,请参阅。

api.tts

快速工程

有些人发现,用做快速工程是可能的!例如,你可以通过在文本前包含“我真的很伤心”之类的内容来唤起情感。我构建了一个自动编校系统,你可以使用它来利用这一点。它的工作原理是尝试编辑提示中用括号括起来的任何文本。例如,提示“[我真的很伤心,]请喂我”只会说出“请喂我”(带有悲伤的音调)。

玩声音潜伏

Tortoise 通过一个小子模型单独馈送参考剪辑来摄取参考剪辑,该子模型产生一个点潜伏,然后取所有生成的潜在点的平均值。我所做的实验表明,这些点潜伏非常具有表现力,影响从语气到语速再到言语异常的所有内容。

这适合一些巧妙的技巧。例如,你可以将两种不同的声音组合给,它将输出它认为这两种声音的“平均值”听起来像什么。

从语音生成条件反射潜伏

使用该脚本提取已安装语音的条件反射潜伏。此脚本会将潜伏转储到 .pth pickle 文件中。该文件将包含一个元组(autoregressive_latent、diffusion_latent)。

get_conditioning_latents.py

或者,使用 API。TextToSpeech.get_conditioning_latents() 获取潜伏。

使用原始条件反射潜伏生成语音

使用它们后,你可以使用它们生成语音,方法是在 voices/ 中创建一个子目录,其中包含一个包含腌制条件潜伏作为元组的“.pth”文件(autoregressive_latent,diffusion_latent)。

向我发送反馈!

像Tortoise这样的概率模型最好被认为是“增强搜索” - 在这种情况下,通过特定文本字符串的可能话语空间。社区参与细读这些空间的影响(例如使用 GPT-3 或 CLIP)确实让我感到惊讶。如果你发现可以用做的整洁的事情,这里没有记录,请向我报告!我很乐意将其发布到此页面。

检测

出于对这个模型可能被滥用的担忧,我构建了一个分类器来告诉音频剪辑来自 Tortoise 的可能性。

此分类器可以在任何计算机上运行,用法如下:

python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>

This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false positives.

Model architecture

Tortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate models that work together. I've assembled a write-up of the system architecture here: https://nonint.com/2022/04/25/tortoise-architectural-design-doc/

Training

These models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of ~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own DLAS trainer.

I currently do not have plans to release the training configurations or methodology. See the next section..

Ethical Considerations

Tortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system could be misused are many. It doesn't take much creativity to think up how.

After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:

  1. It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
  2. It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
  3. The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
  4. I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See
    tortoise-detect
    above.
  5. If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.

Diversity

The diversity expressed by ML models is strongly tied to the datasets they were trained on.

Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities or of people who speak with strong accents.

Looking forward

Tortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.

I want to mention here that I think Tortoise could do be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason to believe that the same is not true of TTS.

The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer. Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.

If you are an ethical organization with computational resources to spare interested in seeing what this model could do if properly scaled out, please reach out to me! I would love to collaborate on this.

Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:

  • Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
  • Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
  • Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
  • Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
  • lucidrains who writes awesome open source pytorch models, many of which are used here.
  • Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.

Notice

Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.