Tortoise 是一个文本到语音转换程序,具有以下优先级:
此存储库包含在推理模式下运行 Tortoise TTS 所需的所有代码。
论文的(非常)草稿现在以文档格式提供。我绝对不胜感激任何意见,建议或评论:https://docs.google.com/document/d/13O_eyY65i6AkNrN_LdPhpUjGhyTNKYHvDrIvHnHe1GA
我以莫哈韦沙漠动植物命名与语音相关的存储库。有点开玩笑:这个模型非常慢。它利用自回归解码器和扩散解码器;两者都以其低采样率而闻名。在 K80 上,预计每 2 分钟生成一个中等大小的句子。
有关示例输出的大量列表,请参阅此页面。
+GPT-3的酷应用(不是我做的):https://twitter.com/lexman_ai
Colab是尝试最简单的方法。我整理了一个笔记本,你可以在这里使用:https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing
如果你想在自己的电脑上使用它,你必须有一个 NVIDIA GPU。
首先,使用以下说明安装 pytorch:https://pytorch.org/get-started/locally/。在Windows上,我强烈建议使用Conda安装路径。有人告诉我,如果你不这样做,你将花费大量时间追逐依赖问题。
接下来,安装 TorToiSe 及其依赖项:
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install
如果你在Windows上,你还需要安装pysoundfile:
conda install -c conda-forge pysoundfile
此脚本允许你使用一个或多个声音说出单个短语。
python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
此脚本提供用于读取大量文本的工具。
python tortoise/read.py --textfile <your text to be read> --voice random
这会将文本文件分解为句子,然后一次将它们转换为语音。它将在生成时输出一系列语音剪辑。生成所有剪辑后,它会将它们合并为一个文件并输出。
有时会搞砸输出。你可以通过使用 --regenerate 参数重新运行来重新生成任何不良剪辑。
read.py
Tortoise 可以通过编程方式使用,如下所示:
reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')
Tortoise经过专门训练,可以成为多扬声器模型。它通过查阅参考剪辑来实现这一点。
这些参考剪辑是你提供的用于指导语音生成的扬声器的录音。这些剪辑用于确定输出的许多属性,例如语音的音高和音调、语速,甚至口齿不清或口吃等说话缺陷。参考剪辑还用于确定音频输出的非语音相关方面,如音量、背景噪音、录音质量和混响。
我包含一个随机生成语音的功能。这些声音实际上并不存在,每次运行它时都是随机的。结果非常迷人,我建议你尝试一下!
你可以通过传入“随机”作为语音名称来使用随机语音。会照顾剩下的。
对于 ML 空间中的人:这是通过将随机向量投影到语音调节潜在空间来创建的。
此存储库带有几个预打包的声音。以“train_”开头的声音来自训练集,并且表现比其他声音好得多。如果你的目标是高质量的演讲,我建议你选择其中之一。如果你想看看可以做些什么来模仿零镜头,看看其他的。
要向 Tortoise 添加新的声音,你需要执行以下操作:
如上所述,你的参考剪辑对 Tortoise 的输出有深远的影响。以下是挑选好剪辑的一些提示:
Tortoise主要是一个自回归解码器模型,与扩散模型相结合。这两个都有很多可以转动的旋钮,为了便于使用,我把它们抽象掉了。为此,我使用各种设置排列生成数千个剪辑,并使用语音真实度和清晰度指标来衡量其效果。我已经将默认值设置为我能够找到的最佳整体设置。对于特定的用例,使用这些设置可能会很有效(而且很可能我错过了一些东西!
这些设置在与 Tortoise 打包的普通脚本中不可用。但是,它们在 API 中可用。有关完整列表,请参阅。
api.tts
有些人发现,用做快速工程是可能的!例如,你可以通过在文本前包含“我真的很伤心”之类的内容来唤起情感。我构建了一个自动编校系统,你可以使用它来利用这一点。它的工作原理是尝试编辑提示中用括号括起来的任何文本。例如,提示“[我真的很伤心,]请喂我”只会说出“请喂我”(带有悲伤的音调)。
Tortoise 通过一个小子模型单独馈送参考剪辑来摄取参考剪辑,该子模型产生一个点潜伏,然后取所有生成的潜在点的平均值。我所做的实验表明,这些点潜伏非常具有表现力,影响从语气到语速再到言语异常的所有内容。
这适合一些巧妙的技巧。例如,你可以将两种不同的声音组合给,它将输出它认为这两种声音的“平均值”听起来像什么。
使用该脚本提取已安装语音的条件反射潜伏。此脚本会将潜伏转储到 .pth pickle 文件中。该文件将包含一个元组(autoregressive_latent、diffusion_latent)。
get_conditioning_latents.py
或者,使用 API。TextToSpeech.get_conditioning_latents() 获取潜伏。
使用它们后,你可以使用它们生成语音,方法是在 voices/ 中创建一个子目录,其中包含一个包含腌制条件潜伏作为元组的“.pth”文件(autoregressive_latent,diffusion_latent)。
像Tortoise这样的概率模型最好被认为是“增强搜索” - 在这种情况下,通过特定文本字符串的可能话语空间。社区参与细读这些空间的影响(例如使用 GPT-3 或 CLIP)确实让我感到惊讶。如果你发现可以用做的整洁的事情,这里没有记录,请向我报告!我很乐意将其发布到此页面。
出于对这个模型可能被滥用的担忧,我构建了一个分类器来告诉音频剪辑来自 Tortoise 的可能性。
此分类器可以在任何计算机上运行,用法如下:
python tortoise/is_this_from_tortoise.py --clip=<path_to_suspicious_audio_file>
This model has 100% accuracy on the contents of the results/ and voices/ folders in this repo. Still, treat this classifier
as a "strong signal". Classifiers can be fooled and it is likewise not impossible for this classifier to exhibit false
positives.
Model architectureTortoise TTS is inspired by OpenAI's DALLE, applied to speech data and using a better decoder. It is made up of 5 separate
models that work together. I've assembled a write-up of the system architecture here:
https://nonint.com/2022/04/25/tortoise-architectural-design-doc/
TrainingThese models were trained on my "homelab" server with 8 RTX 3090s over the course of several months. They were trained on a dataset consisting of
~50k hours of speech data, most of which was transcribed by ocotillo. Training was done on my own
DLAS trainer.
I currently do not have plans to release the training configurations or methodology. See the next section..
Ethical ConsiderationsTortoise v2 works considerably better than I had planned. When I began hearing some of the outputs of the last few versions, I began
wondering whether or not I had an ethically unsound project on my hands. The ways in which a voice-cloning text-to-speech system
could be misused are many. It doesn't take much creativity to think up how.
After some thought, I have decided to go forward with releasing this. Following are the reasons for this choice:
- It is primarily good at reading books and speaking poetry. Other forms of speech do not work well.
- It was trained on a dataset which does not have the voices of public figures. While it will attempt to mimic these voices if they are provided as references, it does not do so in such a way that most humans would be fooled.
- The above points could likely be resolved by scaling up the model and the dataset. For this reason, I am currently withholding details on how I trained the model, pending community feedback.
- I am releasing a separate classifier model which will tell you whether a given audio clip was generated by Tortoise or not. See
tortoise-detect
above.
- If I, a tinkerer with a BS in computer science with a ~$15k computer can build this, then any motivated corporation or state can as well. I would prefer that it be in the open and everyone know the kinds of things ML can do.
DiversityThe diversity expressed by ML models is strongly tied to the datasets they were trained on.
Tortoise was trained primarily on a dataset consisting of audiobooks. I made no effort to
balance diversity in this dataset. For this reason, Tortoise will be particularly poor at generating the voices of minorities
or of people who speak with strong accents.
Looking forwardTortoise v2 is about as good as I think I can do in the TTS world with the resources I have access to. A phenomenon that happens when
training very large models is that as parameter count increases, the communication bandwidth needed to support distributed training
of the model increases multiplicatively. On enterprise-grade hardware, this is not an issue: GPUs are attached together with
exceptionally wide buses that can accommodate this bandwidth. I cannot afford enterprise hardware, though, so I am stuck.
I want to mention here
that I think Tortoise could do be a lot better. The three major components of Tortoise are either vanilla Transformer Encoder stacks
or Decoder stacks. Both of these types of models have a rich experimental history with scaling in the NLP realm. I see no reason
to believe that the same is not true of TTS.
The largest model in Tortoise v2 is considerably smaller than GPT-2 large. It is 20x smaller that the original DALLE transformer.
Imagine what a TTS model trained at or near GPT-3 or DALLE scale could achieve.
If you are an ethical organization with computational resources to spare interested in seeing what this model could do
if properly scaled out, please reach out to me! I would love to collaborate on this.
AcknowledgementsThis project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to
credit a few of the amazing folks in the community that have helped make this happen:
Tortoise was built entirely by me using my own hardware. My employer was not involved in any facet of Tortoise's development.
If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.