Coqui TTS
🐸(青蛙)TTS
https://github.com/coqui-ai/TTS
文档:https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html
For the first time, tts need to download a data model. If the download fails, it will fail for the second time. We need to remove empty data model folder from path below to make it do a retry download:
/home/hgneng/.local/share/tts/
modals are download from here: https://github.com/coqui-ai/TTS/releases/tag/v0.6.1_models
有个论坛,当没有思路的时候可以看看甚至提问:https://github.com/coqui-ai/TTS/discussions
Coqui普通话的问题
ai、an、ang等字前多数会额外增加一个g音。
gai、gan等字前少了一个g音。有可能学错了。
e读音不准,要么读不出,要么前面加了k音。
课程
公开课:https://edu.speechhome.com/
自注意力机制:https://www.youtube.com/watch?v=hYdO9CscNes,找“李宏毅“相关视频可以完成完整的机器学习课程。
训练中文语音
有人正在做这样的尝试,他应该已经成功合成,只是定制的时候出现问题:https://github.com/coqui-ai/TTS/discussions/2488
已经有中文模型,不过不知道为什么声音后面会多了一段奇怪的重复语音(似乎是必须补齐12.05秒):
tts = TTS(model_name="tts_models/zh-CN/baker/tacotron2-DDC-GST")
tts.tts_to_file("你好")
这个语音有一个TensorFlow的版本(不过我没有运行成功):https://huggingface.co/tensorspeech/tts-tacotron2-baker-ch
定制语音
Raise your voice - training a model on your very own voice clips with Common Voice and Coqui
YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion
https://github.com/Edresson/YourTTS
Create a custom Speech-to-Text model for 💫 Your Voice 💫 with Common Voice
Best Procedure For Voice Cloning
以下命令可以轻易地克隆声音,耗时11秒。必须使用multi-lingual模型。目前主要问题应该在于性能。如果实在没有办法,就生产基本拼音让Ekho来合成。
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=False)
tts.tts_to_file("This is voice cloning.", speaker_wav="cameron.wav", language="en", file_path="output.wav")
Coqui STT
https://github.com/coqui-ai/STT
Tacotron2
2006年发布的Tacotron是第一批成功的使用深度学习应用于TTS的模型之一。Tacotron mainly is an encoder-decoder model with attention.
2008年发布了Tacotron2。此模型合成Hello World耗时74秒。
2020年Coqui Eren Gölge提出Tacotron2 Double Decoder Consistency模型。此模型合成Hello World耗时9秒。
参考:https://tts.readthedocs.io/en/latest/models/tacotron1-2.html
学习Pytorch关于语音合成的模块Tacotron2: https://pytorch.org/audio/stable/tutorials/tacotron2_pipeline_tutorial.html
We need to fix network issue:
/home/hgneng/miniconda3/envs/tacotron2/lib/python3.10/site-packages/torch/hub.py download_url_to_file from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt to /home/hgneng/.cache/torch/hub/checkpoints/en_us_cmudict_forward.pt
Vocoder translating specgrams to wav seems slow
此模型合成Hello World耗时139秒,远远高于Coqui的Tracotron2 DDC模型(9秒)。
What's difference between Phoneme-based TTS and Character-based TTS?
Phoneme-based TTS is a text-to-speech system that uses the sounds of a language (phonemes) to generate speech. It is more accurate than character-based TTS because it is based on a more detailed analysis of the language. Character-based TTS, on the other hand, is a text-to-speech system that uses characters (or words) to generate speech. It is less accurate than phoneme-based TTS because it does not take into account the nuances of the language.
https://pytorch.org/audio/2.0.1/pipelines.html
Tacotron2 data Modal
理解其模型,看有没有中文可用的,如果没有想办法自己训练:https://pytorch.org/audio/stable/pipelines.html
DeepPhonemizer
DeepPhonemizer is a multilingual grapheme-to-phoneme modeling library that leverages recent deep learning technology and is optimized for usage in production systems such as TTS. In particular, the library should be accurate, fast, easy to use. Moreover, you can train a custom model on your own dataset in a few lines of code.
DeepPhonemizer is compatible with Python 3.6+ and is distributed under the MIT license.
Read the documentation at: https://as-ideas.github.io/DeepPhonemizer/
希尔贝壳AISHELL-3 高保真中文语音数据库
希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句,可做为多说话人合成系统。录制过程在安静室内环境中, 使用高保真麦克风(44.1kHz,16bit)。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注,并通过严格质量检验,此数据库音字确率在98%以上。
https://www.aishelltech.com/aishell_3
Common Voice Dataset
We’re building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.
Includes both Cantonese and Mandarin Chinese!!
抽样粤语(Chinese Hong Kong)语音数据的质量不好,录音人声音不够清晰(不是声优级别的声音),背景噪音较大,标记文件有错。另外还有个Cantonese的分类。
感觉可能用现有的TTS生成数据质量会好得多。
Librosa
audio and music processing in Python
Conda
We should to install packages in base. If there is conflict, remove packages in base.
How to activate conda env in Visual Studio Code?
1. Open Visual Studio Code.
2. Go to the Extensions tab (Ctrl+Shift+X) and install the Python extension.
3. Go to File > Preferences > Settings.
4. In the left pane, search for “conda”.
5. In the right pane, search for “python.condaPath” and set the path to your Anaconda installation.
6. In the left pane, search for “conda env”.
7. In the right pane, search for “python.condaEnvFile” and set the path to your environment file.
8. Finally, open the Command Palette (Ctrl+Shift+P) and select the Python: Select Interpreter command.
9. Select the environment you would like to activate in Visual Studio Code.
评论13
关于这个TTS
我已经在debian上安装了这个TTS 请问如何调用,谢谢
这个TTS是小草莓告诉我的,我只是收藏一下…
关于这个TTS
有没有可能基于这个TTS开发一个orca可以调用的版本呢
先解决青蛙TTS怎么支持中文,后续再考虑支持Orca…
关于这个TTS
这个还不支持中文吗,有没有可能让他支持中文呢
还不支持。让它支持中文的难度不确定。还没有时间研究…
还不支持。让它支持中文的难度不确定。还没有时间研究。我估计需要有一个中文深度学习用的语料库,然后进行训练,创建中文模型才能支持。就我个人目前能力来看,还是太困难了。
这个TTS我知道但没用过,可能只是一个框架?
除了这个还有MaryTTS,以我的理解,这些是不是一个训练用的框架,要另外获取训练好的语音?
我的理解是,对于英文,已经有训练好的数据模型…
我的理解是,对于英文,已经有训练好的数据模型,只要录制少量语音,模型就能提取特征值合成出和录音人非常相似的声音效果。对于中文,模型应该还不存在。要做出模型应该是需要深度学习训练出来的。
长度12s
将句号作为显式的终止符,在短文本后面人为加上句号,就不会出现意外的颤音了。比如
tts.tts_to_file("你好。")
非常感谢高人指点!这真是一个神奇的魔法!
非常感谢高人指点!这真是一个神奇的魔法!
非常感谢高人指点!这真是一个神奇的魔法!
大佬能教教我吗?这两天想简单的调用一下。一直没能成功github上看到各种项目和看天书差不多,感觉入不了门。我家电脑配置也不行 现在想用google 的Colaboratory试试
Coqui是基于PyTorch的…
Coqui是基于PyTorch的,Colab好像是运行TensorFlow的,可能不太行。我的电脑没有GPU,也可以跑Coqui,就是要等半分钟才能出结果。要在Ubuntu上跑才比较容易安装,命令不多,就是等下载的时间比较长。
理论上可以,我没试过
请自备梯子
https://gist.github.com/erogol/97516ad65b44dbddb8cd694953187c5b
https://github.com/coqui-ai/TTS/discussions/1074
colab支持pytorch,SD画图也是用的pytorch,有很多colab模板。