2025-5-20
接着这个页面的开发:https://cto.eguidedog.net/node/1391
测试checkpoint 150K生成音频失败,可能是应该从头训练,先把训练停了,把所有参数细节研究一下再重新开始。
检查参数发现sample_rate错了,应该从22050改为16000。
根据AI建议,把mel_fmax从7600改为8500,以适应粤语的声调变化。
根据AI建议,把norm_schedule改为了true,自适应学习率调度,初始阶段升温,之后衰减。
有一些问题待调研:
由于mdcc是多人的音频,需要着重研究coqui对这种情况应该怎样配置。
研究一下有没有办法禁用风格学习,以减少运算量。全局风格标记(Global Style Token, GST)
New init steps:
- clone code: git clone https://github.com/hgneng/TTS.git
- chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
- install deps: cd TTS && pip install -e .[all,dev]
- patch 1: cp /gemini/code/TTS/patch/tensorboard_logger.py /root/miniconda3/lib/python3.11/site-packages/trainer/logging/
- patch 2: cp /gemini/code/TTS/patch/summary.py /root/miniconda3/lib/python3.11/site-packages/torch/utils/tensorboard/
- patch 3: cp /gemini/code/TTS/patch/trainer.py /root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py
- run training: cd /gemini/code/TTS && time bash recipes/mdcc/tacotron2-DDC/run.sh 2>&1 | tee out
训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。
把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。
Model test command:
~/code/hgneng/TTS$ TTS/bin/synthesize.py --text "ngo5 wui2 syut3 jyut6 jyu5" --config_path recipes/mdcc/tacotron2-DDC/tacotron2-DDC.json --model_path recipes/mdcc/tacotron2-DDC/model_10000_411.pth --out_path ./demo.wav
2025-5-22
XTTS v2模型似乎是对beginner更容易的选择,Ghost已经成功train了粤语模型。计划从Tacotron 2转到XTTS v2。
kokoro在2024年12月25日发布的一个模型,对不到100小时的音频进行训练,使用A100 80G GPU花了约500小时,约400美元费用。而使用XTTS v2可能大概只需要4090 24G GPU10小时的成本。按照我租用的8G GPU估算,可能2天就能训练完。
tts默认使用hifigan_generator生成"this is a demo"大概需要13 CPU 秒时间。不确定除去初始化后需要多少时间。
time tts --text "this is a demo"
> Processing time: 0.7871761322021484
> Real-time factor: 0.45792617441582345
而使用xtts_v2需要5分钟CPU时间,性能差距还是很明显的。如果从processing time来看,大概是4倍差距。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "this is a demo" --speaker_idx "Claribel Dervla" --language_idx "en"
> Processing time: 2.966963529586792
> Real-time factor: 1.9062221977677378
普通话的合成效果很不错的。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "我挥一挥衣袖,不带走一片云彩" --speaker_idx "Claribel Dervla" --language_idx "zh-cn"
> Processing time: 7.238363027572632
> Real-time factor: 2.274884617416997
2025-5-23
尝试克隆刘德华的声音,有点像,不过杂音很大,可能和我输入的音频质量有关。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "大家好,我是AI生成的刘德华,我暂时不会唱歌。" --speaker_wav andy.wav --language_idx "zh-cn"
2025-5-27
开始训练,过程似乎是一批256个样本训练2万多个step * 1000 epoch,然后再到下一批。目前训练到40000多步时产生了一个best model。
(base) root@gjob-dev-582417481398870016-taskrole1-0:/gemini/code/TTS/recipes/mdcc/xtts_v2# time python train_cantonese.py 2>&1 | tee out
> Training Environment:
| > Backend: Torch
| > Mixed precision: False
| > Precision: float32
| > Current device: 0
| > Num. of GPUs: 1
| > Num. of CPUs: 96
| > Num. of Torch Threads: 1
| > Torch seed: 1
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
| > Torch TF32 MatMul: False
> Start Tensorboard: tensorboard --logdir=/gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> Model has 518442047 parameters
> EPOCH: 0/1000
--> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> EVALUATION
/gemini/code/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py:277: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
dvae_wav = torchaudio.functional.resample(
--> EVAL PERFORMANCE
| > avg_loader_time: 0.04096300427506609 (+0)
| > avg_loss_text_ce: 0.04030846954300636 (+0)
| > avg_loss_mel_ce: 6.010201570464342 (+0)
| > avg_loss: 6.050510057588903 (+0)
> EPOCH: 1/1000
--> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> TRAINING (2025-05-27 12:38:21)
>> DVAE weights restored from: /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/XTTS_v2.0_original_model_files/dvae.pth
mdcc-dataset/text_path does not exist.
| > Found 65120 files in /gemini/data-2/dataset
> Filtering invalid eval samples!!
> Total eval samples after filtering: 247
| > Synthesizing test sentences.
> Sampling by language: dict_keys(['zh-yue'])
--> TIME: 2025-05-27 12:38:28 -- STEP: 0/21622 -- GLOBAL_STEP: 0
| > loss_text_ce: 0.03881431370973587 (0.03881431370973587)
| > loss_mel_ce: 6.046611785888672 (6.046611785888672)
| > loss: 0.07244555652141571 (0.07244555652141571)
| > current_lr: 5e-06
| > step_time: 0.9248 (0.9248268604278564)
| > loader_time: 5.8306 (5.8306238651275635)
2025-5-28
一切数据看起来很不错,在收敛中。15万step的位置产生了新的best model。一天时间大概完成了20万step的训练,总共似乎要训练2000万step。完整需要训练100天,但应该在收敛后,很久都不产生best model的时候就可以结束了。
发现训练时把输出采样率从24000改成了22050可能是个错误,这一轮训练完要用24000再训练一次对比一下。
kokoro用了8千万参数,而xTTS用了5亿个参数。Glow应该是最快的模型,只用了20万个参数。计划之后用Glow训练一次,看看速度怎么样。
2025-5-29
下载了一个model测试,有报错:
Inference...
Traceback (most recent call last):
File "/home/hgneng/code/hgneng/TTS/recipes/mdcc/xtts_v2/test.py", line 18, in <module>
out = model.inference(
^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/coqui/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/hgneng/code/hgneng/TTS/TTS/tts/models/xtts.py", line 534, in inference
text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hgneng/code/hgneng/TTS/TTS/tts/layers/xtts/tokenizer.py", line 653, in encode
return self.tokenizer.encode(txt).ids
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'encode'
2025-5-30
tokenizer的问题是需要在TTS/tts/layers/xtts/tokenizer.py里增加逻辑实现对zh-yue的支持,同时xtts.py代码里没有把config的vocab_file读进取,导致tokenizer一直没有值。目前测试效果是,可以听明白,但不完整,13个字的句子漏了最后3个字。音质有点沙哑(可以认为是噪音)。试试在原来基础上再训练4天看看是什么效果。后面有重要修改需要重新训练的时候再用24000Hz来训练。
2025-6-3
训练到大概145万步,在120万步的时候产生了一个best model,看起来已经收敛,难以产生更优的model了。训练效果还不如33万步时的model(音质变化不大,更早地终止了)。
2025-7-1
之前的训练没有使用eval集,没有引入粤语字符(还不确定要不要)和Jyutpingvob,不知道怎么训练出来的(有可能利用了普通话的预训练结果)。用了粤语的vob后,预训练的模型不能使用了。但是添加粤语字典后不能识别,可能某些地方还是错了,待研究。
fail to load a sample: UNK token found in 佢仲照樣咁樣去拔攏上面嘅草 -> -
fail to load a sample: UNK token found in 破開門 -> -
fail to load a sample: UNK token found in 得意咁話 -> -
2025-7-3
粤语和普通话的逻辑有点搞混了。我重新使用官方的vocab.json,得到以下报错。情况应该是,它把汉字以普通话方式翻译成拼音(token编码可以在vocab.json里找到对应值),然后decode可以得到拼音。其中[zh-yue]被分解为[(UNK), zh(4545), -(8), y(38), ue(925), ](UNK)。因此,下一步应该是搜索zh,看看哪里出错了,为什么没有走进去转换成Jyutping的逻辑(打些log)。
如果不使用vocab.json,而是使用vocab-yue.json。则基本上都是1,只能识别[zh-yue](40)和-(8)。因此,要研究一下vocab.json哪个地方让encoder-decoder知道要用拼音来编解码。
fail to load a sample: UNK token found in 破開門 of zh-yue -> tensor([ 1, 4545, 8, 38, 925, 1, 139, 4499, 4833, 80, 4596],
dtype=torch.int32) -> zh-yuepo4kai1men2
2025-7-4
代码里有些地方写了yue-cn,统一改为yue。
有一个地方粤语调用了chinese_transliterate,返回了拼音,应该实现独立的cantonese_transliterate。vocab的实现可能不太对,需要重新实现,不过实现后运行竟然死机了。
2025-7-7
重新运行后可以了,可能是我安装在移动硬盘的Ubuntu有点问题,死过几次机了。现在开始在虚拟机里训练了,但内存不足,要加配置。
2025-7-9
一天多后报错停止
--> TIME: 2025-07-09 12:31:06 -- STEP: 1688/21707 -- GLOBAL_STEP: 349000
| > loss_text_ce: 0.032358113676309586 (0.029183796481163156)
| > loss_mel_ce: 4.705153465270996 (4.616060291703834)
| > loss: 0.05639895051717758 (0.055300525928101585)
| > current_lr: 5e-06
| > step_time: 0.1641 (0.16446169306881628)
| > loader_time: 0.0298 (0.01972889236364321)
no wav: {"text": "\u4e00", "audio_file": "mdcc-dataset/./audio/447_2012211624_71739_3881.03_3881.36.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_2012211624_71739_3881.03_3881.36"}
no wav: {"text": "\u4e0d", "audio_file": "mdcc-dataset/./audio/447_1803291536_80589_336.81159_337.03159.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1803291536_80589_336.81159_337.03159"}
no wav: {"text": "\u54fc", "audio_file": "mdcc-dataset/./audio/447_1707191853_85994_1215.29372_1215.73373.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1707191853_85994_1215.29372_1215.73373"}
no wav: {"text": "\u8a71", "audio_file": "mdcc-dataset/./audio/447_1709011939_64727_1034.59_1035.06.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1709011939_64727_1034.59_1035.06"}
no wav: {"text": "\u7b2c\u4e00", "audio_file": "mdcc-dataset/./audio/447_1810221351_47789_1028.0_1028.45.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221351_47789_1028.0_1028.45"}
no wav: {"text": "\u554a", "audio_file": "mdcc-dataset/./audio/447_1810221419_96005_947.05_947.44.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221419_96005_947.05_947.44"}
no wav: {"text": "\u4e19", "audio_file": "mdcc-dataset/./audio/447_1711171140_18276_163.89_164.34.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711171140_18276_163.89_164.34"}
no wav: {"text": "\u54c8", "audio_file": "mdcc-dataset/./audio/447_1709011939_95079_590.36_590.73.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1709011939_95079_590.36_590.73"}
no wav: {"text": "\u54c8", "audio_file": "mdcc-dataset/./audio/447_1707191853_85012_687.72_688.16.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1707191853_85012_687.72_688.16"}
no wav: {"text": "\u4f60", "audio_file": "mdcc-dataset/./audio/447_1707101455_52606_260.58_261.06.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1707101455_52606_260.58_261.06"}
no wav: {"text": "\u6c89\u9ed8", "audio_file": "mdcc-dataset/./audio/447_1711162014_33857_69.66_70.14.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711162014_33857_69.66_70.14"}
no wav: {"text": "\u5bc6", "audio_file": "mdcc-dataset/./audio/447_1804041046_47786_634.7_635.01.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1804041046_47786_634.7_635.01"}
no wav: {"text": "\u5509", "audio_file": "mdcc-dataset/./audio/447_1810221419_96005_358.41_358.84.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221419_96005_358.41_358.84"}
no wav: {"text": "\u516d", "audio_file": "mdcc-dataset/./audio/447_2101141117_97282_1908.73001_1909.1.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_2101141117_97282_1908.73001_1909.1"}
! Run is kept in /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-July-08-2025_09+44AM-05659ad2
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1833, in fit
self._fit()
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1785, in _fit
self.train_epoch()
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1504, in train_epoch
outputs, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1360, in train_step
outputs, loss_dict_new, step_time = self.optimize(
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1226, in optimize
outputs, loss_dict = self._compute_loss(
^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1157, in _compute_loss
outputs, loss_dict = self._model_train_step(batch, model, criterion)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1116, in _model_train_step
return model.train_step(*input_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/code/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py", line 308, in train_step
loss_text, loss_mel, _ = self.forward(
^^^^^^^^^^^^^
File "/gemini/code/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py", line 215, in forward
losses = self.xtts.gpt(
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gemini/code/TTS/TTS/tts/layers/xtts/gpt.py", line 511, in forward
text_logits, mel_logits = self.get_logits(
^^^^^^^^^^^^^^^^
File "/gemini/code/TTS/TTS/tts/layers/xtts/gpt.py", line 279, in get_logits
gpt_out = self.gpt(
^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1119, in forward
outputs = block(
^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 654, in forward
feed_forward_hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 575, in forward
hidden_states = self.act(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/transformers/activations.py", line 56, in forward
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 11.70 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 13.58 GiB memory in use. Of the allocated memory 12.96 GiB is allocated by PyTorch, and 1.07 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
real 1608m32.720s
user 2338m50.832s
sys 408m9.760s
2025-7-11
测试失败,似乎配置有问题。从训练的loss变化来看,只下降了20%左右,不到一个数量级,应该训练的时候就失败了。待研究。vocab-yue.json里的[yue] id没有添加到model数组里,这可能是造成问题的原因。需修改后重新训练。
Computing speaker latents...
Inference...
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
2025-7-14
改[yue] id后,对10000的checkpoint进行测试,没有警告了,不过生成的音频基本没什么声音。一天后再看看效果。
2025-7-15
loss是有下降的(但不是一个数量级的程度),但是结果依然是一条直线,只有一点轻微杂音的样子。在虚拟机上测试推理是没有警告的,但是把模型下载到本地推理还是会有上面attention mask的警告。如果我用coqui提供的模型,也会有attention mask的警告,不过能产生合成音频,只是好像在讲听不懂的方言。
2025-7-16
一、理解词汇表不匹配的原因
二、解决方案:扩展而非替换词汇表
1. 保持原始 vocab.json 不变
2. 创建单独的配置文件
3. 微调时冻结文本编码器
4. 数据预处理时映射粤拼字符
三、其他微调注意事项
四、常见错误排查
2025-7-18
2025-7-19
内存不足,在10万步左右训练停止了。测试结果似乎没有了2025-5-30那种沙哑的情况,但还是少了最后3个字的音频。总体效果符合预期。重启用更高的配置训练。
--> TIME: 2025-07-19 02:51:42 -- STEP: 1365/21707 -- GLOBAL_STEP: 109900
| > loss_text_ce: 0.026061130687594414 (0.02817961076349566)
| > loss_mel_ce: 2.6668691635131836 (2.779251498180433)
| > loss: 0.03205869346857071 (0.03342179957872781)
| > current_lr: 5e-06
| > step_time: 0.3719 (0.384144286794977)
| > loader_time: 0.0562 (0.022447185027293638)
! Run is kept in /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-July-18-2025_10+36AM-05659ad2
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1833, in fit
self._fit()
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1785, in _fit
self.train_epoch()
File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1503, in train_epoch
for cur_step, batch in enumerate(self.train_loader):
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 514, in rebuild_storage_filename
storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Shared memory manager connection has timed out
no wav: {"text": "\u5176\u5be6", "audio_file": "mdcc-dataset/./audio/447_2101051829_46868_3236.52_3237.0.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_2101051829_46868_3236.52_3237.0"}
no wav: {"text": "\u4e94", "audio_file": "mdcc-dataset/./audio/447_2012211629_64656_1517.48883_1517.91882.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_2012211629_64656_1517.48883_1517.91882"}
no wav: {"text": "\u7532", "audio_file": "mdcc-dataset/./audio/447_1711171140_18276_147.04_147.4.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711171140_18276_147.04_147.4"}
no wav: {"text": "\u5629", "audio_file": "mdcc-dataset/./audio/447_1810221517_55035_291.02_291.5.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221517_55035_291.02_291.5"}
no wav: {"text": "\u5509", "audio_file": "mdcc-dataset/./audio/447_1810221419_16686_514.352_514.722.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221419_16686_514.352_514.722"}
no wav: {"text": "\u4e00", "audio_file": "mdcc-dataset/./audio/447_1804261541_33391_432.74_433.06.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1804261541_33391_432.74_433.06"}
no wav: {"text": "\u5509", "audio_file": "mdcc-dataset/./audio/447_1810221419_96029_1142.27_1142.62.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221419_96029_1142.27_1142.62"}
no wav: {"text": "\u4fc2", "audio_file": "mdcc-dataset/./audio/447_1707171718_26643_638.19286_638.62287.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1707171718_26643_638.19286_638.62287"}
no wav: {"text": "\u4e00", "audio_file": "mdcc-dataset/./audio/447_2012211629_64656_2953.68_2954.04.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_2012211629_64656_2953.68_2954.04"}
no wav: {"text": "\u4e00", "audio_file": "mdcc-dataset/./audio/447_1709011939_34540_764.5_764.92.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1709011939_34540_764.5_764.92"}
no wav: {"text": "\u6c89\u9ed8", "audio_file": "mdcc-dataset/./audio/447_1711162014_33857_69.66_70.14.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711162014_33857_69.66_70.14"}
no wav: {"text": "\u66f0", "audio_file": "mdcc-dataset/./audio/447_1711171140_69982_352.12_352.6.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711171140_69982_352.12_352.6"}
no wav: {"text": "\u5509", "audio_file": "mdcc-dataset/./audio/447_1810221419_39922_676.75_677.24.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1810221419_39922_676.75_677.24"}
no wav: {"text": "\u516d", "audio_file": "mdcc-dataset/./audio/447_1711171106_61214_160.02_160.36.wav", "speaker_name": "unknown", "root_path": "mdcc-dataset/", "language": "yue", "audio_unique_name": "mdcc#audio/447_1711171106_61214_160.02_160.36"}
real 979m18.412s
user 1115m41.485s
sys 304m6.906s
New init steps:
- clone code: git clone https://github.com/hgneng/TTS.git
- chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
- install deps: cd TTS && pip install -e .[all,dev]
- run training: cd /gemini/code/TTS/recipes/mdcc/xtts_v2 && time TRAINER_TELEMETRY=0 python train_cantonese.py 2>&1 | tee out
训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。
把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。
Model test command:
- 先把model.pth指向要测试的checkpoint文件
- 把vocab-yue.json复制到vocab.json
- ~/code/hgneng/TTS/recipes/mdcc/xtts_v2$ python test.py
评论