一个轻量级的TTS模型实现

2024-06-19 11:28
文章标签 实现 模型 轻量级 tts

本文主要是介绍一个轻量级的TTS模型实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1.环境

python 版本 3.9

2.训练数据集

本次采用LJSpeech数据集,百度网盘下载地址 链接:https://pan.baidu.com/s/1DDFmPpHQrTR_NvjAfwX-QA 
提取码:1234

3.安装依赖

pip install TTS

4.工程结构

5代码部分

decoder.py

import torch
from torch import nnfrom TTS.tts.layers.generic.normalization import ActNorm
from TTS.tts.layers.glow_tts.glow import CouplingBlock, InvConvNeardef squeeze(x, x_mask=None, num_sqz=2):"""GlowTTS squeeze operationIncrease number of channels and reduce number of time stepsby the same factor.Note:each 's' is a n-dimensional vector.``[s1,s2,s3,s4,s5,s6] --> [[s1, s3, s5], [s2, s4, s6]]``"""b, c, t = x.size()t = (t // num_sqz) * num_sqzx = x[:, :, :t]x_sqz = x.view(b, c, t // num_sqz, num_sqz)x_sqz = x_sqz.permute(0, 3, 1, 2).contiguous().view(b, c * num_sqz, t // num_sqz)if x_mask is not None:x_mask = x_mask[:, :, num_sqz - 1 :: num_sqz]else:x_mask = torch.ones(b, 1, t // num_sqz).to(device=x.device, dtype=x.dtype)return x_sqz * x_mask, x_maskdef unsqueeze(x, x_mask=None, num_sqz=2):"""GlowTTS unsqueeze operation (revert the squeeze)Note:each 's' is a n-dimensional vector.``[[s1, s3, s5], [s2, s4, s6]] --> [[s1, s3, s5, s2, s4, s6]]``"""b, c, t = x.size()x_unsqz = x.view(b, num_sqz, c // num_sqz, t)x_unsqz = x_unsqz.permute(0, 2, 3, 1).contiguous().view(b, c // num_sqz, t * num_sqz)if x_mask is not None:x_mask = x_mask.unsqueeze(-1).repeat(1, 1, 1, num_sqz).view(b, 1, t * num_sqz)else:x_mask = torch.ones(b, 1, t * num_sqz).to(device=x.device, dtype=x.dtype)return x_unsqz * x_mask, x_maskclass Decoder(nn.Module):"""Stack of Glow Decoder Modules.::Squeeze -> ActNorm -> InvertibleConv1x1 -> AffineCoupling -> UnsqueezeArgs:in_channels (int): channels of input tensor.hidden_channels (int): hidden decoder channels.kernel_size (int): Coupling block kernel size. (Wavenet filter kernel size.)dilation_rate (int): rate to increase dilation by each layer in a decoder block.num_flow_blocks (int): number of decoder blocks.num_coupling_layers (int): number coupling layers. (number of wavenet layers.)dropout_p (float): wavenet dropout rate.sigmoid_scale (bool): enable/disable sigmoid scaling in coupling layer."""def __init__(self,in_channels,hidden_channels,kernel_size,dilation_rate,num_flow_blocks,num_coupling_layers,dropout_p=0.0,num_splits=4,num_squeeze=2,sigmoid_scale=False,c_in_channels=0,):super().__init__()self.in_channels = in_channelsself.hidden_channels = hidden_channelsself.kernel_size = kernel_sizeself.dilation_rate = dilation_rateself.num_flow_blocks = num_flow_blocksself.num_coupling_layers = num_coupling_layersself.dropout_p = dropout_pself.num_splits = num_splitsself.num_squeeze = num_squeezeself.sigmoid_scale = sigmoid_scaleself.c_in_channels = c_in_channelsself.flows = nn.ModuleList()for _ in range(num_flow_blocks):self.flows.append(ActNorm(channels=in_channels * num_squeeze))self.flows.append(InvConvNear(channels=in_channels * num_squeeze, num_splits=num_splits))self.flows.append(CouplingBlock(in_channels * num_squeeze,hidden_channels,kernel_size=kernel_size,dilation_rate=dilation_rate,num_layers=num_coupling_layers,c_in_channels=c_in_channels,dropout_p=dropout_p,sigmoid_scale=sigmoid_scale,))def forward(self, x, x_mask, g=None, reverse=False):"""Shapes:- x:  :math:`[B, C, T]`- x_mask: :math:`[B, 1 ,T]`- g: :math:`[B, C]`"""if not reverse:flows = self.flowslogdet_tot = 0else:flows = reversed(self.flows)logdet_tot = Noneif self.num_squeeze > 1:x, x_mask = squeeze(x, x_mask, self.num_squeeze)for f in flows:if not reverse:x, logdet = f(x, x_mask, g=g, reverse=reverse)logdet_tot += logdetelse:x, logdet = f(x, x_mask, g=g, reverse=reverse)if self.num_squeeze > 1:x, x_mask = unsqueeze(x, x_mask, self.num_squeeze)return x, logdet_totdef store_inverse(self):for f in self.flows:f.store_inverse()

encoder.py

import mathimport torch
from torch import nnfrom TTS.tts.layers.generic.gated_conv import GatedConvBlock
from TTS.tts.layers.generic.res_conv_bn import ResidualConv1dBNBlock
from TTS.tts.layers.generic.time_depth_sep_conv import TimeDepthSeparableConvBlock
from TTS.tts.layers.glow_tts.duration_predictor import DurationPredictor
from TTS.tts.layers.glow_tts.glow import ResidualConv1dLayerNormBlock
from TTS.tts.layers.glow_tts.transformer import RelativePositionTransformer
from TTS.tts.utils.helpers import sequence_maskclass Encoder(nn.Module):"""Glow-TTS encoder module.::embedding -> <prenet> -> encoder_module -> <postnet> --> proj_mean||-> proj_var||-> concat -> duration_predictor↑speaker_embedArgs:num_chars (int): number of characters.out_channels (int): number of output channels.hidden_channels (int): encoder's embedding size.hidden_channels_ffn (int): transformer's feed-forward channels.kernel_size (int): kernel size for conv layers and duration predictor.dropout_p (float): dropout rate for any dropout layer.mean_only (bool): if True, output only mean values and use constant std.use_prenet (bool): if True, use pre-convolutional layers before transformer layers.c_in_channels (int): number of channels in conditional input.Shapes:- input: (B, T, C)::suggested encoder params...for encoder_type == 'rel_pos_transformer'encoder_params={'kernel_size':3,'dropout_p': 0.1,'num_layers': 6,'num_heads': 2,'hidden_channels_ffn': 768,  # 4 times the hidden_channels'input_length': None}for encoder_type == 'gated_conv'encoder_params={'kernel_size':5,'dropout_p': 0.1,'num_layers': 9,}for encoder_type == 'residual_conv_bn'encoder_params={"kernel_size": 4,"dilations": [1, 2, 4, 1, 2, 4, 1, 2, 4, 1, 2, 4, 1],"num_conv_blocks": 2,"num_res_blocks": 13}for encoder_type == 'time_depth_separable'encoder_params={"kernel_size": 5,'num_layers': 9,}"""def __init__(self,num_chars,out_channels,hidden_channels,hidden_channels_dp,encoder_type,encoder_params,dropout_p_dp=0.1,mean_only=False,use_prenet=True,c_in_channels=0,):super().__init__()# class argumentsself.num_chars = num_charsself.out_channels = out_channelsself.hidden_channels = hidden_channelsself.hidden_channels_dp = hidden_channels_dpself.dropout_p_dp = dropout_p_dpself.mean_only = mean_onlyself.use_prenet = use_prenetself.c_in_channels = c_in_channelsself.encoder_type = encoder_type# embedding layerself.emb = nn.Embedding(num_chars, hidden_channels)nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)# init encoder moduleif encoder_type.lower() == "rel_pos_transformer":if use_prenet:self.prenet = ResidualConv1dLayerNormBlock(hidden_channels, hidden_channels, hidden_channels, kernel_size=5, num_layers=3, dropout_p=0.5)self.encoder = RelativePositionTransformer(hidden_channels, hidden_channels, hidden_channels, **encoder_params)elif encoder_type.lower() == "gated_conv":self.encoder = GatedConvBlock(hidden_channels, **encoder_params)elif encoder_type.lower() == "residual_conv_bn":if use_prenet:self.prenet = nn.Sequential(nn.Conv1d(hidden_channels, hidden_channels, 1), nn.ReLU())self.encoder = ResidualConv1dBNBlock(hidden_channels, hidden_channels, hidden_channels, **encoder_params)self.postnet = nn.Sequential(nn.Conv1d(self.hidden_channels, self.hidden_channels, 1), nn.BatchNorm1d(self.hidden_channels))elif encoder_type.lower() == "time_depth_separable":if use_prenet:self.prenet = ResidualConv1dLayerNormBlock(hidden_channels, hidden_channels, hidden_channels, kernel_size=5, num_layers=3, dropout_p=0.5)self.encoder = TimeDepthSeparableConvBlock(hidden_channels, hidden_channels, hidden_channels, **encoder_params)else:raise ValueError(" [!] Unkown encoder type.")# final projection layersself.proj_m = nn.Conv1d(hidden_channels, out_channels, 1)if not mean_only:self.proj_s = nn.Conv1d(hidden_channels, out_channels, 1)# duration predictorself.duration_predictor = DurationPredictor(hidden_channels + c_in_channels, hidden_channels_dp, 3, dropout_p_dp)def forward(self, x, x_lengths, g=None):"""Shapes:- x: :math:`[B, C, T]`- x_lengths: :math:`[B]`- g (optional): :math:`[B, 1, T]`"""# embedding layer# [B ,T, D]x = self.emb(x) * math.sqrt(self.hidden_channels)# [B, D, T]x = torch.transpose(x, 1, -1)# compute input sequence maskx_mask = torch.unsqueeze(sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)# prenetif hasattr(self, "prenet") and self.use_prenet:x = self.prenet(x, x_mask)# encoderx = self.encoder(x, x_mask)# postnetif hasattr(self, "postnet"):x = self.postnet(x) * x_mask# set duration predictor inputif g is not None:g_exp = g.expand(-1, -1, x.size(-1))x_dp = torch.cat([x.detach(), g_exp], 1)else:x_dp = x.detach()# final projection layerx_m = self.proj_m(x) * x_maskif not self.mean_only:x_logs = self.proj_s(x) * x_maskelse:x_logs = torch.zeros_like(x_m)# duration predictorlogw = self.duration_predictor(x_dp, x_mask)return x_m, x_logs, logw, x_mask

glow_tts.py

import math
from typing import Dict, List, Tuple, Unionimport torch
from coqpit import Coqpit
from torch import nn
from torch.cuda.amp.autocast_mode import autocast
from torch.nn import functional as Ffrom TTS.tts.configs.glow_tts_config import GlowTTSConfig
from decoder import Decoder
from encoder import Encoder
from TTS.tts.models.base_tts import BaseTTS
from TTS.tts.utils.helpers import generate_path, maximum_path, sequence_mask
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.synthesis import synthesis
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.tts.utils.visual import plot_alignment, plot_spectrogram
from TTS.utils.io import load_fsspecclass GlowTTS(BaseTTS):"""GlowTTS model.Paper::https://arxiv.org/abs/2005.11129Paper abstract::Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generatemel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trainedwithout guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS,a flow-based generative model for parallel TTS that does not require any external aligner. By combining theproperties of flows and dynamic programming, the proposed model searches for the most probable monotonicalignment between text and the latent representation of speech on its own. We demonstrate that enforcing hardmonotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flowsenables fast, diverse, and controllable speech synthesis. Glow-TTS obtains an order-of-magnitude speed-up overthe autoregressive model, Tacotron 2, at synthesis with comparable speech quality. We further show that ourmodel can be easily extended to a multi-speaker setting.Check :class:`TTS.tts.configs.glow_tts_config.GlowTTSConfig` for class arguments.Examples:Init only model layers.>>> from TTS.tts.configs.glow_tts_config import GlowTTSConfig>>> from TTS.tts.models.glow_tts import GlowTTS>>> config = GlowTTSConfig(num_chars=2)>>> model = GlowTTS(config)Fully init a model ready for action. All the class attributes and class members(e.g Tokenizer, AudioProcessor, etc.). are initialized internally based on config values.>>> from TTS.tts.configs.glow_tts_config import GlowTTSConfig>>> from TTS.tts.models.glow_tts import GlowTTS>>> config = GlowTTSConfig()>>> model = GlowTTS.init_from_config(config, verbose=False)"""def __init__(self,config: GlowTTSConfig,ap: "AudioProcessor" = None,tokenizer: "TTSTokenizer" = None,speaker_manager: SpeakerManager = None,):super().__init__(config, ap, tokenizer, speaker_manager)# pass all config fields to `self`# for fewer code changeself.config = configfor key in config:setattr(self, key, config[key])self.decoder_output_dim = config.out_channels# init multi-speaker layers if necessaryself.init_multispeaker(config)self.run_data_dep_init = config.data_dep_init_steps > 0self.encoder = Encoder(self.num_chars,out_channels=self.out_channels,hidden_channels=self.hidden_channels_enc,hidden_channels_dp=self.hidden_channels_dp,encoder_type=self.encoder_type,encoder_params=self.encoder_params,mean_only=self.mean_only,use_prenet=self.use_encoder_prenet,dropout_p_dp=self.dropout_p_dp,c_in_channels=self.c_in_channels,)self.decoder = Decoder(self.out_channels,self.hidden_channels_dec,self.kernel_size_dec,self.dilation_rate,self.num_flow_blocks_dec,self.num_block_layers,dropout_p=self.dropout_p_dec,num_splits=self.num_splits,num_squeeze=self.num_squeeze,sigmoid_scale=self.sigmoid_scale,c_in_channels=self.c_in_channels,)def init_multispeaker(self, config: Coqpit):"""Init speaker embedding layer if `use_speaker_embedding` is True and set the expected speaker embeddingvector dimension to the encoder layer channel size. If model uses d-vectors, then it only setsspeaker embedding vector dimension to the d-vector dimension from the config.Args:config (Coqpit): Model configuration."""self.embedded_speaker_dim = 0# set number of speakers - if num_speakers is set in config, use it, otherwise use speaker_managerif self.speaker_manager is not None:self.num_speakers = self.speaker_manager.num_speakers# set ultimate speaker embedding sizeif config.use_d_vector_file:self.embedded_speaker_dim = (config.d_vector_dim if "d_vector_dim" in config and config.d_vector_dim is not None else 512)if self.speaker_manager is not None:assert (config.d_vector_dim == self.speaker_manager.embedding_dim), " [!] d-vector dimension mismatch b/w config and speaker manager."# init speaker embedding layerif config.use_speaker_embedding and not config.use_d_vector_file:print(" > Init speaker_embedding layer.")self.embedded_speaker_dim = self.hidden_channels_encself.emb_g = nn.Embedding(self.num_speakers, self.hidden_channels_enc)nn.init.uniform_(self.emb_g.weight, -0.1, 0.1)# set conditioning dimensionsself.c_in_channels = self.embedded_speaker_dim@staticmethoddef compute_outputs(attn, o_mean, o_log_scale, x_mask):"""Compute and format the mode outputs with the given alignment map"""y_mean = torch.matmul(attn.squeeze(1).transpose(1, 2), o_mean.transpose(1, 2)).transpose(1, 2)  # [b, t', t], [b, t, d] -> [b, d, t']y_log_scale = torch.matmul(attn.squeeze(1).transpose(1, 2), o_log_scale.transpose(1, 2)).transpose(1, 2)  # [b, t', t], [b, t, d] -> [b, d, t']# compute total duration with adjustmento_attn_dur = torch.log(1 + torch.sum(attn, -1)) * x_maskreturn y_mean, y_log_scale, o_attn_durdef unlock_act_norm_layers(self):"""Unlock activation normalization layers for data depended initalization."""for f in self.decoder.flows:if getattr(f, "set_ddi", False):f.set_ddi(True)def lock_act_norm_layers(self):"""Lock activation normalization layers."""for f in self.decoder.flows:if getattr(f, "set_ddi", False):f.set_ddi(False)def _set_speaker_input(self, aux_input: Dict):if aux_input is None:d_vectors = Nonespeaker_ids = Noneelse:d_vectors = aux_input.get("d_vectors", None)speaker_ids = aux_input.get("speaker_ids", None)if d_vectors is not None and speaker_ids is not None:raise ValueError("[!] Cannot use d-vectors and speaker-ids together.")if speaker_ids is not None and not hasattr(self, "emb_g"):raise ValueError("[!] Cannot use speaker-ids without enabling speaker embedding.")g = speaker_ids if speaker_ids is not None else d_vectorsreturn gdef _speaker_embedding(self, aux_input: Dict) -> Union[torch.tensor, None]:g = self._set_speaker_input(aux_input)# speaker embeddingif g is not None:if hasattr(self, "emb_g"):# use speaker embedding layerif not g.size():  # if is a scalarg = g.unsqueeze(0)  # unsqueezeg = F.normalize(self.emb_g(g)).unsqueeze(-1)  # [b, h, 1]else:# use d-vectorg = F.normalize(g).unsqueeze(-1)  # [b, h, 1]return gdef forward(self, x, x_lengths, y, y_lengths=None, aux_input={"d_vectors": None, "speaker_ids": None}):  # pylint: disable=dangerous-default-value"""Args:x (torch.Tensor):Input text sequence ids. :math:`[B, T_en]`x_lengths (torch.Tensor):Lengths of input text sequences. :math:`[B]`y (torch.Tensor):Target mel-spectrogram frames. :math:`[B, T_de, C_mel]`y_lengths (torch.Tensor):Lengths of target mel-spectrogram frames. :math:`[B]`aux_input (Dict):Auxiliary inputs. `d_vectors` is speaker embedding vectors for a multi-speaker model.:math:`[B, D_vec]`. `speaker_ids` is speaker ids for a multi-speaker model usind speaker-embeddinglayer. :math:`B`Returns:Dict:- z: :math: `[B, T_de, C]`- logdet: :math:`B`- y_mean: :math:`[B, T_de, C]`- y_log_scale: :math:`[B, T_de, C]`- alignments: :math:`[B, T_en, T_de]`- durations_log: :math:`[B, T_en, 1]`- total_durations_log: :math:`[B, T_en, 1]`"""# [B, T, C] -> [B, C, T]y = y.transpose(1, 2)y_max_length = y.size(2)# norm speaker embeddingsg = self._speaker_embedding(aux_input)# embedding passo_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x, x_lengths, g=g)# drop redisual frames wrt num_squeeze and set y_lengths.y, y_lengths, y_max_length, attn = self.preprocess(y, y_lengths, y_max_length, None)# create masksy_mask = torch.unsqueeze(sequence_mask(y_lengths, y_max_length), 1).to(x_mask.dtype)# [B, 1, T_en, T_de]attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(y_mask, 2)# decoder passz, logdet = self.decoder(y, y_mask, g=g, reverse=False)# find the alignment pathwith torch.no_grad():o_scale = torch.exp(-2 * o_log_scale)logp1 = torch.sum(-0.5 * math.log(2 * math.pi) - o_log_scale, [1]).unsqueeze(-1)  # [b, t, 1]logp2 = torch.matmul(o_scale.transpose(1, 2), -0.5 * (z**2))  # [b, t, d] x [b, d, t'] = [b, t, t']logp3 = torch.matmul((o_mean * o_scale).transpose(1, 2), z)  # [b, t, d] x [b, d, t'] = [b, t, t']logp4 = torch.sum(-0.5 * (o_mean**2) * o_scale, [1]).unsqueeze(-1)  # [b, t, 1]logp = logp1 + logp2 + logp3 + logp4  # [b, t, t']attn = maximum_path(logp, attn_mask.squeeze(1)).unsqueeze(1).detach()y_mean, y_log_scale, o_attn_dur = self.compute_outputs(attn, o_mean, o_log_scale, x_mask)attn = attn.squeeze(1).permute(0, 2, 1)outputs = {"z": z.transpose(1, 2),"logdet": logdet,"y_mean": y_mean.transpose(1, 2),"y_log_scale": y_log_scale.transpose(1, 2),"alignments": attn,"durations_log": o_dur_log.transpose(1, 2),"total_durations_log": o_attn_dur.transpose(1, 2),}return outputs@torch.no_grad()def inference_with_MAS(self, x, x_lengths, y=None, y_lengths=None, aux_input={"d_vectors": None, "speaker_ids": None}):  # pylint: disable=dangerous-default-value"""It's similar to the teacher forcing in Tacotron.It was proposed in: https://arxiv.org/abs/2104.05557Shapes:- x: :math:`[B, T]`- x_lenghts: :math:`B`- y: :math:`[B, T, C]`- y_lengths: :math:`B`- g: :math:`[B, C] or B`"""y = y.transpose(1, 2)y_max_length = y.size(2)# norm speaker embeddingsg = self._speaker_embedding(aux_input)# embedding passo_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x, x_lengths, g=g)# drop redisual frames wrt num_squeeze and set y_lengths.y, y_lengths, y_max_length, attn = self.preprocess(y, y_lengths, y_max_length, None)# create masksy_mask = torch.unsqueeze(sequence_mask(y_lengths, y_max_length), 1).to(x_mask.dtype)attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(y_mask, 2)# decoder passz, logdet = self.decoder(y, y_mask, g=g, reverse=False)# find the alignment path between z and encoder outputo_scale = torch.exp(-2 * o_log_scale)logp1 = torch.sum(-0.5 * math.log(2 * math.pi) - o_log_scale, [1]).unsqueeze(-1)  # [b, t, 1]logp2 = torch.matmul(o_scale.transpose(1, 2), -0.5 * (z**2))  # [b, t, d] x [b, d, t'] = [b, t, t']logp3 = torch.matmul((o_mean * o_scale).transpose(1, 2), z)  # [b, t, d] x [b, d, t'] = [b, t, t']logp4 = torch.sum(-0.5 * (o_mean**2) * o_scale, [1]).unsqueeze(-1)  # [b, t, 1]logp = logp1 + logp2 + logp3 + logp4  # [b, t, t']attn = maximum_path(logp, attn_mask.squeeze(1)).unsqueeze(1).detach()y_mean, y_log_scale, o_attn_dur = self.compute_outputs(attn, o_mean, o_log_scale, x_mask)attn = attn.squeeze(1).permute(0, 2, 1)# get predited aligned distributionz = y_mean * y_mask# reverse the decoder and predict using the aligned distributiony, logdet = self.decoder(z, y_mask, g=g, reverse=True)outputs = {"model_outputs": z.transpose(1, 2),"logdet": logdet,"y_mean": y_mean.transpose(1, 2),"y_log_scale": y_log_scale.transpose(1, 2),"alignments": attn,"durations_log": o_dur_log.transpose(1, 2),"total_durations_log": o_attn_dur.transpose(1, 2),}return outputs@torch.no_grad()def decoder_inference(self, y, y_lengths=None, aux_input={"d_vectors": None, "speaker_ids": None}):  # pylint: disable=dangerous-default-value"""Shapes:- y: :math:`[B, T, C]`- y_lengths: :math:`B`- g: :math:`[B, C] or B`"""y = y.transpose(1, 2)y_max_length = y.size(2)g = self._speaker_embedding(aux_input)y_mask = torch.unsqueeze(sequence_mask(y_lengths, y_max_length), 1).to(y.dtype)# decoder passz, logdet = self.decoder(y, y_mask, g=g, reverse=False)# reverse decoder and predicty, logdet = self.decoder(z, y_mask, g=g, reverse=True)outputs = {}outputs["model_outputs"] = y.transpose(1, 2)outputs["logdet"] = logdetreturn outputs@torch.no_grad()def inference(self, x, aux_input={"x_lengths": None, "d_vectors": None, "speaker_ids": None}):  # pylint: disable=dangerous-default-valuex_lengths = aux_input["x_lengths"]g = self._speaker_embedding(aux_input)# embedding passo_mean, o_log_scale, o_dur_log, x_mask = self.encoder(x, x_lengths, g=g)# compute output durationsw = (torch.exp(o_dur_log) - 1) * x_mask * self.length_scalew_ceil = torch.clamp_min(torch.ceil(w), 1)y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()y_max_length = None# compute masksy_mask = torch.unsqueeze(sequence_mask(y_lengths, y_max_length), 1).to(x_mask.dtype)attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(y_mask, 2)# compute attention maskattn = generate_path(w_ceil.squeeze(1), attn_mask.squeeze(1)).unsqueeze(1)y_mean, y_log_scale, o_attn_dur = self.compute_outputs(attn, o_mean, o_log_scale, x_mask)z = (y_mean + torch.exp(y_log_scale) * torch.randn_like(y_mean) * self.inference_noise_scale) * y_mask# decoder passy, logdet = self.decoder(z, y_mask, g=g, reverse=True)attn = attn.squeeze(1).permute(0, 2, 1)outputs = {"model_outputs": y.transpose(1, 2),"logdet": logdet,"y_mean": y_mean.transpose(1, 2),"y_log_scale": y_log_scale.transpose(1, 2),"alignments": attn,"durations_log": o_dur_log.transpose(1, 2),"total_durations_log": o_attn_dur.transpose(1, 2),}return outputsdef train_step(self, batch: dict, criterion: nn.Module):"""A single training step. Forward pass and loss computation. Run data depended initialization for thefirst `config.data_dep_init_steps` steps.Args:batch (dict): [description]criterion (nn.Module): [description]"""text_input = batch["text_input"]text_lengths = batch["text_lengths"]mel_input = batch["mel_input"]mel_lengths = batch["mel_lengths"]d_vectors = batch["d_vectors"]speaker_ids = batch["speaker_ids"]if self.run_data_dep_init and self.training:# compute data-dependent initialization of activation norm layersself.unlock_act_norm_layers()with torch.no_grad():_ = self.forward(text_input,text_lengths,mel_input,mel_lengths,aux_input={"d_vectors": d_vectors, "speaker_ids": speaker_ids},)outputs = Noneloss_dict = Noneself.lock_act_norm_layers()else:# normal training stepoutputs = self.forward(text_input,text_lengths,mel_input,mel_lengths,aux_input={"d_vectors": d_vectors, "speaker_ids": speaker_ids},)with autocast(enabled=False):  # avoid mixed_precision in criterionloss_dict = criterion(outputs["z"].float(),outputs["y_mean"].float(),outputs["y_log_scale"].float(),outputs["logdet"].float(),mel_lengths,outputs["durations_log"].float(),outputs["total_durations_log"].float(),text_lengths,)return outputs, loss_dictdef _create_logs(self, batch, outputs, ap):alignments = outputs["alignments"]text_input = batch["text_input"][:1] if batch["text_input"] is not None else Nonetext_lengths = batch["text_lengths"]mel_input = batch["mel_input"]d_vectors = batch["d_vectors"][:1] if batch["d_vectors"] is not None else Nonespeaker_ids = batch["speaker_ids"][:1] if batch["speaker_ids"] is not None else None# model runs reverse flow to predict spectrogramspred_outputs = self.inference(text_input,aux_input={"x_lengths": text_lengths[:1], "d_vectors": d_vectors, "speaker_ids": speaker_ids},)model_outputs = pred_outputs["model_outputs"]pred_spec = model_outputs[0].data.cpu().numpy()gt_spec = mel_input[0].data.cpu().numpy()align_img = alignments[0].data.cpu().numpy()figures = {"prediction": plot_spectrogram(pred_spec, ap, output_fig=False),"ground_truth": plot_spectrogram(gt_spec, ap, output_fig=False),"alignment": plot_alignment(align_img, output_fig=False),}# Sample audiotrain_audio = ap.inv_melspectrogram(pred_spec.T)return figures, {"audio": train_audio}def train_log(self, batch: dict, outputs: dict, logger: "Logger", assets: dict, steps: int) -> None:  # pylint: disable=no-self-usefigures, audios = self._create_logs(batch, outputs, self.ap)logger.train_figures(steps, figures)logger.train_audios(steps, audios, self.ap.sample_rate)@torch.no_grad()def eval_step(self, batch: dict, criterion: nn.Module):return self.train_step(batch, criterion)def eval_log(self, batch: dict, outputs: dict, logger: "Logger", assets: dict, steps: int) -> None:figures, audios = self._create_logs(batch, outputs, self.ap)logger.eval_figures(steps, figures)logger.eval_audios(steps, audios, self.ap.sample_rate)@torch.no_grad()def test_run(self, assets: Dict) -> Tuple[Dict, Dict]:"""Generic test run for `tts` models used by `Trainer`.You can override this for a different behaviour.Returns:Tuple[Dict, Dict]: Test figures and audios to be projected to Tensorboard."""print(" | > Synthesizing test sentences.")test_audios = {}test_figures = {}test_sentences = self.config.test_sentencesaux_inputs = self._get_test_aux_input()if len(test_sentences) == 0:print(" | [!] No test sentences provided.")else:for idx, sen in enumerate(test_sentences):outputs = synthesis(self,sen,self.config,"cuda" in str(next(self.parameters()).device),speaker_id=aux_inputs["speaker_id"],d_vector=aux_inputs["d_vector"],style_wav=aux_inputs["style_wav"],use_griffin_lim=True,do_trim_silence=False,)test_audios["{}-audio".format(idx)] = outputs["wav"]test_figures["{}-prediction".format(idx)] = plot_spectrogram(outputs["outputs"]["model_outputs"], self.ap, output_fig=False)test_figures["{}-alignment".format(idx)] = plot_alignment(outputs["alignments"], output_fig=False)return test_figures, test_audiosdef preprocess(self, y, y_lengths, y_max_length, attn=None):if y_max_length is not None:y_max_length = (y_max_length // self.num_squeeze) * self.num_squeezey = y[:, :, :y_max_length]if attn is not None:attn = attn[:, :, :, :y_max_length]y_lengths = torch.div(y_lengths, self.num_squeeze, rounding_mode="floor") * self.num_squeezereturn y, y_lengths, y_max_length, attndef store_inverse(self):self.decoder.store_inverse()def load_checkpoint(self, config, checkpoint_path, eval=False):  # pylint: disable=unused-argument, redefined-builtinstate = load_fsspec(checkpoint_path, map_location=torch.device("cpu"))self.load_state_dict(state["model"])if eval:self.eval()self.store_inverse()assert not self.training@staticmethoddef get_criterion():from TTS.tts.layers.losses import GlowTTSLoss  # pylint: disable=import-outside-toplevelreturn GlowTTSLoss()def on_train_step_start(self, trainer):"""Decide on every training step wheter enable/disable data depended initialization."""self.run_data_dep_init = trainer.total_steps_done < self.data_dep_init_steps@staticmethoddef init_from_config(config: "GlowTTSConfig", samples: Union[List[List], List[Dict]] = None, verbose=True):"""Initiate model from configArgs:config (VitsConfig): Model config.samples (Union[List[List], List[Dict]]): Training samples to parse speaker ids for training.Defaults to None.verbose (bool): If True, print init messages. Defaults to True."""from TTS.utils.audio import AudioProcessorap = AudioProcessor.init_from_config(config, verbose)tokenizer, new_config = TTSTokenizer.init_from_config(config)speaker_manager = SpeakerManager.init_from_config(config, samples)return GlowTTS(new_config, ap, tokenizer, speaker_manager)

train.py
 

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
from TTS.utils.audio import AudioProcessor
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.tts.datasets import load_tts_samples
import os
import numpy as np
import torch
from glow_tts import GlowTTS
from trainer import Trainer, TrainerArgs
from TTS.utils.radam import RAdam
from trainer.torch import NoamLR
from TTS.tts.layers.losses import GlowTTSLossdef init_config():dataset_config = BaseDatasetConfig(path='train/LJSpeech-1.1/',meta_file_train='metadata.csv',formatter='ljspeech')config = GlowTTSConfig(batch_size=32,eval_batch_size=16,num_loader_workers=4,num_eval_loader_workers=4,run_eval=True,test_delay_epochs=-1,epochs=3,text_cleaner='phoneme_cleaners',use_phonemes=True,phoneme_language='en-us',phoneme_cache_path='train/phoneme_cache',print_step=25,print_eval=False,mixed_precision=True,output_path='train',datasets=[dataset_config],save_step=1000,data_dep_init_steps=0,)processor = AudioProcessor.init_from_config(config)tokenizer, config = TTSTokenizer.init_from_config(config)datas, _ = load_tts_samples(dataset_config,eval_split=True,eval_split_size=0.001)# 排序lens = [os.path.getsize(i['audio_file']) for i in datas]ids = np.argsort(lens)datas = [datas[i] for i in ids]return config, processor, tokenizer, datasconfig, processor, tokenizer, datas = init_config()out = processor.load_wav('train/LJSpeech-1.1/wavs/LJ001-0108.wav')
print('processor.load_wav=', out, out.shape)out = tokenizer.text_to_ids('it is obvious that legibility is the first thing to be aimed at in the forms of the letters'
)
print('tokenizer.text_to_ids=', out, len(out))out = processor.melspectrogram(processor.load_wav('train/LJSpeech-1.1/wavs/LJ001-0108.wav'))
print('processor.melspectrogram=', out.shape)len(datas), datas[:2]def init_model(from_trainer):model = GlowTTS(config, processor, tokenizer, speaker_manager=None)model.run_data_dep_init = Falseif from_trainer:trainer = Trainer(args=TrainerArgs(),config=config,output_path='train',model=model,train_samples=datas,eval_samples=None)optimizer = trainer.get_optimizer(model, config)scheduler = trainer.get_scheduler(model, config, optimizer)criterion = trainer.get_criterion(model)loader = trainer.get_train_dataloader({}, datas, verbose=True)else:optimizer = RAdam(model.parameters(),lr=1e-3,betas=[0.9, 0.998],weight_decay=1e-6)scheduler = NoamLR(optimizer, warmup_steps=4000)criterion = GlowTTSLoss()loader = model.get_data_loader(config=config,assets={},is_eval=False,samples=datas,verbose=True,num_gpus=0)return model, optimizer, scheduler, criterion, loadermodel, optimizer, scheduler, criterion, loader = init_model(from_trainer=False)# 统计参数量
print(sum(i.numel() for i in model.parameters()) / 10000)#optimizer, scheduler, criterion, loaderdef train():global modeldevice = 'cuda' if torch.cuda.is_available() else 'cpu'model.train()model = model.to(device)for epoch in range(config.epochs):for i, data in enumerate(loader):data = model.format_batch(data)for k in data.keys():if isinstance(data[k], torch.Tensor):data[k] = data[k].to(device)print("#############################################")print(data['text_input'].shape)print(data['mel_input'].shape)print("====================================================")_, loss_dict = model.train_step(data, criterion)model.zero_grad(set_to_none=True)loss_dict['loss'].backward()torch.nn.utils.clip_grad_norm_(model.parameters(), 5)optimizer.step()optimizer.zero_grad(set_to_none=True)if i % 2 == 0:lr = optimizer.state_dict()['param_groups'][0]['lr']print(epoch, i, loss_dict['loss'].item(), lr)scheduler.step()config.save_json('train/config.json')model = model.cpu()torch.save({'config': config.to_dict(),'model': model.state_dict()}, 'train/model.pth')if __name__ == '__main__':train()

其中train.py是训练TTS模型的入口,训练好的模型保存在train文件夹下

这篇关于一个轻量级的TTS模型实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1074896

相关文章

大模型研发全揭秘:客服工单数据标注的完整攻略

在人工智能(AI)领域,数据标注是模型训练过程中至关重要的一步。无论你是新手还是有经验的从业者,掌握数据标注的技术细节和常见问题的解决方案都能为你的AI项目增添不少价值。在电信运营商的客服系统中,工单数据是客户问题和解决方案的重要记录。通过对这些工单数据进行有效标注,不仅能够帮助提升客服自动化系统的智能化水平,还能优化客户服务流程,提高客户满意度。本文将详细介绍如何在电信运营商客服工单的背景下进行

hdu1043(八数码问题,广搜 + hash(实现状态压缩) )

利用康拓展开将一个排列映射成一个自然数,然后就变成了普通的广搜题。 #include<iostream>#include<algorithm>#include<string>#include<stack>#include<queue>#include<map>#include<stdio.h>#include<stdlib.h>#include<ctype.h>#inclu

Andrej Karpathy最新采访:认知核心模型10亿参数就够了,AI会打破教育不公的僵局

夕小瑶科技说 原创  作者 | 海野 AI圈子的红人,AI大神Andrej Karpathy,曾是OpenAI联合创始人之一,特斯拉AI总监。上一次的动态是官宣创办一家名为 Eureka Labs 的人工智能+教育公司 ,宣布将长期致力于AI原生教育。 近日,Andrej Karpathy接受了No Priors(投资博客)的采访,与硅谷知名投资人 Sara Guo 和 Elad G

【C++】_list常用方法解析及模拟实现

相信自己的力量,只要对自己始终保持信心,尽自己最大努力去完成任何事,就算事情最终结果是失败了,努力了也不留遗憾。💓💓💓 目录   ✨说在前面 🍋知识点一:什么是list? •🌰1.list的定义 •🌰2.list的基本特性 •🌰3.常用接口介绍 🍋知识点二:list常用接口 •🌰1.默认成员函数 🔥构造函数(⭐) 🔥析构函数 •🌰2.list对象

【Prometheus】PromQL向量匹配实现不同标签的向量数据进行运算

✨✨ 欢迎大家来到景天科技苑✨✨ 🎈🎈 养成好习惯,先赞后看哦~🎈🎈 🏆 作者简介:景天科技苑 🏆《头衔》:大厂架构师,华为云开发者社区专家博主,阿里云开发者社区专家博主,CSDN全栈领域优质创作者,掘金优秀博主,51CTO博客专家等。 🏆《博客》:Python全栈,前后端开发,小程序开发,人工智能,js逆向,App逆向,网络系统安全,数据分析,Django,fastapi

让树莓派智能语音助手实现定时提醒功能

最初的时候是想直接在rasa 的chatbot上实现,因为rasa本身是带有remindschedule模块的。不过经过一番折腾后,忽然发现,chatbot上实现的定时,语音助手不一定会有响应。因为,我目前语音助手的代码设置了长时间无应答会结束对话,这样一来,chatbot定时提醒的触发就不会被语音助手获悉。那怎么让语音助手也具有定时提醒功能呢? 我最后选择的方法是用threading.Time

Android实现任意版本设置默认的锁屏壁纸和桌面壁纸(两张壁纸可不一致)

客户有些需求需要设置默认壁纸和锁屏壁纸  在默认情况下 这两个壁纸是相同的  如果需要默认的锁屏壁纸和桌面壁纸不一样 需要额外修改 Android13实现 替换默认桌面壁纸: 将图片文件替换frameworks/base/core/res/res/drawable-nodpi/default_wallpaper.*  (注意不能是bmp格式) 替换默认锁屏壁纸: 将图片资源放入vendo

C#实战|大乐透选号器[6]:实现实时显示已选择的红蓝球数量

哈喽,你好啊,我是雷工。 关于大乐透选号器在前面已经记录了5篇笔记,这是第6篇; 接下来实现实时显示当前选中红球数量,蓝球数量; 以下为练习笔记。 01 效果演示 当选择和取消选择红球或蓝球时,在对应的位置显示实时已选择的红球、蓝球的数量; 02 标签名称 分别设置Label标签名称为:lblRedCount、lblBlueCount

Retrieval-based-Voice-Conversion-WebUI模型构建指南

一、模型介绍 Retrieval-based-Voice-Conversion-WebUI(简称 RVC)模型是一个基于 VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)的简单易用的语音转换框架。 具有以下特点 简单易用:RVC 模型通过简单易用的网页界面,使得用户无需深入了

透彻!驯服大型语言模型(LLMs)的五种方法,及具体方法选择思路

引言 随着时间的发展,大型语言模型不再停留在演示阶段而是逐步面向生产系统的应用,随着人们期望的不断增加,目标也发生了巨大的变化。在短短的几个月的时间里,人们对大模型的认识已经从对其zero-shot能力感到惊讶,转变为考虑改进模型质量、提高模型可用性。 「大语言模型(LLMs)其实就是利用高容量的模型架构(例如Transformer)对海量的、多种多样的数据分布进行建模得到,它包含了大量的先验