【基频提取算法-PYIN】

2024-04-01 21:12
文章标签 算法 提取 基频 pyin

本文主要是介绍【基频提取算法-PYIN】,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文对基频提取算法 PYIN 做以介绍。如有表述不当之处欢迎批评指正。欢迎任何形式的转载,但请务必注明出处。

文章目录

  • 1. 引言
  • 2. PYIN 各模块代码讲解
    • 2.1. 概率 YIN 算法
    • 2.2. 使用 HMM 得到更平滑的音高轨迹
      • 2.2.1. 观测概率
      • 2.2.2. 状态转移概率
      • 2.2.3. 维特比算法解码
  • 3. 改进和实际问题
  • 4. 总结

1. 引言

笔者在之前的博客中介绍了 YIN 算法,可参考【论文笔记之 YIN】【基频提取算法-YIN】

本文将在此基础之上,结合开源代码介绍 PYIN 算法各个模块的实现细节,如果不了解该算法,可以先阅读笔者另一篇关于该算法论文的博客 【论文笔记之 PYIN】。开源代码的地址为:https://code.soundsoftware.ac.uk/projects/pyin/files,网页上共开源了三个版本,笔者以最早的那个版本为主进行介绍,后续版本在实现方面会略有不同,感兴趣的读者可深入研究。

2. PYIN 各模块代码讲解

PYIN 主要包括两个阶段,第一个阶段是在原始 YIN 的基础上得到概率 YIN,也就是对于每帧音频,得到多个具有不同概率的音高候选值;第二个阶段是使用 HMM 得到更平滑的音高轨迹。

2.1. 概率 YIN 算法

概率 YIN 算法的实现见以下函数,PYIN 算法与 YIN 算法的前两个步骤一样,可参考【基频提取算法-YIN】中的 2.1.2.2. 小节。

Yin::YinOutput
Yin::processProbabilisticYin(const double *in) const {double* yinBuffer = new double[m_yinBufferSize];/* 笔者注释 1:下面两个步骤的实现可参考对 YIN 算法的讲解 */// calculate aperiodicity function for all periodsYinUtil::fastDifference(in, yinBuffer, m_yinBufferSize);    YinUtil::cumulativeDifference(yinBuffer, m_yinBufferSize);/* 笔者注释 2:本小节将详细讲解以下步骤 */vector<double> peakProbability = YinUtil::yinProb(yinBuffer, m_threshDistr, m_yinBufferSize);// calculate overall "probability" from peak probabilitydouble probSum = 0;for (size_t iBin = 0; iBin < m_yinBufferSize; ++iBin){probSum += peakProbability[iBin];}/* 笔者注释 3:概率 YIN 计算结束之后的一些后处理,包括抛物线插值以及将 tau 转换为其对应的具体频率值*/    Yin::YinOutput yo(0,0,0);for (size_t iBuf = 0; iBuf < m_yinBufferSize; ++iBuf){yo.salience.push_back(peakProbability[iBuf]);if (peakProbability[iBuf] > 0){double currentF0 = m_inputSampleRate * (1.0 /YinUtil::parabolicInterpolation(yinBuffer, iBuf, m_yinBufferSize));yo.freqProb.push_back(pair<double, double>(currentF0, peakProbability[iBuf]));}}// std::cerr << yo.freqProb.size() << std::endl;delete [] yinBuffer;return yo;
}

本文将从服从 Beta 分布的阈值开始讲起。这个模块对应原始论文中的公式 ( 4 ) (4) (4) ( 5 ) (5) (5), 具体实现略有不同:
P ( τ = τ 0 ∣ S , x t ) = ∑ i = 1 N a ( s i , τ ) P ( s i ) [ Y ( x t , s i ) = τ ] , \begin{align} P(\tau=\tau_0|S,x_t) = \sum_{i=1}^{N}a(s_i,\tau)P(s_i)[Y(x_t,s_i)=\tau], \tag{4} \end{align} P(τ=τ0S,xt)=i=1Na(si,τ)P(si)[Y(xt,si)=τ],(4)

a ( s i , τ ) = { 1 , if d ′ ( τ ) < s i p a , otherwise. (5) a(s_i,\tau) = \begin{cases} 1, & \text{if} \; d^{'}(\tau) < s_i \\ p_a, & \text{otherwise.} \tag{5} \end{cases} a(si,τ)={1,pa,ifd(τ)<siotherwise.(5)

笔者将在下述开源代码上以加注释的方式进行介绍。

std::vector<double>
YinUtil::yinProb(const double *yinBuffer, const size_t prior, const size_t yinBufferSize) 
{/* 笔者注释 1:yinBuffer:存储的是前述步骤计算的累积均值归一化差分函数;prior:用于选择 Beta 分布,源代码中默认传入的值是 2,也就是选择下列分布中的 betaDist2;yinBufferSize:yinBuffer 数组的元素个数。*/double minWeight = 0.01;size_t tau;/* 笔者注释 2:thresholds:用于存储可选的阈值,从 0.01 到 1,步骤长为 0.01,共 100 个。*/std::vector<float> thresholds;/* 笔者注释 3:distribution:用于存储每个阈值所对应的概率,其实就是存储 betaDist2 数组中的值。*/std::vector<float> distribution;/* 笔者注释 4:peakProb:存储估计的周期值的概率。*/std::vector<double> peakProb = std::vector<double>(yinBufferSize);// TODO: make the distributions below part of a class, so they don't have to// be allocated every time.float uniformDist[100] = {0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000,0.0100000};float betaDist1[100] = {0.028911,0.048656,0.061306,0.068539,0.071703,0.071877,0.069915,0.066489,0.062117,0.057199,0.052034,0.046844,0.041786,0.036971,0.032470,0.028323,0.024549,0.021153,0.018124,0.015446,0.013096,0.011048,0.009275,0.007750,0.006445,0.005336,0.004397,0.003606,0.002945,0.002394,0.001937,0.001560,0.001250,0.000998,0.000792,0.000626,0.000492,0.000385,0.000300,0.000232,0.000179,0.000137,0.000104,0.000079,0.000060,0.000045,0.000033,0.000024,0.000018,0.000013,0.000009,0.000007,0.000005,0.000003,0.000002,0.000002,0.000001,0.000001,0.000001,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000};float betaDist2[100] = {0.012614,0.022715,0.030646,0.036712,0.041184,0.044301,0.046277,0.047298,0.047528,0.047110,0.046171,0.044817,0.043144,0.041231,0.039147,0.036950,0.034690,0.032406,0.030133,0.027898,0.025722,0.023624,0.021614,0.019704,0.017900,0.016205,0.014621,0.013148,0.011785,0.010530,0.009377,0.008324,0.007366,0.006497,0.005712,0.005005,0.004372,0.003806,0.003302,0.002855,0.002460,0.002112,0.001806,0.001539,0.001307,0.001105,0.000931,0.000781,0.000652,0.000542,0.000449,0.000370,0.000303,0.000247,0.000201,0.000162,0.000130,0.000104,0.000082,0.000065,0.000051,0.000039,0.000030,0.000023,0.000018,0.000013,0.000010,0.000007,0.000005,0.000004,0.000003,0.000002,0.000001,0.000001,0.000001,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000};float betaDist3[100] = {0.006715,0.012509,0.017463,0.021655,0.025155,0.028031,0.030344,0.032151,0.033506,0.034458,0.035052,0.035331,0.035332,0.035092,0.034643,0.034015,0.033234,0.032327,0.031314,0.030217,0.029054,0.027841,0.026592,0.025322,0.024042,0.022761,0.021489,0.020234,0.019002,0.017799,0.016630,0.015499,0.014409,0.013362,0.012361,0.011407,0.010500,0.009641,0.008830,0.008067,0.007351,0.006681,0.006056,0.005475,0.004936,0.004437,0.003978,0.003555,0.003168,0.002814,0.002492,0.002199,0.001934,0.001695,0.001481,0.001288,0.001116,0.000963,0.000828,0.000708,0.000603,0.000511,0.000431,0.000361,0.000301,0.000250,0.000206,0.000168,0.000137,0.000110,0.000088,0.000070,0.000055,0.000043,0.000033,0.000025,0.000019,0.000014,0.000010,0.000007,0.000005,0.000004,0.000002,0.000002,0.000001,0.000001,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000};float betaDist4[100] = {0.003996,0.007596,0.010824,0.013703,0.016255,0.018501,0.020460,0.022153,0.023597,0.024809,0.025807,0.026607,0.027223,0.027671,0.027963,0.028114,0.028135,0.028038,0.027834,0.027535,0.027149,0.026687,0.026157,0.025567,0.024926,0.024240,0.023517,0.022763,0.021983,0.021184,0.020371,0.019548,0.018719,0.017890,0.017062,0.016241,0.015428,0.014627,0.013839,0.013068,0.012315,0.011582,0.010870,0.010181,0.009515,0.008874,0.008258,0.007668,0.007103,0.006565,0.006053,0.005567,0.005107,0.004673,0.004264,0.003880,0.003521,0.003185,0.002872,0.002581,0.002312,0.002064,0.001835,0.001626,0.001434,0.001260,0.001102,0.000959,0.000830,0.000715,0.000612,0.000521,0.000440,0.000369,0.000308,0.000254,0.000208,0.000169,0.000136,0.000108,0.000084,0.000065,0.000050,0.000037,0.000027,0.000019,0.000014,0.000009,0.000006,0.000004,0.000002,0.000001,0.000001,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000};float single10[100] = {0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000};float single15[100] = {0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000};float single20[100] = {0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,1.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000};size_t nThreshold = 100;int nThresholdInt = nThreshold;for (int i = 0; i < nThresholdInt; ++i){switch (prior) {case 0:distribution.push_back(uniformDist[i]);break;case 1:distribution.push_back(betaDist1[i]);break;case 2:distribution.push_back(betaDist2[i]);break;case 3:distribution.push_back(betaDist3[i]);break;case 4:distribution.push_back(betaDist4[i]);break;case 5:distribution.push_back(single10[i]);break;case 6:distribution.push_back(single15[i]);break;case 7:distribution.push_back(single20[i]);break;default:distribution.push_back(uniformDist[i]);}thresholds.push_back(0.01 + i*0.01);}// double minYin = 2936;// for (size_t i = 2; i < yinBufferSize; ++i)// {//     if (yinBuffer[i] < minYin)//     {//         minYin = yinBuffer[i];//     }// }// if (minYin < 0.01) std::cerr << "min Yin buffer element: " << minYin << std::endl;int currThreshInd = nThreshold-1;tau = 2;// double factor = 1.0 / (0.25 * (nThresholdInt+1) * (nThresholdInt + 1)); // factor to scale down triangular weightsize_t minInd = 0;float minVal = 42.f;/* 笔者注释 5:阈值从其最大值 1 开始递减,tau 从第二个值开始递增。*/while (currThreshInd != -1 && tau < yinBufferSize){if (yinBuffer[tau] < thresholds[currThreshInd]){/* 笔者注释 6:如果 tau 所对应的累积均值归一化差分函数值小于阈值,那么就通过下面的 while 循环,找到小于阈值的第一个谷值。*/while (tau + 1 < yinBufferSize && yinBuffer[tau+1] < yinBuffer[tau]){tau++;}// tau is now local minimum// std::cerr << tau << " " << currThreshInd << " "<< thresholds[currThreshInd] << " " << distribution[currThreshInd] << std::endl;/* 笔者注释 7:寻找最小的累积均值归一化差分函数值,以及其所对应的 tau*/if (yinBuffer[tau] < minVal && tau > 2){minVal = yinBuffer[tau];minInd = tau;}/* 笔者注释 8:求该谷值所对应的概率,由于该谷值可能小于好几个阈值,所以求这些阈值所对应的概率之和。*/peakProb[tau] += distribution[currThreshInd];/* 笔者注释 9:阈值递减,继续判断当前 tau 是否满足。*/currThreshInd--;} else {/* 笔者注释 10:如果 tau 所对应的累积均值归一化差分函数值大于阈值,那么就跳过当前 tau,判断下一个 tau 是否满足。*/tau++;}}/* 笔者注释 11:可以认为 nonPeakProb 中存储的是不存在基频的概率,= 1 - 估计的周期值的概率之和。*/double nonPeakProb = 1;for (size_t i = 0; i < yinBufferSize; ++i){nonPeakProb -= peakProb[i];}// std::cerr << nonPeakProb << std::endl;if (minInd > 0){/* 笔者注释 12:给最小的 tau 所对应的概率加上不存在基频的概率 * 0.01。*/// std::cerr << "min set " << minVal << " " << minInd << " " << nonPeakProb << std::endl; peakProb[minInd] += nonPeakProb * minWeight;}return peakProb;
}

该模块的核心代码是笔者注释 5下的 while 循环。下面举个例子解释下循环里面要做的事情。假设 tau=10 是第一个令 yinBuffer[tau] 小于阈值的下标,且 yinBuffer[tau=10] < 1,0.99,0.98,那么 peakProb[tau=10] 中存储的就是 1,0.99,0.98 这三个阈值所对应的概率之和。而当阈值等于 0.97 时,tau=10 已经不满足,那么继续寻找比 tau=10 大的满足条件的 tau,假设 tau=20 时,yinBuffer[tau=20] < 0.97,0.96,0.95,那么 peakProb[tau=20] 中存储的就是 0.97,0.96,0.95 这三个阈值所对应的概率之和。以此类推。minValminInd 中存储的分别是最小的 yinBuffer[tau] 及其对应的下标 tau

在经过上述计算过程后,PYINpeakProb[tau]>0 的那些 tau 经过抛物线插值后转化为具体的频率值,并将其与对应的 peakProb[tau] 一起保存。具体可见本小节第一个代码块。

2.2. 使用 HMM 得到更平滑的音高轨迹

学习这个模块需要了解 HMM 的基本知识,关于 HMM 的介绍,可以参考笔者的另一篇博客 【机器学习之 HMM】。在这个模块中,PYIN 480 480 480 个频点建模为隐马尔可夫模型的隐状态,范围从 50 Hz 50\text{Hz} 50Hz 880 Hz 880\text{Hz} 880Hz,步长为 0.1 0.1 0.1 个半音。计算最优音高轨迹的问题被转化为了计算隐马尔可夫模型最佳隐状态序列的问题。众所周知,该问题是隐马尔可夫模型三大问题中的第二个问题,即使用维特比算法进行解码的问题。要使用维特比算法解码,需要提前知道 HMM 的参数,即状态转移概率和观测概率(或发射概率)。关于这两个个概率的计算原始 PYIN 论文都有介绍,下面我们将根据源码来进一步学习。

2.2.1. 观测概率

正如论文中所描述的那样,将上述 2.1. 节所计算出的概率直接赋值给距其最近的频点,具体实现见以下开源代码:

const vector<double>
MonoPitchHMM::calculateObsProb(const vector<pair<double, double> > pitchProb)
{/* 笔者注释 1:pitchProb:存储的是每帧音频数据估计出的周期候选值(由 HZ 转换为了 MIDI)以及其对应的概率。*/vector<double> out = vector<double>(2*m_nPitch+1);double probYinPitched = 0;// BIN THE PITCHESfor (size_t iPair = 0; iPair < pitchProb.size(); ++iPair){/* 笔者注释 2:先将 MIDI 再转回 HZ。*/double freq = 440. * std::pow(2, (pitchProb[iPair].first - 69)/12);if (freq <= m_minFreq) continue;double d = 0;double oldd = 1000;/* 笔者注释 3:核心代码,将当前周期估计值的概率赋值给距其最近的频点;m_nPitch = 480 就是论文中提到的频点数量;m_freqs 中存储了这 480 个频点对应的具体频率值;probYinPitched 中存储的是当前帧存在基频的概率。*/for (size_t iPitch = 0; iPitch < m_nPitch; ++iPitch){d = std::abs(freq-m_freqs[iPitch]);if (oldd < d && iPitch > 0){// previous bin must have been the closestout[iPitch-1] = pitchProb[iPair].second;probYinPitched += out[iPitch-1];break;}oldd = d;}}/* 笔者注释 4:下面的代码实现的是论文中的公式 (6),略有修改;m_yinTrust:值为 0.5,对应于当前帧是语音和非语音的先验概率;probReallyPitched:表示当前帧是语音且存在基频的概率;*/double probReallyPitched = m_yinTrust * probYinPitched;for (size_t iPitch = 0; iPitch < m_nPitch; ++iPitch){if (probYinPitched > 0) out[iPitch] *= (probReallyPitched/probYinPitched) ;out[iPitch+m_nPitch] = (1 - probReallyPitched) / m_nPitch;}// out[2*m_nPitch] = m_yinTrust * (1 - probYinPitched);return(out);
}

上面的代码计算每帧音频数据对应的观测概率,并将结果存储在 out 数组中,out 数组的大小为 2 ∗ m_nPitch + 1 2 * \text{m\_nPitch} + 1 2m_nPitch+1,因为每个频点都对应语音和非语音两个状态,所以有 2 ∗ 2 \, * 2 这个操作,out 是个稀疏数组,因为通常情况下,候选周期值的数量远少于 out 数组长度,上面讲的论文中也都有提到。

当有 N N N 帧音频数据输入到 HMM 模块时,观测概率模块将会是一个二维数组,其维度是 [ N , 2 ∗ m_nPitch + 1 ] [N,2 * \text{m\_nPitch} + 1] [N,2m_nPitch+1]。在具体实现过程中,算法会认为第一帧是不存在基频的,因此,该二维数组的第一行中,除了最后一个元素是 1 1 1 以外,其余元素的值均为 0 0 0

2.2.2. 状态转移概率

这部分的实现见以下代码,它是在创建 HMM 类实例的时候,在构造函数中就实现好的。

void
MonoPitchHMM::build()
{int idx = 0;// INITIAL VECTORinit = vector<double>(2*m_nPitch+1);init[2*m_nPitch] = 1; // force first frame to be unvoiced./*笔者注释 1:下面的 for 循环计算每个频点(也就是 HMM 中的隐状态)的状态转移概率,对应于论文中的公式 (7) 和 (8)*/// TRANSITIONSfor (size_t iPitch = 0; iPitch < m_nPitch; ++iPitch){/*笔者注释 2:针对每个频点,首先计算它所能转移的范围:minNextPitch 和 maxNextPitch;m_transitionWidth = 26,对应于论文中描述的最多可以转移 25 个频点;也就是以当前频点为中心,往最左边扩 13 个频点,往最右边扩 13 个频点。*/int theoreticalMinNextPitch = static_cast<int>(iPitch)-static_cast<int>(m_transitionWidth/2);int minNextPitch = iPitch>m_transitionWidth/2 ? iPitch-m_transitionWidth/2 : 0;int maxNextPitch = iPitch<m_nPitch-m_transitionWidth/2 ? iPitch+m_transitionWidth/2 : m_nPitch-1;/*笔者注释 3:weights:存储的就是论文中提到的三角权重,在这里就是转移概率,离当前频点越远,其概率值越小。*/// WEIGHT VECTORdouble weightSum = 0;vector<double> weights;for (size_t i = minNextPitch; i <= maxNextPitch; ++i){if (i <= iPitch){weights.push_back(i-theoreticalMinNextPitch+1);// weights.push_back(i-theoreticalMinNextPitch+1+m_transitionWidth/2);} else {weights.push_back(iPitch-theoreticalMinNextPitch+1-(i-iPitch));// weights.push_back(iPitch-theoreticalMinNextPitch+1-(i-iPitch)+m_transitionWidth/2);}weightSum += weights[weights.size()-1];}/*笔者注释 4:下面的 for 循环是计算转移概率的核心代码,对应于论文中的公式 (8);from:存储的是原始状态;to:存储的是要转移到的目标状态;transProb:存储的是具体的转移概率值,对 weights 做了归一化;m_selfTrans:0.99,对应于论文中的公式 (7)。*/// std::cerr << minNextPitch << "  " << maxNextPitch << std::endl;// TRANSITIONS TO CLOSE PITCHfor (size_t i = minNextPitch; i <= maxNextPitch; ++i){from[idx] = iPitch;to[idx] = i;transProb[idx] = weights[i-minNextPitch] / weightSum * m_selfTrans;++idx;from[idx] = iPitch;to[idx] = i+m_nPitch;transProb[idx] = weights[i-minNextPitch] / weightSum * (1-m_selfTrans);++idx;from[idx] = iPitch+m_nPitch;to[idx] = i+m_nPitch;transProb[idx] = weights[i-minNextPitch] / weightSum * m_selfTrans;++idx;// transProb.push_back(weights[i-minNextPitch] / weightSum * 0.5);from[idx] = iPitch+m_nPitch;to[idx] = i;transProb[idx] = weights[i-minNextPitch] / weightSum * (1-m_selfTrans);++idx;// transProb.push_back(weights[i-minNextPitch] / weightSum * 0.5);}// TRANSITION TO UNVOICED// from.push_back(iPitch+m_nPitch);// to.push_back(2*m_nPitch);// transProb.push_back(1-m_selfTrans);// TRANSITION FROM UNVOICED TO PITCHfrom[idx] = 2*m_nPitch;to[idx] = iPitch+m_nPitch;transProb[idx] = 1.0/m_nPitch;++idx;}// UNVOICED SELFTRANSITION// from.push_back(2*m_nPitch);// to.push_back(2*m_nPitch);// transProb.push_back(m_selfTrans);// for (size_t i = 0; i < from.size(); ++i) {//     std::cerr << "P(["<< from[i] << " --> " << to[i] << "]) = " << transProb[i] << std::endl;// }}

2.2.3. 维特比算法解码

该模块使用上述计算的观测概率和状态转移概率来计算出最优音高序列。建议读者先学习了解 HMM 中解码所用的维特比算法。

const std::vector<int> 
SparseHMM::decodeViterbi(std::vector<vector<double> > obsProb,vector<double> *scale) 
{/* 笔者注释 1:obsProb:2.2.1. 小节最后提到的二维数组,存储了多帧音频数据对应的观测概率;scale:代码中没实际用到;init:大小为 2 * m_nPitch + 1 的数组;*/size_t nState = init.size();size_t nFrame = obsProb.size();// check for consistency    size_t nTrans = transProb.size();// declaring variablesstd::vector<double> delta = std::vector<double>(nState);std::vector<double> oldDelta = std::vector<double>(nState);vector<vector<int> > psi; //  "matrix" of remembered indices of the best transitionsvector<int> path = vector<int>(nFrame, nState-1); // the final output path (current assignment arbitrary, makes sense only for Chordino, where nChord-1 is the "no chord" label)double deltasum = 0;/* 笔者注释 2:下面的两个 for 循环是维特比算法的初始化步骤;oldDelta:存储的是截止到前一帧,每个隐状态能取到的最大概率值;以下代码先用第一帧的音频对其进行初始化,并进行归一化。*/// initialise first framefor (size_t iState = 0; iState < nState; ++iState){oldDelta[iState] = init[iState] * obsProb[0][iState];// std::cerr << iState << " ----- " << init[iState] << std::endl;deltasum += oldDelta[iState];}for (size_t iState = 0; iState < nState; ++iState){oldDelta[iState] /= deltasum; // normalise (scale)// std::cerr << oldDelta[iState] << std::endl;}scale->push_back(1.0/deltasum);psi.push_back(vector<int>(nState,0));/* 笔者注释 3:下面的 for 循环是维特比算法的迭代步骤,也是最核心的步骤。*/// rest of forward stepfor (size_t iFrame = 1; iFrame < nFrame; ++iFrame){   deltasum = 0;psi.push_back(vector<int>(nState,0));// calculate best previous state for every current statesize_t fromState;size_t toState;double currentTransProb;double currentValue;/* 笔者注释 4:下面的 for 循环得到由前一帧的隐状态转移到当前帧的每个隐状态所能取到的最大转移概率,并将最大转移概率存储到 delta 中,取得最大转移概率所对应的前一帧的隐状态存储到 psi 中 。*/// this is the "sparse" loopfor (size_t iTrans = 0; iTrans < nTrans; ++iTrans){fromState = from[iTrans];toState = to[iTrans];currentTransProb = transProb[iTrans];currentValue = oldDelta[fromState] * currentTransProb;if (currentValue > delta[toState]){delta[toState] = currentValue; // will be multiplied by the right obs later!psi[iFrame][toState] = fromState;}            }/* 笔者注释 5:下面的 for 循环是得到当前帧的每个隐状态所能取到的最大概率,即在转移概率的基础上再乘以观测概率,并将结果存储在 delta 中。*/for (size_t jState = 0; jState < nState; ++jState){delta[jState] *= obsProb[iFrame][jState];deltasum += delta[jState];}/* 笔者注释 6:更新 oldDelta 和 delta。*/if (deltasum > 0){for (size_t iState = 0; iState < nState; ++iState){oldDelta[iState] = delta[iState] / deltasum; // normalise (scale)delta[iState] = 0;}scale->push_back(1.0/deltasum);} else{std::cerr << "WARNING: Viterbi has been fed some zero probabilities, at least they become zero in combination with the model." << std::endl;for (size_t iState = 0; iState < nState; ++iState){oldDelta[iState] = 1.0/nState;delta[iState] = 0;}scale->push_back(1.0);}}/* 笔者注释 7:回溯步骤,根据 psi 中存储的结果找到最优音高序列,存储在 path 中。*/// initialise backward stepdouble bestValue = 0;for (size_t iState = 0; iState < nState; ++iState){double currentValue = oldDelta[iState];if (currentValue > bestValue){bestValue = currentValue;            path[nFrame-1] = iState;}}// rest of backward stepfor (int iFrame = nFrame-2; iFrame != -1; --iFrame){path[iFrame] = psi[iFrame+1][path[iFrame+1]];}for (size_t iState = 0; iState < nState; ++iState){// std::cerr << psi[2][iState] << std::endl;}return path;
}

至此,PYIN 算法的核心模块介绍完毕。

3. 改进和实际问题

可以看到,在 PYIN 算法的论文和实现中,作者认为每个频点都对应 voicedunvoiced 两种状态,但这块是不是有些冗余,是不是可以将每个频点的 unvoiced 状态都集中用一个 unvoiced 状态来表示,反正表示的都是 unvoiced。这样做的话,隐状态的数量可以减少一半,大大降低了计算量。不过这样做,也需要重新设计状态转移概率等,而且效果需要进一步验证。

还有一个在实际中需要注意的是,可以看到 PYIN 算法的第二阶段,是对一段音频(包含许多连续音频帧)进行维特比算法解码。那么问题来了:这段音频到底应该需要多长,长的话计算复杂度会增加,有可能满足不了实时性要求,短的话效果可能会变差,所以在实际应用中,这也是一个需要折衷的情况。

4. 总结

本文暂且作为基频提取技术系列的最终篇,该系列主要学习了 YIN,PYIN 算法的论文,分析了相关开源代码,后续若是有更好更新的技术再继续更新。

整个系列学习下来,笔者也获益匪浅。YIN 部分加深了对 FFT 计算(互)相关操作的认识,学习了抛物线插值的实现方法。PYIN 的第二阶段,学到了隐马尔可夫模型的理论,以及该如何将隐马尔可夫模型用在实际问题上。隐马尔可夫模型也是笔者一直想学习的东西,之前做的音频、语音相关项目基本用到不它。

最后一句话送给自己:零碎时间利用起来,也能干不少事,学不少知识。

这篇关于【基频提取算法-PYIN】的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/868151

相关文章

不懂推荐算法也能设计推荐系统

本文以商业化应用推荐为例,告诉我们不懂推荐算法的产品,也能从产品侧出发, 设计出一款不错的推荐系统。 相信很多新手产品,看到算法二字,多是懵圈的。 什么排序算法、最短路径等都是相对传统的算法(注:传统是指科班出身的产品都会接触过)。但对于推荐算法,多数产品对着网上搜到的资源,都会无从下手。特别当某些推荐算法 和 “AI”扯上关系后,更是加大了理解的难度。 但,不了解推荐算法,就无法做推荐系

康拓展开(hash算法中会用到)

康拓展开是一个全排列到一个自然数的双射(也就是某个全排列与某个自然数一一对应) 公式: X=a[n]*(n-1)!+a[n-1]*(n-2)!+...+a[i]*(i-1)!+...+a[1]*0! 其中,a[i]为整数,并且0<=a[i]<i,1<=i<=n。(a[i]在不同应用中的含义不同); 典型应用: 计算当前排列在所有由小到大全排列中的顺序,也就是说求当前排列是第

csu 1446 Problem J Modified LCS (扩展欧几里得算法的简单应用)

这是一道扩展欧几里得算法的简单应用题,这题是在湖南多校训练赛中队友ac的一道题,在比赛之后请教了队友,然后自己把它a掉 这也是自己独自做扩展欧几里得算法的题目 题意:把题意转变下就变成了:求d1*x - d2*y = f2 - f1的解,很明显用exgcd来解 下面介绍一下exgcd的一些知识点:求ax + by = c的解 一、首先求ax + by = gcd(a,b)的解 这个

综合安防管理平台LntonAIServer视频监控汇聚抖动检测算法优势

LntonAIServer视频质量诊断功能中的抖动检测是一个专门针对视频稳定性进行分析的功能。抖动通常是指视频帧之间的不必要运动,这种运动可能是由于摄像机的移动、传输中的错误或编解码问题导致的。抖动检测对于确保视频内容的平滑性和观看体验至关重要。 优势 1. 提高图像质量 - 清晰度提升:减少抖动,提高图像的清晰度和细节表现力,使得监控画面更加真实可信。 - 细节增强:在低光条件下,抖

【数据结构】——原来排序算法搞懂这些就行,轻松拿捏

前言:快速排序的实现最重要的是找基准值,下面让我们来了解如何实现找基准值 基准值的注释:在快排的过程中,每一次我们要取一个元素作为枢纽值,以这个数字来将序列划分为两部分。 在此我们采用三数取中法,也就是取左端、中间、右端三个数,然后进行排序,将中间数作为枢纽值。 快速排序实现主框架: //快速排序 void QuickSort(int* arr, int left, int rig

poj 3974 and hdu 3068 最长回文串的O(n)解法(Manacher算法)

求一段字符串中的最长回文串。 因为数据量比较大,用原来的O(n^2)会爆。 小白上的O(n^2)解法代码:TLE啦~ #include<stdio.h>#include<string.h>const int Maxn = 1000000;char s[Maxn];int main(){char e[] = {"END"};while(scanf("%s", s) != EO

秋招最新大模型算法面试,熬夜都要肝完它

💥大家在面试大模型LLM这个板块的时候,不知道面试完会不会复盘、总结,做笔记的习惯,这份大模型算法岗面试八股笔记也帮助不少人拿到过offer ✨对于面试大模型算法工程师会有一定的帮助,都附有完整答案,熬夜也要看完,祝大家一臂之力 这份《大模型算法工程师面试题》已经上传CSDN,还有完整版的大模型 AI 学习资料,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费

dp算法练习题【8】

不同二叉搜索树 96. 不同的二叉搜索树 给你一个整数 n ,求恰由 n 个节点组成且节点值从 1 到 n 互不相同的 二叉搜索树 有多少种?返回满足题意的二叉搜索树的种数。 示例 1: 输入:n = 3输出:5 示例 2: 输入:n = 1输出:1 class Solution {public int numTrees(int n) {int[] dp = new int

Codeforces Round #240 (Div. 2) E分治算法探究1

Codeforces Round #240 (Div. 2) E  http://codeforces.com/contest/415/problem/E 2^n个数,每次操作将其分成2^q份,对于每一份内部的数进行翻转(逆序),每次操作完后输出操作后新序列的逆序对数。 图一:  划分子问题。 图二: 分而治之,=>  合并 。 图三: 回溯:

最大公因数:欧几里得算法

简述         求两个数字 m和n 的最大公因数,假设r是m%n的余数,只要n不等于0,就一直执行 m=n,n=r 举例 以18和12为例 m n r18 % 12 = 612 % 6 = 06 0所以最大公因数为:6 代码实现 #include<iostream>using namespace std;/