参考An introduction to signal processing for speech,From Dan Ellis @ Columbia University,Chapter 22 in Handbook of Phonetic Science ,极好的入门引导,摘录+补充。

This chapter aims to give a transparent and intuitive introduction to the basic ideas of the Fourier domain and filtering, and connects them to some of the common representations used in speech science, including the spectrogram and cepstral coefficients.



Very roughly, linearity is the idea that scaling the input to a system will result in scaling the output by the same amount.


Now we can learn one more very important property of sinusoids: they are the eigenfunctions of linear systems. What this means is that if a linear system is fed a sinusoid (with or without an exponential envelope), the output will also be a sinusoid, with the same frequency and the same rate of exponential decay, merely scaled in amplitude and possibly shifted in phase.


比如f(xy)=f(x)+f(y)没有给出具体的函数表达形式,只是给出了相应的函数性质,是一个抽象函数。而满足该性质的一个具体函数被称为特征函数,比如 f(x)=ln(x) 就是它的一个特征函数,观察函数性质可知该特征函数集合是对数函数。

正弦输入正弦输出+叠加性,is the key to the value of Fourier transform.


The core of Fourier analysis is a simple but somewhat surprising fact: Any periodically-repeating waveform can be expressed as a sum of sinusoids, each scaled and shifted in time by appropriate constants. Moreover, the only sinusoids required are those whose frequency is an integer multiple of the fundamental frequency of the periodic sequence.

那么,该如何去求a sum of sinusoids呢?这也就是如何求傅里叶级数。(原文这段大致解释了做内积的原理,推荐看这个遗迹系列 -【学渣告诉你】到底神马是傅里叶级数!,白话易懂,关键点在于“将一个函数用一堆基函数表示”和“将一个向量用一堆基向量表示”这两件事情是类似的)

It turns out that finding the Fourier series coefficients – the optimal scale constants and phase shifts for each harmonic – is very straightforward: All you have to do is multiply the waveform, point-for-point, with a candidate harmonic, and sum up (i.e. integrate) over a complete cycle; this is known as taking the inner product between the waveform and the harmonic, and gives the required scale constant for that harmonic. This works because the harmonics are orthogonal, meaning that the inner product between different harmonics is exactly zero, so if we assume that the original waveform is a sum of scaled harmonics, only the term involving the candidate harmonic appears in the result of the inner product. Finding the phase requires taking the inner product twice, once with a cosine-phase harmonic and once with the sine-phase harmonic, giving two scaled harmonics that can sum together to give a sinusoid of the corresponding frequency at any amplitude and any phase.


Consider, however, stretching the period of repetition to be longer and longer. Fourier analysis states that within this very long period we can have any arbitrary and unique waveform, and we will still be able to represent it as accurately as we wish. All that happens is that the ‘harmonics’ of our very long period become more and more closely spaced in frequency.

关键点就在于stretching the period of repetition,参考从傅立叶级数到傅立叶变换:

Now by letting the fundamental period go to infinity, we end up with a signal that is no longer periodic, since there is only space for a single repetition in the entire real time axis; at the same time, the spacing between our harmonics goes to zero, meaning that the Fourier series now becomes a continuous function of frequency, not a series of discrete values. However, nothing essentially changes – and, in particular, we can still find the value of the Fourier transform function simply by calculating the inner product integral. Now we have the most general form of the Fourier transform, pairing a continuous, non-repeating (aperiodic) waveform in time, with a continuous function of frequency.




Calculation of the spectrogram. Input signal (1) is converted into a sequence of short excerpts by applying a sliding tapered window (2). Each short excerpt is converted to the frequency domain via the Fourier transform (3), then these individual spectra become columns in the spectrogram image (4), with each pixel’s color reflecting the log-magnitude at the corresponding frequency value in the Fourier transform.

(Linear Prediction暂时略过)

在语音识别中最为常用的特征就是Mel-frequency cepstral coefficients(MFCC)。
要理解MFCC,可以从两个方面进行:(1)什么是Mel-frequency scale?(2)什么是倒谱系数cepstral coefficients?
(1)什么是Mel-frequency scale?

The Mel-frequency scale is a nonlinear mapping of the audible frequency range. The scale is approximately linear below 1000 Hz and approximately logarithmic above 1000Hz.

(2)什么是倒谱系数cepstral coefficients?

Cepstra amounts to taking a second Fourier transform on the logarithm of the magnitude of the original spectrum (Fourier transform of the time waveform). Because of the symmetry between time and frequency in the basic Fourier mathematics, without the intervening log-magnitude step, taking the Fourier transform of a Fourier transform almost gets you back to the original signal. But taking the magnitude removes any phase (relative timing) information between different frequencies, and applying a logarithm drastically alters the balance between intense and weak components, leading to a very different signal.

那么,MFCC就是在Mel-warped spectrum上求cepstra了。


Perceptual Linear Prediction (PLP)

PLP features often perform comparably to MFCCs, although which feature is superior tends to vary from task to task. PLP features use the Bark auditory scale, and trapezoidal (flat-topped) rather than triangular windows, to create the initial auditory spectrum. Then, rather than smoothing the auditory spectrum by keeping only the low-order cepstral coefficients, linear prediction is used to find a smooth spectrum consisting of only a few resonant peaks (typically 4 to 6) that matches the Bark-spectrum.Finally, this smoothed PLP spectrum is again converted to the compact, decorrelated cepstral coefficients via another neat mathematical trick that finds cepstra directly from an LP model.

delta coefficients

an estimate of the local slope, along the direction of the time axis, for each frequency or cepstral coefficient

Cepstral Mean Normalization (CMN)

the average value of each cepstral dimension over an entire segment or utterance is subtracted from that dimension at every time step

