语音信号处理1:Introduction

2024-09-01 00:32

本文主要是介绍语音信号处理1:Introduction,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

参考An introduction to signal processing for speech,From Dan Ellis @ Columbia University,Chapter 22 in Handbook of Phonetic Science ,极好的入门引导,摘录+补充。

This chapter aims to give a transparent and intuitive introduction to the basic ideas of the Fourier domain and filtering, and connects them to some of the common representations used in speech science, including the spectrogram and cepstral coefficients.

关键词:傅里叶分析,滤波,时频图(spectrogram),倒谱系数

在介绍傅里叶分析之前,首先要理解一个重要的概念:线性,粗略来说就是系统的输出是随着输入同步缩放的。

Very roughly, linearity is the idea that scaling the input to a system will result in scaling the output by the same amount.

正弦函数就是线性系统的一个特征函数(也就是说正弦函数是满足线性特征的)。

Now we can learn one more very important property of sinusoids: they are the eigenfunctions of linear systems. What this means is that if a linear system is fed a sinusoid (with or without an exponential envelope), the output will also be a sinusoid, with the same frequency and the same rate of exponential decay, merely scaled in amplitude and possibly shifted in phase.

这里要稍微解释一下,参考什么是特征函数?

我们把没有给出具体解析式的函数称为抽象函数。而特征函数是相对于抽象函数而言的,是满足一定(特征)条件的具体函数,表现形式是一个具体函数。
比如f(xy)=f(x)+f(y)没有给出具体的函数表达形式,只是给出了相应的函数性质,是一个抽象函数。而满足该性质的一个具体函数被称为特征函数,比如 f(x)=ln(x) 就是它的一个特征函数,观察函数性质可知该特征函数集合是对数函数。

线性所带来的重要而精妙的特性之一就是叠加性(superposition)。
正弦输入正弦输出+叠加性,is the key to the value of Fourier transform.

下面正式引入傅里叶分析,首先看看periodically-repeating波形的情况:

The core of Fourier analysis is a simple but somewhat surprising fact: Any periodically-repeating waveform can be expressed as a sum of sinusoids, each scaled and shifted in time by appropriate constants. Moreover, the only sinusoids required are those whose frequency is an integer multiple of the fundamental frequency of the periodic sequence.

那么,该如何去求a sum of sinusoids呢?这也就是如何求傅里叶级数。(原文这段大致解释了做内积的原理,推荐看这个遗迹系列 -【学渣告诉你】到底神马是傅里叶级数!,白话易懂,关键点在于“将一个函数用一堆基函数表示”和“将一个向量用一堆基向量表示”这两件事情是类似的)

It turns out that finding the Fourier series coefficients – the optimal scale constants and phase shifts for each harmonic – is very straightforward: All you have to do is multiply the waveform, point-for-point, with a candidate harmonic, and sum up (i.e. integrate) over a complete cycle; this is known as taking the inner product between the waveform and the harmonic, and gives the required scale constant for that harmonic. This works because the harmonics are orthogonal, meaning that the inner product between different harmonics is exactly zero, so if we assume that the original waveform is a sum of scaled harmonics, only the term involving the candidate harmonic appears in the result of the inner product. Finding the phase requires taking the inner product twice, once with a cosine-phase harmonic and once with the sine-phase harmonic, giving two scaled harmonics that can sum together to give a sinusoid of the corresponding frequency at any amplitude and any phase.

求傅里叶级数即是做傅里叶分析,反之,傅里叶合成就是将傅里叶级数转化为波形。
但是如果傅里叶分析只对periodically-repeating波形有效,那就没太大意义了,因为纯周期信号(在无限时间上周期循环)只是一个数学抽象,在现实世界中并不存在。怎么办呢?

Consider, however, stretching the period of repetition to be longer and longer. Fourier analysis states that within this very long period we can have any arbitrary and unique waveform, and we will still be able to represent it as accurately as we wish. All that happens is that the ‘harmonics’ of our very long period become more and more closely spaced in frequency.

关键点就在于stretching the period of repetition,参考从傅立叶级数到傅立叶变换:
FT

Now by letting the fundamental period go to infinity, we end up with a signal that is no longer periodic, since there is only space for a single repetition in the entire real time axis; at the same time, the spacing between our harmonics goes to zero, meaning that the Fourier series now becomes a continuous function of frequency, not a series of discrete values. However, nothing essentially changes – and, in particular, we can still find the value of the Fourier transform function simply by calculating the inner product integral. Now we have the most general form of the Fourier transform, pairing a continuous, non-repeating (aperiodic) waveform in time, with a continuous function of frequency.

人类听觉系统其实就像是在做傅里叶变换,通过耳蜗(可以视为一组滤波器)将时域声压转换为独立的不同频率的分量,但是,准确地说,更接近于短时傅里叶变换(STFT)。关于STFT,参考能不能通俗的讲解下傅立叶分析和小波分析之间的关系?:

傅里叶变换处理非平稳信号(频率随时间变化的信号)有天生缺陷。它只能获取一段信号总体上包含哪些频率的成分,但是对各成分出现的时刻并无所知。因此时域相差很大的两个信号,可能频谱图一样。
对于这样的非平稳信号,只知道包含哪些频率成分是不够的,我们还想知道各个成分出现的时间。知道信号频率随时间变化的情况,各个时刻的瞬时频率及其幅值——这也就是时频分析。
一个简单可行的方法就是——加窗。我又要套用方沁园同学的描述了,“把整个时域过程分解成无数个等长的小过程,每个小过程近似平稳,再傅里叶变换,就知道在哪个时间点上出现了什么频率了。”这就是短时傅里叶变换。

我们在时频图(spectrogram,不是频谱spectrum,频谱是不带有时域信息的)上所看到的实际上就是STFT的幅值。

Calculation of the spectrogram. Input signal (1) is converted into a sequence of short excerpts by applying a sliding tapered window (2). Each short excerpt is converted to the frequency domain via the Fourier transform (3), then these individual spectra become columns in the spectrogram image (4), with each pixel’s color reflecting the log-magnitude at the corresponding frequency value in the Fourier transform.
spectra

(Linear Prediction暂时略过)

然而,spectrum和spectrogram中所包含的信息太多,实际应用中并不需要那么多的细节,更重要的是提取其中的关键特征。
在语音识别中最为常用的特征就是Mel-frequency cepstral coefficients(MFCC)。
要理解MFCC,可以从两个方面进行:(1)什么是Mel-frequency scale?(2)什么是倒谱系数cepstral coefficients?
(1)什么是Mel-frequency scale?

The Mel-frequency scale is a nonlinear mapping of the audible frequency range. The scale is approximately linear below 1000 Hz and approximately logarithmic above 1000Hz.

(2)什么是倒谱系数cepstral coefficients?

Cepstra amounts to taking a second Fourier transform on the logarithm of the magnitude of the original spectrum (Fourier transform of the time waveform). Because of the symmetry between time and frequency in the basic Fourier mathematics, without the intervening log-magnitude step, taking the Fourier transform of a Fourier transform almost gets you back to the original signal. But taking the magnitude removes any phase (relative timing) information between different frequencies, and applying a logarithm drastically alters the balance between intense and weak components, leading to a very different signal.

那么,MFCC就是在Mel-warped spectrum上求cepstra了。

当然还有其他的一些特征,比如:

Perceptual Linear Prediction (PLP)

PLP features often perform comparably to MFCCs, although which feature is superior tends to vary from task to task. PLP features use the Bark auditory scale, and trapezoidal (flat-topped) rather than triangular windows, to create the initial auditory spectrum. Then, rather than smoothing the auditory spectrum by keeping only the low-order cepstral coefficients, linear prediction is used to find a smooth spectrum consisting of only a few resonant peaks (typically 4 to 6) that matches the Bark-spectrum.Finally, this smoothed PLP spectrum is again converted to the compact, decorrelated cepstral coefficients via another neat mathematical trick that finds cepstra directly from an LP model.

delta coefficients

an estimate of the local slope, along the direction of the time axis, for each frequency or cepstral coefficient

Cepstral Mean Normalization (CMN)

the average value of each cepstral dimension over an entire segment or utterance is subtracted from that dimension at every time step

另外一些不错的资料:
zouxy09的专栏
小腹黑zju的博客
傅里叶级数和傅里叶变换是什么关系?
如何理解傅里叶变换公式?
CMU Speech Processing
爱丁堡大学Automatic Speech Recognition
哥伦比亚大学Speech and Audio Processing and Recognition
TAMU Speech processing
MIT Linguistic Phonetics
语谱图,滤波器组(Filter banks、MFCC),介绍了MFCC具体的求法

这篇关于语音信号处理1:Introduction的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1125434

相关文章

阿里开源语音识别SenseVoiceWindows环境部署

SenseVoice介绍 SenseVoice 专注于高精度多语言语音识别、情感辨识和音频事件检测多语言识别: 采用超过 40 万小时数据训练,支持超过 50 种语言,识别效果上优于 Whisper 模型。富文本识别:具备优秀的情感识别,能够在测试数据上达到和超过目前最佳情感识别模型的效果。支持声音事件检测能力,支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。高效推

让树莓派智能语音助手实现定时提醒功能

最初的时候是想直接在rasa 的chatbot上实现,因为rasa本身是带有remindschedule模块的。不过经过一番折腾后,忽然发现,chatbot上实现的定时,语音助手不一定会有响应。因为,我目前语音助手的代码设置了长时间无应答会结束对话,这样一来,chatbot定时提醒的触发就不会被语音助手获悉。那怎么让语音助手也具有定时提醒功能呢? 我最后选择的方法是用threading.Time

AI(文生语音)-TTS 技术线路探索学习:从拼接式参数化方法到Tacotron端到端输出

AI(文生语音)-TTS 技术线路探索学习:从拼接式参数化方法到Tacotron端到端输出 在数字化时代,文本到语音(Text-to-Speech, TTS)技术已成为人机交互的关键桥梁,无论是为视障人士提供辅助阅读,还是为智能助手注入声音的灵魂,TTS 技术都扮演着至关重要的角色。从最初的拼接式方法到参数化技术,再到现今的深度学习解决方案,TTS 技术经历了一段长足的进步。这篇文章将带您穿越时

基于人工智能的智能家居语音控制系统

目录 引言项目背景环境准备 硬件要求软件安装与配置系统设计 系统架构关键技术代码示例 数据预处理模型训练模型预测应用场景结论 1. 引言 随着物联网(IoT)和人工智能技术的发展,智能家居语音控制系统已经成为现代家庭的一部分。通过语音控制设备,用户可以轻松实现对灯光、空调、门锁等家电的控制,提升生活的便捷性和舒适性。本文将介绍如何构建一个基于人工智能的智能家居语音控制系统,包括环境准备

LLM系列 | 38:解读阿里开源语音多模态模型Qwen2-Audio

引言 模型概述 模型架构 训练方法 性能评估 实战演示 总结 引言 金山挂月窥禅径,沙鸟听经恋法门。 小伙伴们好,我是微信公众号《小窗幽记机器学习》的小编:卖铁观音的小男孩,今天这篇小作文主要是介绍阿里巴巴的语音多模态大模型Qwen2-Audio。近日,阿里巴巴Qwen团队发布了最新的大规模音频-语言模型Qwen2-Audio及其技术报告。该模型在音频理解和多模态交互

【阅读文献】一个使用大语言模型的端到端语音概要

摘要 ssum框架(Speech Summarization)为了 从说话人的语音提出对应的文本二题出。 ssum面临的挑战: 控制长语音的输入捕捉 the intricate cross-mdoel mapping 在长语音输入和短文本之间。 ssum端到端模型框架 使用 Q-Former 作为 语音和文本的中介连接 ,并且使用LLMs去从语音特征正确地产生文本。 采取 multi-st

【语音告警】博灵智能语音报警灯JavaScript循环播报场景实例-语音报警灯|声光报警器|网络信号灯

功能说明 本文将以JavaScript代码为实例,讲解如何通过JavaScript代码调用博灵语音通知终端 A4实现声光语音告警。主要博灵语音通知终端如何实现无线循环播报或者周期播报的功能。 本代码实现HTTP接口的声光语音播报,并指定循环次数、播报内容。由于通知终端采用TTS语音合成技术,所以本次案例中无需预先录制音频。 代码实战 为了通过JavaScript调用博灵语音通知终端,实现HT

讯飞XFS5152 语音模块在RK3288 上的应用

公司产品使用XFS5152语音模块作为语音提示应用在RK3288 平台上,这里记录一下驱动调试过程。 XFS5152 支持 UART、I2C 、SPI 三种通讯方式,将收到的中文、英文文本进行语音合成。 产品中RK3288 使用I2C连接该模块,但存在一个问题该模块只支持低速率的I2C,速度最大只能到15KHz, 但RK3288 支持的标准I2C速率为100KHz,实际测试发现虽然可以设置到

java把文字转MP3语音案例

一 工具下载: https://download.csdn.net/download/jinhuding/89723540 二代码 <dependency><groupId>com.hynnet</groupId><artifactId>jacob</artifactId><version>1.18</version></dependency> import com.jacob.acti

Windows 一键定时自动化任务神器 zTasker,支持语音报时+多项定时计划执行

简介 zTasker(详情请戳 官网)是一款完全免费支持定时、热键或条件触发的方式执行多种自动化任务的小工具,支持win7-11。其支持超过100种任务类型,50+种定时/条件执行方法,而且任务列表可以随意编辑、排列、移动、更改类型,支持任务执行日志,可覆盖win自带的热键,同时支持任务列表等数据的备份及自动更新等。 简言之,比微软系统自带的任务计划要强好几倍,至少灵活性高多了,能大幅提高电脑使