【信息技术】【2009.11】自动情感识别：声学和韵律参数的研究

本文主要是介绍【信息技术】【2009.11】自动情感识别：声学和韵律参数的研究，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

在这里插入图片描述

本文为澳大利亚新南威尔士大学（作者：Vidhyasaharan Sethu）的博士论文，共186页。

实现具有人与人之间通信自然性的人机语音通信的一个重要步骤是开发一种能够基于语音识别情感的机器。本文利用声学和韵律信息对这一问题进行了研究。在特征层次上，提出了新的群时延和加权频率特征。群延迟特征被显示为强调与共振峰带宽相关的信息，并且被显示为情绪表示。基于最近引入的经验模态分解，提出将加权频率特征作为谱能量分布的一种紧凑表示，并证明其优于其他能量分布估计。特征级比较表明，详细的频谱测量非常能反映情绪，同时表现出更大的说话人差异性。此外，研究还表明，所有特征都是说话人的表征，在多说话人情况下使用这些特征之前，需要进行某种标准化。提出了一种新的说话人特征可变性归一化方法，该方法显著提高了基于不同说话人数据训练和测试的系统性能。这项技术也被用来研究不同特征中特定于说话人的变异量。语音变异性的初步研究表明，特定音位的特征不受情感模型的影响，说话人的变异性在所研究的情境中是一个更重要的问题。最后，分析了一种考虑语音参数时间变化的情感建模方法。在传统的信源滤波模型的基础上，引入了声门频谱的显式模型，并利用该模型的参数来表征语音信号。一个自动情感识别系统应考虑到这些参数随时间变化的轮廓形状，才能显示出优于一个只建立参数分布模型的系统。这一新方法也被经验证明与人类情感分类的表现不相上下。

An essential step to achievinghuman-machine speech communication with the naturalness of communicationbetween humans is developing a machine that is capable of recognising emotionsbased on speech. This thesis presents research addressing this problem, bymaking use of acoustic and prosodic information. At a feature level, novelgroup delay and weighted frequency features are proposed. The group delayfeatures are shown to emphasise information pertaining to formant bandwidthsand are shown to be indicative of emotions. The weighted frequency feature,based on the recently introduced empirical mode decomposition, is proposed as acompact representation of the spectral energy distribution and is shown tooutperform other estimates of energy distribution. Feature level comparisonssuggest that detailed spectral measures are very indicative of emotions whileexhibiting greater speaker specificity. Moreover, it is shown that all featuresare characteristic of the speaker and require some of sort of normalisationprior to use in a multi-speaker situation. A novel technique for normalisingspeaker-specific variability in features is proposed, which leads tosignificant improvements in the performances of systems trained and tested ondata from different speakers. This technique is also used to investigate theamount of speaker-specific variability in different features. A preliminarystudy of phonetic variability suggests that phoneme specific traits are notmodelled by the emotion models and that speaker variability is a moresignificant problem in the investigated setup. Finally, a novel approach toemotion modelling that takes into account temporal variations of speechparameters is analysed. An explicit model of the glottal spectrum isincorporated into the framework of the traditional source-filter model, and theparameters of this combined model are used to characterise speech signals. Anautomatic emotion recognition system that takes into account the shape of thecontours of these parameters as they vary with time is shown to outperform asystem that models only the parameter distributions. The novel approach is alsoempirically shown to be on par with human emotion classification performance.