Human-Like Machine Hearing With AI (1/3)--Applying neural networks in real-time audio signal process

本文主要是介绍Human-Like Machine Hearing With AI (1/3)--Applying neural networks in real-time audio signal process,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

文章来自

Author :Daniel Rothmann

原文网站:链接

未翻译...

Significant breakthroughs in AI technology have been achieved through modeling human systems. While artificial neural networks (NNs) are mathematical models which are only loosely coupled with the way actual human neurons function, their application in solving complex and ambiguous real-world problems has been profound. Additionally, modeling the architectural depth of the brain in NNs has opened up broad possibilities in learning more meaningful representations of data.

If you’ve missed out on the other articles, click below to get up to speed:

Background: The promise of AI in audio processing
Criticism: What’s wrong with CNNs and spectrograms for audio processing?
Part 2: Human-Like Machine Hearing With AI (2/3)

In image recognition and processing, the inspiration from the complex and more spatially invariant cells of the visual system in CNNs has also produced great improvements to our technologies. If you’re interested in applying image recognition technologies on audio spectrograms, check out my article “What’s wrong with CNNs and spectrograms for audio processing?”.

As long as human perceptual capacity exceeds that of machines, we stand to gain by understanding the principles of human systems. Humans are very skillful when it comes to perceptual tasks and the contrast between human understanding and the status quo of AI becomes particularly apparent in the area of machine hearing. Considering the benefits reaped from getting inspired by human systems in visual processing, I propose that we stand to gain from a similar process in machine hearing with neural networks.

                   An overview of the framework which will be proposed during this article series.

In this article series, I will detail a framework for real-time audio signal processing with AI which was developed in cooperation with Aarhus University and intelligent loudspeaker manufacturer Dynaudio A/S. Its inspiration is primarily drawn from cognitive science which attempts to combine perspectives of biology, neuroscience, psychology and philosophy to gain greater understanding of our cognitive faculties.

Cognitive Sound Properties

Perhaps the most abstract domain of sound is how we, as humans, perceive it. While a solution for a signal processing problem has to operate within the parameters of intensity, spectral and temporal properties on a low level, the end goal is most often a cognitive one: Transforming a signal in such a way that our perceptions of the sounds it contains are altered.

If one wishes to programatically change the gender of a recorded spoken voice for example, it is necessary to describe this problem in more meaningful terms before defining its lower level characteristics. The gender of a speaker can be conceived as a cognitive property which is constructed from many factors: General pitch and timbre of a voice, differences in pronunciation, differences in choice of words and language and a common understanding of how these properties relate to gender.

These parameters can be described in lower level features like intensity, spectral and temporal properties but only in more complex combinations do they form high-level representations. This forms a hierarchy of audio features from which the “meaning” of a sound can be derived. The cognitive property representing a human voice can be thought of as a combinatory pattern of temporal developments in a sound’s intensity, spectral and statistical properties.

 

                    A hierarchy of features that can be used to derive meaning from digital audio.

NNs are great at extracting abstracted representations of data and are therefore well suited for the task of detecting cognitive properties in sound. In order to build a system for this purpose, let’s examine how sound is represented in human auditory organs that we can use to inspire representation of sound for processing with NNs.

Cochlear Representation

Hearing in humans starts at the outer ear which firstly consists of the pinna. The pinna acts as a form of spectral preprocessing in which the incoming sound is modified depending on its direction in relation to the listener. Sound then travels through the opening in the pinna into the ear canal which further acts to modify spectral properties of incoming sound by resonating in a way that amplifies frequencies in the range ~1–6 kHz [1].

                                   An illustration of the human auditory system.

As sound waves reach the end of the ear canal, they excite the eardrum onto which the ossicles (the smallest bones in the body) are attached. These bones transmit the pressure from the ear canal to the fluid-filled cochlea in the inner ear [1]. The cochlea is of great interest in guiding sound representation for NNs because this is the organ responsible for transducing acoustic vibrations into neural activity in humans .

It is a coiled tube which is separated along its length by two membranes being the Reissner’s membrane and the basilar membrane. Along the length of the cochlea, there is a row of around 3,500 inner hair cells [1]. As pressures enter the cochlea, its two membranes are pushed down. The basilar membrane is narrow and stiff at its base but loose and wide at its apex so that each place along its length responds more intensely at a particular frequency.

To simplify, the basilar membrane can be thought of as a continuous array of bandpass filters which, along the length of the membrane, acts to separate sounds into their spectral components.

                                     An illustration of the human cochlea.

This is the primary mechanism by which humans convert sound pressures into neural activity. Therefore, it is reasonable to assume that spectral representations of audio would be beneficial in modeling sound perception with AI. Since frequency responses along the basilar membrane vary exponentially [2], logarithmic frequency representations might prove most efficient. One such representation could be derived using a gammatone filterbank. These filters are commonly applied in modeling spectral filtering in the auditory system since they approximate the impulse response of human auditory filters derived from the measured auditory nerve fiber response to white noise stimuli called the “revcor” function [3].

                               A simplified comparison between human spectral transduction and a digital counterpart.

Since the cochlea has ~3500 inner hair cells and humans can detect gaps in sounds down to ~2–5 ms in length [1], a spectral resolution of 3500 gammatone filters separated into 2 ms windows seem optimal parameters for achieving human-like spectral representation in machines. In practical scenarios however, I assume that lesser resolutions could still achieve desirable effects in most analysis and processing tasks while being more viable from a computational standpoint.

A number of software libraries for auditory analysis are available online. A notable example is the Gammatone Filterbank Toolkit by Jason Heeris. It provides adjustable filters as well as tools for spectrogram-like analysis of audio signals with gammatone filters.

Neural Coding

As neural activity moves from the cochlea onto the auditory nerve and the ascending auditory pathways, a number of processes are applied in brainstem nuclei before it reaches the auditory cortex.

These processes form a neural code which represents an interface between stimulus and perception [4]. Much knowledge about the specific inner workings of these nuclei is still speculative or unknown, so I will detail these nuclei only at their higher levels of functioning.

                   A simplified illustration of the ascending auditory pathway (for one ear) and its assumed functions.

Humans have a set of these nuclei for each ear that are interconnected, but for simplicity, I’ve illustrated the flow for only one ear. The cochlear nucleus is the first coding step for neural signals coming from the auditory nerve. It consists of a variety of neurons with different properties which serve to perform initial processing of sound features, some of which are directed to the superior olivewhich is associated with sound localization while others are directed to the lateral lemniscus and inferior colliculus, commonly associated with more advanced features [1].

J. J. Eggermont details this flow of information from the cochlear nucleus in “Between sound and perception: reviewing the search for a neural code” as follows: “The ventral [cochlear nucleus] (VCN) extracts and enhances the frequency and timing information that is multiplexed in the firing patterns of the [auditory nerve] fibers, and distributes the results via two main pathways: the sound localization path and the sound identification path. The anterior part of the VCN (AVCN) mainly serves the sound localization aspects and its two types of bushy cells provide input to the superior olivary complex (SOC), where interaural time differences (ITDs) and level differences (ILDs) are mapped for each frequency separately” [4].

The information carried by the sound identification pathway is a representation of complex spectra such as vowels. This representation is mainly created in the ventral cochlear nucleus by special types of units dubbed “chopper” (stellate) neurons [4]. The details of these auditory encodings are difficult to specify but they indicate to us that a form of “coding” of incoming frequency spectra could improve understanding of low level sound features as well as making sound impressions less expensive to process in NNs.

Spectral Sound Embeddings

We can apply the unsupervised autoencoder NN architecture as an attempt to learn common properties associated with complex spectra. Like word embeddings, its possible to find commonalities in frequency spectra that represent select features (or a more tightly condensed meaning) of sounds.

An autoencoder is trained to encode an input into a compressed representation that can be reconstructed back into a representation with a high similarity to the input. This means that the autoencoder’s target output is the input itself [5]. If an input can be reconstructed without great loss, the network has learnt to encode it in such a way that the compressed internal representation contains enough meaningful information. This internal representation is then what we refer to as the embedding. The encoding part of the autoencoder can be decoupled from the decoder to generate embeddings for other applications.

                         A simplified illustration of an autoencoder architecture for spectral sound embeddings.

Embeddings also have the benefit that they are often of lower dimensionality than the original data. For instance, an autoencoder could compress a frequency spectrum with a total of 3500 values into a vector with a length of 500 values. Put simply, each value of such a vector could describe higher level factors of a spectrum such as vowel, harshness or harmonicity - These are only examples, as the meaning of statistically common factors derived by an autoencoder might often be difficult to label in plain language.

In the next article, we will expand upon this idea with added memory to produce embeddings for temporal developments of audio frequency spectra.


This wraps up the first part of my article series on audio processing with artificial intelligence. Next, we will discuss the essential concepts of sensory memory and temporal dependencies in sound.

Follow to stay updated and feel free to leave claps if you enjoyed the article!

To stay in touch, please feel free to connect with me on LinkedIn!

References

[1] C. J. Plack, The Sense of Hearing, 2nd ed. Psychology Press, 2014.

[2] S. J. Elliott and C. A. Shera, “The cochlea as a smart structure,” Smart Mater. Struct., vol. 21, no. 6, p. 64001, Jun. 2012.

[3] A.M. Darling, “Properties and implementation of the gammatone filter: A tutorial”, Speech hearing and language, University College London, 1991.

[4] J. J. Eggermont, “Between sound and perception: reviewing the search for a neural code.,” Hear. Res., vol. 157, no. 1–2, pp. 1–42, Jul. 2001.

[5] T. P. Lillicrap et al., Learning Deep Architectures for AI, vol. 2, no. 1. 2015.

音频相关文章的搬运工,如有侵权 请联系我们删除。

微博:砖瓦工-日记本

联系方式qq:1657250854

 

 

 

 

 

这篇关于Human-Like Machine Hearing With AI (1/3)--Applying neural networks in real-time audio signal process的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/435880

相关文章

揭秘未来艺术:AI绘画工具全面介绍

📑前言 随着科技的飞速发展,人工智能(AI)已经逐渐渗透到我们生活的方方面面。在艺术创作领域,AI技术同样展现出了其独特的魅力。今天,我们就来一起探索这个神秘而引人入胜的领域,深入了解AI绘画工具的奥秘及其为艺术创作带来的革命性变革。 一、AI绘画工具的崛起 1.1 颠覆传统绘画模式 在过去,绘画是艺术家们通过手中的画笔,蘸取颜料,在画布上自由挥洒的创造性过程。然而,随着AI绘画工

一份LLM资源清单围观技术大佬的日常;手把手教你在美国搭建「百万卡」AI数据中心;为啥大模型做不好简单的数学计算? | ShowMeAI日报

👀日报&周刊合集 | 🎡ShowMeAI官网 | 🧡 点赞关注评论拜托啦! 1. 为啥大模型做不好简单的数学计算?从大模型高考数学成绩不及格说起 司南评测体系 OpenCompass 选取 7 个大模型 (6 个开源模型+ GPT-4o),组织参与了 2024 年高考「新课标I卷」的语文、数学、英语考试,然后由经验丰富的判卷老师评判得分。 结果如上图所

AI儿童绘本创作

之前分享过AI儿童绘画的项目,但是主要问题是角色一致要花费很长的时间! 今天发现了这款,非常奈斯! 只需输入故事主题、风格、模板,软件就会自动创作故事内容,自动生成插画配图,自动根据模板生成成品,测试效果如下图。 变现方式:生成儿童绘本发布到各平台,吸引宝妈群体进私域。  百度网盘 请输入提取码百度网盘为您提供文件的网络备份、同步和分享服务。空间大、速度快、安全

【操作系统】信号Signal超详解|捕捉函数

🔥博客主页: 我要成为C++领域大神🎥系列专栏:【C++核心编程】 【计算机网络】 【Linux编程】 【操作系统】 ❤️感谢大家点赞👍收藏⭐评论✍️ 本博客致力于知识分享,与更多的人进行学习交流 ​ 如何触发信号 信号是Linux下的经典技术,一般操作系统利用信号杀死违规进程,典型进程干预手段,信号除了杀死进程外也可以挂起进程 kill -l 查看系统支持的信号

人工和AI大语言模型成本对比 ai语音模型

这里既有AI,又有生活大道理,无数渺小的思考填满了一生。 上一专题搭建了一套GMM-HMM系统,来识别连续0123456789的英文语音。 但若不是仅针对数字,而是所有普通词汇,可能达到十几万个词,解码过程将非常复杂,识别结果组合太多,识别结果不会理想。因此只有声学模型是完全不够的,需要引入语言模型来约束识别结果。让“今天天气很好”的概率高于“今天天汽很好”的概率,得到声学模型概率高,又符合表达

智能客服到个人助理,国内AI大模型如何改变我们的生活?

引言 随着人工智能(AI)技术的高速发展,AI大模型越来越多地出现在我们的日常生活和工作中。国内的AI大模型在过去几年里取得了显著的进展,不少独创的技术点和实际应用令人瞩目。 那么,国内的AI大模型有哪些独创的技术点?它们在实际应用中又有哪些出色表现呢?此外,普通人又该如何利用这些大模型提升工作和生活的质量和效率呢?本文将为你一一解析。 一、国内AI大模型的独创技术点 多模态学习 多

【新闻】AI程序员要来了吗?阿里云官宣

内容提要 6 月 21 日,在阿里云上海 AI 峰会上,阿里云宣布推出首个AI 程序员。 据介绍,这个AI程序员具备架构师、开发工程师、测试工程师等多种岗位的技能,能一站式自主完成任务分解、代码编写、测试、问题修复、代码提交整个过程,最快分钟级即可完成应用开发,大幅提升研发效率。 近段时间以来,有关AI的实践应用突破不断,全球开发者加速研发步伐。有业内人士坦言,随着大模型性能逐渐提升,AI应

AI元宇宙

随着科技的迅猛发展,人工智能(AI)迎来了一个宇宙大爆发的时代。特别是以GPT为代表的生成式大模型的诞生和不断进步,彻底改变了人们的工作和生活方式。程序员与AI协同工作写代码已成为常态,大模型不仅提高了工作效率,还为人类带来了无限的可能性。 AI元宇宙http://ai.toolxq.com/#/如同生物进化出眼睛打开了三维世界的元宇宙之后,GPT打开了人+AI工作模式的新时代,程序员的人生被划

AI学习指南机器学习篇-朴素贝叶斯处理连续特征和离散特征

AI学习指南机器学习篇-朴素贝叶斯处理连续特征和离散特征 在机器学习领域,朴素贝叶斯是一种常用的分类算法,它的简单性和高效性使得它在实际应用中得到了广泛的应用。然而,在使用朴素贝叶斯算法进行分类时,我们通常会面临一个重要的问题,就是如何处理连续特征和离散特征。因为朴素贝叶斯算法基于特征的条件独立性假设,所以对于不同类型的特征,我们需要采取不同的处理方式。 在本篇博客中,我们将探讨如何有效地处理

AI赋能天气:微软研究院发布首个大规模大气基础模型Aurora

编者按:气候变化日益加剧,高温、洪水、干旱,频率和强度不断增加的全球极端天气给整个人类社会都带来了难以估计的影响。这给现有的天气预测模型提出了更高的要求——这些模型要更准确地预测极端天气变化,为政府、企业和公众提供更可靠的信息,以便做出及时的准备和响应。为了应对这一挑战,微软研究院开发了首个大规模大气基础模型 Aurora,其超高的预测准确率、效率及计算速度,实现了目前最先进天气预测系统性能的显著