Human-Like Machine Hearing With AI (3/3)--Results and perspectives.

2023-11-30 05:58

本文主要是介绍Human-Like Machine Hearing With AI (3/3)--Results and perspectives.,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

文章来自

Author :Daniel Rothmann

原文网站:链接

 

未翻译...

This is the last part of my article series on “Human-Like” Machine Hearing: Modeling parts of human hearing to do audio signal processing with AI.

This last part of the series will provide:

  • A concluding summary of the key ideas.
  • Results from empirical testing.
  • Related work and future perspectives.

If you’ve missed out on the previous articles, here they are:

Background: The promise of AI in audio processing
Criticism: What’s wrong with CNNs and spectrograms for audio processing?
Part 1: Human-Like Machine Hearing With AI (1/3)
Part 2: Human-Like Machine Hearing With AI (2/3)

Summary.

Understanding and processing information at an abstract level is not an easy task. Artificial neural networks have moved mountains in this area. Especially for computer vision: Deep 2D CNNs have been shown to capture a hierarchy of visual features, increasing in complexity with each layer of the network. The convolutional neural network was inspired by the Neocognitron which in turn was inspired by the human visual system.

Attempts have been made at reapplying techniques such as style transfer in the audio domain, but the results are rarely convincing. Visual methods don’t seem to reapply well on sounds.

I have argued that sound is a different beast altogether. It’s something to keep in mind when doing feature extraction and designing deep learning architectures. Sounds behave differently. Just like computer vision benefited from modeling the visual system, we can benefit from considering human hearing when working with sound in neural networks.

                                                                            Photo credit: Steve Harvey

Sound representation.

To start exploring a modeling approach, we can establish a human baseline:

To the brain, sounds are spectrally represented. Pressure waves are processed by the cochlea and divided into ~3500 logarithmically spaced frequency bands in the range ~20-20,000 Hz.

Sounds are heard at a temporal resolution of 2–5 ms. Sounds (or gaps in sounds) shorter than this are something near imperceptible to humans.

Based on this information, I recommend using a Gammatone filterbankinstead of the Fourier Transform. Gammatone filterbanks are a common tool in auditory modeling and generally, filterbanks allow for the decoupling of spectral and temporal resolutions. This way, you can have a many spectral bands and a short window of analysis.

Memory and buffers.

Humans are thought to have memory for storing sensory impressions in the short-term so that they can be compared and integrated. Although experimental results differer slightly, they have shown that humans have ~0.25–4 seconds of echoic memory (sensory memory dedicated to sound).

Any sound that can be understood by a human being can be represented within these limits!

A couple of seconds worth of 2–5 ms windows, each with ~3500 logarithmically spaced frequency bands. But that does add up to a lot of data.

                                          Two dilated buffers covering ~1.25s of sound with 8 time steps.

To reduce the dimensionality a bit, I proposed the idea of dilated buffers, where the temporal resolution of a time series is reduced by an increasing factor for older timesteps. This way, a larger time context can be covered.

                                                                Photo credit: Alireza Attari

Listener-processor architecture.

We can conceptualize the inner ear as a spectral feature extractor and the auditory cortex as an analytical processor, deriving “cognitive properties”from memories of auditory impressions.

In between, there is a set of steps often forgotten about. They’re referred to as the cochlear nuclei. There’s a lot we don’t know about these, but they do a sort of initial neural coding of sounds: Encoding basic features essential to localization and sound identification.

This led me to explore a listener-processor model, where sound buffers are embedded to a low-dimensional space by a general-purpose LSTM autoencoder (a “listener”) before being passed to a task-specific neural network (a “processor”). That essentially makes the autoencoder a reusable preprocessing step for doing some task-specific analytical work on a sound.

                                            A listener-processor architecture to do sound classification.

Results.

I like to think this article series has presented a couple of fresh ideas for working with sound in neural networks. Using these principles, I built a model to do environmental sound classification using the UrbanSound8K dataset.

Due to limited computing resources, I settled for humble representation:

  • 100 Gammatone filters.
  • 10 ms analysis windows.
  • 8-step dilated buffers covering ~1.25 seconds of sound.

To train the listener, I fed thousands of dilated buffers to an autoencoder with 2 LSTM layers on each side, encoding the 800-dimensional sequential input into a latent space with 250 “static” dimensions.

                                                   An illustration of the LSTM autoencoder architecture.

After training for ~50 epochs, the autoencoder was able to capture the coarse structures for most of the input buffers. Being able to produce an embedding that captures the complex sequential movements of frequencies in sounds is very interesting!

But is it useful?

This is the question to ask. To test this, I trained a 5-layer self-normalizing neural network to predict the sound class (UrbanSound8K defines 10 possible classes) using embeddings from the LSTM encoder.

After 50 epochs of training, this network predicted sound class with an accuracy of ~70%.

In 2018, Z. Zhang et. al. achieved state of the art on the UrbanSound8K dataset with a 77.4% prediction accuracy. They achieved this by doing 1D convolutions on Gammatone spectrograms. With data augmentation, that accuracy can be pushed higher yet. I did not have resources to explore data augmentation, so I am going to compare on non-augmented versions of the system.

In comparison, my approach was 7.4% less accurate. However, my technique works on 10 ms time frames (usually with some memory attached), meaning that the 70% accuracy covers a 10 ms moment at any given time of a sound in the dataset. This reduces the latency of the system by a factor of 300, making it well suited for realtime processing. Put simply, this approach was less accurate but introduced significantly less delay when processing.

 

                                                  A “happy accident” I encountered during initial experiments.

Perspectives.

Working through this project has been very interesting indeed. I hope to have supplied you with some ideas for how you can work with sound in neural networks. Though I am happy with my initial results, I believe they can be significantly improved given a finer spectral resolution and computational resources for training the neural networks.

I hope that someone will pick this up and experiment with a listener-processor approach on new problems. In particular, I am curious to try a variational autoencoder approach to see what happens to reconstructed sounds when their latent spaces are adjusted — Maybe this can reveal some intuitions about what makes the basic statistical features of sound itself.

Related work.

If you’re interested in sound representation using autoencoders, here are some projects that have inspired me and that I recommend looking into:

Audio Word2Vec

These folks worked on a similar approach for encoding speech using MFCC’s and Seq2Seq autoencoders. They found that the phonetic structures of speech can be adequately represented this way.

A Universal Musical Translation Network

This is very impressive and the closest I’ve seen to style transfer in sound yet. It came out from Facebook AI Research last year. By using a shared WaveNet encoder, they compressed a number of raw-sample musical sequences to a latent space, then decoded with separate WaveNet decoders for each desired output “style”.

Modeling Non-Linear Audio Effects with Neural Networks

Marco Martinez and Joshua Reiss successfully modeled non-linear audio effects (like distortion) with neural networks. They achieved this by using 1D convolutions to encode raw-sample sequencies, transform these encodings (!) with a deep neural network, and then resynthesize these encodings back to raw samples with deconvolution.

                                                                                      ....

Dear reader, thanks so much for coming on this journey with me. I feel privileged by the amount of positive, critical and informative response I’ve had to this article series.

Having been swamped with work, this final article has been a long time coming. Now that it’s wrapped up, I am looking forward to next chapters, new projects and ideas to explore.

I hope you’ve enjoyed it! If you would like to get in touch, please feel free to connect with me here and on LinkedIn.

音频相关文章的搬运工,如有侵权 请联系我们删除。

微博:砖瓦工-日记本

联系方式qq:1657250854

 

这篇关于Human-Like Machine Hearing With AI (3/3)--Results and perspectives.的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/435881

相关文章

Spring AI集成DeepSeek的详细步骤

《SpringAI集成DeepSeek的详细步骤》DeepSeek作为一款卓越的国产AI模型,越来越多的公司考虑在自己的应用中集成,对于Java应用来说,我们可以借助SpringAI集成DeepSe... 目录DeepSeek 介绍Spring AI 是什么?1、环境准备2、构建项目2.1、pom依赖2.2

Deepseek R1模型本地化部署+API接口调用详细教程(释放AI生产力)

《DeepseekR1模型本地化部署+API接口调用详细教程(释放AI生产力)》本文介绍了本地部署DeepSeekR1模型和通过API调用将其集成到VSCode中的过程,作者详细步骤展示了如何下载和... 目录前言一、deepseek R1模型与chatGPT o1系列模型对比二、本地部署步骤1.安装oll

Spring AI Alibaba接入大模型时的依赖问题小结

《SpringAIAlibaba接入大模型时的依赖问题小结》文章介绍了如何在pom.xml文件中配置SpringAIAlibaba依赖,并提供了一个示例pom.xml文件,同时,建议将Maven仓... 目录(一)pom.XML文件:(二)application.yml配置文件(一)pom.xml文件:首

SpringBoot整合DeepSeek实现AI对话功能

《SpringBoot整合DeepSeek实现AI对话功能》本文介绍了如何在SpringBoot项目中整合DeepSeekAPI和本地私有化部署DeepSeekR1模型,通过SpringAI框架简化了... 目录Spring AI版本依赖整合DeepSeek API key整合本地化部署的DeepSeek

PyCharm接入DeepSeek实现AI编程的操作流程

《PyCharm接入DeepSeek实现AI编程的操作流程》DeepSeek是一家专注于人工智能技术研发的公司,致力于开发高性能、低成本的AI模型,接下来,我们把DeepSeek接入到PyCharm中... 目录引言效果演示创建API key在PyCharm中下载Continue插件配置Continue引言

Ubuntu系统怎么安装Warp? 新一代AI 终端神器安装使用方法

《Ubuntu系统怎么安装Warp?新一代AI终端神器安装使用方法》Warp是一款使用Rust开发的现代化AI终端工具,该怎么再Ubuntu系统中安装使用呢?下面我们就来看看详细教程... Warp Terminal 是一款使用 Rust 开发的现代化「AI 终端」工具。最初它只支持 MACOS,但在 20

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

AI绘图怎么变现?想做点副业的小白必看!

在科技飞速发展的今天,AI绘图作为一种新兴技术,不仅改变了艺术创作的方式,也为创作者提供了多种变现途径。本文将详细探讨几种常见的AI绘图变现方式,帮助创作者更好地利用这一技术实现经济收益。 更多实操教程和AI绘画工具,可以扫描下方,免费获取 定制服务:个性化的创意商机 个性化定制 AI绘图技术能够根据用户需求生成个性化的头像、壁纸、插画等作品。例如,姓氏头像在电商平台上非常受欢迎,

从去中心化到智能化:Web3如何与AI共同塑造数字生态

在数字时代的演进中,Web3和人工智能(AI)正成为塑造未来互联网的两大核心力量。Web3的去中心化理念与AI的智能化技术,正相互交织,共同推动数字生态的变革。本文将探讨Web3与AI的融合如何改变数字世界,并展望这一新兴组合如何重塑我们的在线体验。 Web3的去中心化愿景 Web3代表了互联网的第三代发展,它基于去中心化的区块链技术,旨在创建一个开放、透明且用户主导的数字生态。不同于传统

AI一键生成 PPT

AI一键生成 PPT 操作步骤 作为一名打工人,是不是经常需要制作各种PPT来分享我的生活和想法。但是,你们知道,有时候灵感来了,时间却不够用了!😩直到我发现了Kimi AI——一个能够自动生成PPT的神奇助手!🌟 什么是Kimi? 一款月之暗面科技有限公司开发的AI办公工具,帮助用户快速生成高质量的演示文稿。 无论你是职场人士、学生还是教师,Kimi都能够为你的办公文