Simulating group conversations with talking heads

Featured

This work was presented at the 184th Meeting of the Acoustical Society of America, May 2023, in Chicago, Illinois.

This project is part of the larger Mechatronic Acoustic Research System, a tool for roboticized, automatic audio data collection.

In group conversations, a listener will hear speech coming from many directions of arrival. Human listeners can discern where a particular sound is coming from based on the difference in volume and timing of sound at their left and right ears: these are referred to in the literature as interaural level and time differences.

Diagram of interaural effects

While the brain automatically performs this localization, computers must rely on algorithms. Developing algorithms that are sufficiently accurate, quick, and robust is the work of acoustical signal processing researchers. To do so, researchers need datasets of spatial audio that mimic what is sensed by the ears of a real listener.

Acoustic head simulators provide a solution for generating such datasets. These simulators are designed to have similar absorptive and structural properties as a real human head, and unlike real humans, can be stationed in a lab 24/7 and actuated for precise and repeatable motion.

Head and torso simulators (HATS) from Bruel & Kjaer, an HBK company.

However, research-grade acoustic head simulators can be prohibitively expensive. To achieve high levels of realism, expensive materials and actuators are used, which raises typical prices into the range of tens of thousands of dollars. As such, very few labs will have access to multiple head simulators, which is necessary for simulating group conversations.

We investigate the application of 3D printing technology to the fabrication of head simulators. In recent years, 3D printing has become a cheap and accessible means of producing highly complicated structures. This makes it uniquely suited to the complex geometry of the human ears and head, both of which significantly affect interaural levels and delay.

Exploded-view render of head simulators, produced by Zhihao Tang for TE401F in 2021

Prototype 3D printed ears, which affect the binaural cues

To allow for movement of each individual head, we also design a multi-axial turret that the head can lock onto to. This lets the simulators nod and turn, mimicking natural gestures. Researchers can use this feature to evaluate the robustness and responsiveness of their algorithms to spatial perturbations.

3D printed head simulator mounted on a multiaxial turret for motion.

By designing a 3D printable, actuated head simulator, we aim to enable anyone to fabricate many such devices for their own research.

 

Bandwidth extension with air and body-conduction microphones for speech enhancement

This work will be presented at the 184th Meeting of the Acoustical Society of America, May 2023, in Chicago, Illinois.

Conventional microphones can be referred to as air-conduction mics (ACMs), because they capture sound that propagates through the air. ACMs can record wideband audio, but will capture sounds from undesired sources in noisy scenarios.

In contrast, bone-conduction mics (BCMs) are worn directly on a person to detect sounds propagating through the body. While this can isolate the wearer’s speech, it also severely degrades the quality. We can model this degradation as a low-pass filter.

To enhance a target talker’s speech, we can use the BCM for a noise-robust speech estimate, and combine it with ACM audio by applying a ratio mask in the time-frequency domain

 

Factorization methods outperform parametric methods for BWE of female (left) and male (right) speech. The ensemble systems (solid) can significantly outperform baseline systems (striped)

However, the BCM only provides good estimates of the lower frequencies. Therefore, we need to estimate the missing upper frequencies. This task is called bandwidth extension (BWE), and can be solved in a variety of ways.

We found that ensemble factorization approaches can significantly outperform other low-compute BWE methods.

 

The ensemble factorization method uses two expert models for voiced and unvoiced speech segments

We provide listening examples of our proposed ensemble system. The audio data was generated from a simulated indoor, multi-talker scene.

  Female Male
BCM
ACM
Baseline
Proposed

Investigating sample bias towards languages in audio super-resolution

This work was presented at the 2023 Undergraduate Research Symposium, held by the University of Illinois Urbana-Champaign (Poster 61)

Speech audio sounds good when sampled at 16 kHz, however legacy infrastructure and certain microphones can only capture 8 kHz audio. This can significantly reduce the perceived clarity and intelligibility of speech.

Deep learning provides a way to estimate the lost frequency components, thereby improving quality. This task is called audio super-resolution (or bandwidth extension).

Typically, large datasets of clean audio are required to train such models. However, these datasets do not sufficiently represent all languages, and it is not always possible to train language-specific models for any given language. We investigate how a model trained only on high-quality English recordings generalizes to lower-quality recordings of unseen languages.

The specifics of our model are discussed in our poster at the Undergraduate Research Symposium. Here, we present some results and audio.

We find that our model generalizes well to certain languages, but not others. We provide example audio in the table below. Languages are listed by the attained model accuracy, with English being the most accurate and Catalan being the least so.

Target (16 kHz)Input (8 kHz)Output (16 kHz)
English
Korean
Twi
German
Nepali
Esperanto
Catalan

We conjecture that the variance in performance is correlated with the linguistic similarity between the trained English and inference languages. We reserve this analysis for future work.

*Target audio samples are from the Common Voices Corpus, which contains recordings of over 100 languages

An Unofficial Port of Matrix HAL to Ubuntu 22.04 and Raspbian Bullseye

What is Matrix VOICE and Matrix HAL?

The Matrix VOICE is described on the Matrix website as a “development board for building sound driven behaviors and interfaces.” It is a nifty piece of hardware that features an 8-microphone array and has unique possibilities for beamforming and audio processing applications.

However, in February 2021, when Matrix Labs was bought out, development and support for the Matrix VOICE C++ library, the Matrix HAL, was silently withdrawn.

The latest release of Matrix HAL only works with a version of the Raspberry Pi OS called Raspbian Buster which is fine for a lot of applications, but recently our research has been focused on integrating this device with ROS, a robotics framework that is meant for Ubuntu. While there were workarounds for using ROS on Raspbian Buster, this would not be without its own unique set of challenges.

It was determined that it would be valuable to investigate patching the Matrix HAL to work on Ubuntu 22.04. And over the summer, we were able to accomplish this goal.

Description of the Port

There are some limitations to the port. With our own acoustic research being the primary application, we did not ensure the functionality of sensors other than the microphone array. This means there is currently no support for the humidity sensor, IMU, pressure sensor or UV sensor. There is currently only support for the microphone array and the Everloop LED interface.

Additionally, the repo has not been tested on the Matrix CREATOR and there is no guarantee that it will be compatible with this patch.

You can download the source from this Github repository. Please direct any support related inquiries to gfw3@illinois.edu.

Enhancing Group Conversations with Smartphones and Hearing Devices

This post describes our paper “Adaptive Crosstalk Cancellation and Spatialization for Dynamic Group Conversation Enhancement Using Mobile and Wearable Devices,” presented at the International Workshop on Acoustic Signal Enhancement (IWAENC) in September 2022.

One of the most common complaints from people with hearing loss – and everyone else, really – is that it’s hard to hear in noisy places like restaurants. Group conversations are especially difficult since the listener needs to keep track of multiple people who sometimes interrupt or talk over each other. Conventional hearing aids and other listening devices don’t work well for noisy group conversations. Our team at the Illinois Augmented Listening Laboratory is developing systems to help people hear better in group conversations by connecting hearing devices with other nearby devices. Previously, we showed how wireless remote microphone systems can be improved to support group conversations and how a microphone array can enhance talkers in the group while removing outside noise. But both of those approaches rely on specialized hardware, which isn’t always practical. What if we could build a system using devices that users already have with them?

We can connect together hearing devices and smartphones to enhance speech from group members and remove unwanted background noise.

In this work, we enhance a group conversation by connecting together the hearing devices and mobile phones of everyone in the group. Each user wears a pair of earpieces – which could be hearing aids, “hearables”, or wireless earbuds – and places their mobile phone on the table in front of them. The earpieces and phones all transmit audio data to each other, and we use adaptive signal processing to generate an individualized sound mixture for each user. We want each user to be able to hear every other user in the group, but not background noise from other people talking nearby. We also want to remove echoes of the user’s own voice, which can be distracting. And as always, we want to preserve spatial cues that help users tell which direction sound is coming from. Those spatial cues are especially important for group conversations where multiple people might talk at once.

Continue reading

Turning the Music Down with Wireless Assistive Listening Systems

This post accompanies our presentation “Turn the music down! Repurposing assistive listening broadcast systems to remove nuisance sounds” from the Acoustical Society of America meeting in May 2022.

It is often difficult to hear over loud music in a bar or restaurant. What if we could remove the annoying music while hearing everything else? With the magic of adaptive signal processing, we can!

To do that, we’ll use a wireless assistive listening system (ALS). An ALS is usually used to enhance sound in theaters, places of worship, and other venues with sound systems. It transmits the sound coming over the speakers directly to the user’s hearing device, making it louder and cutting through noise and reverberation. Common types of ALS include infrared (IR) or frequency modulation (FM) transmitters, which work with dedicated headsets, and induction loops, which work with telecoils built into hearing devices.

We can instead use those same systems to cancel the sound at the ears while preserving everything else. We use an adaptive filter to predict the music as heard at the listener’s ears, then subtract it out. What’s left over is all the other sound in the room, including the correct spatial cues. The challenge is adapting as the listener moves.

The video below demonstrates the system using a high-end FM wireless system. The dummy head wears a set of microphones that simulate a hearing device; you’ll be hearing through its ears. The FM system broadcasts the same sound being played over the speakers. An adaptive filter cancels it so you can hear my voice but not the music.

Group Conversation Enhancement

This post accompanies two presentations titled “Immersive Conversation Enhancement Using Binaural Hearing Aids and External Microphone Arrays” and “Group Conversation Enhancement Using Wireless Microphones and the Tympan Open-Source Hearing Platform”, which were presented at the International Hearing Aid Research Conference (IHCON) in August 2022. The latter is part of a special session on open-source hearing tools.

Have you ever struggled to hear the people across from you in a crowded restaurant? Group conversations in noisy environments are among the most frustrating hearing challenges, especially for people with hearing loss, but conventional hearing devices don’t do much to help. They make everything louder, including the background noise. Our research group is developing new methods to make it easier to hear in loud noise. In this project, we focus on group conversations, where there are several users who all want to hear each other.

Conversation enhancement allows users within a group to hear each other while tuning out background noise.

A group conversation enhancement system should turn up the voices of users in the group while tuning out background noise, including speech from other people nearby. To do that, it needs to separate the speech of group members from that of non-members. It should handle multiple talkers at once, in case people interrupt or talk over each other. To help listeners keep track of fast-paced conversations, it should sound as immersive as possible. Specifically, it should have imperceptible delay and it should preserve spatial cues so that listeners can tell what sound is coming from what direction. And it has to do all that while all the users are constantly moving, such as turning to look at each other while talking.

Continue reading

Motion and Audio, with Robots

This post describes our paper “Mechatronic Generation of Datasets for Acoustics Research,” presented at the International Workshop on Acoustic Signal Enhancement (IWAENC) in September 2022.

Creating datasets is expensive, be it in terms of time or funding. This is especially true for spatial audio: Some applications require that hundreds of recordings are taken from specific regions in a room, while others involve arranging many microphones and loudspeakers to mimic real-life scenarios – for instance, a conference. Few researchers have access to dedicated recording spaces that can accurately portray acoustically-interesting environments, and fewer still are able to create dynamic scenes where microphones and speakers move precisely to replicate how people walk and talk.

To support the creation of these types of datasets, we propose the Mechatronic Acoustic Research System, or MARS for short. We envision MARS as a robot-enabled recording space that researchers would have remote access to. Users could emulate a wide variety of acoustic environments and take recordings with little effort. Our initial concept is for a website design interface that can be used to specify a complicated experiment, which a robot system then automatically recreates.

Diagram of MARS

How the MARS frontend and backend link together

Continue reading

Immersive Remote Microphone System on the Tympan Platform

This post accompanies our presentation “Immersive multitalker remote microphone system” at the 181st Acoustical Society of America Meeting in Seattle.

In our previous post, which accompanied a paper at WASPAA 2021, we proposed an improved wireless microphone system for hearing aids and other listening devices. Unlike conventional remote microphones, the proposed system works with multiple talkers at once, and it uses earpiece microphones to preserve the spatial cues that humans use to localize and separate sound. In that paper, we successfully demonstrated the adaptive filtering system in an offline laboratory experiment.

To see if it would work in a real-time, real-world listening system, we participated in an Acoustical Society of America hackathon using the open-source Tympan platform. The Tympan is an Arduino-based hearing aid development kit. It comes with high-quality audio hardware, a built-in rechargeable battery, a user-friendly Android app, a memory card for recording, and a comprehensive modular software library. Using the Tympan, we were able to quickly demonstrate our adaptive binaural filtering system in real hardware.

The Tympan processor connects to a stereo wireless microphone system and binaural earbuds.

Continue reading

Improving remote microphones for group conversations

This post accompanies the paper “Adaptive Binaural Filtering for a Multiple-Talker Listening System Using Remote and On-Ear Microphones” presented at WASPAA 2021 (PDF).

Wireless assistive listening technology

Hearing aids and other listening devices can help people to hear better by amplifying quiet sounds. But amplification alone is not enough in loud environments like restaurants, where the sound from a conversation partner is buried in background noise, or when the talker is far away, like in a large classroom or a theater. To make sound easier to understand, we need to bring the sound source closer to the listener. While we often cannot physically move the talker, we can do the next best thing by placing a microphone on them.

A remote microphone transmits sound from a talker to the listener's hearing device.

Remote microphones make it easier to hear by transmitting sound directly from a talker to a listener. Conventional remote microphones only work with one talker at a time.

When a remote microphone is placed on or close to a talker, it captures speech with lower noise than the microphones built into hearing aid earpieces. The sound also has less reverberation since it does not bounce around the room before reaching the listener. In clinical studies, remote microphones have been shown to consistently improve speech understanding in noisy environments. In our interviews of hearing technology users, we found that people who use remote microphones love them – but with the exception of K-12 schools, where remote microphones are often legally required accommodations, very few people bother to use them.

Continue reading