Investigating sample bias towards languages in audio super-resolution

This work was presented at the 2023 Undergraduate Research Symposium, held by the University of Illinois Urbana-Champaign (Poster 61)

Speech audio sounds good when sampled at 16 kHz, however legacy infrastructure and certain microphones can only capture 8 kHz audio. This can significantly reduce the perceived clarity and intelligibility of speech.

Deep learning provides a way to estimate the lost frequency components, thereby improving quality. This task is called audio super-resolution (or bandwidth extension).

Typically, large datasets of clean audio are required to train such models. However, these datasets do not sufficiently represent all languages, and it is not always possible to train language-specific models for any given language. We investigate how a model trained only on high-quality English recordings generalizes to lower-quality recordings of unseen languages.

The specifics of our model are discussed in our poster at the Undergraduate Research Symposium. Here, we present some results and audio.

We find that our model generalizes well to certain languages, but not others. We provide example audio in the table below. Languages are listed by the attained model accuracy, with English being the most accurate and Catalan being the least so.

Target (16 kHz)Input (8 kHz)Output (16 kHz)
English
Korean
Twi
German
Nepali
Esperanto
Catalan

We conjecture that the variance in performance is correlated with the linguistic similarity between the trained English and inference languages. We reserve this analysis for future work.

*Target audio samples are from the Common Voices Corpus, which contains recordings of over 100 languages

Simulating group conversations with talking heads

Featured

This work was presented at the 184th Meeting of the Acoustical Society of America, May 2023, in Chicago, Illinois.

This project is part of the larger Mechatronic Acoustic Research System, a tool for roboticized, automatic audio data collection.

In group conversations, a listener will hear speech coming from many directions of arrival. Human listeners can discern where a particular sound is coming from based on the difference in volume and timing of sound at their left and right ears: these are referred to in the literature as interaural level and time differences.

Diagram of interaural effects

While the brain automatically performs this localization, computers must rely on algorithms. Developing algorithms that are sufficiently accurate, quick, and robust is the work of acoustical signal processing researchers. To do so, researchers need datasets of spatial audio that mimic what is sensed by the ears of a real listener.

Acoustic head simulators provide a solution for generating such datasets. These simulators are designed to have similar absorptive and structural properties as a real human head, and unlike real humans, can be stationed in a lab 24/7 and actuated for precise and repeatable motion.

Head and torso simulators (HATS) from Bruel & Kjaer, an HBK company.

However, research-grade acoustic head simulators can be prohibitively expensive. To achieve high levels of realism, expensive materials and actuators are used, which raises typical prices into the range of tens of thousands of dollars. As such, very few labs will have access to multiple head simulators, which is necessary for simulating group conversations.

We investigate the application of 3D printing technology to the fabrication of head simulators. In recent years, 3D printing has become a cheap and accessible means of producing highly complicated structures. This makes it uniquely suited to the complex geometry of the human ears and head, both of which significantly affect interaural levels and delay.

Exploded-view render of head simulators, produced by Zhihao Tang for TE401F in 2021

Prototype 3D printed ears, which affect the binaural cues

To allow for movement of each individual head, we also design a multi-axial turret that the head can lock onto to. This lets the simulators nod and turn, mimicking natural gestures. Researchers can use this feature to evaluate the robustness and responsiveness of their algorithms to spatial perturbations.

3D printed head simulator mounted on a multiaxial turret for motion.

By designing a 3D printable, actuated head simulator, we aim to enable anyone to fabricate many such devices for their own research.

 

An Unofficial Port of Matrix HAL to Ubuntu 22.04 and Raspbian Bullseye

What is Matrix VOICE and Matrix HAL?

The Matrix VOICE is described on the Matrix website as a “development board for building sound driven behaviors and interfaces.” It is a nifty piece of hardware that features an 8-microphone array and has unique possibilities for beamforming and audio processing applications.

However, in February 2021, when Matrix Labs was bought out, development and support for the Matrix VOICE C++ library, the Matrix HAL, was silently withdrawn.

The latest release of Matrix HAL only works with a version of the Raspberry Pi OS called Raspbian Buster which is fine for a lot of applications, but recently our research has been focused on integrating this device with ROS, a robotics framework that is meant for Ubuntu. While there were workarounds for using ROS on Raspbian Buster, this would not be without its own unique set of challenges.

It was determined that it would be valuable to investigate patching the Matrix HAL to work on Ubuntu 22.04. And over the summer, we were able to accomplish this goal.

Description of the Port

There are some limitations to the port. With our own acoustic research being the primary application, we did not ensure the functionality of sensors other than the microphone array. This means there is currently no support for the humidity sensor, IMU, pressure sensor or UV sensor. There is currently only support for the microphone array and the Everloop LED interface.

Additionally, the repo has not been tested on the Matrix CREATOR and there is no guarantee that it will be compatible with this patch.

You can download the source from this Github repository. Please direct any support related inquiries to gfw3@illinois.edu.

Motion and Audio, with Robots

This post describes our paper “Mechatronic Generation of Datasets for Acoustics Research,” presented at the International Workshop on Acoustic Signal Enhancement (IWAENC) in September 2022.

Creating datasets is expensive, be it in terms of time or funding. This is especially true for spatial audio: Some applications require that hundreds of recordings are taken from specific regions in a room, while others involve arranging many microphones and loudspeakers to mimic real-life scenarios – for instance, a conference. Few researchers have access to dedicated recording spaces that can accurately portray acoustically-interesting environments, and fewer still are able to create dynamic scenes where microphones and speakers move precisely to replicate how people walk and talk.

To support the creation of these types of datasets, we propose the Mechatronic Acoustic Research System, or MARS for short. We envision MARS as a robot-enabled recording space that researchers would have remote access to. Users could emulate a wide variety of acoustic environments and take recordings with little effort. Our initial concept is for a website design interface that can be used to specify a complicated experiment, which a robot system then automatically recreates.

Diagram of MARS

How the MARS frontend and backend link together

Continue reading

Hearing aid algorithm adapted for COVID-19 ventilators

Audio signal processing would seem to have nothing to do with the COVID-19 pandemic. It turns out, however, that a low-complexity signal processing algorithm used in hearing aids can also be used to monitor breathing for patients on certain types of ventilator.

To address the shortage of emergency ventilators caused by the pandemic, this spring the Grainger College of Engineering launched the Illinois RapidVent project to design an emergency ventilator that could be rapidly and inexpensively produced. In little more than a week, the team built a functional pressure-cycled pneumatic ventilator, which is now being manufactured by Belkin.

The Illinois RapidVent is powered by pressurized gas and has no electronic components, making it easy to produce and to use. However, it lacks many of the monitoring features found in advanced commercial ventilators. Without an alarm to indicate malfunctions, clinicians must constantly watch patients to make sure that they are still breathing. More-advanced ventilators also display information about pressure, respiratory rate, and air volume that can inform care decisions.

The Illinois RapidAlarm adds monitoring features to pressure-cycled ventilators.

To complement the ventilator, a team of electrical engineers worked with medical experts to design a sensor and alarm system known as the Illinois RapidAlarm. The device attaches to a pressure-cycled ventilator, such as the Illinois RapidVent, and monitors the breathing cycle. The device includes a pressure sensor, a microcontroller, a buzzer, three buttons, and a display. It shows clinically useful metrics and sounds an audible alarm when the ventilator stops working. The hardware design, firmware code, and documentation are available online with open-source licenses. A paper describing how the system works is available on arXiv.

Continue reading

EchoXL

The EchoXL is a large format Alexa-powered smart speaker as part of TEC’s Alexa program. Based on the current market offerings, it would be the largest of its kind. It’s form and features will be modeled after Amazon’s Echo speaker, to keep branding consistent and to also exemplify a potential line expansion. While small Bluetooth speakers still hold the largest market segment in audio, the market for larger sound systems has been steadily increasing over the past few years (as evidenced by new products from LG, Samsung, Sony, and JBL). Currently, Amazon does not have any products in this category. 

The speaker will be used as a public demonstration piece to exhibit the current technology incorporated within smart speakers, such as the implementation of microphone arrays as wells as internal room correction capabilities. The novelty factor of a scaled up Echo speaker will also be useful for press for the group’s research.

Continue reading

Studio-Quality Recording Devices for Smart Home Data Collection

Alexa, Google Home, and Facebook smart devices are becoming more and more commonplace in the home. Although many individuals only use these smart devices to ask for the time or weather, They provide an important edge controller for the Internet of Things infrastructure.

Unknown to some consumers, Alexa and other smart devices contain multiple microphones. Alexa uses these microphones in order to determine the direction of the speaker, and display a light almost as if to “face” the user. This localization function is also very important for processing whatever is about to be said after “Alexa”, or “OK Google”.

In our research lab, this kind of localization is important and we hope to extrapolate more from individuals’ interactions with their home smart speaker. The final details of the experiments we hope to run and not yet concrete. However, we know that we will have to have our own Alexa-like device that can do studio recording with a number of different channels.

Continue reading

Sound Source Localization

Imagine you are at a noisy restaurant, you hear the clanging of the dishes, the hearty laughs from the patrons around you, the musical ambience, and you are struggling to hear your friend from across the table. Wouldn’t it be nice if the primary noise that you hear was solely from your friend? That is the problem that sound source localization can help solve.

Sound source localization, as you might have guessed, is the process of identifying unique noises that you want to amplify. It is how your Amazon Echo Dot identifies who is speaking to it with the little ring at the top. For Engineering Open House, we wanted to create a device that can mimic the colorful ring at the top in a fun, creative way. Instead of a colorful light ring, we wanted to use a mannequin head that turns towards the audience when they speak to it.

My colleague Manan and I designed “Alexander”, the spinning head that can detect speech.  We knew our system had to contain a microphone array, a processor to control the localization system and a motor to turn the mannequin head. Our choices of each component are as follows:

Continue reading

Capturing Data From a Wearable Microphone Array

Introduction

Constructing a microphone array is a challenge of its own, but how do we actually process the microphone array data to do things like filtering and beamforming? One solution is to store the data on off-chip memory for later processing. This solution is great for experimenting with different microphone arrays since we can process the data offline and see what filter combinations work best from the data that we collected. This solution also avoids having to make changes to the hardware design any time we want to change filter coefficients or what algorithm is being implemented.

Overview of a basic microphone array system

Here’s a quick refresher of the DE1-SoC, the development board we use to process the microphone array.

The main components in this project that we utilize are the GPIO pins, off-chip DDR3 memory, the HPS, and the Ethernet port. The microphone array connects to the GPIO port of the FPGA. The digital I2S data is interpreted on the FPGA by deserializing the data into samples. The 1-GB off-chip memory is where the samples will be stored for later processing. The HPS that is running linux will be able to grab the data from memory and store it on the SD card. Connecting the Ethernet port on a computer gives us the ability to grab the data from the FPGA seamlessly using shell and python scripts.

Currently the system is setup to stream the samples from the microphone array to the output of the audio codec. The microphones on the left side are summed up and output to the left channel, and the microphones on the right side are summed up and output to the right channel. The microphones are not processed before being sent to the CODEC. Here is a block diagram of what the system looks like before we add a DMA interface to the system.

Continue reading

Talking Heads

Within the Augmented Listening team, it has been my goal to develop Speech Simulators for testing purposes. These would be distributed around the environment in a sort of ‘Cocktail Party’ scenario.

 

Why use a Speech Simulator instead of human subjects?

CONSISTENCY.
Human Subjects can never say the same thing exactly the same way twice. By using anechoic recordings of people speaking played through speakers, we can remove the human error from the experiment. We can also simulate the user’s own voice captured by a wearable microphone array.

 

Why not just use normal Studio Monitors?

While studio monitors are designed to have a flat frequency response perfect for this situation, their off-axis performance is not consistent with that of the human voice. As most monitors use multiple drivers to achieve the desired frequency range, the dispersion is also inconsistent across the frequency range as it crosses between the drivers.

Continue reading