This post describes our paper “Motion-Robust Beamforming for Deformable Microphone Arrays,” which won the best student paper award at WASPAA 2019.
When our team designs wearable microphone arrays, we usually test them on our beloved mannequin test subject, Mike A. Ray. With Mike’s help, we’ve shown that large wearable microphone arrays can perform much better than conventional earpieces and headsets for augmented listening applications, such as noise reduction in hearing aids. Mannequin experiments are useful because, unlike a human, Mike doesn’t need to be paid, doesn’t need to sign any paperwork, and doesn’t mind having things duct-taped to his head. There is one major difference between mannequin and human subjects, however: humans move. In our recent paper at WASPAA 2019, which won a best student paper award, we described the effects of this motion on microphone arrays and proposed several ways to address it.
Beamformers, which use spatial information to separate and enhance sounds from different directions, rely on precise distances between microphones. (We don’t actually measure those distances directly; we measure relative time delays between signals at the different microphones, which depend on distances.) When a human user turns their head – as humans do constantly and subconsciously while listening – the microphones near the ears move relative to the microphones on the lower body. The distances between microphones therefore change frequently.
In a deformable microphone array, microphones can move relative to each other.
Microphone array researchers have studied motion before, but it is usually the sound source that moves relative to the entire array. For example, a talker might walk around the room. That problem, while challenging, is easier to deal with: we just need to track the direction of the user. Deformation of the array itself – that is, relative motion between microphones – is more difficult because there are more moving parts and the changing shape of the array has complicated effects on the signals. In this paper, we mathematically analyzed the effects of deformation on beamformer performance and considered several ways to compensate for it.
The EchoXL is a large format Alexa-powered smart speaker as part of TEC’s Alexa program. Based on the current market offerings, it would be the largest of its kind. It’s form and features will be modeled after Amazon’s Echo speaker, to keep branding consistent and to also exemplify a potential line expansion. While small Bluetooth speakers still hold the largest market segment in audio, the market for larger sound systems has been steadily increasing over the past few years (as evidenced by new products from LG, Samsung, Sony, and JBL). Currently, Amazon does not have any products in this category.
The speaker will be used as a public demonstration piece to exhibit the current technology incorporated within smart speakers, such as the implementation of microphone arrays as wells as internal room correction capabilities. The novelty factor of a scaled up Echo speaker will also be useful for press for the group’s research.
This post describes our paper presented at CAMSAP 2019.
Imagine what it would sound like to listen through someone else’s ears. I don’t mean that in a metaphorical sense. What if you had a headset that would let you listen through microphones in the ears of someone else in the room, so that you can hear what they hear? Better yet, what if your headset was connected to the ears of everyone else in the room? In our group’s latest paper, “Cooperative Audio Source Separation and Enhancement Using Distributed Microphone Arrays and Wearable Devices,” presented this week at CAMSAP 2019, we designed a system to do just that.
Our team is trying to improve the performance of hearing aids and other augmented listening devices in crowded, noisy environments. In spaces like restaurants and bars where there are many people talking at once, it can be difficult for even normal-hearing people to hold a conversation. Microphone arrays can help by spatially separating sounds, so that each user can hear what they want to hear and turn off the sounds they don’t want to hear. To do that in a very noisy room, however, we need a large number of microphones that cover a large area.
Complex listening environments include many different sound sources, but also many microphone-equipped devices. Each listening device tries to enhance a different sound source.
In the past, we’ve built large wearable microphone arrays with sensors that cover wearable accessories or even the entire body. These arrays can perform much better than conventional earpieces, but they aren’t enough in the most challenging environments. In a large, reverberant room packed with noisy people, we need microphones spread all over the room. Instead of having a compact microphone array surrounded by sound sources, we should have microphones spread around and among the sound sources, helping each listener to distinguish even faraway sounds.
This post describes our new massive distributed microphone array dataset, which is available for download from the Illinois Databank and is featured an upcoming paper at CAMSAP 2019.
The conference room used for the massive distributed array dataset.
Listening in loud noise is hard: we only have two ears, after all, but a crowded party might have dozens or even hundreds of people talking at once. Our ears are hopelessly outnumbered! Augmented listening devices, however, are not limited by physiology: they could use hundreds of microphones spread all across a room to make sense of the jumble of sounds.
Our world is already filled with microphones. There are multiple microphones in every smartphone, laptop, smart speaker, conferencing system, and hearing aid. As microphone technology and wireless networks improve, it will be possible to place hundreds of microphones throughout crowded spaces to help us hear better. Massive-scale distributed arrays are more useful than compact arrays because they are spread around and among the sound sources. One user’s listening device might have trouble distinguishing between two voices on the other side of the room, but wearable microphones on those talkers can provide excellent information about their speech signals.
Many researchers, including our team, are developing algorithms that can harness information from massive-scale arrays, but there is little publicly available data suitable for source separation and audio enhancement research at such a large scale. To facilitate this research, we have released a new dataset with 10 speech sources and 160 microphones in a large, reverberant conference room.
Alexa, Google Home, and Facebook smart devices are becoming more and more commonplace in the home. Although many individuals only use these smart devices to ask for the time or weather, They provide an important edge controller for the Internet of Things infrastructure.
Unknown to some consumers, Alexa and other smart devices contain multiple microphones. Alexa uses these microphones in order to determine the direction of the speaker, and display a light almost as if to “face” the user. This localization function is also very important for processing whatever is about to be said after “Alexa”, or “OK Google”.
In our research lab, this kind of localization is important and we hope to extrapolate more from individuals’ interactions with their home smart speaker. The final details of the experiments we hope to run and not yet concrete. However, we know that we will have to have our own Alexa-like device that can do studio recording with a number of different channels.
Imagine you are at a noisy restaurant, you hear the clanging of the dishes, the hearty laughs from the patrons around you, the musical ambience, and you are struggling to hear your friend from across the table. Wouldn’t it be nice if the primary noise that you hear was solely from your friend? That is the problem that sound source localization can help solve.
Sound source localization, as you might have guessed, is the process of identifying unique noises that you want to amplify. It is how your Amazon Echo Dot identifies who is speaking to it with the little ring at the top. For Engineering Open House, we wanted to create a device that can mimic the colorful ring at the top in a fun, creative way. Instead of a colorful light ring, we wanted to use a mannequin head that turns towards the audience when they speak to it.
My colleague Manan and I designed “Alexander”, the spinning head that can detect speech. We knew our system had to contain a microphone array, a processor to control the localization system and a motor to turn the mannequin head. Our choices of each component are as follows:
Have you ever wondered what it would sound like to listen through sixteen ears? This past March, hundreds of Central Illinois children and families experienced microphone-array augmented listening technology firsthand at the annual Engineering Open House (EOH) sponsored by the University of Illinois College of Engineering. At the event, which attracts thousands of elementary-, middle-, and high-school students and local community members, visitors learned about technologies for enhancing human and machine listening.
Listen up (or down): The technology of directional listening
Our team’s award-winning exhibit introduced visitors to several directional listening technologies, which enhance audio by isolating sounds that come from a certain direction. Directional listening is important when the sounds we want to hear are far away, or when there are many different sounds coming from different directions—like at a crowded open house! There are two ways to focus on sounds from one direction: we can physically block sounds from directions we don’t want, or we can use the mathematical tools of signal processing to cancel out those unwanted sounds. At our exhibit in Engineering Hall, visitors could try both.
This carefully designed mechanical listening device is definitely not an oil funnel from the local hardware store.
The oldest and most intuitive listening technology is the ear horn, pictured above. This horn literally funnels sound waves from the direction in which it is pointed. The effect is surprisingly strong, and there is a noticeable difference in the acoustics of the two horns we had on display. The shape of the horn affects both its directional pattern and its effect on different sound wavelengths, which humans perceive as pitch. The toy listening dish shown below operates on the same principle, but also includes an electronic amplifier. The funnels work much better for directional listening, but the spy gadget is the clear winner for style.
This toy listening dish is not very powerful, but it certainly looks cool!
These mechanical hearing aids rely on physical acoustics to isolate sound from one direction. To listen in a different direction, the user needs to physically turn them in that direction. Modern directional listening technology uses microphone arrays, which are groups of microphones spread apart from each other in space. We can use signal processing to compare and combine the signals recorded by the microphones to tell what direction a sound came from or to listen in a certain direction. We can change the direction using software, without physically moving the microphones. With sophisticated array signal processing, we can even listen in multiple directions at once, and can compensate for reflections and echoes in the room.
Constructing a microphone array is a challenge of its own, but how do we actually process the microphone array data to do things like filtering and beamforming? One solution is to store the data on off-chip memory for later processing. This solution is great for experimenting with different microphone arrays since we can process the data offline and see what filter combinations work best from the data that we collected. This solution also avoids having to make changes to the hardware design any time we want to change filter coefficients or what algorithm is being implemented.
Overview of a basic microphone array system
Here’s a quick refresher of the DE1-SoC, the development board we use to process the microphone array.
The main components in this project that we utilize are the GPIO pins, off-chip DDR3 memory, the HPS, and the Ethernet port. The microphone array connects to the GPIO port of the FPGA. The digital I2S data is interpreted on the FPGA by deserializing the data into samples. The 1-GB off-chip memory is where the samples will be stored for later processing. The HPS that is running linux will be able to grab the data from memory and store it on the SD card. Connecting the Ethernet port on a computer gives us the ability to grab the data from the FPGA seamlessly using shell and python scripts.
Currently the system is setup to stream the samples from the microphone array to the output of the audio codec. The microphones on the left side are summed up and output to the left channel, and the microphones on the right side are summed up and output to the right channel. The microphones are not processed before being sent to the CODEC. Here is a block diagram of what the system looks like before we add a DMA interface to the system.
Within the Augmented Listening team, it has been my goal to develop Speech Simulators for testing purposes. These would be distributed around the environment in a sort of ‘Cocktail Party’ scenario.
Why use a Speech Simulator instead of human subjects?
Human Subjects can never say the same thing exactly the same way twice. By using anechoic recordings of people speaking played through speakers, we can remove the human error from the experiment. We can also simulate the user’s own voice captured by a wearable microphone array.
Why not just use normal Studio Monitors?
While studio monitors are designed to have a flat frequency response perfect for this situation, their off-axis performance is not consistent with that of the human voice. As most monitors use multiple drivers to achieve the desired frequency range, the dispersion is also inconsistent across the frequency range as it crosses between the drivers.
With March 3rd being World Hearing Day, WHO-ITU (World Health Organization and International Telecommunication Union) released a new standard for safe listening devices on February 12th, 2019. As our group researches on improving hearing through array processing, we also think that preventing hearing loss and taking care of our hearing is important. Hearing loss is almost permanent, and there are currently no treatment for restoring hearing once it is lost. In this post, I will revisit the new WHO-ITU standard for safe listening devices, and I will also test how loud my personal audio device is with respect to the new standard.
Summary of WHO-ITU standard for safe listening devices
In the new WHO-ITU standard for safe listening devices, WHO-ITU recommends including the following four functions in audio devices (which is originally found here):
- “Sound allowance” function: software that tracks the level and duration of the user’s exposure to sound as a percentage used of a reference exposure.
- Personalized profile: an individualized listening profile, based on the user’s listening practices, which informs the user of how safely (or not) he or she has been listening and gives cues for action based on this information.
- Volume limiting options: options to limit the volume, including automatic volume reduction and parental volume control.
- General information: information and guidance to users on safe listening practices, both through personal audio devices and for other leisure activities.
Also, as it is written in the Introduction of Safe Listening Devices and Systems, WHO-ITU considers safe level of listening to be listening to sound with loudness under 80dB for a maximum of 40 hours per week. This recommendation is stricter than the standard currently implemented by OSHA (Occupational Safety and Health Administration), which enforces a PEL (permissible exposure limit) of 90dBA* for 8 hours per day with the exposure time halving with each 5dBA* increase in the noise level. NIOSH (The National Institute for Occupational Safety and Health) also has a different set of recommendations concerning noise exposure. They recommend an exposure time of 8 hours for a noise of 85dBA* with the exposure time halving with each 3dBA* increase in the noise level. With this recommendation, workers are recommended to be exposed to noise with 100dBA* for only 15 minutes per day!