Source Separation using a Massive Number of Microphones

This post describes the results of using simple delay-and-sum beamforming for source separation with the massive distributed microphone array dataset by Ryan Corey, Matt Skarha, and Professor Andrew Singer.

Although source separation (separating distinct and overlapping sound sources from each other and from dispersed noise) in a small, quiet lab with only a few speakers usually produces excellent results, such a situation may not always be present. In a large reverberant room with many speakers, for example, it may be difficult for a person or speech recognition system to keep track of and to comprehend what one particular speaker is saying. But using source separation in such a scenario to improve intelligibility is quite difficult without having external information that in itself may also be difficult to obtain.

Many source separation methods work well up to only a certain number of speakers – typically not much more than four or five. Moreover, some of these methods rely on constraining the number of microphones to an amount equal to the number of sources, and will scale poorly in terms of results, if at all, with the addition of more microphones. Limiting the number of microphones to the number of speakers will not work in these difficult scenarios, but adding more microphones may help, due to the greater amount of spatial information made available by the additional microphones. Therefore, the motivation behind this experiment was to find a suitable source separation method, or series of such methods, that could leverage the “massive” number of microphones used in the Massive Distributed Microphone Array Dataset and solve the particularly challenging problem of separating ten speech sources. Ideally, the algorithm would rely on as little external information as possible, instead relying on the wealth of information gathered by the microphone arrays distributed around the conference room. Thus, the delay-and-sum beamformer was first considered for this task, because it only requires the locations of each source and microphone, and it inherently scales well with a large number of microphones.

The Dataset

Layout of the conference room

Shown above is a diagram depicting the setup for the Massive Distributed Microphone Array Dataset. Of note is the fact that there are two distinct types of arrays – wearable arrays, denoted by the letter W and numbered from 1-4, and tabletop arrays, denoted by the letter T and numbered from 1-12. Wearable arrays have 16 microphones each, whereas tabletop arrays have 8.

Delay-and-Sum Beamforming

Delay-and-Sum beamforming attempts to align the delays of source images across all microphones for a given target source, and then sum the corresponding delayed signals across all microphones. In doing so, the hope is that the images of the target source will be perfectly in phase, and therefore summing them up will produce constructive interference, “boosting” the target source while having a more mixed effect on the interfering sources. The output of a delay-and-sum beamformer will ideally have the target source at a loud enough volume to be clearly intelligible, but will still probably contain the interfering sources at a quieter, less intelligible volume.

Graphically Depict Delay-and-Sum Beamforming

A diagram depicting delay-and-sum beamforming. The top half depicts aligning the delays of each image of source 1 (the target source) at each microphone so that these images are perfectly in phase with each other. It also shows that the sum of all these aligned images results in a “boosted” target source. On the other hand, the interfering source, source 2, has its images delayed by the same amount as the images of source 1, which does not result in an alignment. Thus its sum does not produce a multiplicative effect.

Results

The beamformer performed poorly when targeting sources 3, 5, 6, and 10, but reproduced intelligibly separated versions of the other sources. Specifically, the problematic sources are often boosted when targeted, but either fade in and out of intelligibility, or are still too quiet for comprehension. The only exception is source 10, which is not boosted at all and is completely undetectable when it is targeted. Three probable causes of these poor results were investigated – an excessive amount of reverberation, an incorrect recording of the positions of the microphones and sources, and the directionality of the speech sources undermining the underlying model of delay-and-sum beamforming. The first cause was dismissed, the second cause was found to be somewhat of an issue with several sources, but the third cause was deemed as the most significant factor behind the beamformer’s poor performance on the problematic target sources.

In an attempt to mitigate these issues, an enhanced delay-and-sum beamformer was devised that only used the microphones nearest to each respective target source, and ones that were within a 180° arc from the direction that the target source was pointing in. This direction was estimated using the diagram, therefore some inaccuracy was probably introduced to the selection process. Furthermore, the results of this enhanced beamformer were then used as input to the AuxIVA algorithm developed by Nobutaka Ono for blind source separation. Because the enhanced beamformer was able to boost the target source relative to the interfering sources to a degree larger than the previous beamformer could in some cases, the results of the AuxIVA algorithm were much better with the enhanced beamformer as compared to the previous one for certain target sources. This can be seen below in the mir_eval results.

Diagram color-coded to show the most suitable microphone arrays used for target sources 3, 5, 6, 8, and 10 for the enhanced beamformer. These suitable microphone arrays captured the majority of the information from each respective source, and so they also illustrate the degree of the directionality problem for all sources, 6 and 10 particularly.

The table below depicts the results of the enhanced beamformer and compares them to the results of the original beamformer, which naively used data from every microphone in the room. Results for problematic sources 3, 5, 6, and 10 are depicted, and results for source 8, which were fairly good across both beamformers, are provided for reference.

Although the enhanced beamformer could mitigate the directionality problem, the incorrect recording of coordinates could only be resolved by calculating relative distances through correlation of source images, which is difficult to do with so many sources present.

Mir_Eval Results

Here SDR means source distortion ratio, which describes in decibels how much energy the target source has compared to the interfering sources, the noise, and the artifacts in the result. SIR means source interference ratio, which describes in decibels how much energy the target source has compared to the interfering sources. 

Key Takeaways

  • We used delay-and-sum beamforming on the Massive Distributed Microphone Array Dataset because it is a relatively simple algorithm that scales well with the number of microphones used and only requires coordinates of microphones and sources
  • Beamforming performed poorly on a number of target sources (3, 5, 6, 10), probably due to a combination of inaccurately recorded coordinates and the directionality of the sound sources
  • When using an enhanced beamformer in tandem with the AuxIVA algorithm to address the directionality problem, results were intelligible for most sources

Further improvements into source separation with the dataset probably lie in using statistical methods, such as the MVDR beamformer, or in using neural networks to estimate something like an ideal ratio mask (IRM) that aids in source separation. MVDR was already attempted, as the dataset contains recordings of exponential chirps played from the positions of the speech sources, and worked very well. However, MVDR relies on very specific information that is difficult to obtain in real-world scenarios, and so was considered a solution that was not in-line with the goals of this experiment. Nevertheless, this experiment suggests the viability of algorithms that scale well with an increasing number of microphones, as the enhanced delay-and-sum beamformer in tandem with the AuxIVA algorithm produced largely intelligible results with only a few exceptions.

Citations

Dataset

Corey, Ryan M.; Skarha, Matthew D.; Singer, Andrew C. (2019): Massive Distributed Microphone Array Dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-6216881_V1

Mir_Eval

Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis, “mir_eval: A Transparent Implementation of Common MIR Metrics”, Proceedings of the 15th International Conference on Music Information Retrieval, 2014.

Pyroomacoustics and AuxIVA implementation

Scheibler, E. Bezzam, I. Dokmanić, Pyroomacoustics: A Python package for audio room simulations and array processing algorithms, Proc. IEEE ICASSP, Calgary, CA, 2018.

Delay-and-Sum Diagram:

Detecting Laterality and Nasality in Speech with the Use of a Multi-Channel recorder – Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/The-block-diagram-of-the-delay-sum-beamforming-method_fig2_280041100 [accessed 26 Jul, 2020]