Contributions

During the past three years from 2010 to 2013, we made great progress to solving fundamental problems in developing audio technologies for a low-cost, real-time, realistic, and flexible telepresence system. Our accomplishment over the past three years can be summarized in a number of aspects:

Biologically inspired miniature microphone array

Realistic teleimmersion will never be widely deployed if each user must have a large physical array of microphones. Thus in RATEM project we employ a biologically inspired miniature microphone array, shown in Figure 1. This ‘zero-aperture’ microphone array uses four precisely placed microphones, and is the smallest microphone array sensor in the world.  Each microphone measures only a few millimeters across, and is placed about a centimeter apart from the others. This XYZO array combines three directional gradient response microphones (XYZ) with an omni-directional response microphone (O); together the array captures more acoustic information than traditional arrays composed of only omni-directional microphones. Unlike the conventional spatially separated microphone array employing the time difference of each microphone pair, the XYZO array containing collocated gradient microphones of orthogonal orientations employs the amplitude difference between the microphones. Therefore, the performance of the XYZO array is less likely affected by the frequency variations of target sources. In addition, the localization mechanism of the spatially separated microphone array similarly applies to the XYZO array in terms that the array responses highly depend on the direction of arrival (DOA) of sources. After retrieving the knowledge of this dependence through experimental measurements or theories, the conventional localizer like Multiple Signal Classification (MUSIC) is then able to be adopted for searching the directional peaks. The 2D localization and beamforming capabilities of the XYZO microphone array have been experimentally studied in the works [MLKJ08].

micarray1

The New Sonistic’s MEMS Microphone Array

Besides the unique biologically inspired world’s smallest four-channel XYZO microphone array, we are using a novel Sonistic, microelectromechanical (MEMS) microphone array as shown in Figure 1.1.  Sonistic LLC is an Illinois start-up company co-founded by Prof Douglas Jones; we are granted use of the hardware devices. Sonistic’s microphone array has highly compact form (approximately 1 in2) and uses the standard low-cost microelectromechanical (MEMS) microphones which are all omni-directional and highly identical. By using Sonistic’s microphone array, the calibration process can be avoided, while achieving excellent performance in audio direction finding, speech enhancement, and binaural audio reconstruction.

MEMS_array

Figure 1.1 View of the system configuration with the Sonistic’s MEMS microphone array

Real-time 2D and 3D direction finding systems

For localizing speakers in Realistic teleimmersion, we have demonstrated real-time 2D and 3D direction finding algorithms for the XYZO miniature microphone array, using an ordinary laptop, a commodity graphics card, and a general-purpose graphics processing unit (GPGPU). Figure 2 shows the system at work.  The new hardware and software is simultaneously smaller in size, lower power, less expensive, more accurate, and more robust than conventional approaches based on a much larger microphone array. Figure 3 shows the directional tracking of a military vehicle for an open field trial using the XYZO array. The algorithms developed for the XYZO microphone array leverage the increased acoustic information obtainable from a miniature directional array, and employ a wide variety of signal processing techniques not feasible in traditional physically separated microphone arrays.  The GPU implementation and computation optimization of the real-time 3D direction finding system has been accepted for publication [LCZRZJC12], and the theory and performance evaluation of the system is submitted for publication [ZALRCJ12].  For more information, view a video of this research in action here and here.

doa_test1doa_test2

Robust 2D and 3D underdetermined DOA estimation systems

We first addressed the underdetermined 2D DOA estimation where only the azimuth direction is detected. We extended from our previous research using the time-frequency sparsity and coherence test, and  we added the noise tracking, onset detection to improve the overall accuracy.  After all the single-source time-frequency bins are selected, we want to separate them into different clusters that match to different sources. We do eigen-decomposition of the covariance matrices corresponding to those bins to find interesting structures of data. It turns out that the largest eigenvectors of the covariance matrices form different clusters corresponding to the speech sources. We successfully developed a more robust underdetermined 2D DOA estimation system which works well in noisy and correlated conditions. Figures 4 illustrates eigenvector clustering and DOA estimation of 7 speech sources with 20dB SNR. It is observed that all the sources are well localized on the xy plane [TZJ14].

RDOA_2D_fig1Figure 4. Eigenvector clustering for 7 sound sources

Next, we addressed the underdetermined 3D DOA estimation of the quasi-stationary speech signals using the KR product and AVS configuration. The Khatri-Rao subspace method was first proposed for the uniform linear array (ULA) to detect the azimuth directions up to 2N-2 number of uncorrelated sound sources with N sensors. In our work, we apply the Khatri-Rao subspace method to the XYZO array to detect both azimuth and elevation directions. We studied the identifiability of the new approach and demonstrated that six speech sources are successfully determined in both horizontal and vertical planes as shown in Figure 5 [[ZSJ14] ].

UnderDOA_3D

Figure 5. The 3D DOA spectrum results of the KR-AVS approach with both azimuth and elevation angles

Faster and more effective speech enhancement algorithms

We developed a multiple-iteration constrained conjugate gradient (MICCG) algorithm and a single-iteration constrained conjugate gradient (SICCG) algorithm to realize the widely-used frequency-domain minimum-variance-distortionless-response (MVDR) beamformer and the resulting algorithms are applied to speech enhancement. The algorithms are derived based on the Lagrange method and the conjugate gradient techniques. The implementations of the algorithms avoid any form of explicit or implicit autocorrelation matrix inversion. Theoretical analysis establishes formal convergence of the algorithms. Specifically, the MICCG algorithm is developed based on a block adaptation approach and it generates a finite sequence of estimates that converge to the MVDR solution. For limited data records, the estimates of the MICCG algorithm are better than the conventional estimators and equivalent to the auxiliary vector (AV) algorithms. The SICCG algorithm is developed based on a continuous adaptation approach with a sample-by-sample updating procedure and the estimates asymptotically converge to the MVDR solution. An illustrative example using synthetic data from a uniform linear array (ULA) is studied and an evaluation on real data recorded by an acoustic vector sensor (AVS) array is demonstrated. Performance of the MICCG algorithm and the SICCG algorithm are compared with the state-of-the-art approaches shown in Figure 6 [ZJKM14].

MICCG_SICCG

Figure 6. Comparison of the minimum MS estimation errors as a function of the data record size less than or equal to 500 for the algorithms: MICCG, AV, STAV, BPAV, RAV, UCG-DL-SMI, SICCG, RLS, and Frost.

Novel 3D audio capture and reconstruction algorithm

To further improve user experience and user’s sense of being physically in the same location in immersive teleconference, We have also developed a novel 3D audio capture and reconstruction algorithm for stereo headphones, based on the XYZO array.  The approach produces the left and right ear signals from the microphone array outputs by multiplying a set of optimal time-invariant gain vectors. The set of optimal time-invariant gain vectors for the microphone outputs are derived based on the minimum-variance-distortionless-response (MVDR) beamformer and minimum-mean-squared-error (MMSE) estimation. These gain vectors integrate the two stages of beamforming and head related transfer function (HRTF) filtering and can be easily computed offline. The proposed approach is independent of the number of virtual sound sources and flexible for working on different set of HRTFs. In subjective user tests, a wide variety of subjects reported good localization accuracy [SRJJ12]. For more information, view a video clip of our 3D audio effect here.

[MLKJ08] S. Mohan, M.E. Lockwood, M.L. Kramer and D.L. Jones, “Localization of multiple acoustic sources with small arrays using a coherence test”, Journal of the Acoustical Society of America 123(4), pp. 2136–2147, April 2008.

[LCZRZJC12] Y. Liang, Z. Cui, S. Zhao, K. Rupnow, Y. Zhang, D.L. Jones, and D. Chen, “Real-time implementation and performance optimization of 3D sound localization on GPUs”, Proceedings of Design, Automation & Test in Europe (DATE’12), March 2012, Dresden, Germany.

[ZALRCJ12] S. Zhao, S. Ahmed, Y. Liang, K. Rupnow, D. Chen, and D.L. Jones, “A real-time 3D sound localization system with miniature microphone array for virtual reality”, Proceedings of the 7th IEEE Conference on Industrial Electronics and Applications (ICIEA ’12), July 2012, Singapore.

[ZRJJ12] S. Zhao, R. Rogowski, R. Johnson, and D. L. Jones, “3D binaural sound capture and reproduction using a miniature microphone array”, Proceedings of the 15th International Conference on Digital Audio Effects (DAFx’ 12), 17-21 Sep, 2012, York, UK.

[TZJ14] N. T. N. Tho, S. Zhao, and D. L. Jones, “Robust DOA estimation of multiple speech sources,” Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP 2014), Florence Italy, 4-11 May 2014.

[ZSJ14] S Zhao, T Saluev, and D L. Jones, “Underdetermined direction of arrival estimation using acoustic vector sensor,” Elsevier Journal of Signal Processing, vol. 100, pp. 160-168, Jul 2014.

[ZJKM14] S. Zhao, D L. Jones, S. Khoo, and Z. Man, “Frequency-domain beamformers using conjugate gradient techniques for speech enhancement,” in press, Journal of Acoustical Society of America, 2014.