For low-cost, high-quality and real-time telepresence
ITEM is a research project conducted at Advanced Digital Sciences Center (ADSC)
, as part of the Interactive Digital Media (IDM)
sub-program. The ITEM project contributes to the main vision of the IDM subprogram by providing a 4D (3 spatial dimensions plus time) tele-presence meeting space, the ultimate goal of audio-visual remote reality system, that is augmented by the information from the cyberspace. The PI leading our research efforts is Professor Minh N. Do
from University of Illinois at Urbana-Champaign
Low-cost, realistic, real-time, and flexible Audio-Visual telepresence is a fundamental goal of AV research at ADSC’s IDM program and is a grand challenge in itself, as acknowledged by the inclusion of virtual reality in the US National Academy of Engineering’s list of 14 grand challenges for engineering in the 21st century. Success in this endeavor will equate to natural and seamless communication between individuals through a wide variety of media. Among its many uses, such a system could revolutionize teleconferencing, augmented reality, and gaming by providing 1) true 4D sound and video perception and 2) the ability to synthesize a telepresence involving several sites by placing participants in a virtual scene and allowing them to interact naturally as if all of the participants were physically present. These effective interactions with friends, colleagues, and collaborators around the world has become central to success in business, government, as well as personal relationships. By greatly extending the capability and naturalness, the ITEM project could enable more frequent and less costly high-quality communication and interaction, with huge benefits to business and society.
In the following section, ‘Research challenge and the state of the art‘ and ‘Our research methodology and novelty‘ are described in more details.
Research challenge and the state of the art
– It may be easy for humans, but for computers?
For realistic telepresence, the processing chain must solve a variety of classical theoretical problems in computer vision and image processing, such as bi-layer video segmentation, visual correspondences, image matting, and motion deblurring. For example, suppose that Alice and Bob attend a meeting in a virtual meeting room, while in fact Alice is seated in a coffee shop and Bob is at home. To Bob, it must appear as though Alice is in the meeting room, not in the coffee shop. In other words, we must remove the busy restaurant from the background of Alice’s camera image, and replace it by the appropriate part of the virtual meeting room. Although this is easy for humans, no known algorithm can do a decent job of cutting out Alice from her backdrop in real time, using any existing camera and a conventional processor – not even with hardware assistance in the form of a GPU or FPGA – in spite of intense interest in both the industrial and research community.
– Ultimately, it’s all for human users.
Realistic teleimmersion must also consider human perceptual issues. For example, if Alice participates in a meeting around a virtual table, she should see and hear other participants from a consistent perspective. A colleague speaking to her must appear to be looking at her, even though he may be looking in another direction in the real world (gaze correction). Alice must have a full-front view of the colleague seated across from her and she must see the left side of the face of the person seated at her virtual right, even though each participant has only a single video capture device operating from a single angle (free-viewpoint video synthesis). Further, the degree and angle of lighting should be consistent around the virtual table, even though each participant has different real-world lighting. Current state-of-the-art algorithms for free-viewpoint video synthesis cannot reconstruct a scene from a different viewpoint at high quality in real time, mainly due to the diffuculties in obtaining accurate 3D depth information efficiently. Ensuring consistent lighting and color tone across different participants and their virtual environment is another open challenge.
– 3D Audio, another fundamental part…
Human hearing is extremely sensitive to the perceived direction of arrival of sounds. If there are small differences between the expected and perceived directions of arrival, so that the voice of the colleague on Alice’s (virtual) left seems to be coming from another direction, the illusion of co-presence will be broken. Thus, accurate capture of acoustic sources’ direction of arrival and appropriate reconstruction of their direction for each virtual listener (3D audio direction finding and reconstruction) is a critical need for immersive communications. Acoustic sound direction finding is a classic problem in array signal processing, and has been extensively studied with arrays of physically separated microphones. The practical uses of these systems are limited because of the need for many microphones, which occupy a lot of space. Further, these systems require a large separation between microphones in order to detect low frequency sources accurately. Overall, state-of-the-art algorithms have trouble performing 2D audio direction finding with good accuracy in real time with a conventional processor. Even with GPU assistance, there are no real-time algorithms for 3D audio direction finding. Thus realistic teleimmersion will require a major advance in audio direction finding. (For more audio related research activities in ADSC, please refer to audio reseach group
Finally, additional technical challenges, such as resource-constrained coding of color video plus depth information, emerge because we want to support real-time teleimmersion even with low-cost sensors, commodity CPUs, GPUs, and FPGAs, and bandwidth-limited public Internet.
Given the many challenges that must be addressed, it is not surprising that state-of-the-art teleimmersive systems, such as the National Tele-Immersion Initiative [NTII11] and the University of Illinois at Urbana-Champaign’s own TEEVE project [TEEVE11], are typically very bulky, demand high bandwidth, and are too computationally intensive to run on an ordinary PC or laptop. These barriers seriously hinder their wide deployment in daily practice.
Our research methodology and novelty
Towards our vision, we are addressing various fundamental and practical challenges for building such a cost-effective telepresence system by focusing on the following aspects:
- Developing solid basic theory as well as innovative computational tools to process, represent, and interpret (often non-ideal) raw sensory data efficiently and effectively;
- Taking a holistic and integrative approach to solving sets of problems that are typically solved in isolation and also in a conventional manner;
- Conducting application-driven, use-inspired system design and optimization by leveraging multi-disciplinary expertise