University of Illinois Urbana-Champaign Presence at CVPR 2024

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2024 will take place June 17-21 in Seattle, bringing together the world’s top researchers to present cutting-edge advancements in computer vision, machine learning, and artificial intelligence. CVPR is the premier computer vision conference, attracting over 5,000 attendees.

Illinois @ CVPR 2024

41 Conference Papers (link)

7 Oral & Highlight Papers

Conference Activities (link)

67+ Researchers

55+ Partner Institutions  

Adobe, Bar-Ilan University, California Institute of Technology, Carnegie Mellon University, École Polytechnique Fédérale de Lausanne, Georgia Institute of Technology, Google, Harvard University, Indiana University, International Institute of Information Technology, Hyderabad, Johns Hopkins University, King Abdullah University of Science and Technology, KUIS AI, Kyung Hee University, Lapis Labs, Meta, Microsoft, Mistral AI, National University of Singapore, Nokia Bell Labs, Nvidia, Picsart, Pohang University of Science and Technology, Purdue University, SRI International, SambaNova Systems, Samsung, Shanghai Jiao Tong University, Simon Fraser University, Stability AI, Stanford University, Tencent, University of Texas at Austin, University of Tokyo, Toyota, Tsinghua University, University of California Berkeley, University of California Merced, University of Cambridge, University of Catania, University of Chicago, University of Minnesota, University of North Carolina at Chapel Hill, University of Pennsylvania, University of Southern California, University of Trento, University of Virginia, University of Washington, University of Wisconsin–Madison, Universidad de los Andes, University of Bristol, Virginia Polytechnic Institute and State University, Xi’an Jiaotong University, Yonsei University.

Oral & Highlight Publications

Oral and Highlight paper designations are highly-selective. These papers are promoted individually by the program committee and typically obtained high scores during the review process. Oral and Highlight indicate top 3% and 10% of accepted papers, respectively. Illinois researchers co-authored 2 Oral and 5 Highlight papers.


Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives (CVPR’24 Oral)

Research Team: Kristen Grauman et al. (including Bikram Boote, Fiona Ryan, James M. Rehg)

Bikram Boote

Bikram Boote is a Research Engineer in the Health Care Engineering Systems Center at the University of llinois. Bikram has an MS in Robotics from Georgia Tech. His interests lie in computer vision and deep learning with applications in 3D and egocentric vision.

AI driven human skill understanding can help facilitate various applications like picking up new skills with AR glasses. Ego-Exo4D is a foundational dataset to support research towards this direction using both egocentric and exocentric views of various skills.

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions — including a novel “expert commentary” done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community.

Figure. Ego-Exo4D offers egocentric video alongside multiple time-synchronized exocentric video streams for an array of skilled human activities—1,286 hours of ego and exo video in total. The data is both multiview and multimodal, and it is extensively annotated with language, 3D body and hand pose, keysteps, procedural dependencies, and proficiency ratings in support of our proposed benchmark tasks.

Why it matters

  • Both the egocentric and exocentric viewpoints are critical for capturing human skill and Ego-Exo4D is a foundational dataset to support research on ego-exo video learning and multimodal perception.
  • Ego-Exo4D focuses on skilled single-person activities where 740 participants perform skilled physical and/or procedural activities—dance, soccer, basketball, bouldering, music, cooking, bike repair, health care—in an unscripted manner and in natural settings (e.g., gym, soccer field, kitchens, bike shops, etc.), exhibiting a variety of skill levels from novice to expert.
  • Alongside the dataset, the paper also introduces 4 foundational tasks: ego-exo relation, ego-exo recognition, ego-exo proficiency estimation and ego pose.


Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (CVPR’24 Oral)

Research Team: Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg

Sangmin Lee

Sangmin Lee is a Postdoc in CS at the University of Illinois. His interests lie in expanding machine capabilities through multimodal perception and minimal supervision. Building upon these foundations, his current research focuses on developing socially intelligent machines that can understand and interact with humans in social contexts seamlessly.

Multimodal and multi-party are key features of real-world social interactions. Tackling these challenges brings us closer to developing socially intelligent machines.

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.

Figure. We introduce three new multimodal social benchmarks for multi-party understanding. These tasks are challenging because they require understanding the fine-grained interplay of verbal and non-verbal cues exchanged between multiple people. To address this, we propose the necessity of densely aligned language-visual representations and introduce a novel baseline leveraging this concept.

Why it matters

  • We introduce new multimodal social benchmarks that are challenging because they require understanding the fine-grained dynamics between multiple people.
  • We propose a novel baseline model with densely aligned language-visual representations, which enables individual-level interpretation of verbal and non-verbal cues.
  • We validate the significance of densely aligned multimodal representations in multi-party interaction understanding.


Putting the Object Back Into Video Object Segmentation (CVPR’24 Highlight)

Research Team: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing

Ho Kei Cheng

Ho Kei (Rex) is a Ph.D. student in the Computer Science Department at the University of Illinois, advised by Alexander Schwing. He works on machine learning algorithms for videos, including segmentation, tracking, editing, and generation.

Track your mask with Cutie today!

We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those, it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J &F over XMem with a similar running time and improves by 4.2 J &F over DeAOT while being three times faster.

Figure. We store pixel memory and object memory representations from past segmented (memory) frames. Pixel memory is retrieved for the query frame as pixel readout, which bidirectionally interacts with object queries and object memory in the object transformer. The object transformer blocks enrich the pixel feature with object-level semantics and produce the final object readout for decoding into the output mask.

Why it matters

  • Cutie combines pixel-level outputs and object-level reasoning in a single transformer which permits flexible end-to-end learning
  • Cutie is class-agnostic and works out of the box for many types of data at a low cost
  • Best open-source mask tracker at the moment for downstream tasks like mask annotation, providing auxiliary inputs to multi-view tasks or robot manipulation

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models (CVPR’24 Highlight)

Research Team: Moreno D’Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, Nicu Sebe

Xingqian Xu

Xingqian Xu is a senior research scientist in Picsart AI Research and a PhD graduate from the University of Illinois ECE IFP group supervised by Humphrey Shi and Thomas Huang. He also earned his Bachelor’s and Master’s degrees in Electrical Engineering from Illinois. His research focuses on generative AI and computer vision, particularly diffusion-based text-to-image multimodal generative AI.

Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments, it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However, existing works focus on detecting closed sets of biases defined a priori, limiting the studies to well-known concepts. In this paper, we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias, a new pipeline that identifies and quantifies the severity of biases agnostically, without access to any precompiled set. OpenBias has three stages. In the first phase, we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly, the target generative model produces images using the same set of captions. Lastly, a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated before. Via quantitative experiments, we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.

Figure. OpenBias pipeline. Starting with a dataset of real textual captions (T ) we leverage a Large Language Model (LLM) to build a knowledge base B of possible biases that may occur during the image generation process. In the second stage, synthesized images are generated using the target generative model conditioned on captions where a potential bias has been identified. Finally, the biases are assessed and quantified by querying a VQA model with caption-specific questions extracted during the bias proposal phase.

Why it matters

  • OpenBias is the first to study the problem of open-set bias detection at large scale without relying on a predefined list of biases. The method discovers novel biases that have never been studied before.
  • OpenBias is also a modular pipeline that given a list of prompts, leverages a LLM to extract a knowledge base of possible biases, and a VQA model to recognize and quantify them.
  • The pipeline is adapting multiple text-to-image generative models such as Stable Diffusion XL, 1.5, 2. The pipeline shows its agreement with closed-set classifier-based methods and with human judgement.

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models (CVPR’24 Highlight)

Research Team: Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag

Ozgur Kara

Ozgur Kara is a Ph.D. student in Computer Science at the University of Illinois (he was a Ph.D. student at Georgia Tech). He earned his Bachelor’s degree in Electrical and Electronics Engineering from Boğaziçi University. His research focuses on generative AI and computer vision, particularly on generative AI and its applications in video.

Let’s RAVE!

Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods.

Figure. Our process begins by performing a DDIM inversion with the pre-trained T2I model and condition extraction with an off-the-shelf condition preprocessor applied to the input video. These conditions are subsequently input into ControlNet. In the RAVE video editing process, diffusion denoising is performed for T timesteps using condition grids, latent grids, and the target text prompt as input for ControlNet. Random shuffling is applied to the latent grids and condition grids at each denoising step. After T timesteps, the latent grids are rearranged, and the final output video is obtained.

Why it matters

  • RAVE’s zero-shot capability allows users to employ any off-the-shelf T2I model, making it compatible with any fine-tuned stable diffusion model.
  • RAVE’s speed makes it adaptable for practical applications, supporting videos of any length.
  • The released dataset aims to standardize the evaluation of text-guided video editing to ensure scientific rigor.

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (CVPR’24 Highlight)

Research Team: Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

Savya Khosla

Savya Khosla is an incoming Ph.D. student in the department of Computer Science at the University of Illinois, advised by Professor Derek Hoiem. His research focuses on multimodal learning and long-form video understanding.

“Enabling machines to learn from multiple data modalities provides them with a more human-like understanding of our world, allowing them to make better decisions, solve complex problems, and interact more effectively with humans and their environment.”

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs — images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation.

Figure. Unified-IO 2 architecture. Input text, images, audio, or image/audio history are encoded into sequences of embeddings which are concatenated and used as input to an encoder-decoder transformer model. The transformer outputs discrete tokens that can be decoded into text, an image, or an audio clip.

Why it matters

  • Unified-IO 2 is the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action.
  • This model was trained from scratch on multimodal data and further refined with instruction tuning on a massive multimodal corpus.
  • This work further proposes several architectural techniques to overcome the stability and scalability issues during training.


Project Lead and Web Development: Sangmin Lee and Jim Rehg