Artificial Intelligence in Action

From voice-powered personal assistants like Siri and Alexa, to more underlying and fundamental technologies such as behavioral algorithms, suggestive searches and autonomously-powered self-driving vehicles boasting powerful predictive capabilities, artificial intelligence has only started to revolutionize our lives. A multitude of exciting possibilities in fields like computer vision, natural language processing, medicine, biology, industry, manufacturing, security, education, virtual environments, games and others, are yet to be explored.

The ‘Artificial Intelligence In Action’ session aims to bring together student researchers and practitioners to present their latest achievements and innovations in different areas of artificial intelligence.

Time and place – Friday Feb. 8, 2pm-5pm, CSL B02 

Keynote Speaker: Antonio Torralba – MIT

Learning to See

Summary – It is an exciting time for computer vision. With the success of new computational architectures for visual processing, such as deep neural networks (e.g., convNets) and access to image databases with millions of labeled examples (e.g., ImageNet, Places), the state of the art in computer vision is advancing rapidly. Computer vision is now present among many commercial products, such as digital cameras, web applications, security applications, etc.The performance achieved by convNets is remarkable and constitute the state of the art on many recognition tasks. But why does it work so well? what is the nature of the internal representation learned by the network? I will show that the internal representation can be interpretable. In particular, object detectors emerge in a scene classification task. Then, I will show that an ambient audio signal can be used as a supervisory signal for learning visual representations. We do this by taking advantage of the fact that vision and hearing often tell us about similar structures in the world, such as when we see an object and simultaneously hear it make a sound. We train a convNet to predict ambient sound from video frames, and we show that, through this process, the model learns a visual representation that conveys significant information about objects and scenes. I will also show how we can use raw speech descriptions of images to jointly learn to segment words in speech and objects in images without any additional supervision.

Bio – Antonio Torralba is a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT), the MIT director of the MIT-IBM Watson AI Lab, and the inaugural director of the MIT Quest for Intelligence, a MIT campus-wide initiative to discover the foundations of intelligence. He received the degree in telecommunications engineering from Telecom BCN, Spain, in 1994 and the Ph.D. degree in signal, image, and speech processing from the Institut National Polytechnique de Grenoble, France, in 2000. From 2000 to 2005, he spent postdoctoral training at the Brain and Cognitive Science Department and the Computer Science and Artificial Intelligence Laboratory, MIT, where he is now a professor.

Pulkit Agrawal – UC Berkeley

Computational Sensorimotor Learning

Summary – An open question in artificial intelligence is how to endow agents with common sense knowledge that humans naturally seem to possess. A prominent theory in child development posits that human infants gradually acquire such knowledge through the process of experimentation. According to this theory, even the seemingly frivolous play of infants is a mechanism for them to conduct experiments to learn about their environment. Inspired by this view of biological sensorimotor learning, I will present my work on building artificial agents that use the paradigm of experimentation to explore and condense their experience into models that enable them to solve new problems. I will discuss the effectiveness of my approach and open issues using case studies of a robot learning to push objects, manipulate ropes, finding its way in office environments and an agent learning to play video games merely based on the incentive of conducting experiments.

Bio – Pulkit is a co-founder of SafelyYou Inc. and holds aPh.D. in computer science from UC Berkeley. His research interests span robotics, deep learning, computer vision and computational neuroscience. Pulkit completed his bachelor’s in electrical engineering from IIT Kanpur and was awarded the Director’sGoldMedal.His work has appeared multiple times in MITTechReview, Quanta, New Scientist, NYPost etc.He is a recipient of Signatures Fellow Award, Fulbright Science and Technology Award, Goldman Sachs Global Leadership Award, OPJEMS, Sridhar Memorial Prize and IIT Kanpur’s Academic Excellence Awards among others. Pulkit holds a “Sangeet Prabhakar” (equivalent to bachelors in Indian classical music) and occasionally performs in music concerts.

Tanmay Gangwani – UIUC

Reinforcement Learning via Self-imitation

Summary – The success of popular algorithms for deep reinforcement learning, such as policy-gradients and Q-learning, relies heavily on the availability of an informative reward signal at each time step of the sequential decision-making
process. When rewards are only sparsely available during an episode, or a rewarding feedback is provided only after episode termination, these algorithms perform sub-optimally due to the difficultly in credit assignment. Alternatively, trajectory-based policy optimization methods, such as cross-entropy method and evolution strategies, do not require per-time step rewards, but have been found to suffer from high sample complexity by completing forgoing the temporal nature of the problem.Improving the efficiency of RL algorithms in real-world problems with sparse or episodic rewards is therefore a pressing need. In this talk, we introduce a self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings. We view each policy as a state-action visitation distribution and formulate policy optimization as a divergence minimization problem. We show that with Jensen-Shannon divergence, this divergence minimization problem can be reduced into a policy-gradient algorithm with shaped rewards learned using an experience-replay buffer. Experimental results indicate that our algorithm works comparable to existing algorithms in environments with dense rewards, and significantly better in environments with sparse and episodic rewards. We then discuss limitations of self-imitation learning and propose to solve them by using Stein variational policy gradient descent with the Jensen-Shannon kernel to learn multiple diverse policies. We demonstrate its effectiveness on a challenging variant of continuous-control MuJoCo locomotion tasks.

Bio – Tanmay Gangwani is a Ph.D. candidate in the Department of Computer Science at University of Illinois, Urbana-Champaign. He is advised by Prof. Jian Peng. His research focuses on challenges in deep reinforcement learning, such as sample-complexity, efficient exploration, imitation, and model-learning. Previously, he earned a master’s degree from the same university, with specialization in Computer-Architecture.

Aditya Deshpande – UIUC

Learning multiple solutions for computer vision problems

Summary – Deep Nets trained to regress to a single value or classify to a single class label are the workhorse of many computer vision applications. However, some computer vision problems are ambiguous, i.e. they have more than one plausible solution. Therefore, we need methods –a) That can model the multi-modal output distribution, and b) Produce diverse and meaningful solutions from the estimated (multi-modal) distribution. In this talk, I will demonstrate our research on the ambiguous problems of colorization and image captioning. Note, more than one colorization is feasible for a grey-level image and for captioning, a given image can be described in various ways. In colorization, we first use variational autoencoders and mixture density networks to model the multi-modal output space of colorizations and then, infer diverse colorizations. For image captioning, we use convolutional networks to perform captioning which previously was done using recurrent networks. We show convolutional networks achieve similar accuracy while producing more entropy in the output posteriors and therefore, more diversity in output captions.

Bio – Aditya Deshpande is a final year PhD student in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He is advised by Prof. David Forsyth and Prof. Alexander Schwing. Aditya’s research is centered around the fields of machine learning and computer vision. His thesis research deals with models for learning multi-modal distributions, and their application to different tasks, e.g., for captioning, colorization etc. During undergrad, he worked on research problems in areas of GPU computing and structure from motion. His paper “Can GPUs Sort Strings Efficiently?” won the Best GPU Paper Award at IEEE International Conference on High Performance Computing, 2013.

Yuan-Ting Hu – UIUC

Semantic A modal Instance Level Video Object Segmentation -A Dataset and Baselines

Summary – Semantic a modal instance level video object segmentation, i.e., semantic segmenting of individual objects in videos even under occlusion, is an important problem for sophisticated occlusion reasoning, depth ordering, and object size prediction. Particularly the temporal sequence provided by a densely and semantically labeled video dataset is increasingly important since it enables assessment of temporal reasoning and evaluation of methods which anticipate behavior of objects and humans. Despite these benefits, even for images, a modal segmentation has not been considered until very recently. While the problem is ill-posed, it has been shown that humans are able to predict the occluded regions with high degrees of confidence and consistency. However, the lack of available data makes a modal image segmentation a challenging endeavor, even today. In this talk, I will present a new dataset for semantic a modal instance level video object segmentation and the methodology for collecting the dataset by leveraging Grand Theft Auto V (GTA V). I will also show the benefits of the proposed dataset and discuss some of the possibilities of the dataset

Bio – Yuan-Ting Hu is a Ph.D. candidate in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign under supervision of Prof. Alexander Schwing. Her research interests include computer vision and machine learning. She is particularly interested in video analysis and its applications.

Unnat Jain – UIUC

Two Body Problem: Collaborative Visual Task Completion

Summary – Collaboration is a necessary skill to perform tasks that are beyond one agent’s capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. We compare performance of collaborative tasks in photorealistic visual environments to an analogous grid-world environment, to establish that the former are more challenging. We also provide a statistical interpretation of the communication strategy learned by the agents. To summarize, in this work we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks.

Bio – Unnat is a PhD student in the Department of Computer Science at the UIUC. His current research is focused on applications of machine learning and multi-agent reinforcement learning on computer vision problems. His work with Prof. Alexander Schwing and Prof. Svetlana Lazebnik has been published at top conferences like CVPR. Exploring his interests in AI, he worked as an intern with Perception team in Uber’s self driving project and as a research intern at Allen Institute for Artificial Intelligence. Prior to joining the PhD program, Unnat graduated with the best master’s thesis award at CS@Illinois and the best thesis award across all engineering departments from IIT Kanpur. He is also the recipient of Siebel Scholars Award at UIUC and the Director’s Gold Medal at IIT Kanpur for his all-round achievement and leadership.

Anand Bhattad – UIUC

Big but Invisible Adversarial Attacks

Summary – An adversarial example is an image that has been adjusted to cause a classifier to report the wrong label. Adversarial examples are interesting, because some changes that cause labels to flip are imperceptible to people. Current constructions search for small, noise-like changes to avoid detection by human users; typically, the search controls the norm of the change. This paper shows how to construct large changes to images that confuse deep network classifiers, but still keep the image looking natural to human subjects. The paper demonstrates three novel constructions which achieve high attack success rate while preserving the perception that the image is natural. These attacks exploit solutions to fundamental computer vision problems (texture transfer; colorization; super-resolution) to apply major image changes that aren’t noticeable to users. For example, our colorization attack adjusts colors in large regions of the image (achieving high Lp norm changes) but does so in away that the picture still looks natural to a user. We do a thorough analysis on each of our attacks and also conduct a user study to show that human cannot discriminate between benign and these generated adversarial images.

Bio – Anand Bhattad is a first-year Ph.D. student advised byProf. David Forsyth in the ECE department of UIUC. His research interest lies at the intersection of Computer Vision, Computational Photography and Machine Learning. He develops algorithms for scene understanding and image manipulation. Webpage: