Challenges in Rebooting Autonomy with Deep Learned Perception

Synopsis

Deep learning (DL) models are becoming effective in solving computer-vision tasks such as semantic segmentation, object tracking, and pose estimation on real-world captured images. Reliability analysis of autonomous systems that use these DL models as part of their perception systems have to account for the performance of these models. Autonomous systems with traditional sensors have tried-and-tested reliability assessment processes with modular design, unit tests, system integration, compositional verification, certification, etc. In contrast, DL perception modules relies on data-driven or learned models. These models do not capture uncertainty and often lack robustness. Also, these models are often updated throughout the lifecycle of the product when new data sets become available. However, the integration of an updated DL-based perception requires a reboot and start afresh of the reliability assessment and operation processes for autonomous systems. In this webpage, we provide the references and implementations to the following two concrete and open source examples based on Microsoft AirSim simulator:

  • Drone racing with vision-based gate detection
  • Swarm formation using vision-based relative positioning

In our industrial pitch paper published in EMSOFT 2022,1 we discuss the three challenges related to specifying, verifying, and operating these systems with DL-based perception modules, and we use these two systems to illustrate the challenges. We would like to see the three challenges be addressed by the community.

1Please refer to https://doi.org/10.1109/EMSOFT55006.2022.00016 for our industrial track paper. A preprint version is also available at https://mitras.ece.illinois.edu/research/2022/Industry-EMSOFT22.pdf.

Drone Racing

Our first example is built on top of the AirSim Drone Racing Lab (ADRL) [1] available at https://github.com/microsoft/AirSim-Drone-Racing-Lab/blob/master/docs/adrl_overview.md. We particularly consider the Tier II: Perception Only. In this tier, the next gate is not always in view, but the noisy pose provided by the ADRL API can help steer the participants roughly in the right direction. The vision-based perception and control would be necessary to first detect the gate and navigate the drone through the gate.

Figure 1. Sub-systems in the drone racing autonomous system. Control decisions are made based on the gate pose estimates obtained from a DL-based perception pipeline.

Below we provide the links to the relevant GitHub repositories for ADRL as well as the vision-based gate detection provided by competitors. We do not implement or own the code. We have exchanged emails with the competitors. Thanks to Team USRG and Spleenlab AI for replying and providing their opensource implementation.

Swarm Formation

Our second example is built on top of the Microsoft AirSim simulator [2] available at https://microsoft.github.io/AirSim/. The vision-base formation control system is inspired by Fathian et al. [3]. The distributed formation control system consists of N identical aerial vehicles or agents as shown in Figure 2. Each agent has a downward facing camera, and each ego agent uses images from its own camera and its neighbor’s camera to periodically estimate the relative position of its neighbors with respect to the ego agent. Based on these estimated relative positions to all its neighbors, the ego agent then updates it own position by setting a velocity, to try and achieve the target formation. The linear velocity control is from the classical algorithm for displacement-based formation control algorithm in the textbook [4]. A concise description of the algorithm is available in Section 5 of the survey [5].

For simplicity, our implementation makes sure that the whole system runs in a synchronous mode, i.e., all the drones will capture the image at the same time and there’s no communication delay between drones while sharing the images. Our OpenCV based implementation first uses the ORB algorithm for feature detection, then the FLANN algorithm for feature matching to find the pairs of matching pixels. The relative position is reconstructed with pairs of pixels by finding and decomposing the homography mapping between the two images [6]. The detail installation instructions and simulation scripts are provided in the repository below.

Figure 2. Architecture of an agent in formation control systems with vision-based relative positioning.

References

[1] R. Madaan, N. Gyde, S. Vemprala, M. Brown, K. Nagami, T. Taubner, E. Cristofalo, D. Scaramuzza, M. Schwager, and A. Kapoor, “AirSim Drone Racing Lab,” in Proc. NeurIPS 2019 Competition and Demonstration Track, ser. Proc. Machine Learning Research, vol. 123. PMLR, Dec. 2020, pp. 177–191.
[2] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles,” in Field and Service Robotics, M. Hutter and R. Siegwart, Eds. Cham: Springer International Publishing, 2018, pp. 621–635.
[3] K. Fathian, E. Doucette, J. W. Curtis and N. R. Gans, “Vision-Based Distributed Formation Control of Unmanned Aerial Vehicles,” 2018, arXiv:1809.00096, doi: 10.48550/arXiv.1809.00096.
[4] M. Mesbahi and M. Egerstedt, “Graph Theoretic Methods in Multiagent Networks.” Princeton: Princeton University Press, 2010.
[5] K. Oh, M. Park, H. Ahn, “A survey of multi-agent formation control,” in J. Automatica, vol. 53, pp. 424-440, 2015, doi: 10.1016/j.automatica.2014.10.022
[6] E. Malis and M. Vargas, “Deeper understanding of the homography decomposition for vision-based control,” INRIA, Research Report RR-6303, 2007. [Online]. Available: https://hal.inria.fr/inria-00174036