Thrust 5: Prediction and trend recognition by AI and machine learning

Once one has deployed pathogen diagnostic sensors, e.g. as part of a mobile crowdsourcing system, one can use a variety of artificial intelligence (AI) algorithms to perform numerous public health surveillance tasks. Since physical sensors are necessarily subject to uncertainties and may also not have complete coverage, it is important to ensure these algorithms are robust to noisy and incomplete data. In all of these settings, it is valuable to combine physical models together with data-driven techniques, where the physics are not just of the sensor but also of the phenomenon, such as human mobility patterns or disease spreading dynamics. This is often called physics-based AI.

As a typical example task, one can consider localizing the source of an infection (so-called patient zero) spreading through a social network. We previously developed an AI algorithm to localize the source of such an infection from noisy and incomplete snapshot data in a directed acyclic graph, proved that it is Bayes-optimal under certain technical conditions, and demonstrated its efficacy in numerous examples. Here we propose to extend this work to not only consider directed acyclic graphs but also undirected graphs, simplicial complexes, and hypergraphs that may better model the transmission of infectious disease in social settings where several people may interact simultaneously. We also propose to extend the approach to consider multiple sources of the infection that arrive at different times, as may happen when there is community spread, building on our previous work in this direction81. The same AI algorithmic idea can be extended to develop statistical privacy-preserving techniques for contact tracing, in contrast to cryptographic privacy-preserving techniques. Privacy is preserved by adding noise and omitting data: then the aforementioned AI algorithms for source localization can be adapted to infer contacts.

As another example task, we can develop ways to reduce the number of physical tests that are needed to make inferences about the disease status of individuals, neighborhoods, or communities. Of particular note, one may be able to do localized inference without determining the individual disease state of any particular individual. Group testing is the problem of determining a small set of infected people from a larger set of people using as few tests as possible or to estimate the rate of infection for various subsets. It is relevant in infectious disease surveillance when there are limited resources or when privacy is to be preserved. The basic idea is to pool and test items together. Since physical tests may be noisy, redundancy must be added to achieve reliable results, much like error-correcting codes in communications (the Shannon limit for communication is also a fundamental limit on the number of required tests). Here we propose a significant extension of the standard formulation of group testing to not only consider noise in physical tests, but also (1) potentially complicated correlation structure among the infected states of people, and (2) side information on their infection status including symptom reporting, wastewater surveillance, and network structure. Due to these complications, analytical approaches are untenable, and we propose a new machine learning architecture to design adaptive pooling protocols. The architecture is designed via an analogy to optimal posterior matching in joint source-channel coding from information theory. Experiments show significant reduction in number of tests compared to baselines that do not consider correlation structure or side information. In fact, we plan to compare to newly derived information-theoretic limits to show the learned schemes are nearly optimal.