Faulttolerant Communication
The security of a complex interconnected cyberphysical system like a smart grid critically depends on its capability to correctly and timely compute various functions with inputs held by distributed sensors (e.g. computing the total power being consumed by certain types of end loads), even when its communication system is under attacks. Thus, a natural question to ask is:


In this work, we developed the theoretical foundations to formally study the communication complexity required for computing functions in a faulttolerant way. To our best knowledge, our results presented the very first series of lower bounds on the faulttolerant communication complexity of the important SUM function. These faulttolerant lower bounds are (at least) exponentially larger than the corresponding upper bounds on the nonfaulttolerant communication complexity, which demonstrates the significant impact of failures. Thus the faulttolerant communication complexity needs to be studied on its own, in addition to and in comparison with the classic nonfaulttolerant communication complexity.
ChallengesWhile the above question is natural, interestingly, it has not been formally posed and thoroughly studied. We suspect that such a lack of prior work on fault tolerance is due to two reasons. First, one needs to define correctness in a meaningful way when failures are possible, since some of the inputs can be missing. For this work, recent applications in wireless sensor networks have shown us how to do so. Second, communication complexity problems tend to be challenging to study, and taking failures into account only makes things harder. For this work, we rely on several quite recent results, in particular, the breakthrough result on the communication complexity of gaphamming distance problem by A. Chakrabarti and O. Regev (see their STOC’11 paper “An optimal lower bound on the communication complexity of gaphamming distance”). Our approachOur work considers the setting of a synchronous network with N nodes connected by some topology (e.g. consider a wireless mesh network of smart meters). Each node can only communicate directly with its neighbours, and the communication complexity corresponds to the number of bits sent by the bottleneck node. Given that each node has a nonnegative integer value with bounded domain, the SUM function asks for the sum of all the values. We study the faulttolerant communication complexity of SUM function with the following two natural assumptions: (1) we assume that the special root node who needs to learn the final result never fails; (2) we allow the computation to ignore/omit the inputs held by nodes that have failed or been disconnected from the root due to other nodes’ failures. Our resultsSince there is a tradeoff between communication complexity and time complexity, we prove a series of faulttolerant communication complexity lower bounds under increasing time complexity of the SUM protocol, as shown in the figure below. Our results also imply the optimality (within polylog factors) of some recent faulttolerant protocols for computing SUM via duplicateinsensitive techniques, thereby answering an open question as well. Part of our results are obtained via a novel reduction from a new twoparty problem UNIONSIZECP that we introduce. UNIONSIZECP comes with a novel cycle promise, which is the key enabler of our reduction. We further prove that this cycle promise and UNIONSIZECP play a fundamental role in reasoning about faulttolerant communication complexity. CollaborationsThis work results from the collaboration with Prof. Haifeng Yu’s team in National University of Singapore and Dr. Phillip B. Gibbons from Intel Labs in the US. Publications
