|The security of a complex interconnected cyber-physical system like a smart grid critically depends on its capability to correctly and timely compute various functions with inputs held by distributed sensors (e.g. computing the total power being consumed by certain types of end loads), even when its communication system is under attacks. Thus, a natural question to ask is:
|In this work, we developed the theoretical foundations to formally study the communication complexity required for computing functions in a fault-tolerant way. To our best knowledge, our results presented the very first series of lower bounds on the fault-tolerant communication complexity of the important SUM function. These fault-tolerant lower bounds are (at least) exponentially larger than the corresponding upper bounds on the non-fault-tolerant communication complexity, which demonstrates the significant impact of failures. Thus the fault-tolerant communication complexity needs to be studied on its own, in addition to and in comparison with the classic non-fault-tolerant communication complexity.
While the above question is natural, interestingly, it has not been formally posed and thoroughly studied. We suspect that such a lack of prior work on fault tolerance is due to two reasons. First, one needs to define correctness in a meaningful way when failures are possible, since some of the inputs can be missing. For this work, recent applications in wireless sensor networks have shown us how to do so. Second, communication complexity problems tend to be challenging to study, and taking failures into account only makes things harder. For this work, we rely on several quite recent results, in particular, the breakthrough result on the communication complexity of gap-hamming distance problem by A. Chakrabarti and O. Regev (see their STOC’11 paper “An optimal lower bound on the communication complexity of gap-hamming distance”).
Our work considers the setting of a synchronous network with N nodes connected by some topology (e.g. consider a wireless mesh network of smart meters). Each node can only communicate directly with its neighbours, and the communication complexity corresponds to the number of bits sent by the bottleneck node. Given that each node has a non-negative integer value with bounded domain, the SUM function asks for the sum of all the values. We study the fault-tolerant communication complexity of SUM function with the following two natural assumptions: (1) we assume that the special root node who needs to learn the final result never fails; (2) we allow the computation to ignore/omit the inputs held by nodes that have failed or been disconnected from the root due to other nodes’ failures.
Since there is a tradeoff between communication complexity and time complexity, we prove a series of fault-tolerant communication complexity lower bounds under increasing time complexity of the SUM protocol, as shown in the figure below.
Our results also imply the optimality (within polylog factors) of some recent fault-tolerant protocols for computing SUM via duplicate-insensitive techniques, thereby answering an open question as well. Part of our results are obtained via a novel reduction from a new two-party problem UNIONSIZECP that we introduce. UNIONSIZECP comes with a novel cycle promise, which is the key enabler of our reduction. We further prove that this cycle promise and UNIONSIZECP play a fundamental role in reasoning about fault-tolerant communication complexity.
This work results from the collaboration with Prof. Haifeng Yu’s team in National University of Singapore and Dr. Phillip B. Gibbons from Intel Labs in the US.