Fault-tolerant Communication

The security of a complex interconnected cyber-physical system like a smart grid critically depends on its capability to correctly and timely compute various functions with inputs held by distributed sensors (e.g. computing the total power being consumed by certain types of end loads), even when its communication system is under attacks. Thus, a natural question to ask is:

If we want to compute a certain function in a fault-tolerant way, what will the communication complexity be?

In this work, we developed the theoretical foundations to formally study the communication complexity required for computing functions in a fault-tolerant way. To our best knowledge, our results presented the very first series of lower bounds on the fault-tolerant communication complexity of the important SUM function. These fault-tolerant lower bounds are (at least) exponentially larger than the corresponding upper bounds on the non-fault-tolerant communication complexity, which demonstrates the significant impact of failures. Thus the fault-tolerant communication complexity needs to be studied on its own, in addition to and in comparison with the classic non-fault-tolerant communication complexity.


While the above question is natural, interestingly, it has not been formally posed and thoroughly studied. We suspect that such a lack of prior work on fault tolerance is due to two reasons. First, one needs to define correctness in a meaningful way when failures are possible, since some of the inputs can be missing. For this work, recent applications in wireless sensor networks have shown us how to do so. Second, communication complexity problems tend to be challenging to study, and taking failures into account only makes things harder. For this work, we rely on several quite recent results, in particular, the breakthrough result on the communication complexity of gap-hamming distance problem by A. Chakrabarti and O. Regev (see their STOC’11 paper “An optimal lower bound on the communication complexity of gap-hamming distance”).

Our approach

Our work considers the setting of a synchronous network with N nodes connected by some topology (e.g. consider a wireless mesh network of smart meters). Each node can only communicate directly with its neighbours, and the communication complexity corresponds to the number of bits sent by the bottleneck node. Given that each node has a non-negative integer value with bounded domain, the SUM function asks for the sum of all the values. We study the fault-tolerant communication complexity of SUM function with the following two natural assumptions: (1) we assume that the special root node who needs to learn the final result never fails; (2) we allow the computation to ignore/omit the inputs held by nodes that have failed or been disconnected from the root due to other nodes’ failures.

Our results

Since there is a tradeoff between communication complexity and time complexity, we prove a series of fault-tolerant communication complexity lower bounds under increasing time complexity of the SUM protocol, as shown in the figure below.

Gap between fault-tolerant communication complexity and non-fault-tolerant communication complexity

Our results also imply the optimality (within polylog factors) of some recent fault-tolerant protocols for computing SUM via duplicate-insensitive techniques, thereby answering an open question as well. Part of our results are obtained via a novel reduction from a new two-party problem UNIONSIZECP that we introduce. UNIONSIZECP comes with a novel cycle promise, which is the key enabler of our reduction. We further prove that this cycle promise and UNIONSIZECP play a fundamental role in reasoning about fault-tolerant communication complexity.


This work results from the collaboration with Prof. Haifeng Yu’s team in National University of Singapore and Dr. Phillip B. Gibbons from Intel Labs in the US.


  1. Binbin Chen, Haifeng Yu, Yuda Zhao, and Phillip Gibbons
    The Cost of Fault Tolerance in Multi-Party Communication Complexity
    Accepted by the Journal of the ACM.
  2. Binbin Chen, Haifeng Yu, Yuda Zhao, and Phillip Gibbons
    The Cost of Fault Tolerance in Multi-Party Communication Complexity
    In Proceedings of 31st Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’12)
    Madeira, Portugal, July 2012. [PDF]
  3. Binbin Chen, Haifeng Yu, Yuda Zhao, and Phillip Gibbons
    The Cost of Fault Tolerance in Multi-Party Communication Complexity
    Technical report with full proofs of our results. [PDF]