Software Fault Tolerance

Most Realtime systems focus on hardware fault tolerance. Software fault tolerance is often overlooked. This is really surprising because hardware components have much higher reliability than the software that runs over them. Most system designers go to great lengths to limit the impact of a hardware failure on system performance. However they pay little attention to the systems behavior when a software module fails.

In this article we will be covering several techniques that can be used to limit the impact of software faults (read bugs) on system performance. The main idea here is to contain the damage caused by software faults. Software fault tolerance is not a license to ship the system with bugs. The real objective is to improve system performance and availability in cases when the system encounters a software or hardware fault.

Timeouts

Most Realtime systems use timers to keep track of feature execution. A timeout generally signals that some entity involved in the feature has misbehaved and a corrective action is required. The corrective action could be of two forms:

The choice between retrying or aborting on timeouts is based on several factors. Consider all these factors before you decide either way:

Audits

Most Realtime systems comprise of software running across multiple processors. This implies that data is also distributed. The distributed data may get inconsistent in Realtime due to reasons like:

The system must behave reliably under all these conditions. A simple strategy to overcome data inconsistency is to implement audits. Audit is a program that checks the consistency of data structures across processors by performing predefined checks.

Audit Procedure

  1. System may trigger audits due to several reasons:
    • periodically
    • failure of certain features
    • processor reboots
    • processor switchovers
    • certain cases of resource congestion
  2. Audits perform checks on data and look for data inconsistencies between processors.
  3. Since audits have to run on live systems, they need to filter out conditions where the data inconsistency is caused by transient data updates. On data inconsistency detection, audits perform multiple checks to confirm inconsistency. A inconsistency is considered valid if and only if it is detected on every iteration of the check.
  4. When inconsistency is confirmed, audits may perform data structure cleanups across processors.
  5. At times audits may not directly cleanup inconsistencies; they may trigger appropriate feature aborts etc.

An Example

Lets consider a switching system. If the call occupancy on the system is much less than the maximum that could be handled and still calls are failing due to lack of space-slot resources, call processing subsystem will detect this condition and will trigger space-slot audit. The audit will run on the Switching and Central processors cross-check if a space-slot that is busy at Central actually has a corresponding call at Switching. If no active call is found on Switching for a space-slot, the audit will recheck the condition after a small delay for several times. If the inconsistency holds on every attempt, the space-slot resource is marked free at Central. The audit performs several rechecks to eliminate the scenario in which the space-slot release message may be in transit.

Exception Handling

Whenever a task receives a message, it performs a series of defensive checks before processing it. The defensive checks should verify the consistency of the message as well as the internal state of the task. Exception handler should be invoked on defensive check failure.

Depending on the severity, exception handler can take any of the following actions:

Leaky Bucket Counter
Leaky-bucket counters are used to detect a flurry of error conditions. To ignore rare error conditions they are periodically leaked i.e. decremented. If these counters reach a certain threshold, appropriate exception handling is triggered. Note that the threshold will never be crossed by rare happening of the associated error condition. However, if the error condition occurs rapidly, the counter will overflow i.e. cross the threshold.

Task Rollback

In a complex Realtime system, a software bug in one task leading to processor reboot may not be acceptable. A better option in such cases is to isolate the erroneous task and handle the failure at the task level. The task in turn may decide to rollback i.e. start operation from a known or previously saved state. In other cases, it may not be expensive to forget the context by just deleting the offending task and informing other associated tasks.

For example, if a central entity encounters a exception condition leading to task rollback, it might resume operation by recovering information from downstream processors in the system. Another option would be to keep the design simple and clear all sessions during a rollback.

Task rollback may be triggered by any of the following events:

Incremental Reboot

Software processor reboots can be time consuming, leading to unacceptable amount of downtime. To reduce the system reboot time, complex Realtime systems often implement incremental system initialization procedures. For example, a typical Realtime system may implement three levels of system reboot :

Incremental Reboot Procedure

  1. A defensive check leaky-bucket counter overflow will typically lead to rollback of the offending task.
  2. In most cases task rollback will fix the problem. However, in some cases, the problem may not be fixed leading to subsequent rollbacks too soon. This will cause the task level rollback counter to overflow, leading to a Level 1 Reboot.
  3. Most of the times, Level 1 Reboot will fix the problem. But in some cases, the processor may continue to hit Level 1 Reboots repeatedly. This will cause the Level 1 Reboot counter to overflow, leading to a Level 2 Reboot.
  4. Majority of the times, Level 2 Reboot is able to fix the problem. If it is unable to fix the problem, the processor will repeatedly hit Level 2 Reboots, causing the Level 2 Reboot counter to overflow leading to Level 3 Reboot.

Voting

This is a technique that is used in mission critical systems where software failure may lead to loss of human life .e.g. aircraft navigation software. Here, the Realtime system software is developed by at least three distinct teams. All the teams develop the software independently. And, in a live system, all the three implementations are run simultaneously. All the inputs are fed to the three versions of software and their outputs are voted to determine the actual system response. In such systems, a bug in one of the three modules will get voted out by the other two versions.