Software Fault Tolerance

Most Realtime systems focus on hardware fault tolerance. Software fault tolerance is often overlooked. This is really surprising because hardware components have much higher reliability than the software that runs over them. Most system designers go to great lengths to limit the impact of a hardware failure on system performance. However they pay little attention to the systems behavior when a software module fails.

In this article we will be covering several techniques that can be used to limit the impact of software faults (read bugs) on system performance. The main idea here is to contain the damage caused by software faults. Software fault tolerance is not a license to ship the system with bugs. The real objective is to improve system performance and availability in cases when the system encounters a software or hardware fault.

Timeouts
Audits
Exception Handling
Task Rollback
Incremental Reboot
Voting

Timeouts

Most Realtime systems use timers to keep track of feature execution. A timeout generally signals that some entity involved in the feature has misbehaved and a corrective action is required. The corrective action could be of two forms:

Retry: When the application times out for a response, it can retry the message interaction. You might argue that we do not need to implement application level retries as lower level protocols will automatically recover from message loss. Keep in mind that message loss recovery is not the only objective of implementing retries. Retries help in recovering from software faults too. Consider a scenario where a message sent to a task is not processed because of a task restart or processor reboot. An application level retry will recover from this condition.
Abort: In this case timeout for a response leads to aborting of the feature. This might seem too drastic, but in reality aborting a feature might be the simplest and safest solution in recovering from the errors. The feature might be retried by the user invoking the feature. Consider a case where a call has to be cleared because the task originating the call did not receive a response in time. If this condition can happen only in rare scenarios, the simplest action on timeout might be to clear the call. The user would retry the call.

The choice between retrying or aborting on timeouts is based on several factors. Consider all these factors before you decide either way:

If the feature being executed is fairly important for system stability, it might be better to retry. For example, a system startup feature should not be aborted on one timeout.
If the lower layer protocol is not robust, retry might be a good option. For example, message interactions using an inherently unreliable protocol like slotted aloha should always be retried.
Complexity of implementation should also be considered before retrying a message interaction. Aborting a feature is a simpler option. More often than not system designers just default to retrying without even considering the abort option. Keep in mind that retry implementation complicates the code and state machine design.
If the entity invoking this feature will retry the feature, the simplest action might be abort the feature and wait for an external retry.
Retrying every message in the system will lower system performance because of frequent timer start and stop operations. In many cases, performance can be improved by just running a single timer for the complete feature execution. On timeout the feature can simply be aborted.
For most external interactions, the designer might have no choice. As the timeouts and retry actions are generally specified by the external protocols.
Many times the two techniques are used together. The task retries a message certain number of times. If no response is received after exhausting this limit, the feature might be aborted.

Audits

Most Realtime systems comprise of software running across multiple processors. This implies that data is also distributed. The distributed data may get inconsistent in Realtime due to reasons like:

independent processor reboot
software bugs
race conditions
hardware failures
protocol failures

The system must behave reliably under all these conditions. A simple strategy to overcome data inconsistency is to implement audits. Audit is a program that checks the consistency of data structures across processors by performing predefined checks.

Audit Procedure

System may trigger audits due to several reasons:
- periodically
- failure of certain features
- processor reboots
- processor switchovers
- certain cases of resource congestion
Audits perform checks on data and look for data inconsistencies between processors.
Since audits have to run on live systems, they need to filter out conditions where the data inconsistency is caused by transient data updates. On data inconsistency detection, audits perform multiple checks to confirm inconsistency. A inconsistency is considered valid if and only if it is detected on every iteration of the check.
When inconsistency is confirmed, audits may perform data structure cleanups across processors.
At times audits may not directly cleanup inconsistencies; they may trigger appropriate feature aborts etc.

An Example

Lets consider a switching system. If the call occupancy on the system is much less than the maximum that could be handled and still calls are failing due to lack of space-slot resources, call processing subsystem will detect this condition and will trigger space-slot audit. The audit will run on the Switching and Central processors cross-check if a space-slot that is busy at Central actually has a corresponding call at Switching. If no active call is found on Switching for a space-slot, the audit will recheck the condition after a small delay for several times. If the inconsistency holds on every attempt, the space-slot resource is marked free at Central. The audit performs several rechecks to eliminate the scenario in which the space-slot release message may be in transit.

Exception Handling

Whenever a task receives a message, it performs a series of defensive checks before processing it. The defensive checks should verify the consistency of the message as well as the internal state of the task. Exception handler should be invoked on defensive check failure.

Depending on the severity, exception handler can take any of the following actions:

Log a trace for developer post processing.
Increment a leaky-bucket counter for the error condition.
Trigger appropriate audit.
Trigger a task rollback.
Trigger processor reboot.

Leaky Bucket Counter

Leaky-bucket counters are used to detect a flurry of error conditions. To ignore rare error conditions they are periodically leaked i.e. decremented. If these counters reach a certain threshold, appropriate exception handling is triggered. Note that the threshold will never be crossed by rare happening of the associated error condition. However, if the error condition occurs rapidly, the counter will overflow i.e. cross the threshold.

Task Rollback

In a complex Realtime system, a software bug in one task leading to processor reboot may not be acceptable. A better option in such cases is to isolate the erroneous task and handle the failure at the task level. The task in turn may decide to rollback i.e. start operation from a known or previously saved state. In other cases, it may not be expensive to forget the context by just deleting the offending task and informing other associated tasks.

For example, if a central entity encounters a exception condition leading to task rollback, it might resume operation by recovering information from downstream processors in the system. Another option would be to keep the design simple and clear all sessions during a rollback.

Task rollback may be triggered by any of the following events:

Hardware exception conditions like divide by zero, illegal address access (bus error)
Defensive check leaky-bucket counter overflows.
Audit detected inconsistency to be resolved by task rollback.

Incremental Reboot

Software processor reboots can be time consuming, leading to unacceptable amount of downtime. To reduce the system reboot time, complex Realtime systems often implement incremental system initialization procedures. For example, a typical Realtime system may implement three levels of system reboot :

Level 1 Reboot : Operating system reboot
Level 2 Reboot : Operating system reboot along with configuration data download
Level 3 Reboot : Code reload followed by operating system reboot along with configuration data download.

Incremental Reboot Procedure

A defensive check leaky-bucket counter overflow will typically lead to rollback of the offending task.
In most cases task rollback will fix the problem. However, in some cases, the problem may not be fixed leading to subsequent rollbacks too soon. This will cause the task level rollback counter to overflow, leading to a Level 1 Reboot.
Most of the times, Level 1 Reboot will fix the problem. But in some cases, the processor may continue to hit Level 1 Reboots repeatedly. This will cause the Level 1 Reboot counter to overflow, leading to a Level 2 Reboot.
Majority of the times, Level 2 Reboot is able to fix the problem. If it is unable to fix the problem, the processor will repeatedly hit Level 2 Reboots, causing the Level 2 Reboot counter to overflow leading to Level 3 Reboot.

Voting

This is a technique that is used in mission critical systems where software failure may lead to loss of human life .e.g. aircraft navigation software. Here, the Realtime system software is developed by at least three distinct teams. All the teams develop the software independently. And, in a live system, all the three implementations are run simultaneously. All the inputs are fed to the three versions of software and their outputs are voted to determine the actual system response. In such systems, a bug in one of the three modules will get voted out by the other two versions.