|
Most Realtime systems focus on hardware fault tolerance. Software fault
tolerance is often overlooked. This is really surprising because hardware
components have much higher reliability than the software that runs over them.
Most system designers go to great lengths to limit the impact of a hardware
failure on system performance. However they pay little attention to the systems
behavior when a software module fails.
In this article we will be covering several techniques that can be used to
limit the impact of software faults (read bugs) on system performance. The main
idea here is to contain the damage caused by software faults. Software fault
tolerance is not a license to ship the system with bugs. The real objective is
to improve system performance and availability in cases when the system
encounters a software or hardware fault.
Most Realtime systems use timers to keep track of feature execution. A
timeout generally signals that some entity involved in the feature has
misbehaved and a corrective action is required. The corrective action could be
of two forms:
- Retry: When the application times out
for a response, it can retry the message interaction. You might argue that
we do not need to implement application level retries as lower level
protocols will automatically recover from message loss. Keep in mind that
message loss recovery is not the only objective of implementing retries.
Retries help in recovering from software faults too. Consider a scenario
where a message sent to a task is not processed because of a task restart or
processor reboot. An application level retry will recover from this
condition.
- Abort: In this case timeout for a
response leads to aborting of the feature. This might seem too drastic, but
in reality aborting a feature might be the simplest and safest solution in
recovering from the errors. The feature might be retried by the user
invoking the feature. Consider a case where a call has to be cleared because
the task originating the call did not receive a response in time. If this
condition can happen only in rare scenarios, the simplest action on timeout
might be to clear the call. The user would retry the call.
The choice between retrying or aborting on timeouts is based on several
factors. Consider all these factors before you decide either way:
- If the feature being executed is fairly important for system stability, it
might be better to retry. For example, a system startup feature should not
be aborted on one timeout.
- If the lower layer protocol is not robust, retry might be a good option.
For example, message interactions using an inherently unreliable protocol
like slotted aloha should always be retried.
- Complexity of implementation should also be considered before retrying a
message interaction. Aborting a feature is a simpler option. More often than
not system designers just default to retrying without even considering the
abort option. Keep in mind that retry implementation complicates the code
and state machine design.
- If the entity invoking this feature will retry the feature, the simplest
action might be abort the feature and wait for an external retry.
- Retrying every message in the system will lower system performance because
of frequent timer start and stop operations. In many cases, performance can
be improved by just running a single timer for the complete feature
execution. On timeout the feature can simply be aborted.
- For most external interactions, the designer might have no choice. As the
timeouts and retry actions are generally specified by the external
protocols.
- Many times the two techniques are used together. The task retries a
message certain number of times. If no response is received after exhausting
this limit, the feature might be aborted.
top
Most Realtime systems comprise of software running across multiple
processors. This implies that data is also distributed. The distributed data may
get inconsistent in Realtime due to reasons like:
- independent processor reboot
- software bugs
- race conditions
- hardware failures
- protocol failures
The system must behave reliably under all these conditions. A simple strategy
to overcome data inconsistency is to implement audits. Audit is a program that
checks the consistency of data structures across processors by performing
predefined checks.
Audit Procedure
- System may trigger audits due to several reasons:
- periodically
- failure of certain features
- processor reboots
- processor switchovers
- certain cases of resource congestion
- Audits perform checks on data and look for data inconsistencies between
processors.
- Since audits have to run on live systems, they need to filter out
conditions where the data inconsistency is caused by transient data updates.
On data inconsistency detection, audits perform multiple checks to confirm
inconsistency. A inconsistency is considered valid if and only if it is
detected on every iteration of the check.
- When inconsistency is confirmed, audits may perform data structure
cleanups across processors.
- At times audits may not directly cleanup inconsistencies; they may trigger
appropriate feature aborts etc.
An Example
Lets consider the Xenon Switching System. If the call occupancy on the system
is much less than the maximum that could be handled and still calls are failing
due to lack of space-slot resources, call processing subsystem will detect
this condition and will trigger space-slot audit. The audit will run on the XEN
and CAS processors cross-check if a space-slot that is busy at CAS actually has
a corresponding call at XEN. If no active call is found on XEN for a space-slot,
the audit will recheck the condition after a small delay for several times. If
the inconsistency holds on every attempt, the space-slot resource is marked free
at CAS. The audit performs several rechecks to eliminate the scenario in which
the space-slot release message may be in transit.
top
Whenever a task receives a message, it performs a series of defensive
checks before processing it. The defensive checks should verify the
consistency of the message as well as the internal state of the task. Exception
handler should be invoked on defensive check failure.
Depending on the severity, exception handler can take any of the following
actions:
- Log a trace for developer post processing.
- Increment a leaky-bucket counter for the error condition.
- Trigger appropriate audit.
- Trigger a task rollback.
- Trigger processor reboot.
| Leaky
Bucket Counter |
| Leaky-bucket
counters are used to detect a flurry of error conditions. To ignore rare
error conditions they are periodically leaked i.e. decremented. If these
counters reach a certain threshold, appropriate exception handling is
triggered. Note that the threshold will never be crossed by rare happening
of the associated error condition. However, if the error condition occurs
rapidly, the counter will overflow i.e. cross the threshold. |
top
In a complex Realtime system, a software bug in one task leading to processor
reboot may not be acceptable. A better option in such cases is to isolate the
erroneous task and handle the failure at the task level. The task in turn may
decide to rollback i.e. start operation from a known or previously saved state.
In other cases, it may not be expensive to forget the context by just deleting
the offending task and informing other associated tasks.
For example, if the Space
Slot Manager on the CAS card encounters a exception condition leading to
task rollback, it might resume operation by recovering the space slot allocation
status from the connection memory. On the other hand, exception in a call task
might just be handled by clearing the call task and releasing all the resources
assigned to this task.
Task rollback may be triggered by any of the following events:
- Hardware exception conditions like divide by zero, illegal address access
(bus error)
- Defensive check leaky-bucket counter overflows.
- Audit detected inconsistency to be resolved by task rollback.
top
Software processor reboots can be time consuming, leading to unacceptable
amount of downtime. To reduce the system reboot time, complex Realtime systems
often implement incremental system initialization procedures. For example, a
typical Realtime system may implement three levels of system reboot :
- Level 1 Reboot : Operating system
reboot
- Level 2 Reboot : Operating system
reboot along with configuration data download
- Level 3 Reboot : Code reload followed
by operating system reboot along with configuration data download.
Incremental Reboot Procedure
- A defensive check leaky-bucket counter overflow will typically lead to
rollback of the offending task.
- In most cases task rollback will fix the problem. However, in some cases,
the problem may not be fixed leading to subsequent rollbacks too soon. This
will cause the task level rollback counter to overflow, leading to a Level 1
Reboot.
- Most of the times, Level 1 Reboot will fix the problem. But in some cases,
the processor may continue to hit Level 1 Reboots repeatedly. This will
cause the Level 1 Reboot counter to overflow, leading to a Level 2 Reboot.
- Majority of the times, Level 2 Reboot is able to fix the problem. If
it is unable to fix the problem, the processor will repeatedly hit Level 2
Reboots, causing the Level 2 Reboot counter to overflow leading to Level 3
Reboot.
top
This is a technique that is used in mission critical systems where software
failure may lead to loss of human life .e.g. aircraft navigation software. Here,
the Realtime system software is developed by at least three distinct teams. All
the teams develop the software independently. And, in a live system, all the
three implementations are run simultaneously. All the inputs are fed to the
three versions of software and their outputs are voted to determine the actual
system response. In such systems, a bug in one of the three modules will get
voted out by the other two versions.
top
|