Handling Processor Reboot

Realtime systems typically consist of multiple processors implementing different parts of the systems functionality. Each of these processors can encounter a hardware or software failure and reboot. Realtime systems should be designed to smoothly handle processor failure and recovery.

Processor failure and recovery handling can be divided into the following steps:

A processor in the system fails. Other processors in the system detect the failure.
All other processors in the system cleanup all features that are involved in interactions with the failed processor.
The failed processor reboots and comes up.
Once the processor comes back up, it establishes protocol with all the processors in the system.
After establishing protocol, the rebooted processor reconciles all its data structures with the system.
Data structure audits are initiated with other processors to weed out inconsistencies that might have taken place due to processor reboot.

In the following discussion we will cover each of the steps mentioned above. We will be taking the example of a call processor reboot.

Processor Failure Detection

When a processor reboots in the system, other processors will detect its failure in one of the following ways:

Loss of periodic health messages: In an idle system with very little traffic, loss of periodic health messages may be the only mechanism to detect processor failure. This mechanism places an upper bound on the time it will take to detect processor failure.
Protocol faults: Protocol faults are the quickest way to detect the failure of a processor in a busy system. As soon as a node sends a message to the failed processor, the protocol software will timeout for the peer protocol entity on the failed processor. This failure is reported to the fault handling software. Note that this technique works only when a message is sent to the failed node. Thus no upper bound can be specified on the failure detection time. But in most situations, protocol fault detection will be fast as there will be some message traffic towards the failed node. For example, a Switching card failure will be detected by other Switching and Central processors as soon as they try to send a message to the failed Switching.

Cleaning Up on Processor Failure

Whenever a node fails, all the other nodes in the system that were involved in feature interactions with this node, need to be notified so that they can clean up any feature that might be affected by the failure of this node.

For example, when a Switching card fails, all the other Switching cards are informed so that they can clear all calls that had one leg of the call in the failed Switching. This may appear to be fairly straightforward, but consider that all of a sudden the system has to clear so many calls. This may lead to a sudden increase in memory buffer and CPU utilization. The designers should take this into account when dimensioning resources.

Processor Recovery

Once a failed processor reboots and comes up, it will communicate with the central processor informing it that it has recovered and is ready to resume service. At this point the central processor would inform all other processors so that they can reestablish protocol with the just recovered processor.

In the Switching example, when Switching card recovers, it will inform the Central card about its recovery. Then Central will inform other Switching cards so that they can resume protocol with the recovered card. This will also involve changing the status of all terminals and trunk groups handled by the Switching card to inservice.

Data Reconciliation

When the failed card comes up, it has to recover the context that was lost due to failure. The context is recovered by the following mechanisms:

Getting the configuration data from the operations and maintenance module.
Periodically backing up the state data with the operations and maintenance module so that this information can be recovered on reboot.
Reconciling data structures with other processors in the system to rebuild data structures.

When a Switching card recovers, it obtains V5.2 interface definition, trunk group data etc. from the operations and maintenance module. Permanent status change information like circuit failure status would be obtained from the backed up data. Transient state information like circuit blocking status would be recovered by exchanging blocking messages with other exchanges.

Audits

A processor reboot might have created lot of inconsistencies in the system. Software audits are run just after processor recovery to catch these inconsistencies. Once the inconsistencies are fixed, the system designers may opt to have audits running periodically to counter inconsistencies that might happen during normal course of operation.

When the Switching card recovers, it triggers the following audits:

Space slot resource audit with Central
Time slot resource audit with other Switching cards
Call audit with Switching and Central

The above audits will clean up any hanging slot allocations or hanging calls in the system.