Handling Processor Reboot

Realtime systems typically consist of multiple processors implementing different parts of the systems functionality. Each of these processors can encounter a hardware or software failure and reboot. Realtime systems should be designed to smoothly handle processor failure and recovery.

Processor failure and recovery handling can be divided into the following steps:

  1. A processor in the system fails. Other processors in the system detect the failure.
  2. All other processors in the system cleanup all features that are involved in interactions with the failed processor.
  3. The failed processor reboots and comes up.
  4. Once the processor comes back up, it establishes protocol with all the processors in the system.
  5. After establishing protocol, the rebooted processor reconciles all its data structures with the system.
  6. Data structure audits are initiated with other processors to weed out inconsistencies that might have taken place due to processor reboot.

In the following discussion we will cover each of the steps mentioned above. We will be taking the example of a call processor reboot.

Processor Failure Detection

When a processor reboots in the system, other processors will detect its failure in one of the following ways:

Cleaning Up on Processor Failure

Whenever a node fails, all the other nodes in the system that were involved in feature interactions with this node, need to be notified so that they can clean up any feature that might be affected by the failure of this node.

For example, when a Switching card fails, all the other Switching cards are informed so that they can clear all calls that had one leg of the call in the failed Switching. This may appear to be fairly straightforward, but consider that all of a sudden the system has to clear so many calls. This may lead to a sudden increase in memory buffer and CPU utilization. The designers should take this into account when dimensioning resources.

Processor Recovery

Once a failed processor reboots and comes up, it will communicate with the central processor informing it that it has recovered and is ready to resume service. At this point the central processor would inform all other processors so that they can reestablish protocol with the just recovered processor.

In the Switching example, when Switching card recovers, it will inform the Central card about its recovery. Then Central will inform other Switching cards so that they can resume protocol with the recovered card. This will also involve changing the status of all terminals and trunk groups handled by the Switching card to inservice.

Data Reconciliation

When the failed card comes up, it has to recover the context that was lost due to failure. The context is recovered by the following mechanisms:

When a Switching card recovers, it obtains V5.2 interface definition, trunk group data etc. from the operations and maintenance module. Permanent status change information like circuit failure status would be obtained from the backed up data. Transient state information like circuit blocking status would be recovered by exchanging blocking messages with other exchanges.

Audits

A processor reboot might have created lot of inconsistencies in the system. Software audits are run just after processor recovery to catch these inconsistencies. Once the inconsistencies are fixed, the system designers may opt to have audits running periodically to counter inconsistencies that might happen during normal course of operation.

When the Switching card recovers, it triggers the following audits:

The above audits will clean up any hanging slot allocations or hanging calls in the system.