Fault Handling Techniques

This article describes some of the techniques that are used in fault handling software design. A typical fault handling state transition diagram is described in detail. The article also covers several fault detection and isolation techniques.

Fault Handling Lifecycle

The following figure describes the fault handling lifecycle of an active unit in a redundancy pair.

Fault Handling State Machine

  1. Assume that the system is running with copy-0 as active unit and copy-1 as standby.
  2. When the copy-0 fails, copy-1 will detect the fault by any of the fault detection mechanisms.
  3. At this point, copy-1 takes over from copy-0 and becomes active. The state of copy-0 is marked suspect, pending diagnostics.
  4. The system raises an alarm, notifying the operator that the system is working in a non-redundant configuration.
  5. Diagnostics are scheduled on copy-0. This includes power-on diagnostics and hardware interface diagnostics.
  6. If the diagnostics on copy-0 pass, copy-0 is brought in-service as standby unit. If the diagnostics fail, copy-0 is marked failed and the operator is notified about the failed card.
  7. The operator replaces the failed card and commands the system to bring the card in-service.
  8. The system schedules diagnostics on the new card to ascertain that the card is healthy.
  9. Once the diagnostics pass, copy-0 is marked standby.
  10. The copy-0 now starts monitoring the health of copy-1 which is currently the active copy.
  11. The system clears the non-redundant configuration alarm as redundancy has been restored.
  12. The operator can restore the original configuration by switching over the two copies.

Fault Detection and Isolation

Fault Detection

One of the most important aspects of fault handling is detecting a fault immediately and isolating it to the appropriate unit as quickly as possible. Here are some of the commonly used fault detection mechanisms.

Fault Isolation

If a unit is actually faulty, many fault triggers will be generated for that unit. The main objective of fault isolation is to correlate the fault triggers and identify the faulty unit. If fault triggers are fuzzy in nature, the isolation procedure involves interrogating the health of several units. For example, if protocol fault is the only fault reported, all the units in the path from source to destination are probed for health.