Issues in Real-time System Design

Designing Realtime systems is a challenging task. Most of the challenge comes from the fact that Realtime systems have to interact with real world entities. These interactions can get fairly complex. A typical Realtime system might be interacting with thousands of such entities at the same time. For example, a telephone switching system routinely handles calls from tens of thousands of subscriber. The system has to connect each call differently. Also, the exact sequence of events in the call might vary a lot.

In the following sections we will be discussing these very issues:

Realtime Response
Recovering from Failures
Working with Distributed Architectures
Asynchronous Communication
Race Conditions and Timing

Realtime Response

Realtime systems have to respond to external interactions in a predetermined amount of time. Successful completion of an operation depends upon the correct and timely operation of the system. Design the hardware and the software in the system to meet the Realtime requirements. For example, a telephone switching system must feed dial tone to thousands of subscribers within a recommended limit of one second. To meet these requirements, the off hook detection mechanism and the software message communication involved have to work within the limited time budget. The system has to meet these requirements for all the calls being set up at any given time.

The designers have to focus very early on the Realtime response requirements. During the architecture design phase, the hardware and software engineers work together to select the right system architecture that will meet the requirements. This involves deciding inter connectivity of the processors, link speeds, processor speeds, etc. The main questions to be asked are:

Is the architecture suitable? If message communication involves too many nodes, it is likely that the system may not be able to meet the Realtime requirement due to even mild congestion. Thus a simpler architecture has a better chance of meeting the Realtime requirements.

Are the link speeds adequate? Generally, loading a link more than 40-50% is a bad idea. A higher link utilization causes the queues to build up on different nodes, thus causing variable amounts of delays in message communication.

Are the processing components powerful enough? A CPU with really high utilization will lead to unpredictable Realtime behavior. Also, it is possible that the high priority tasks in the system will starve the low priority tasks of any CPU time. This can cause the low priority tasks to misbehave. As with link, keep the peak CPU utilization below 50 %.

Is the Operating System suitable? Assign high priority to tasks that are involved in processing Realtime critical events. Consider preemptive scheduling if Realtime requirements are stringent. When choosing the operating system, the interrupt latency and scheduling variance should be verified.
- Scheduling variance refers to the predictability in task scheduling times. For example, a telephone switching system is expected to feed dialtone in less than 500 ms. This would typically involve scheduling three to five tasks within the stipulated time. Most operating systems would easily meet these numbers as far as the mean dialtone delay is concerned. But general purpose operating systems would have much higher standard deviation in the dialtone numbers.
- Interrupt Latency refers to the delay with which the operating system can handle interrupts and schedule tasks to respond to the interrupt. Again, real-time operating systems would have much lower interrupt latency.

Recovering from Failures

Realtime systems must function reliably in event of failures. These failures can be internal as well as external. The following sections discuss the issues involved in handling these failures.

Internal Failures

Internal failures can be due to hardware and software failures in the system. The different types of failures you would typically expect are:

Software Failures in a Task: Unlike desktop applications, Realtime applications do not have the luxury of popping a dialog box and exiting on detecting a failure. Design the tasks to safeguard against error conditions. This becomes even more important in a Realtime system because sequence of events can result in a large number of scenarios. It may not be possible to test all the cases in the laboratory environment. Thus apply defensive checks to recover from error conditions. Also, some software error conditions might lead to a task hitting a processor exception. In such cases, it might sometimes be possible to just rollback the task to its previous saved state.

Processor Restart: Most Realtime systems are made up of multiple nodes. It is not possible to bring down the complete system on failure of a single node thus design the software to handle independent failure of any of the nodes. This involves two activities:
1. Handling Processor Failure: When a processor fails, other processors have to be notified about the failure. These processors will then abort any interactions with the failed processor node. For example, if a control processor fails, the telephone switch clears all calls involving that processor.
2. Recovering Context for the Failed Processor: When the failed processor comes back up, it will have to recover all its lost context from other processors in the system. There is always a chance of inconsistencies between different processors in the system. In such cases, the system runs audits to resolve any inconsistencies. Taking our switch example, once the control processor comes up it will recover the status of subscriber ports from other processors. To avoid any inconsistencies, the system initiates audits to crosscheck data-structures on the different control processors.

Board Failure: Realtime systems are expected to recover from hardware failures. The system should be able to detect and recover from board failures. When a board fails, the system notifies the operator about the it. Also, the system should be able to switch in a spare for the failed board. (If the board has a spare)

Link Failure: Most of the communication in Realtime systems takes place over links connecting the different processing nodes in the system. Again, the system isolates a link failure and reroutes messages so that link failure does not disturb the message communication.

External Failures

Realtime systems have to perform in the real world. Thus they should recover from failures in the external environment. Different types of failures that can take place in the environment are:

Invalid Behavior of External Entities: When a Realtime system interacts with external entities, it should be able to handle all possible failure conditions from these entities. A good example of this is the way a telephone switching systems handle calls from subscribers. In this case, the system is interacting with humans, so it should handle all kinds of failures, like:
1. Subscriber goes off hook but does not dial
2. Toddler playing with the phone!
3. Subscriber hangs up before completing dialing.

Inter Connectivity Failure: Many times a Realtime system is distributed across several locations. External links might connect these locations. Handling of these conditions is similar to handling of internal link failures. The major difference is that such failures might be for an extended duration and many times it might not be possible to reroute the messages.

Working with Distributed Architectures

Most Realtime systems involve processing on several different nodes. The system itself distributes the processing load among several processors. This introduces several challenges in design:

Maintaining Consistency: Maintaining data-structure consistency is a challenge when multiple processors are involved in feature execution. Consistency is generally maintained by running data-structure audits.
Initializing the System: Initializing a system with multiple processors is far more complicated than bringing up a single machine. In most systems the software release is resident on the OMC. The node that is directly connected to the OMC will initialize first. When this node finishes initialization, it will initiate software downloads for the child nodes directly connected to it. This process goes on in an hierarchical fashion till the complete system is initialized.
Inter-Processor Interfaces: One of the biggest headache in Realtime systems is defining and maintaining message interfaces. Defining of interfaces is complicated by different byte ordering and padding rules in processors. Maintenance of interfaces is complicated by backward compatibility issues. For example if a cellular system changes the air interface protocol for a new breed of phones, it will still have to support interfaces with older phones.
Load Distribution: When multiple processors and links are involved in message interactions distributing the load evenly can be a daunting task. If the system has evenly balanced load, the capacity of the system can be increased by adding more processors. Such systems are said to scale linearly with increasing processing power. But often designers find themselves in a position where a single processor or link becomes a bottle neck. This leads to costly redesign of the features to improve system scalability.
Centralized Resource Allocation: Distributed systems may be running on multiple processors, but they have to allocate resources from a shared pool. Shared pool allocation is typically managed by a single processor allocating resources from the shared pool. If the system is not designed carefully, the shared resource allocator can become a bottle neck in achieving full system capacity.

Asynchronous Communication

Remote procedure calls (RPC) are used in computer systems to simplify software design. RPC allows a programmer to call procedures on a remote machine with the same semantics as local procedure calls. RPCs really simplify the design and development of conventional systems, but they are of very limited use in Realtime systems. The main reason is that most communication in the real world is asynchronous in nature, i.e. very few message interactions can be classified into the query response paradigm that works so well using RPCs.

Thus most Realtime systems support state machine based design where multiple messages can be received in a single state. The next state is determined by the contents of the received message. State machines provide a very flexible mechanism to handle asynchronous message interactions. The flexibility comes with its own complexities. We will be covering state machine design issues in future additions to the Realtime Mantra.

Race Conditions and Timing

It is said that the three most important things in Realtime system design are timing, timing and timing. A brief look at any protocol will underscore the importance of timing. All the steps in a protocol are described with exact timing specification for each stage. Most protocols will also specify how the timing should vary with increasing load. Realtime systems deal with timing issues by using timers. Timers are started to monitor the progress of events. If the expected event takes place, the timer is stopped. If the expected event does not take place, the timer will timeout and recovery action will be triggered.

A race condition occurs when the state of a resource depends on timing factors that are not predictable. This is best explained with an example. Telephone exchanges have two way trunks which can be used by any of the two exchanges connected by the trunk. The problem is that both ends can allocate the trunk at more or less the same time, thus resulting in a race condition. Here the same trunk has been allocated for a incoming and an outgoing call. This race condition can be easily resolved by defining rules on who gets to keep the resource when such a clash occurs. The race condition can be avoided by requiring the two exchanges to work from different ends of the pool. Thus there will be no clashes under low load. Under high load race conditions will be hit which will be resolved by the pre-defined rules.

A more conservative design would partition the two way trunk pool into two one way pools. This would avoid the race condition but would fragment the resource pool.

The main issue here is identifying race conditions. Most race conditions are not as simple as this one. Some of them are subtle and can only be identified by careful examination of the design.