Issues in Real-time System Design

Designing Realtime systems is a challenging task. Most of the challenge comes from the fact that Realtime systems have to interact with real world entities. These interactions can get fairly complex. A typical Realtime system might be interacting with thousands of such entities at the same time. For example, a telephone switching system routinely handles calls from tens of thousands of subscriber. The system has to connect each call differently. Also, the exact sequence of events in the call might vary a lot.

In the following sections we will be discussing these very issues:

Realtime Response

Realtime systems have to respond to external interactions in a predetermined amount of time. Successful completion of an operation depends upon the correct and timely operation of the system. Design the hardware and the software in the system to meet the Realtime requirements. For example, a telephone switching system must feed dial tone to thousands of subscribers within a recommended limit of one second. To meet these requirements, the off hook detection mechanism and the software message communication involved have to work within the limited time budget. The system has to meet these requirements for all the calls being set up at any given time.

The designers have to focus very early on the Realtime response requirements. During the architecture design phase, the hardware and software engineers work together to select the right system architecture that will meet the requirements. This involves deciding inter connectivity of the processors, link speeds, processor speeds, etc. The main questions to be asked are:

Recovering from Failures

Realtime systems must function reliably in event of failures. These failures can be internal as well as external. The following sections discuss the issues involved in handling these failures.

Internal Failures

Internal failures can be due to hardware and software failures in the system. The different types of failures you would typically expect are:

External Failures

Realtime systems have to perform in the real world. Thus they should recover from failures in the external environment. Different types of failures that can take place in the environment are:

Working with Distributed Architectures

Most Realtime systems involve processing on several different nodes. The system itself distributes the processing load among several processors. This introduces several challenges in design:

Asynchronous Communication

Remote procedure calls (RPC) are used in computer systems to simplify software design. RPC allows a programmer to call procedures on a remote machine with the same semantics as local procedure calls. RPCs really simplify the design and development of conventional systems, but they are of very limited use in Realtime systems. The main reason is that most communication in the real world is asynchronous in nature, i.e. very few message interactions can be classified into the query response paradigm that works so well using RPCs.

Thus most Realtime systems support state machine based design where multiple messages can be received in a single state. The next state is determined by the contents of the received message. State machines provide a very flexible mechanism to handle asynchronous message interactions. The flexibility comes with its own complexities. We will be covering state machine design issues in future additions to the Realtime Mantra.

Race Conditions and Timing

It is said that the three most important things in Realtime system design are timing, timing and timing. A brief look at any protocol will underscore the importance of timing. All the steps in a protocol are described with exact timing specification for each stage. Most protocols will also specify how the timing should vary with increasing load. Realtime systems deal with timing issues by using timers. Timers are started to monitor the progress of events. If the expected event takes place, the timer is stopped. If the expected event does not take place, the timer will timeout and recovery action will be triggered.

A race condition occurs when the state of a resource depends on timing factors that are not predictable. This is best explained with an example. Telephone exchanges have two way trunks which can be used by any of the two exchanges connected by the trunk. The problem is that both ends can allocate the trunk at more or less the same time, thus resulting in a race condition. Here the same trunk has been allocated for a incoming and an outgoing call. This race condition can be easily resolved by defining rules on who gets to keep the resource when such a clash occurs. The race condition can be avoided by requiring the two exchanges to work from different ends of the pool. Thus there will be no clashes under low load. Under high load race conditions will be hit which will be resolved by the pre-defined rules.

A more conservative design would partition the two way trunk pool into two one way pools. This would avoid the race condition but would fragment the resource pool.

The main issue here is identifying race conditions. Most race conditions are not as simple as this one. Some of them are subtle and can only be identified by careful examination of the design.