This section examines in detail the limitation with upper bound model tuning technique which arises from the assumption that it defines an absolute upper limit on run-time. Intuitively, it might be expected that the effect of reducing a grain size should be either to reduce run-time or to leave it unchanged, depending on whether the grain is or is not critical. In fact, a reduced grain size can theoretically lead to an increase in overall run-time; this will be termed an anomaly. A simple form of the phenomenon is illustrated below. The critical path of the program is formed by the grains of tasks A, B, C and E. Their messages can be assumed to be sent and received in negligible time, so the run-time of this part of the program is thus the sum of the grain sizes, A+B+C+E. Task A sends a message to task R, which is not on the critical path, but
which send a message to task D. D, in turn sends a message to the final grain, E. This program is illustrated as a DAG diagram in Figure 34.
Figure 34 : An Example of a DAG Program
The behaviour of this program before tuning when it is mapped onto three processors is shown in the upper part of Figure 35. The program runs, as planned, with the grains A, B, C and E on the critical path. If sufficient reduction is now made in the size of the grain R, the grain D is brought into contention with the grain B on the critical path. If the priorities are such that D pre-empts B, the reduction in the size of R can cause the completion of B to be considerably delayed. Hence, the reduction in grain size causes an increase in overall run-time and thus illustrates the concept of an anomaly.
processor B processor C A R B D E Before Tuning processor C
processor A R < = □ reduction in grain size After
11
I
Tuningprocessor B B ] D B
pre-emption, so D
interrupts the extension o f overall run-time critical path
time
I I = critical path grains = n on -critical grains | = m essa g es
Figure 35 : An Example of a T uning Anomaly
The importance o f anomalies is that they violate the assumption that an upper bound model represents an upper limit on run-time. An anomaly can cause the run-time o f program for a particular data set to be longer than that o f the upper bound model. The implication is that the upper bound model cannot totally resolve the risk from data-dependence where it is possible for one grain o f processing to pre-empt another
This phenomenon does not arise due to the special case which is being studied and can occur in many circumstances. Firstly, anomalous effects are not an artefact o f imperfections in modelling or monitoring tools, indeed, even the program itself, executed for a data set for which all its grain sizes take their worst case values, would not necessarily define an absolute upper bound on performance Secondly, anomalies can occur even if computations are highly engineered so that upper bounds are never violated. Finally, anomalies are not an artefact o f non-determinism or complexity in programs. In the above illustration, they occur for a simple DAG programs with well-understood behaviour and a deterministic trace. On the other hand, it is certainly reasonable to assume that anomalies can also occur for complex, non-deterministic and non-DAG programs and with modelling and monitoring tools which are not perfect
For performance-critical programs, it is therefore important to understand the circumstances under which anomalies can and cannot occur and the circumstances in which they can have significant effects and important practical consequences. A way o f preventing anomalies from occurring in the first place is by program design. Anomalies only occur when less critical grains are allowed to pre-empt more critical grains. Hence, anomalies never occur if the grains are prioritised so that such priority inversions cannot occur. However, priority inversions are often associated with message routing. Processing grains associated with routing generally have a high priority, which is usually defined in system software and not
user-modifiable. Hence, the high priority processing grain associated with routing a non-critical message may be brought into contention with a critical grain of processing, causing an anomaly. This may lead to large effects if there are many such messages to be routed.
Where anomalies may occur for a performance-critical program, it is important to manage the risks they cause. This can be done by testing for their effects. The visualisation of behaviour, whether from a program model or from performance monitoring results, can provide information as to how anomalies might occur. In particular, events can be used to show the order in which grains are executed and how they may contend. The exception to this occurs if program behaviour is very complex, so that the number of events to be analysed becomes overwhelming, or if there is no well-defined critical path or load. Where anomalies can be detected, the priorities of tasks and the routing of messages can be tuned to ensure that critical grains are not pre-empted.
Testing for anomalies is facilitated by modelling, in that it allows grain sizes to be varied directly. This is helpful because it makes testing for model anomalies more convenient. In particular, it is possible to carry out a sensitivity analysis by testing the effects of grain size variations in a systematic way. If a reduction in a single grain size with respect to the upper bound model produces an increase in run-time, the possibility of anomalous behaviour has been detected. Such tests are much harder to carry out on the program itself, because the program grain sizes can only be varied indirectly, by varying the input data. Failure to detect anomalies by sensitivity analysis affords a good indication that tuning anomalies will not arise, but is not an absolute proof. Greater assurance can be provided by testing combinations of grain sizes and by testing a range of values for each grain size. The limitations on sensitivity analysis occur where many grain sizes are variable over a wide range, so that the space of possibihties is too large to test thoroughly.
In summary, while a number of modelling techniques can be employed to minimise the risks from anomalies, the risk cannot be entirely eliminated. If the program is time-critical, data-dependent grain sizes and large amounts of communication should be avoided if possible, particularly if they are shown to lead to complex variations in behaviour.