• No se han encontrado resultados

CAPÍTULO 2. EL SISTEMA HIDROTÉRMICO

2.2. El sistema termoeléctrico

2.2.2. Modelo matemático de un generador térmico

The quasi-static scheduling of an actor partition enables the actors to be merged into one larger actor which has a predictable behavior internally. A composed actor includes the FIFOs connecting the actors in the parti- tion, all the state variables of the actors, and all the actions of the actors. The interaction between these is after composition more predictable as the behavior of the actor is described by a set of static schedules. To allow efficient code generation it makes a difference how the actor is merged and transformed before code generation.

Typically the code generation, from high-level descriptions such as data- flow, targets other programming languages (e.g. C/C++) which already have a wide support for code generation for various processor architectures. This in turn, means that part of how efficient a program is depends on how well the low level compiler is able to optimize the code. Several trade-offs, that should be considered in the code generation is discussed by Battacharya et al. in [18], and in addition to transformations related to high-level concepts such as clustering and scheduling, also code generation aspects such as a potential negative interaction between e.g. inlining and register allocation. While the work in this thesis is concerned with scheduling and composition, it will make a difference how the merged actors are constructed. For the tool chain, this means two things: first, the transformation should be made on a close to algorithmic level as low level optimizations are provided

by the low level compiler and all possible options should be kept open for the optimizations. Second, optimizations related to the information specific for a dataflow program, should be used to minimize the needed calculations of the program as the low level compiler does not have this information.

The actor merging approach presented in Paper 5 [26] is an example of this. The quasi-static scheduler provided for an actor partition is imple- mented such that there is a minimal loss of information. The actions of the original actors are made into procedures, the schedules of the composed actor becomes actions which sequentially fire the action procedures, and the FSM and guard of the compound scheduler are directly used. This kind of a structure enables the code generator to inline these procedures (action) if it is useful, or to iterate sequences of repeated actions either in a loop or as a sequence of procedure calls. The point is that it is up to the low level compiler to decide what is the most adequate code for the target platform. The actor composition enables the actor merger to remove all internal buffers of a composition as the quasi-static scheduler of such a composition can be used to predict the read and write behavior of the buffers. As a result, FIFO channels can be replaced with simple data structures such as arrays.

What should be handled by the dataflow code generator, however, is any- thing related to removing scheduling decisions, removing redundant schedul- ing information such as FIFOs or state variables, and anything related to sequentializing parts of the code. This type of transformations may not be possible for a low level compiler to perform as it cannot know about the restrictions a dataflow program has regarding data locality etc. This means that is would make sense for the tool to remove state variables and FIFOs that cannot affect the outputs of the partition that has been merged to one actor.

With the composed actors, the last step which affects the performance in the actual code generator. There exists a number of code generations for CAL, and there is a lot of development around such tools as Orcc. As these code generators are constantly improved, it is important that the, in this case external, scheduling tools have the ability to use the code generators of these.

6.4

Related Work

This chapter is mainly based on the experimental results that has been ob- tained from the evaluation of the scheduling approaches presented. There exists a number of publications which present scheduling approaches and each of these present, in one way or another, experimental results which could be compared to the results in this work. While it to some extent is

hard to directly compare measurements such as speedup, some experimen- tal results should be compared to show, on a more general level, that the presented results and conclusions hold.

Some other Experiments To get more broad results regarding the po-

tentials of efficient scheduling and actor composition, some of the approaches presented earlier should be compared regarding the performance improve- ment reported.

Wipliez et al. presented actor classification in [124, 123], where the scheduling is performed based on the classes of MoCs that the actors are determined to belong to. This kind of an approach can provide very efficient scheduling when the actors belong to static MoCs, which also is shown by the results is Paper 3 [50] where this classification approach was used to schedule one of the partitions. When the actors of a partition belong to statically schedulable classes, it is difficult to improve on the speedup of this type of approaches. With the actor classification, the speedup of the MPEG- 4 decoder that was used as an example in [123] was about 20%. However, the results in paper Paper 3 [50] shows, for a different platform, that the speedup of an MPEG-4 decoder is up to 53% using the actor composition based on the classification presented in [124].

Boutellier et al. presented an scheduling approach based on dynamic code analysis in [27]. The dynamic code analysis resembles the work pre- sented in this thesis, with the exception of how the input sequences of a program are generated and how the actual schedules are searched for. Else, the type of composition is quite similar as it is based on the traces of how actions of several actors are fired. The results presented in [27] show a speedup of up to 58% for an MPEG-4 decoder running on a Intel Core 2 Duo E8500 processor. The produced schedules, with the dynamic code anal- ysis approach, result in schedules that resembles the schedules produced by the model checking approach. For this reason, the resulting performance can be expected to reside in the same range, however, the partitioning is more closely specified in the dynamic code analysis approach.

Gu et al. presented using static regions for enabling static scheduling of parts of a dataflow program in [60]. This approach, while being elegant, only performs well when the application is enough static. The approach was evaluated on the IDCT network of a CAL implementation of an MPEG- 4 decoder and showed a speedup of about 10% regarding the frame rate. Again, it is difficult to compare the results, however, it shows on improve- ment when the dataflow network is appropriately restructured. What is interesting in this approach, although it was not the case in this example, is that the restructuring not necessarily is a pure composition but also may involve splitting actors, or ending up with a completely new set of actors.

is based on the CFDF model of computation. The experimental results presented show a reduction of execution time of up to almost 85% for some signal processing applications, compared to a simple round-robin scheduler. However, for one of the test programs, the run-time increased; this result corresponds to the findings presented in this chapter regarding the trade-off between different overheads.

Falk et al. presented a rule-based qusi-static scheduling approach in [55], where sets of static actors were clustered together and were quasi-statically scheduled such that deadlocks were avoided when the composite actors in- teracted with the surrounding including dynamic actor. Falk et al. presents some experimental results with the proposed clustering algorithms where, among other results, the performance of a Motion JPEG decoder was im- proved by 40%, when the (fine grained) IDCT network was clustered with the proposed algorithm. Another experiment showed that the speedup of a coarser grained mp3 decoder was was about 6%. This is another example of that clustering, or composition, is only useful up to a specific point where the program reaches a sweet spot, which again depends on the platform.

Each of these works show on an improvement of the speed of an dataflow program in the (quite wide) range of 10-85% depending on the platform and the application that was used in the experiment. These results correspond to the experiments performed in this work, and therefore shows that actor composition and scheduling is an important part of the tools that are needed for fitting dataflow applications onto various processor architectures. The other property that will be of interest to evaluate based on relevant literature is the scalability of dataflow program onto multi-core architectures.

Scalability to Many-Core The main reason for that dataflow is inter-

esting for software development is that is scales well to many-core architec- tures. Now, while it is easy to see that the explicit parallelism of a dataflow program at least in theory scales to as many cores as there are actors, a statement like this needs some evidence taking into account the potential overheads of a realistic platform.

CAL is in principle platform independent and the same implementation should run on anything between an FPGA and a single processor archi- tecture. Eker et al. presents in [47] an overview of results where CAL is used to target both hardware and software. One of the results show that a MPEG-4 decoder scales, almost linearly, at least up to four cores without modification. Of course, the program can be made to scale even better if the appropriate transformations are performed on the actor network.

To know which transformation to perform, some knowledge regarding the program is needed. Casale-Brunet et al. presented a tool called TURNUS in [32], which performs a profiled simulation on CAL dataflow programs. The tool records causation traces and analyses properties such as compu-

tational load, optimal buffer sizes, critical path measurements, etc. The analysis provided by TURNUS is used to decide how the program is to be refactored to achieve the requirements. [32, 3]

With the appropriate design space exploration for identifying the trans- formations needed for a program to efficiently run on a program, dataflow programs will provide a scalable implementation of many algorithms. The scalability to many-core, of course, is still limited by the size of the program and the communication overheads between cores.

Chapter 7

Some Scheduling Case

Studies

The scheduling methods should be general enough to produce a scheduler for any program, or at least reproduce the original actor if it cannot produce a better one. At the same time, the scheduler should use specific properties of the program to be able to identify more clever schedules for the program. This is a difficult trade-off; in the work by Wipliez et al. [124], for example, the actors must conform to one of a few statically schedulable MoCs. The schedules produced are very efficient but it also requires the actors to be implemented such that these can be scheduled according the one of the given MoCs. In this work, the target is something more general, which is enough to identify the schedules to be searched for. In practice, the question is how to ask the model checker the right questions regarding the state to search for. This was already described in Chapter 5, and worked for the kind of applications that were used as examples, but it is still relevant to evaluate the methods with some more applications and types of applications.

In this chapter some relevant applications 1 for the signal processing

and multimedia domain are used to show how different properties of the program affect the scheduling approach. The idea is to from this draw some conclusions about the generality of the approach and what kind of problems may arise. The two first cases are basic digital filters, where the scheduling can be expected to be static, however, the implementation may cause some difficulties with the scheduling. The following case is a network protocol application, which also may be expected to have static scheduling, but the size of the packets is variable. The last cases are a couple of video decoding implementations, representing larger applications than the other

1The applications are available at https://github.com/orcc/orc-apps and I would like

to thank all the authors for their work as non of this work would have been possible without their contributions

case studies, and are for this reason are partitioned in to a couple of regions which are individually scheduled and composed.

7.1

Case I – FIR Filter

To get started, let us consider a very simple, and for dataflow typical ap- plication, namely a Finite Impulse Response (FIR) filter. An FIR filter is a digital filter where the output signal is a function of a finite number of inputs and can be implemented with a number of simple actors such as adders, multipliers, and delays. In practice the operation of an FIR filter is about reading input tokens, keeping the token in the filter for some time, and generating output as the sum of different input tokens multiplied by some constants depending on how old the input is. As an example, a three tap filter is a filter that uses the current input and the two previous inputs to produce the output sample.

Figure 7.1: The FIR filter implementation in Orcc.

The CAL program used for this experiment is shown in Figure 7.1. The actors are rather simple: the adders, multipliers, and rshift actors, perform their respective operation in one action firing; similarly the source and sink actors produce or consume one token in each actor firing. The delay actors also consume and produce one token each firing, however, it keeps an internal buffer delaying the tokens, in this case, with one token meaning that the first input will correspond to the second output. Finally, the delay actors and the source and sink actors have initializers which must be scheduled and run before starting the actual computations; but except for this, the actors clearly have static token rates and it should be possible to find an SDF like schedule for the operation of the filter.

Two schedules can be expected to be found for this application, an ini- tializing schedule that is run once, and a schedule that schedules one token through the filter, which can be repeated forever. In order to find appropri- ate schedules for the application, two properties need to hold for a schedule, first, a schedule must show some progress, that is, some actions must be fired and the actors must have been initialized, second, the FIFOs must be empty after the schedule has been completed.

The actors of this network are classified to belong to either SDF or KPN, and none of the actors have input dependent guards. In other words, the actors are not time-dependent and there are no control tokens sent between the actors. As a result, the application can be scheduled based on token rates and actor states, only. One actor is chosen as the leader for the partition, but, because there are no control tokens in this partition, the choice of actor is arbitrary and has no consequence to the composition whatsoever and only affects the naming on the schedules. In this example program none of the actors has an FSM scheduler which means that it is modeled an FSM with one state and each action is a transition starting and ending in this state.

The first schedule to search for is the one starting in the initial state and ending in a state where the program has made progress and has empty FIFOs. This schedule is found and includes the initialization of all the actors which have an initializer, but it may also contain an arbitrary number of iterations of the filter. The reason for this is that if the state space is traversed in such an order that some tokens are read into the dataflow network before a state matching the properties searched for is found, the schedule includes processing these tokens. It is not incorrect to have such a schedule; however, it means that the program cannot be initialized before a specific number of input is available. A more general schedule can be found by instructing the model checker to search for the shortest trace to such a state. The Spin model checker performs this, when it is requested, by, when finding a matching state, continuing searching for that state but restricting the search depth to be less that the shortest trace so far. By using this approach a minimal schedule to the state where only the initializer has been run is found. This state is named ’s1’ and in a similar fashion the minimal schedule processing one token and returning to this same state is found. The scheduler can be described as an FSM with two states and two transitions:

0 o n e _ s t a t e _ r u n o n e _ s t a t e o n e _ s t a t e s0 s1 1 o n e _ s t a t e _ r u n o n e _ s t a t e o n e _ s t a t e s1 s1

The output from the scheduling framework is an XML file describing how actors are composed and how the composed actors are to be scheduled. A simplified version of the scheduler description, which shows the relevant parts for the discussion, is shown in Figure 7.2. The scheduler includes an FSM scheduler and static schedules which here are called superactions and include a list of action firings and guard expressions which in this case are empty. This description of the composed actor scheduler is then used to merge the actors before the actors are code generated.

The produced scheduler consists, as expected, of an initialization phase and an SDF schedule which then can be repeated as long as the application