PERSONALIDAD JURÍDICA
PENAS CONVENCIONALES
The discussion of experimental results include two different experimental scenarios. First, we compare the WCET and the analysis time obtained for small programs from the analysis of architectural flows (TDM) versus control flows (TDM) in Table8.1. Second, we compare the WCET results of composable TDM versus a LR abstraction of a TDM arbiter for M¨alardalen WCET benchmark programs [109] in Table 8.2. By compositionality of the LR abstraction and assuming that each processor core has a sufficiently large private data memory (D-$) and a common initial hardware state, each program is analyzed independently from the program configured to run on the second core. We consider the simplified multicore architecture in Fig.8.4(b), where instructions are shared in a partitioned SRAM memory shared by a TDM arbiter.
(a) Template for a multi-process source
program (b) Simplified multi-processor architecture
Figure 8.4: A simple source code running on a simplified multicore architecture By definition, architectural flows cannot be feasibly computed. However, we do compute interleavings for the simple program in Fig.8.4(a), where “application A” and “application X” have only a few instructions each. Due to its natural composability, the analysis of control flows with TDM arbitration is much faster than the analysis of architectural flows, requiring
only 1% of the time. With respect to the WCET estimate, the first line in Table8.1 shows a lower WCET (179 CPU cycles) for the interleavings approach compared to composable TDM analysis (185 CPU cycles). This difference in the WCET is a consequence of the actual hardware state of the processor core running “application X” upon the invocation of the fork procedure and demonstrates the impact that the intermediate hardware states have on the timing analysis of architectural flows.
In fact, when the number of instructions of “application X” is bigger than the number of instructions of “application A”, the worst-case path corresponds to that of “application X”. However, since the analysis of “application X” starts with an empty pipeline state, it naturally takes less CPU cycles to complete. After increasing the number of instructions in “application A”, this effect is eliminated because the worst-case path becomes that of “application A”. Consequently, for the two analyses, the WCET is equal in the last two experiments.
Table 8.1: Comparison results for architectural flows, composable TDM
No. instructions No. instructions No. of Results Architectural Composable “application A” “application X” interleavings (CPU cycles/sec.) Flows (TDM) TDM
4 5 126 WCET 179 185 Analysis Time 57.0 0.17 5 5 252 WCET 188 188 Analysis Time 140.3 0.18 6 5 462 WCET 195 195 Analysis Time 588.7 0.43
Next, we compare the WCET results in Table 8.2 obtained using the LR abstraction with Θ = 1 and ρ = 0.5 (modeling a particular TDM configuration with frame size of 2) to the results obtained with composable TDM. The WCET values presented in Table 8.2 depend not only on the size of the instruction cache and on the ability of the LR server to stay busy, but also on the program flow, e.g. number of loop iterations. Since we are considering a blocking multicore architecture, where a request from a processor core cannot be issued before the previous request has been served, every request starts a new busy period by definition. This is the most unfavorable situation possible for the LR abstraction, since every request requires Θ + 1/ρ cycles to complete, maximizing the overhead compared to TDM.
Still, our experiments show that this overhead is limited to between 8.7% and 12.1% for the considered arbiter, configuration, and applications. This is partly because the use of a small frame size reduces the penalty of starting a new busy period upon every cache miss through the low Θ = 1 value, but also because the case of an SRAM shared by a TDM arbiter is quite simple and is captured well by the abstraction. A more complex case with DRAM and CCSP arbitration is shown in [127] along with an optimization to reduce the pessimism of the abstraction without loss of generality. In terms of the run-time of the analysis tool, it is
approximately (≈) the same for both composable TDM and the LR abstraction.
From this experiment, we conclude that compositional analysis of control flows using the LR abstraction is very fast and scalable compared to analysis of architectural flows. The analysis time is similar to compositional analysis based on composable TDM arbitration, although it incurs a reduction in accuracy of about 8-12% for our configuration and applications. More precise WCET estimates would be obtained for multicore architectures that support high levels of parallelism. For example, architectures including super-scalar pipelines or caches allowing multiple outstanding requests. This would reduce the number of busy periods in the LR server upon cache misses, but would also increase the overall complexity of the WCET analyzer. Nevertheless, the main benefit of the LR abstraction is that it is able to perform compositional timing analysis using any arbiter belonging to the class, as opposed to being limited to composable TDM.
Table 8.2: WCET results for some of the M¨alardalen benchmarks
Benchmark No. Source LR-server No. Cache TDM Overhead Analysis Time Loop Iterations (WCET) Misses (WCET) (%) in sec. (≈)
bs 152 1162 111 1036 10.8 2.3 bsort 156 1459 152 1311 10.1 0.9 cnt 145 1309 175 1171 10.5 0.8 cover 111 796 105 707 11.2 3.9 crc 459 3160 304 2826 10.6 15.0 expint 251 2023 233 1818 10.1 1.9 fdct 1011 10897 720 9892 9.2 20.1 fibcall 111 994 59 885 11.0 2.3 matmult 287 2580 188 2343 9.2 5.2 minmax 221 956 263 873 8.7 2.6 prime 232 1079 196 959 11.1 5.2 ud 418 3943 97 3464 12.1 40.0
8.7
Summary
The work presented in this chapter is an approach to timing analysis in multicore architec- tures exclusively based on the declarative frameworks of denotational semantics, abstract interpretation and functional programming. Comparatively with the generic framework for data flow analysis described in Section 5, the WCET analysis in multicores is defined incrementally by extending the intermediate representation language with a new syntactic element, representing programs running on different processing cores, which denotational interpretation reuses the algebraic combinators used for static analysis in single-cores to automatically generate type-safe fixpoint (abstract)-interpreters.
The complexity of the new fixpoint interpreter is reduced by using the abstraction provided by the LR server model on the timing behavior of shared resources. This abstraction is
proved correct in relation to the calculational approach of “architectural flows” by means of a Galois connection. Using declarative programming in Haskell, the temporal behavior of shared resources is in direct correspondence with the mathematical definitions of the TDM and LR arbiter models. The outcome is the definition of provably sound and compositional timing analysis in multicore environments, with a loss in precision in order of 8% that is relatively small compared to the factor 100 reduction in terms of analysis time.
Conclusion and Future Work
The main objective of the work reported in this dissertation is the definition of programming- language independent meta-semantic formalism, capable of specifying the fixpoint semantics of programs using typed and polymorphic higher-order combinators in Haskell. Sound and efficient fixpoint computations are obtained through the use of formal approaches to control- flow analysis combined with data-flow analysis within the same meta-semantic formalism. The former is obtained by a type-safe fixpoint algorithm, automatically derived from a topo- logical order over the syntactic elements of the program. The latter is obtained by applying a calculational method to the induction of “correct by construction” abstract interpreters. The meta-semantic formalism is defined to ease fixpoint verification and program transformations in the scenarios where the framework of Abstract-Carrying Code (ACC) can be applied. The success of our approach is evaluated when the meta-semantic formalism is applied to the analysis of the worst-case execution time (WCET) of assembly programs, considering the ARM9 as the target platform. We show that the conservative approach to abstract interpretation proposed by the Cousots can be used to prove the correctness of the existent state-of-the-art on WCET analysis, in terms of the several static analyses required to compute a WCET estimate.
When using WCET safety specifications, the verification mechanism of the abstract inter- pretation part of ACC was extended with dual theory applied to linear programming (LP). In this way, the complexity of the LP problem on consumer sites is reduced from NP-hard to polynomial time, by using simple linear algebra computations. Therefore, we are able to provide an efficient and low-resource consuming verification mechanism.
Last but not least, we apply the latency-rate (LR) server model to our WCET analysis with the objective to surpass the intrinsic computational complexity of timing analysis of multiple processing cores sharing common resources. The soundness of the integration of the LR timing abstraction into our data flow framework is proved using the abstract interpretation
framework based on Galois connections. Although the considered multicore architecture is rather simplified, the results show that the our solution for WCET analysis on multicores can be easily parametrized with an abstraction of the timing behavior of any arbiter for shared resources belonging to the class of LR-servers.
9.1
Future Work
The two main limitations of our WCET analysis framework are the absence of the use of widening/narrowing operators to accelerate the convergence of fixpoint computations and the simplification of the real ARM9 cache replacement policies and hardware timing models. The first limitation is a consequence of the requirements imposed by the ACC framework, stating that the verification mechanism must be performed without manual intervention. In fact, since program flow annotations on the source code are not allowed in ACC when performing static analysis of the machine code, we have to resort to complete loop unrolling to perform an automatic program flow analysis by abstract interpretation. This can be a considerably less efficient process when compared to existing state-of-the-art tools, such as AbsInt’s aiT, but it produces more precise results by minimizing the non-determinism introduced by the separate use of different analyses.
The static analysis of a realistic ARM9 microprocessor depends greatly on its hardware components and may even be impossible to perform. For example, the cache replacement policy of ARM9 is typically Pseudo-Random. This replacement policy is highly unpredictable and precludes, to the best of our knowledge, the application of static analysis methods to determine approximations about the actual cache dynamic behavior. Indeed, state-of-the-art cache analysis consider either Least Recently Used LRU, First-In-First-Out FIFO or Pseudo- LRU PLRU [55]. For this reason, and for sake of simplicity, we restrict our calculational approach to abstract cache analysis using the LRU replacement policy, for which we give a correctness proof by construction.
Furthermore, recent published work on automatic generation of timing models from VHDL microprocessor specifications [118], would allow the automatically generation of Haskell code to include the pipeline timing model of ARM9. Although some progress was made in this direction in cooperation with AbsInt GmbH [2] and in cooperation with the Compiler Lab Design at Saarbr¨ucken University, the results of such work are not mature enough and are, therefore, outside the scope of this thesis. Of course, the absence of a realistic timing model for ARM9 will influence the uniformity of our WCET estimates when compared to AbsInt’s tool, for example.