Perspectivas par el futuro a manera de epílogo

In order to obtain the hints for the ASIP development targeting task management, its C-implementation is analyzed. The analysis is made at two levels: a high-level analysis of the data structure and a low-level profiling of the C-implementation. In the high-level analysis, the focus is laid on the way how the information, especially the tasks are organized and maintained in OSIP and in which data structure the tasks are presented. In the low-level analysis, cycle-accurate simulations are made on LT- OSIP to obtain detailed application profiling results such as the frequency of memory accesses, the control flow overhead, etc.

3.2.2.1 Data Structure

In the C-implementation, the complete system information is organized hierarchically as a tree-like structure. Each information node of the tree can be a task node, a PE node, a scheduling or mapping node. Note that a task node does not contain the complete information of a task, such as the function pointer and the task parameters, but a reference to the location in the shared memory, from which the task information can be found. As explained for the example given in Listing 3.2 in Section 3.1.2, creating a task, more precisely, creating a task node in OSIP actually means the registration of a task in OSIP with necessary scheduling information. Similarly, fetching a task from OSIP does not fetch the complete task information directly, but the above-mentioned reference in the shared memory.

The tree-structure enables a hierarchical scheduling and mapping, which is similar to the approach introduced in [71]. It has the benefits of easy construction and extension of the scheduling and mapping structures for different applications and system configurations.

In Figure 3.5, an exemplary scheduling hierarchy is given. This hierarchy has three levels, one level for the task nodes (the leaf nodes of the tree) and two levels for the scheduling nodes (the root node and intermediate nodes). In each scheduling node, a scheduling policy is defined to determine the best task candidate from the underlying task nodes and sub-trees. For example, the best candidate among node 2,

node 3 and node 4 is determined based on the FIFO policy defined in node 8. And the

best candidate for all task nodes of the whole hierarchy is determined among node 1,

node 5, the winner in sub-tree of node 8 and the winner in sub-tree of node 9, following

the fair queuing policy.

The mapping hierarchy is very similar to the scheduling hierarchy, but with the PE nodes being the leaf nodes of the tree and the mapping nodes being the root or the intermediate nodes. A mapping node defines the rule which one of the underlying PEs should be selected for executing a chosen task.

10 1 8 2 3 4 5 9 6 7 Fair queuing FIFO Priority-based Scheduling node Task node

Figure 3.5: An exemplary scheduling hierarchy

In summary, in the scheduling hierarchy a winner task, i.e., the best task candidate that shall be executed as the next, is determined, and in the mapping hierarchy a winner PE, i.e., the best PE candidate, is determined to execute the winner task. The switching from the task scheduling to the task mapping is achieved by merg- ing the root scheduling and mapping node of both hierarchies. Note that multiple independent scheduling and mapping hierarchies can exist in parallel in a system.

These hierarchies are implemented using doubly linked lists, which have the ad- vantage of easy maintenance of the data structure, when adding or removing nodes. The lists are created not only along the hierarchy, but also for the nodes at the same the hierarchy level, e.g., for connecting task nodes or connecting a task node to a scheduling node, which is however not shown in the figure for clarity.

In comparison to the standard list implementation, in which the nodes are linked using C pointers, they are linked with the node indices. Each node is assigned with a unique index. From the hardware perspective, these index-based lists have two main advantages against the pointer-based lists. First, less memory space is needed for the index-based nodes than for the pointer-based. The size of a pointer typically corre- sponds to the processor architecture. For example, the pointers of a 32-bit processor

usually have a size of 32 bits. In contrast, the bits needed for an index depend on the maximum number of the supported nodes. To support 65536 nodes in the system, which are already large enough for most of the today’s applications, only 16 bits are needed for the index. Second, the indexing of the nodes enables efficient hardware support for the operations performed in the lists, which will be shown later.

The linked lists through indices are made possible by: a) fixing the same size for all node types (task node, PE node, scheduling node and mapping node); b) allocating consecutively a static array for the nodes. In the actual C-implementation, each node has a size of eight 32-bit words, serving as the basic data type that is considered during the scheduling and mapping by OSIP. This basic data type is further referred to as OSIP_DT. But internally, for different node types, the construction of OSIP_DT is different, largely depending on which information should be included in the nodes. 3.2.2.2 Profiling

In addition to the high-level analysis for the data structure of the C-implementation, cycle-accurate simulations are run on LT-OSIP to profile the C-implementation from the instruction-level. A set of applications, ranging from synthetic programs to real- life applications like H.264 video decoding are simulated. The profiling results are given in Figure 3.6. arithmetic; logic; compare; misc (41.2%) memory access (32.5%) control (26.3%) 1-2 3-4 5-6 6+ 0 20 40 Number of consecutive arithmetic/logic/compare instructions Per centage (%) branch nop,stall,flush 0 20 40 60 80 Per centage (%)

load store nop,stall 0 20 40 Per centage (%)

Figure 3.6: Instruction-level profiling of C-implementation

The following observations can be made from the profiling results. First, arithmetic operations, memory accesses and control flows (branch/jump operations) can be roughly considered as evenly distributed, shown by the pie-diagram in the figure. Even the largest group of the operations — the arithmetic operations — does not dominate the execution time.

Second, there are only few cases, in which a large number of arithmetic instructions are executed consecutively (see the left bar diagram of the figure). 46% of the

consecutive executions contain only one or two arithmetic instructions. This results from the fact that the arithmetic instructions are frequently separated by the control and/or memory access instructions. This is determined by the characteristic of this kind of OS-like applications. During task scheduling and mapping, no highly complex data processing exists, unlike the signal processing from the multimedia or wireless communication domain. Instead, examination of the system information like checking the status of a task or a PE, and simple arithmetic instructions like compar- ing task priorities are the most common operations needed for making scheduling and mapping decisions. This application characteristic on the one hand requires fre- quent memory accesses, on the other hand involves high control overhead for making decisions based on the information obtained from the memory accesses. Therefore, the ASIP development for OSIP largely differs from the typical data-centric ASIP de- velopment. Instead, it is control-centric. This makes the development challenging, for which efficient handling of the control and memory accesses is extremely important.

Third, a further profiling of the control and memory accesses in the two bar di- agrams on the right side of the figure shows that large execution overhead of these operations is caused by the pipeline hazards, including both the data and the control hazards. A large number of pipeline stall and flush operations, and additional nop instructions are presented. This again highlights the necessity of efficient handling of the control and memory accesses, or even reducing them generally by using native hardware support.

In document EXPERIENCIAS Y APRENDIZAJES (página 107-120)