• No se han encontrado resultados

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

N/A
N/A
Protected

Academic year: 2022

Share "Instituto Tecnol´ogico y de Estudios Superiores de Monterrey"

Copied!
59
0
0

Texto completo

(1)

Instituto Tecnol´ogico y de Estudios Superiores de Monterrey

Campus Monterrey

School of Engineering and Sciences

Dual-Core Embedded Implementation of the SISO Adaptive Predictive Control

A thesis presented by

Rene Martinez Esquivel

Submitted to the

School of Engineering and Sciences

in partial fulfillment of the requirements for the degree of

Master of Science

in

Electronics Engineering Monterrey, Nuevo Le´on, March, 2020

(2)

Dedication

To my girlfriend and future wife, Cinthya, whom I love. Thank you so much for your support throughout these years. To my kids, Ares and Emma, you were my motivation to finish this work.

To my beloved parents, Porfirio and Flavia. You raised me to be a hardworking individual and instilled in me the discipline required to achieve this endeavor. Thank you, for your love and unconditional support.

To my siblings, Ernesto, Ruth, Eva and Esther. As the youngest one, I had the privilege of always having someone be there for me. I can’t even begin to list the many ways each one of you have helped me throughout my life and continue to do so to this day. My gratitude to each of you is beyond words.

iii

(3)

Acknowledgements

I sincerely thank my advisor, Dr. Alfonso ´Avila, for giving me the opportunity to collaborate in this project and for being a mentor throughout my enrollment in this university. I would thank as well, my co-advisor, Dr. Antonio Favela for introducing me to the area of advanced control theory.

To my employer, Commscope-Arris, I am grateful for sponsoring me via the Education Assistance program, and special thanks to my manager, Osvaldo Mendoza, for giving me the opportunity to participate in this program and allowing me to take time off work to do my school work.

Special thanks to the Tecnol´ogico de Monterrey for the scholarship I was awarded to help me fulfill this professional endeavor.

iv

(4)

Dual-Core Embedded Implementation of the SISO Adaptive Predictive Control

by

Rene Martinez Esquivel Abstract

The SISO Adaptive Predictive Control (APC) algorithm is implemented on a Dual-Core em- bedded system. The Predictive control strategy is implemented on one of the cores available on the ZYNQ processing system, and the Adaptation Mechanism is implemented on the sec- ond core. The implementation can be thought of as two independent applications running si- multaneously. However, these are not completely independent. The Predictive Control model is updated on every sampling interval with new parameters generated by the Adaptation mech- anism that better fit the response of the system under control. To correctly synchronize the two applications, a Master-Slave architecture was chosen for the communication between the two cores. The core running the predictive control algorithm was assigned the role of the master because it had a longer execution time for every sampling period. This first core acted as the master and orchestrated the flow of data between the two cores and the actions taken by the second core. The second core, which computes the adaptation mechanism, acted as the slave, as it only performed operations when it received a message from the first core at specific times during a sampling period. Further optimization was achieved by using in-line Assembly code, which makes it possible to take advantage of the processor architecture and implement opti- mal subroutines for a particular application. The low-level optimizations improved execution time by reducing clock cycle counts and reducing the use of external memory for temporary variables. These optimizations resulted in a speedup of up to 3x when compared to the latest embedded implementation of the APC algorithm reported in the literature.

v

(5)

List of Figures

1.1 Methodology Flow Chart . . . 3

3.1 SISO Adaptive Predictive Control Block Diagram[1] . . . 13

3.2 APC Variable Dependence Diagram . . . 18

4.1 The ZedBoard Development Board [2] . . . 20

4.2 ARM Cortex-A9 MPcore Processor [3] . . . 21

4.3 Application Processor Unit (APU) [4] . . . 22

6.1 ZYNQ Design configuration . . . 36

6.2 Workload distribution between two cores . . . 39

6.3 Process and Control Output for prediction horizon (a) λ = 2, (b) λ = 5 (c) λ = 10 (d) λ = 20 . . . 41

6.4 Error Histogram . . . 42

vi

(6)

List of Tables

5.1 SISO APC Computational Complexity (FLOPs) [1] . . . 31

6.1 Cortex-A9 Processor Clock Configuration . . . 37

6.2 Memory map for Core 0 . . . 38

6.3 Application Specific Configurations . . . 38

6.4 Optimization Performance results for N=4 . . . 40

6.5 Optimization Performance results for λ = 2 . . . 40

vii

(7)

Contents

Abstract v

List of Figures vi

List of Tables vii

1 Introduction 1

1.1 Background . . . 1

1.2 Motivation . . . 2

1.3 Methodology . . . 3

1.4 Solution Overview . . . 4

1.4.1 Dual-core Processing System . . . 4

1.4.2 Low-level optimization . . . 5

1.5 Thesis Outline . . . 6

2 Related Work 7 2.1 APC SISO Real-time Implementations . . . 7

2.2 Parallel Computation Approaches . . . 8

2.2.1 Multi-core MPC . . . 8

2.3 Compiler and Optimizations . . . 10

3 Model Predictive Control 12

viii

(8)

3.1 SISO APC Overview . . . 13

3.1.1 The Predictive Model . . . 13

3.1.2 The Driver Block . . . 14

3.1.3 The Adaptation Mechanism . . . 15

3.2 Algorithm Variable Dependence Analysis . . . 16

3.3 Chapter Summary . . . 17

4 An Embedded Platform for the APC Algorithm Implementation 19 4.1 The Development Board: ZEDBOARD . . . 19

4.2 The Dual-Core ZYNQ SoC . . . 20

4.3 The Application Processor Unit (APU) . . . 22

4.4 Chapter Summary . . . 22

5 Embedded APC Implementation Optimization 24 5.1 Enabling the Dual-Core capabilities on the Zynq . . . 25

5.2 Dual Core SISO APC Embedded Optimizations . . . 26

5.3 Analysis of the Dual-Core Implementation . . . 28

5.4 The Case for Manual Low Level Optimization . . . 28

5.5 Use of the SIMD co-processor . . . 29

5.5.1 NEON Usage Example in APC Implementation . . . 30

5.6 Regression Stage Optimizations . . . 31

5.6.1 Low Level Optimizations . . . 33

5.7 Chapter Summary . . . 34

6 Setup and Results 35 6.1 ZYNQ Hardware Configuration . . . 35

6.2 SOFTWARE CONFIGURATIONS . . . 37

6.2.1 Memory Allocation . . . 37

6.2.2 Compiler Settings . . . 38

ix

(9)

6.3 Results . . . 39 6.4 Controller Performance . . . 41 6.5 Chapter Summary . . . 42

7 Conclusion 43

7.1 Future Work . . . 44

Bibliography 49

x

(10)

Chapter 1 Introduction

1.1 Background

Over the past few decades, predictive control has gained popularity as a controller design strategy, in theory and in practice, primarily because of the high performance and stability ex- hibited when it was applied to difficult high-order and multi-variable processes and its ability to handle process constraints in a systematic way. Predictive control approaches are charac- terized by the model of the process under control used to make predictions of future process outputs, which are then used to produce a control signal that optimizes the future behavior of a process. The model represents the interactions among input, output and disturbances.

The principles of predictive control can be traced to the 1960s. Advanced control tech- niques were driven by the needs of the petroleum, defense and similar industries, and the first recorded industrial applications date back to the early 1970s. The constraint handling and zero tracking error made Predictive control particularly attractive. A big contributor to the adoption of predictive control techniques in the late 1960s and early 1970s in industrial ap- plications was the market availability of digital computers that were more reliable, relatively easier to program and inexpensive. The introduction of the predictive model increased the

1

(11)

CHAPTER 1. INTRODUCTION 2

mathematical complexity of the control algorithms which made the digital computer neces- sary in model-based control strategies [5]

1.2 Motivation

In this work, a particular type of Model Predictive control algorithm, identified as Adaptive Predictive Control (APC), is studied and implemented in an embedded system. When com- pared to other predictive control algorithm, the APC algorithm is under-represented in the literature, even though it has been proved useful in actual industrial applications [6, 7, 8]. As other MPC algorithms, the principals of APC were first developed in the 1970’s.

The APC algorithm improves on the predictive control strategy by adding an adaptation mechanism. Predictive control is built on a mathematical model of the process under control, but even if the model is an accurate representation of the process, this control strategy becomes unstable when perturbations are introduced in the system, or if the dynamics of the process change. The adaptive strategy helps to deal with this problem by dynamically updating the process model. This property of the APC approach gives it an advantage over some MPC algorithms, which makes it attractive for industrial applications.

Implementations of the APC algorithm on general purpose computers can be found in several research works, primarily developed at the ITESM. The contribution of these research works includes the analysis, design and implementation of the APC algorithm in high level programming languages such as MATLAB and LabVIEW. [9, 7, 8, 10, 11]. Later works have focused on implementing the algorithm in embedded platforms. [1, 12]

This work aims to broaden the study of the APC algorithm and its implementation in embedded systems. It builds upon past embedded implementation of the same algorithm and it strives to reduce execution time execution time following different optimization approaches.

(12)

CHAPTER 1. INTRODUCTION 3

Figure 1.1: Methodology Flow Chart

1.3 Methodology

The methodology followed for improving an application execution time is shown in Figure 1.1. The first step involves measuring execution times for different sections of the code to create a profile of the execution time per section. The profile helps in concentrating the efforts on the sections of the code the that consume most of the execution time.

In the analysis step, sections of the code that showed the most execution time are scru- tinized to identify bottlenecks that are candidates for optimization. Particularly, the opti- mizations aimed in this work fall under two categories, portions that are parallelizable, and methods that help to reduce memory access. The disassembly debugger is a great tool for this job.

In the optimization step, changes are made to the source code that was identified as being optimizable. Any changes made should perform the same operation in a shorter time and with the same results or outputs. It is important to point out that the focus of this work was the execution time optimization instead of the algorithmic reduction optimization.

Due to the need of code changes, it is necessary to add another step to verify correctness

(13)

CHAPTER 1. INTRODUCTION 4

of results. This step is shown in the Flowchart as “Formal verification”

1.4 Solution Overview

The optimization of an embedded SISO APC implementation required two different ap- proaches in order to achieve an acceptable improvement in performance. These approaches are specified in in this section.

1.4.1 Dual-core Processing System

A time tested method to achieve greater speed in computation intensive applications is to increase the processing power of the platform that is running the application. Increasing the processing power might be straight forward for applications developed to run on a general purpose computer because that would involve either upgrading the hardware to a processor that works at higher clock frequencies, or upgrading to a system with a greater number of processors. At the time of this work, a plethora of multi-core computers and operating systems were available.

However, the multi-core approach to reduce execution time is not as straight forward on embedded applications. Such applications often include low-level optimizations, such as implementing certain functions in assembly, that make it difficult to port the application code to a multi-core platform. This was the case for the SISO APC embedded implementation presented in this work. The existing implementation exploited processor specific capabilities that are supported using assembly language, specifically the SIMD of the ARM Cortex-A9 named the NEON co-processor. Porting such implementation to a multi-core system is not straight forward. Some problems need to be solved first, such as intra-core communication, to enable the possibility of a multi-core solution to the increased speed requirement. Deep knowledge of the processing system architecture is required to implement communication among processors.

(14)

CHAPTER 1. INTRODUCTION 5

In addition to solving the challenges of properly configuring a system to use multiple cores, the algorithm also needs to be adapted to allow for parallel computation. Identifying sections of the algorithm that are candidates for parallelization requires careful analysis of the algorithm.

1.4.2 Low-level optimization

A processing system processes instructions to operate on some data that produce some result.

An instruction is represented by a number known as operation code, or opcode for short. The processor takes in the opcode, along with the data to operate on, the operands, and preforms the operation according to the opcode. The opcode, which is a set of bits often one byte or a few bytes in size, is denoted by mnemonics that are easier for a human to remember.

At the level of mnemonics, a programmer can start developing applications. However, these mnemonics still need to be translated into opcode bytes. The process of converting mnemonics to opcode bytes is called compilation. This simplified description of the how a processing system operates is needed to understand the concepts described in this work.

Low-level optimization refers to the process of exploiting the full capabilities of a pro- cessor by means low-level programming languages that translate directly into machine code, such as Assembly. Although this requires intimate knowledge of the processor architecture, it makes possible the efficient use of the processing resources. As processors become faster, the need for low-level optimization is less critical because the applications run faster by simply upgrading to the latest generation of a processor. Additionally, code compilers used in higher level languages can achieve high levels of optimization more often than a human program- mer can. Nonetheless, there are still opportunities for manual low-level optimization that can make an application more efficient, as this work shows.

(15)

CHAPTER 1. INTRODUCTION 6

1.5 Thesis Outline

Chapter 2 provides a review of the literature on embedded implementations of MPC algo- rithms. In particular, this chapter covers parallel or multi-core computations approaches for some MPC applications. It also covers a review of some low-level optimizations approaches in MPC implementations.

Chapter 3 covers the concepts of the Single-input Single-output were presented. It il- lustrates the main components of the APC algorithm, and each of the blocks is described in detail in the sections of this chapter. The last section some properties of the APC algorithm are exposed that are relevant to concepts that are discussed in the subsequent chapters.

In chapter 4 the hardware platform used in the implementation of the SISO APC algo- rithm is described in detail. It introduces the ZEDBOARD and lists its main components.

At the heart of the ZEDBOARD is the ZYNQ SoC, which combines a Processing System (PS) and Programable Logic (PL) which are interconnected via high speed communication interfaces. In this research work, only the PS of the ZYNQ is used for the embedded im- plementation. The last section in this chapter provides a detailed description of the critical components in the PS, including the Dual-Core Cortex-A9 processor.

Chapter 5 presents the main contributions of this work, in particular how the Dual-Core processor was leveraged to reduce the execution time of the application. First, it reveals the subtleties of working with the two cores on the ZYNQ SoC, and the particular solution to the parallel implementation approach of SISO APC algorithm. Then, it describes the low-level optimizations that made it possible for the application to improve in execution time.

Chapter 6 discloses the results of the optimized implementation and compares these re- sults to an analogous solution reported in [1]. The performance and precision of the controller are discussed in the latter sections of this chapter.

(16)

Chapter 2

Related Work

2.1 APC SISO Real-time Implementations

An embedded implementation of a single-input single-output (SISO) adaptive predictive con- trol (APC) algorithm is described in [12]. The ACP algorithm was developed using C lan- guage on the ZYBO development board, where customized data structures and libraries were developed with the aim of minimizing the execution time while maintaining accuracy of the controlled output. Specifically, the implementation of a circular linked list of past control and system outputs proved to be essential for reducing execution time. Execution times for sim- ulated test cases were reported in the range of 5 to 13 microseconds at a processor speed of 650 MHz. The simulations parameters for the reported execution times were λ taking values of 2, 5 and 10, while keeping the N parameter set to 4.

Another APC SISO embedded implementation is presented in [1], which builds on the work done in [12]. Three different APC embedded implementations are introduced [1], each for different types of processors. The first implementation was developed to work on any general purpose processor. For processors lacking of an FPU, the floating point arithmetic is emulated in software by the compiler libraries. The second implementation works with ARM processors that include an FPU. The floating point arithmetic carried out in the floating point

7

(17)

CHAPTER 2. RELATED WORK 8

unit. In addition, this implementation makes use of the FPU Math library. The third imple- mentation works on ARM processors that include a NEON co-processor. The floating point arithmetic is managed by NEON, which can process 4 floating point operations with a single instruction. This firmware makes use of the NEON Math Library. This last implementation is the one that achieved the highest reduction in execution time when compared against the other implementation presented in [1]. The the speedup achieved ranges from 7x to 12x times faster than the original implementation [12], depending on the λ and N parameters selected for comparison.

Embedded optimizations of other MPC algorithms can be found in the literature. How- ever, to reduce the scope, the literature research was limited to multi-core implementations of the MPC algorithm. Few theoretical and applied solutions have been proposed for imple- menting Model Predictive control in multi-core processors.

2.2 Parallel Computation Approaches

In [13] it shows that an MPC problem can broken down into smaller MPC problems of shorter prediction horizon. These reduced MPC problems can be solved in parallel, which reduces the overall execution time of the original MPC problem. This approach reduces the complexity of the portion of the algorithm that solves the quadratic programming problem from O(λ) to O(log λ).

2.2.1 Multi-core MPC

In [14], a Network-on-Chip (NoC) based multicore architecture is presented with the goal of solving large-scale nonlinear model predictive control problems. Using the method proposed in [13], a complex MPC problem is broken down into subproblems, and by using convex optimization concepts, the quadratic programming problem can be reduced to solving a linear system with a large matrix of coefficients. The large matrix of coefficients is factorized into

(18)

CHAPTER 2. RELATED WORK 9

blocks of smaller matrices, which can be solved in a small group of parallel computing cores, or NoC cores.

A multi-core approach for the MPC algorithm is explored in [15]. Their approach was to utilize the extra processing capability of a multi-core processor system for doing speculative execution, where the predicted output produced by the model is exploited for control execution optimization.

An MPC algorithm is applied to a pendulum control problem. A profile of the exe- cution time for different sections of the algorithm was generated, and the results from this profile revealed that the recursive prediction horizon calculations account for about 70% of the total execution time. However, the authors argue that parallelization of the numerical cal- culations of the MPC algorithm is difficult using traditional thread level parallelism due to the interdependencies found in the algorithm. Instead of evenly dividing the workload among the multiple cores, the authors propose to take each of the predicted output trajectories as the real output of the system. Then, on each of the available cores, new predicted trajectories are computed based on the previous predicted trajectories. In other words, this method speculates that at least one of the multiple predicted outputs would match the actual output, and since the new trajectories for that output has been pre-computed on one of the multiple cores available, time is saved.

A distinguishing feature of an MPC algorithm is the updating of the model based on, among others, the output of the system via a feedback path. Hence, the speculated outputs become speculated inputs to the controller. The speculated inputs are obtained from the calcu- lation results of the prediction horizon, and represent no additional workload on the processor running the algorithm. Rather than waiting for the actual system output after updating a con- trol signal, this method takes the predicted output and pre-computes possible control signals for the next iteration. With this speculated input, another speculated output can be computed on different processor.

(19)

CHAPTER 2. RELATED WORK 10

The improvement in performance is achieved by running the MPC algorithm on one pro- cessor, and the speculative pre-computations on separate processors. In essence, an instance of the MPC algorithm would run on each available processor. Then, the actual measured sys- tem output would be matched to one of the speculated inputs, and the controller would follow that pre-computed trajectory. Since this trajectory was already calculated, execution time is reduced. This process is repeated continuously. This solution also proposes a mechanism for handling miss speculation. The gain in performance using this approach depends on the de- sired accuracy (i.e. the number of digits after the decimal point) and the number of cores used.

As an example, authors estimate the need of 56 Xeon cores running at 2.93GHz to achieve an 8x speedup with an accuracy of 3 decimal points.

A similar parallel pre-computation approach is discussed in [16]. This works goes a step further and applies the concept to benchmark applications that simulate MPC control systems, such as the nonlinear spring, arm pendulum. This work revealed that there are challenges that a real MPC pre-computation implementation needs to overcome in order for it to achieve ac- ceptable control performance. For example, small differences between the predicted output and the actual output are tolerated by the implementation, but these errors eventually accu- mulate and the difference between the predicted output and the actual output becomes large, which results in system failure. This work proposes a mechanism to handle such challenges.

In the results, the authors claim an speedup of 6x using 7 Xeon Phi processors.

2.3 Compiler and Optimizations

Newer microprocessor designs are driven by higher performance requirements. To improve performance, different hardware architectures have been developed for achieving different types of parallelism, which include pipelining, superscalar and Very Large Instruction Word VLIW execution, out-of-order execution, speculative execution, register renaming, branch

(20)

CHAPTER 2. RELATED WORK 11

prediction, and multithreaded architectures. Each of these hardware architectures target dif- ferent levels of parallelism to improve applicaton execution performance. In general, three major categories of parallelism exist: Instruction Level Parallelism (ILP), Data Level Paral- lelism (DLP), and Thread Level Parallelism (TLP) [17].

Architectural designs such as instruction pipelining, VLIW execution, out-of-order ex- ecution, and branch prediction are all used ILP. From a software developer perspective, no additional work is required to take advantage of ILP. That is not the case for TLP or DLP. TLP approaches applied to predictive control have been discussed in section 2.2.1, which include multithreading and chip multiprocessors (CMP). Single Instruction Multiple Data (SIMD) accelerators and vector processors are used for processing data in parallel. DLP approaches have also been explored for predictive control applications.

In [18], the optimization of a single linear algebra matrix operation, the matrix-matrix multiplication, commonly used in the linear MPC algorithm, is presented for three different processors, one of them being the ARM Cortex A9. The optimizations on each of the pro- cessors leverage the SIMD capabilities of each processor, as well as using Assembly code for the most computational expensive operations. The results presented show a speedup from 2x to 6x for small and large problems, respectively, with the single precision floating point supported by SIMD.

SIMD execution entails operating on multiple data elements in parallel using a single instruction. Thus, SIMD execution requires data arrangements to allow parallel operations.

This known as vectorization and is usually set up at compilation time.

However, an evaluation of compilers in [19] demonstrates that compilers, such as GNU GCC, are limited in performing data vectorization on vectorizable applications. The main problem is the compiler’s inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. Additionally, [20] shows similar results in pointer- based applications.

(21)

Chapter 3

Model Predictive Control

The advanced control techniques, that is to say more advanced than the classical PID control, were develop in the 1960s as an application of optimal control theory. The term utilized by the control community to distinguish these control techniques from the rest is Model Predictive Control (MPC).

MPC is an umbrella term that refers to a family of controllers that have some elements in common with the main element being a model of the system under control. This model is used to predict future behaviour of the system over a time horizon, referred to as prediction horizon.

Other key characteristics of MPC algorithms is that they keep a record of past control values for later use in the control algorithm, and they make use of a cost function that the algorithm strives to minimize.

This work focuses on a particular type of MPC control algorithm, the Adaptive Pre- dictive Control algorithm, with the goal of improving existing implementations, particularly, embedded implementations. Before presenting the contributions of this research work, the following section summarizes the theory behind the single-input single-output (SISO) APC and provides the basis for the actual implementations.

12

(22)

CHAPTER 3. MODEL PREDICTIVE CONTROL 13

Figure 3.1: SISO Adaptive Predictive Control Block Diagram[1]

3.1 SISO APC Overview

Adaptive Predictive control is an advanced control technique used by complex dynamic in- dustrial processes. The virtue of predictive control is its capacity to predict the behavior of the system based on past control and output values. This type of control is particularly suit- able for dynamic processes that change slowly over time. For processes that are subject to perturbations or processes whose dynamics change significantly over time, it is convenient to introduce adaptive mechanisms to achieve better control. The adaptive predictive control (APC) algorithm belongs to the family of model-based predictive control MPC algorithms. It was first introduced by Mart´ın-S´anchez [21] in 1978 as part of a doctoral thesis. The princi- pals of the APC algorithm that are presented in this chapter are based on a publication by the same author [22].

The block diagram of the SISO adaptive predictive controller is shown in Figure 3.1.

The following sections describe the purpose and the underlying equations of each block.

3.1.1 The Predictive Model

The predictive model, the block between the Driver Block and Process Block in Figure 3.1, is a mathematical model of the processes under control. This model makes possible the estimation of future process outputs and uses this information to generate an adequate control signal for the process. The predictive model computes the control signal that makes the process output reach the setpoint, and it does this following the desired trajectory provided by the Driver

(23)

CHAPTER 3. MODEL PREDICTIVE CONTROL 14

Block. In turn, the driver generates the desired output based on the set-point value and past process output values. The control signal is calculated using equation (3.1).

u (k) = 1 ˆh(λ)



yd(k + λ|k) −

ˆ n

X

i=1

ˆ

e(λ)i y (k + 1 − i) −

ˆ m

X

i=2

ˆ

g(λ)i u (k + 1 − i)



(3.1)

The coefficients ˆe, ˆg, and ˆh are obtained using equations (3.2) (3.3) (3.4) respectively, which are the recursive equations needed to estimate the system output throughout the predic- tion horizon λ.

ˆ

e(j)i = ˆe(j−1)1 ˆai+ ˆe(j−1)i+1 i = 1, . . . , ˆn j = 2, . . . , λ (3.2)

ˆ

g(j)i = ˆe(j−1)1 ˆbi+ ˆg(j−1)i+1 i = 1, . . . , ˆm j = 2, . . . , λ (3.3)

ˆh(λ) = ˆg1(λ)+ ˆg(λ−1)1 + . . . + ˆg1(1) (3.4)

3.1.2 The Driver Block

The purpose of the driver block is to drive the desired output towards the setpoint following the path or trajectory defined by a mathematical model. Because the goal of control is to reach the setpoint in a smooth style, the desired output follows a trajectory that avoids abrupt changes, including overshoots, and tries to reach the setpoint as fast as possible. This trajectory is redefined at each sampling interval, and it is referred in this work as the prediction horizon (λ). The prediction horizon represents the number of future sampling intervals for which the process output y is predicted. The driver consists of an ARX difference equation (3.5).

(24)

CHAPTER 3. MODEL PREDICTIVE CONTROL 15

yd(k + λ|k) = µ(λ)ysp(k) +

p

X

i=1

ϕ(λ)i y (k + 1 − i) +

q

X

i=2

δi(λ)ysp(k + 1 − i) (3.5)

Where p and q represent the number of α and β coefficients in the system’s model, or equivalently the order of the system. Note that the process real output y needed is in this equation. yspis the setpoint that the output is trying to reach. The choice of α and β determines the gain of the driver block and if ydwill be overdamped, underdamped or critically damped.

Ideally, the driver block should have unity gain and be critically damped. The ϕ, δ, and µ variables are computed with recursive equations 3.6, 3.7, and 3.8, respectively.

ϕ(j)i = ϕ(j−1)1 αi+ ϕ(j−1)i+1 i = 1, . . . , p j = 2, . . . , λ (3.6)

δi(j)= ϕ(j−1)1 βi+ δi+1(j−1) i = 1, . . . , q j = 2, . . . , λ (3.7)

µ(λ) = δ1(λ)+ δ(λ−1)1 + . . . + δ1(1) (3.8) where: ϕ(1)i = αi, ϕ(j−1)p+1 = 0, δi(1) = βi, and δ(j−1)q+1 = 0.

3.1.3 The Adaptation Mechanism

Adaptive control consists of updating the predictive model parameters based on the input and output values of the system under control. The purpose of constantly updating the model is to keep the model as accurate as possible so that control can be maintained even when the under- lying process has changed. The adaptation mechanism computes the difference between the actual process output and the predicted output to obtain the error. The goal of the adaptation mechanism is to make the prediction error to tend towards zero. Note in Figure 3.1 that the control signal is not only applied to the system under control, but it is also fed to the adaptive

(25)

CHAPTER 3. MODEL PREDICTIVE CONTROL 16

model. The adaptation mechanism also informs the driver block what the output deviation is with respect to desired trajectory. Using this information, the driver block can redefine the desired trajectory in a coherent manner.

The adaptation mechanism adjusts the ˆa and ˆb coefficients of the predictive model, stored in the ˆθ vector (3.10) for each sampling period k. The past output and control values are stored in φ. θ and φ are used to compute the predicted process output (3.9).

ˆ

y (k|k − 1) = ˆθ (k − 1)T φ (k) (3.9) Where:

θˆT =

 ˆ

a1 . . . ˆanˆ ˆb1 . . . ˆbmˆ



(3.10)

φ (k)T =

y (k − 1) . . . y (k − n) u (k − 1) . . . u (k − m).

The updated adaptive predictive model parameters are computed with (3.11), where B is the identity matrix of size N = (m + n). The error e is the difference between the predicted output and the actual system output (3.12).

θ (k) =ˆ e (k|k − 1) Bφ (k)

1 + φ (k)T Bφ (k) + ˆθ (k − 1) (3.11)

e (k|k − 1) = y (k − 1) − ˆy (k|k − 1) (3.12)

3.2 Algorithm Variable Dependence Analysis

The SISO APC algorithm is illustrated in Figure 3.2 using the same variables introduced in the previous section with the intent of emphasizing how each variable in the algorithm depends

(26)

CHAPTER 3. MODEL PREDICTIVE CONTROL 17

on each other, which will help to understand the concepts discussed in the following chapters.

The variables enclosed in the red box are of special interest in this work due to being the most computationally intensive portion of the algorithm. The variables enclosed in the doted box, is what constitutes the adaptation mechanism. It is shown in that manner to denote a weak dependence between the recursion and adaptation stages. The recursion stage depends on the result from the adaptation stage, but not until the next sampling period. This characteristic is of special interest in this work, as it allows for the partial parallelization of the algorithm.

3.3 Chapter Summary

In this chapter the concepts of the Single-input Single-output were presented. The block diagram shown in Figure 3.1 illustrates the main components of the APC algorithm, and each of these blocks was described in detail in the sections of this chapter. In the last section of this chapter some properties of the algorithm were exposed that are relevant to concepts that will be discussed in the subsequent chapters. In the next chapter, the hardware platform used for the implementation of the SISO APC algorithm is described in detail.

.

(27)

CHAPTER 3. MODEL PREDICTIVE CONTROL 18

Figure 3.2: APC Variable Dependence Diagram

(28)

Chapter 4

An Embedded Platform for the APC Algorithm Implementation

In this chapter, the ZEDBOARD is introduced, on which an optimized implementation of the SISO APC was developed. Additionally, the features on the board and its components that are relevant to this work are discussed in detail, such as the Dual-Core multiprocessor.

4.1 The Development Board: ZEDBOARD

The Zynq Evaluation and Development Board (ZedBoard) is a single-board computer that features the Zynq Z-7020 (XC7Z020) SoC as its main component. The board is designed around the Zynq SoC, and it provides all the necessary components and interfaces to develop applications for the Zynq. The board offers 256Mbit flash memory and 512MB DDR3 mem- ory. It includes two oscillator clock sources, one at 100MHz, and the other at 33.3333MHz.

The ZedBoard also includes multiple peripheral interfaces, among them GPIO, for LEDs, switches, and push buttons, Audio peripherals, HDMI port, VGA connector, OLED display, Pmod interfaces, Ethernet port, USB port including OTG, JTAG, UART, SD card slot, FMC interface, XADC header, Xilinx JTAG header [23]. Figure 4.1 shows some of these compo- nents highlighted.

19

(29)

CHAPTER 4. AN EMBEDDED PLATFORM FOR THE APC ALGORITHM IMPLEMENTATION20

Figure 4.1: The ZedBoard Development Board [2]

4.2 The Dual-Core ZYNQ SoC

The Zynq-7000 is a Xilinx SoC based architecture. This SoC incorporates a dual or single- core ARM Cortex-A9 based processing system (PS) and the Xilinx programmable logic (PL) in a single device. The ARM Cortex-A9 CPUs is the main component of the PS, which also includes on-chip memory, external memory interfaces, and I/O peripherals. The PL brings the flexibility of an FPGA to the Zynq SoC where custom logic can be developed. The combi- nation of custom logic in the PL and software in PS allows for a wide range of applications while proving high levels of performance that two-chip implementations cannot match due bandwidth and power limitations [24].

The interconnectivity and high speed communication between the Processing System and the Programable logic makes the Zynq SoC appealing from an application development point of view. The programable logic on the Zynq is based on the Xilinx Artix 7-series of FPGAs. The PL is independent from the PS, as it has separate on-chip power plane, clock and reset management. A set of JTAG ports enable independent programming and debugging of the PL. The PL also incorporates special communication interfaces, such as GTX transceivers for highspeed communications, which support multiple standard interfaces including PCI Ex- press, Serial RepidIO, and SATA [25].

(30)

CHAPTER 4. AN EMBEDDED PLATFORM FOR THE APC ALGORITHM IMPLEMENTATION21

Figure 4.2: ARM Cortex-A9 MPcore Processor [3]

An ARM Cortex-A9 MPcore processor is at the heart of the Zynq Z-7020. The Cortex- A9 MPcore processors come in many variants. It supports up to four ARM Cortex-A9 pro- cessors. Figure 4.2 the diagram of a four-core MPcore. Each ARM Cortex-A9 processor on an MPcore implements an ARMv7-A architecture and supports Thumb instruction set and Jazelle Runtime Compilation Target (RCT). Among the standard components included in a Cortex-A9 MPcore is the snoop control unit (SCU), which is responsible L1 cache coherence, Accelerator coherency port (ACP) operations, and uniprocessor access to private memory regions. Other relevant components include the Generic Interrupt Controller (GIC), private timer and private watchdog, and global timer [26]. Configurable options include instruction and data cache sizes and the NEON FPU, among others [24].

The MPcore on the Zynq is a dual-core and includes the Floating-point unit and NEON co-processors. Arm Neon technology is an advanced Single Instruction Multiple Data (SIMD) architecture extension for the ARM Cortex-A family of processors. Neon registers are con- sidered as vectors of elements of the same data type. The Neon instructions perform the same operations in all lanes of the vectors. The number of operations performed depends on the data types. Neon instructions allow up to 16x8-bit, 8x16-bit, 4x32-bit, 2x64-bit integer oper- ations and 8x16-bit, 4x32-bit, 2x64-bit floating-point operations. Neon technology can also support issue of multiple instructions in parallel [27].

(31)

CHAPTER 4. AN EMBEDDED PLATFORM FOR THE APC ALGORITHM IMPLEMENTATION22

Figure 4.3: Application Processor Unit (APU) [4]

4.3 The Application Processor Unit (APU)

The PS is divided into multiple functional units, including among others the Application Pro- cessor Unit (APU), I/O peripherals (IOP), Datapath and Memory resources [4].

The main component of the Processor System is the Application Processor Unit which is made up of Dual ARM Cortex A9 processors, NEON co-processor, General Interrupt Con- troller, timers, and caches, as seen in Figure 4.3. The key components in the APU are the Dual-core ARM Cortex-A9 processors with their independent caches. The Dual-core ARM Cortex-A9 processor is implemented as a Hard-core processor, which means that it is a ded- icated and optimized silicon element on the board. The ARM processor is based on RISC architecture with the goal of high-speed instruction processing. The ARM Cortex-A9 proces- sor can operate at up to 866 MHz, using the fastest speed grade [28]. Each of the cores have dedicated 32KB level 1 cache and cache-controllers for instructions and data. The cores share an external 512KB level 2 cache. Each of the two cores include a NEON and a VFP extension for single-instruction multiple-data (SIMD) and double-precision floating point operations re- spectively [4]. The Dual-core ARM Cortex-A9 processor plays a key role in the main solution in this thesis.

4.4 Chapter Summary

This chapter presented the ZEDBOARD as the development board for the SISO APC imple- mentation. The peripherals in the development board are listed in this chapter, and Figure

(32)

CHAPTER 4. AN EMBEDDED PLATFORM FOR THE APC ALGORITHM IMPLEMENTATION23

4.1 shows each one of these. At the heart of the ZEDBOARD is the ZYNQ SoC, which combines a Processing System (PS) and Programable Logic (PL) which are interconnected via high speed communication interfaces. In this work, only the PS of the ZYNQ was used for the embedded implementation. Thus, the last section in this chapter provides a detailed description of the main components in the PS, including the Dual-Core Cortex-A9 processor.

The next chapter explains how the Dual-Core processor was leveraged for this research.

(33)

Chapter 5

Embedded APC Implementation Optimization

Following the methodology discussed in section 1.3, a profile for the baseline implementation reveals that there are two sections of the code that account for 60% of the execution time.

That is, the Recursion stage contributes to 40% of the execution time, while the Adaptation stage contributes to 20% of the total execution time.

The analysis of the algorithm for the Recursion and Adaptation stages reveals that the two stages are partially independent of each other. In other words, the calculations made in the Recursion stage do not play a role in the calculations carried out in the Adaptation stage. The same cannot be said for the reverse scenario, where some calculations in the Adaptation stage are used in the Recursion stage but in a subsequent iteration. Therefore, the first optimization considers parallel execution of the two stages. The ZYNQ SoC is a good choice for parallel implementations, as it includes two ARM Cortex-A9 processors. The Adaptation stage was selected to run on the ZYNQ’s second core in this implementation.

Moving one portion of the algorithm to the second core raises the problem of synchro- nization. Special considerations need to be made to prevent the Adaptation stage from moving on to the second iteration before the Recursion stage completes its first iteration. If this is not

24

(34)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 25

handled, it will cause the Recursion stage to use results from a subsequent iteration instead of using the numbers for the first iteration, causing application to fail as a control system. The following sections explain in detail the optimizations implemented to reduce execution time.

5.1 Enabling the Dual-Core capabilities on the Zynq

As illustrated in section 3.2, the adaptation stage is the portion of the algorithm that can be parallelized for two reasons. First the inputs to the adaptation stage are independent of any of the other variables computed during a single iteration of the algorithm. Second, the resulting data obtained in the adaptation stage is not utilized in the same iteration cycle in which it is computed. Instead, the data generated in the adaptation staged is consumed by the recursion stage until the next iteration cycle. This property makes it possible to run the adaptation stage alongside the remaining portion of the APC algorithm at the same time.

The baseline embedded implementation of the SISO APC algorithm was developed us- ing C language [1], and it was developed for the ZEDBOAD development platform. Chapter 4 gives an overview of the ZEDBOARD and its main components. The ZYNQ SoC on the ZEDBOARD includes a Processing System (PS) which is made up of, among other compo- nents, a Dual-Core ARM Cortex-A9 MPcore.

To improve execution time of the baseline implementation, the adaptation stage of the algorithm was implemented on the second core of the ZYNQ. The two Cortex-A9 processors on the ZYNQ share memory and some peripherals. A complete list of shared resources can be found in the reference manual [24]. Some of these shared resources include the Interrupt con- trol distributor (ICD), DDR memory, On-chip memory (OCM), Global timer, Snoop control unit (SCU), L2 cache and a UART interface.

Upon startup, each processor initializes each of the components. If both processors are started at the same time without a special configuration, proper initialization of these compo- nents will fail. Asymmetric multiprocessing (AMP) is a special configuration that makes it

(35)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 26

possible for both processors on the ZYNQ to run separate applications while allowing them to communicate via shared resources. When configured for APM, each CPU executes a bare- metal application in a standalone environment. AMP is needed to prevent the cores from conflicting on shared hardware resources, particularly during initialization [29].

The development tools provided by Xilinx, the ZYNQ SoC designer, include scripts to initialize each of the Cores automatically. The initialization of the shared resources can be sequenced in such a way that it prevents both processor form accessing the same resource at the same time. This is controlled in the scripts by defining a variable named USE_AMP.

By default, this variable is disabled, or set to zero, and the initialization sequence of each of the processors is identical, which causes conflicts with shared resources when both cores are enabled. When this variable is set to 1, the initialization script for the second processor, CPU1, skips the initialization of the shared components, which are initialized by the first processor, CPU0. From an application developer point of view, enabling the proper initialization of both processors simply takes one step. A macro is defined as a compiler flag with the ‘D’ option in the gcc compiler, which takes the form of -DUSE_AMP=1.

5.2 Dual Core SISO APC Embedded Optimizations

Enabling the Asymmetric Multiprocessing (AMP) mechanism discussed in the previous sec- tion only resolves the conflicts with shared resources during initialization. Once initialized, it is up to the application developer to prevent any conflicts in the applications when they access the shared resources. A simple solution to avoid such conflicts is to follow a master/slave approach. In this approach one processor, the master, controls the execution of the other processor and decides when the slave can access the shared resources. This is the approach followed in this work.

The available memory shared by both processors can be divided into separate memory regions, and a memory region can be assigned exclusively to a specific processor. This is

(36)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 27

accomplished using the linker scripts (‘.ld’), which are used to control where different sections of an executable are placed in memory. In the linker scripts, the application developer can also define new memory regions, and change the default assignment of the sections to memory regions. For the applications developed in this work, the shared memory was divided into three segments. The first segment is assigned exclusively for CPU0. The second segment is exclusively assigned to CPU1. Each core uses these assigned segments to load the machine instructions and reserve memory for variables used during code execution. It is important to note that the method in which these memory regions are assigned doesn’t prevent CPU0 from accessing the memory region assigned to CPU1, or vice versa, during code execution.

Since the memory addresses are assigned to each segment manually by the developer, he or she can simply point to that memory address in the code to read and write data as desired. It is this very property that allows both processors to communicate and share data between them.

In this work, a third region of memory is assigned in the linker scripts for synchronizing code execution and data sharing. In this third region, a pointer to the memory address that holds the pointer to the data structure that holds all the algorithm variables, named the APCType data structure, is saved. Additionally, two more variables, event0, and event1, are stored in this memory region, which are used by both processors to synchronize execution of the applications. In this case, memory address 0x8000000 was chosen as the beginning of this third memory region.

Both applications on each processor start by initializing the variables stored in the third memory region used for intra processor communication. Then CPU0 initializes all the vari- ables that will be needed during computation, while CPU1 waits for a message from CPU0 that marks the start of the first iteration. When the message is received, CPU1 starts execution of the Adaptation stage of the algorithm for the first iteration, while at the same time CPU0 starts execution of the rest of the APC algorithm for the first iteration. As CPU1 completes execution of the first iteration before CPU0, it waits until CPU0 informs CPU1 that it has completed the first iteration, and then CPU1 can store the results of the first iteration to the

(37)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 28

appropriate memory location. This cycle repeats for every iteration.

5.3 Analysis of the Dual-Core Implementation

The output of the dual-core implementation was confirmed to be accurate. However, the performance achieved by applying the optimizations described in the previous section was not sufficient to reach a speedup of 2x, which was the goal because the processing resources were doubled. Following the methodology describe in section 1.3, it was concluded that the recursion stage needed to be optimized.

By analyzing the recursion algorithm, it was determined that some operations could be split for parallel computation, but only for cases where the system under control was of high order. No gain in execution time would have been obtained for systems of second order if that approach was followed. A rational and proven method for code optimization is the utilization of low-level programming languages, such as assembly, for computationally heavy algorithms. The remaining sections of this chapter discuss the benefits of using low-level optimization for the APC embedded implementation.

5.4 The Case for Manual Low Level Optimization

Modern compilers are capable of generating highly optimized machine code. In most cases, compilers perform much better at low level code optimization than a human can. However, a human can write better assembler code than a modern compiler when he or she knows more about the custom hardware that is involved.

In general, a modern C compiler knows much more about code optimization. It knows how the processor pipeline works. It also can rearrange instructions faster than a human can. As an analogy, a computer is as good as or better than the best human chess player because it can compute different outcomes faster than a human can. Although a human can theoretically perform as well as a computer in a specific case, he or she can’t do it at the same

(38)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 29

speed, making it impractical for more than a few cases. A good compiler will undoubtedly outperform a human if he or she tries to write more than a few assembler routines.

Adding to the compiler’s capabilities, there are techniques, or best practices, that a de- veloper can use when coding in high level languages that make it more likely that the compiler will generate optimal low level code. These techniques include loop unrolling, loop-invariant code extraction, among others.

While it is true that compilers do an outstanding job, there are cases where a compiler does not have an overall understanding of the hardware architecture. This is particularly true for application specific SoC. As an example, the Cortex-A7 processor and its successors have an SMID engine or co-processor designed for parallel data computation on large data sets called the NEON engine. Some NEON intrinsic instructions include VEXT, which concate- nates the ’tail’ and ’head’ of two 1 dimensional vectors. Or the VLD3 instruction, which loads a vector from memory in an interleaved manner, Each of these intrinsics take only a single clock cycle. In these cases, a developer’s familiarization with the hardware in question can yield better results than a compiler could achieve.

5.5 Use of the SIMD co-processor

This work takes the APC SISO embedded implementation presented in [1] as the baseline and increases the use of assembler code to improve performance. The supporting arguments for low-level optimization are presented in the previous section. Low-level optimization helps by decreasing memory access and decreasing the number of processor instructions. In other words, the overhead generated by the compiler is reduced. This level of optimization is also possible because for the APC SISO algorithm some computations produce intermediate re- sults that are only used in successive computations on a single iteration, and there is no need for committing these intermediate results to external memory, as the compilier does when temporary variables are used.

(39)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 30

The baseline implementation [1] made use some in-line assembly for optimization, but it was primarily used to take advantage of the NEON co-processor available in the Cortex-A9 processor. The NEON engine supports non-standard assembly instructions to load data into the Vector Floating Point (VFP) registers for performing vector operations, such as Vector Multiply and Accumulate (VMLA), and Vector Division (VDIV), which were exploited in the APC embedded implementation.

5.5.1 NEON Usage Example in APC Implementation

The main advantage of using the NEON co-processor is the ability to perform vector multipli- cations with a single instruction which takes fewer cycles to execute than if each of the vector elements is multiplied individually. The size of the vector that a NEON instruction can take depends of the data type of each vector element. The NEON engine on the Cortex-A9 proces- sor only supports single-precision floating-point representation when 32-bits are assigned to a variable. Therefore, in this implementation up to four 32-bit floating-point numbers can be loaded on a VFP register for processing.

θ = [ˆˆ a1, ˆa2, ˆb1, ˆb2] (5.1)

ˆ

a1[ˆa1, ˆa2, ˆb1, ˆb2] + [ˆa2, 0, ˆb2, 0] = [ˆe21, ˆe22, ˆg12, ˆg22] λ = 2 (5.2)

ˆ

e21[ˆe21, ˆe22, ˆg21, ˆg22] + [ˆe22, 0, ˆg22, 0] = [ˆe31, ˆe32, ˆg13, ˆg23] λ = 3 (5.3)

ˆ

e31[ˆe31, ˆe32, ˆg31, ˆg32] + [ˆe32, 0, ˆg23, 0] = [ˆe41, ˆe42, ˆg14, ˆg24] λ = 4 (5.4)

To illustrate the power of the NEON engine and how it was exploited, the computations

(40)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 31

Table 5.1: SISO APC Computational Complexity (FLOPs) [1]

Algorithm FLOPs

Output 2(n + m) − 1

Recursion 2(λ − 1)(p + q + N + 1) Desired Output 2(p + q − 1) Manipulation 2N − 1 Adaptation 4N2+ 5N

performed for the most computationally intensive portions of the SISO APC algorithm, re- ferred to as the recursion stage, are shown next. For a second order system, the process of calculating ˆe and ˆg requires the use of the VMLA instruction. Given that the operands have been properly loaded into the VFP registers, the equations below represent the computations performed by the NEON engine with one VMLA instruction:

The equations above show the calculation of ˆe and ˆg for when λ = 4. Although only three iterations of the algorithm are shown, the pattern it follows can already be observed. The VMLA instruction takes care of the vector multiplication and addition shown for each λ. No additional instructions are needed provided that the operands have been loaded to the appro- priate VFP registers. Therefore, 4 floating-point multiplication and 4 additions instructions are computed with single instruction which takes fewer clock cycles using the SIMD Neon Engine. Clearly, the supporting code that loads the operands on to the VFP registers is not illustrated in these equations, which also adds execution time in the recursion stage.

5.6 Regression Stage Optimizations

The Regression stage is the most computational demanding portion of the SISO APC algo- rithm. Table 5.1 shows the Floating-Point Operations (FLOPs) for different sections of code, where the Regression and Adaptation stages stand out from the rest. While the Adaptation stage has been dealt with in Section 5.2. Further analysis of the Recursion stage is presented in this section to give grounds for low-level optimizations.

Algorithm 1 shows the pseudo code for the recursion stage. Most of the code lines in

(41)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 32

the recursion stage prepare the variables that are used as arguments for the VMLA function.

The VMLA function implements the NEON Vector Multiply and Accumulate with in-line assembly using mainly the vmla.f32 NEON intrinsic instruction. This function also includes other in-line Assembly instructions to load the operands into the NEON VFP registers. Only one vector, ˆθ, is required as the input. ˆθ points to the current ˆa and ˆb coefficients in the adaptive predictive model, which were calculated in the previous sampling period or were input as initial parameters for the first sampling period. Iterations over λ do change the values in ˆθ. The outputs of the recursion stage are ˆeλ and ˆgλ. For the first iteration over λ, ˆeλ contains an exact copy of the ˆa elements in ˆθ, and ˆgλcontains an exact copy of the ˆb elements in ˆθ; however, these vectors are updated in every λ iteration with the result from the VMLA function.

Algorithm 1: Recursion Implementation

input : ˆθ which contains [ˆa1, ˆa2. . . ˆan, ˆb1, ˆb2. . . ˆbn] output: ˆeλ, ˆgλ

ˆ

e0 ←− [ˆa1, ˆa2. . . ˆan];

ˆ

g0 ←− [ˆb1, ˆb2. . . ˆbn];

Initialize temporary variable ˆR;

for j ← 1 to λ do

R ←− [ˆˆ ej−12 , ˆej−13 . . . ˆej−1n , 0, ˆg2j−1, ˆg3j−1. . . ˆgnj−1, 0];

for i ← 0 to i < RegSize do

[ˆeji, ˆgji] = vmla([ˆej−1i , ˆgj−1i ], ˆθ, ˆR);

end end

The ˆR vector holds one of the operands used in the VMLA function. The ˆR vector contains a copy of most of elements in ˆeλ and ˆgλtogether, except for the first element of each of these vectors. In essence, the second element of ˆeλbecomes the first element in ˆR, the third element in ˆeλ becomes the second element in ˆR and so forth. A zero is added after the last element of ˆeλhas been copied to ˆR, which would be right in the middle of the ˆR vector. After the zero in the middle, the second element of ˆgλ is appended to the ˆR vector, then the third element and so forth. Again, a zero is added to the end to maintain the vector size equal to the

(42)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 33

size of ˆθ. This way, the computations would still yield the correct results.

The actual C language implementation of the recursion stage in [1], ˆR is obtained by first initializing it to zeros and then using a series of memory copy operations to copy the desired elements form ˆeλ and ˆgλ. The memcpy() C function with the appropriate arguments is used to accomplish this task. Therefore, blocks of data are being copied from one memory location to another for every λ iteration.

5.6.1 Low Level Optimizations

The pseudo code in Algorithm 1 shows how the recursion stage is coded. In the recursion stage, ˆe and ˆg are computed iteratively, which means that the result from each iteration is used in subsequent iterations. This creates a data dependency that makes it difficult to optimize via parallel execution solutions, as it was done with the Adaptation stage. However, there are other optimization approaches that can be applied in these scenarios.

Algorithm 2: Assembly code generated by the compiler for j ← 1 to λ do

regR0 ←− ˆθ; loaded from memory;

regR1 ←− ˆR; loaded from memory;

regR2 ←− ˆej−10 ; loaded from memory;

regR3 ←− regR2 × regR0 + regR1;

regR3 −→ [ˆeji, ˆgij]; stored in memory;

end

Upon detailed analysis of the machine code generated by the compiler for Algorithm 1, it was determined that the compiler did not generate an optimal low-level implementation mainly because it required multiple transactions between the processor registers and memory external to the processor. Algorithm 2 presents a simplified pseudo code for the assembly code generated by the compiler, and it shows that the data is fetched from memory before the vector operation VMLA is performed and the result is also stored in memory. The pseudo code explicitly shows the instructions that require data to be moved in and out of the external

(43)

CHAPTER 5. EMBEDDED APC IMPLEMENTATION OPTIMIZATION 34

memory. The number of memory transactions increases proportionally to λ. Therefore, when λ is a high number, a significant amount of time is misspent in memory transactions.

The time spent in memory transactions was greatly reduced by porting the complete section of the recursion algorithm to assembly code. With assembly language, it became pos- sible to transfer data among registers instead of committing the data to external memory and reading back in subsequent iterations. This was only possible because the result of the vector operation is only needed for the next iteration, so there is no need to save it into memory.

Since the result form the VMLA instruction is already in the registers, it can be simply moved to the appropriate register, which takes significantly less time than writing it to and reading it from memory. The subsequent iterations can be executed without needing to access external memory.

5.7 Chapter Summary

This chapter described how the Dual-Core processor was leveraged to reduce the execution time of the embedded SISO APC implementation. First, it revealed the subtleties of working with the two cores on the ZYNQ SoC, and the particular solution to the parallel implementa- tion approach. Finally, it described the low-level optimizations that made it possible for the application to improve the application’s execution time.

(44)

Chapter 6

Setup and Results

6.1 ZYNQ Hardware Configuration

Zynq SoCs are configured using the Vivado software Suite from Xilinx. The Vivado De- sign Suite provides an environment to configure, implement, verify, and integrate Intellectual Property (IP). The Vivado development environment includes a comprehensive library of IP blocks, called the LogiCORE library, and templates for Zynq and FPGA designs. To de- velop Real Time logic (RTL) for a project, the included IP block templates can be used and customized via the IP Integrator tool.

The project design diagram used in this work is shown in Figure 6.1, which was gener- ated with the Vivado IP integrator. The main block in the design is the IP Processing System 7 core, displayed as ZYNQ Processing System in the diagram, which is included in the Logi- CORE library. This block is the software interface around the Zynq-7000 SoC. It acts as a logic connection between the PS and the PL and assists in the integration of custom and em- bedded IPs. The AXI Interconnect block acts as a bus, which allows access to peripherals, such as timers, UART ports, GPIOs pins, CAN bus, etc. In this implementation, the AXI GPIO blocks create the interface between the ZYNQ PS and the components connected to the GPIO ports in the development board. The Zedboard includes a series of 8 switches, which in

35

(45)

CHAPTER 6. SETUP AND RESULTS 36

Figure 6.1: ZYNQ Design configuration

past implementations have been used to start/stop the execution of the code, as well as some LEDs to indicate if the processor is busy or has completed the cycle of iterations indicated in the implementation.

Using Vivado, it is simple to create such design. This design was auto generated by the IP Integrator after adding the ZYNQ PS block and the AXI GPIO blocks in the block dia- gram. The wizard added the missing blocks, in this case the AXI interconnect and the Process System Reset blocks and made the appropriate connections among them. The Processor Sys- tem Reset IP Module provides customized resets for an entire processor system, including the processor, the interconnect and peripherals. It oversees the sequencing of reset signals after a reset. For instance, the UART, SPI, IIC peripherals come out of reset after 16 clock cycles.

The ARM processor clock frequency is customizable and is configured in the ZYNQ PS block as seen in Table 6.1. The clock was set to 667 MHz for this implementation. A

Referencias

Documento similar

The main goal of this work is to extend the Hamilton-Jacobi theory to different geometric frameworks (reduction, Poisson, almost-Poisson, presymplectic...) and obtain new ways,

The paper is structured as follows: In the next section, we briefly characterize the production technology and present the definition of the ML index as the geometric mean of

Astrometric and photometric star cata- logues derived from the ESA HIPPARCOS Space Astrometry Mission.

In the previous sections we have shown how astronomical alignments and solar hierophanies – with a common interest in the solstices − were substantiated in the

Díaz Soto has raised the point about banning religious garb in the ―public space.‖ He states, ―for example, in most Spanish public Universities, there is a Catholic chapel

teriza por dos factores, que vienen a determinar la especial responsabilidad que incumbe al Tribunal de Justicia en esta materia: de un lado, la inexistencia, en el

Abstract: Transepidermal water-loss (TEWL), stratum-corneum hydration (SCH), erythema, elas- ticity, pH and melanin, are parameters of the epidermal barrier function and

Government policy varies between nations and this guidance sets out the need for balanced decision-making about ways of working, and the ongoing safety considerations