Xilinx HLS + OmpSs@FPGA Course Guide

(1)

Xilinx HLS + OmpSs@FPGA Course Guide

XUP and OmpSs@FPGA Team

February 12, 2019

(2)

Index

Index 2

1 Course Environment 3

1.1 Hard Disk: Ubuntu system and code . . . 3

1.2 Docker: Cross-compilation environment . . . 3

1.2.1 Login in the container . . . 3

1.2.2 Sharing data between host Ubuntu system and docker cotainer . . . 4

1.2.3 Cross-compiling. . . 4

1.2.4 Vivado HLS analysis . . . 5

1.3 Zynq Board System . . . 6

1.3.1 Boot from SD. . . 6

1.3.2 Minicom Connection . . . 6

1.3.3 Ethernet connection . . . 7

1.3.4 Application Execution . . . 8

1.4 Shutdown . . . 8

1.5 Disclaimer . . . 8

2 First Day 9 2.1 Vivado HLS analisys of the MxM . . . 9

2.1.1 OmpSs compilation and OmpSs@FPGA porting . . . 9

2.1.2 Compilation to Design and Analysis . . . 9

2.1.3 RTL/C Simulation . . . 12

2.1.4 RTL/C Simulation w/ Dump Trace . . . 15

2.2 Vivado HLS analysis and optimization of the RGB2YUV Filter . . . 16

2.2.3 Optmizations and Analysis . . . 17

2.2.4 Compilation to Bitstream . . . 24

2.2.5 Paraver Trace analysis . . . 26

3 Second Day 29 3.1 Vivado HLS analysis and optimization of a DCT code . . . 29

3.1.3 Optmizations and Analysis . . . 32

3.1.4 Compilation to Bitstream . . . 40

3.1.5 Timing and Execution in the Zynq board . . . 41

3.2 Hardware Instrumentation. . . 42

3.2.1 Default Instrumentation . . . 42

3.2.2 User Instrumentation . . . 42

(3)

3.3.3 Timing and Execution of the OmpSs@FPGA application . . . 44

3.3.5 Tunning MxM application . . . 46

3.3.6 Blocking MxM application. . . 47

(4)

1

Course Environment

1.1 Hard Disk: Ubuntu system and code

Ubuntu 16.04 running in the bootable Hard disk is prepared with a docker container that will help you to cross-compile to the Zynq 7000 (zedboard). Next section explains how to access to this docker container and copy from/to files when you compile. Also, this ubuntu has the minicom application to connect via serie with the board (explained later). You can also copy data to the Zedboard using ssh (explained later).

1. To Boot from the hard disk you may have to push the function key (usually F9/F12) to make BIOS appear.

2. Boot from the external hard disk ubuntu, password: ubuntu.

3. Work in the tutorial directory

4. Original source code should be at this directory (or provided to you by USB if you are running your own system).

1.2 Docker: Cross-compilation environment

1.2.1 Login in the container

1. Docker container has been already created.

2. Run: sudo docker ps -a . Although it may change the release of the package, the command output should be the list of active docker containers:

3. Look for the name of the container. Run: sudo docker start -i -a name container

(5)

4. You can exit the container running exit. To login in again can repeat previous command.

1.2.2 Sharing data between host Ubuntu system and docker cotainer

1. At /home/ubuntu docker container directory you have access to the Ubuntu 16.04 host file system (or the directory you have chosen if you run the docker image to create a new container).

2. You can cp from/to host directories to/from contanier directories, or use directly a host directory.

1.2.3 Cross-compiling

1. Makefile and course code are prepared to compile and generate a OmpSs binary and the bitstream with the accelerator specified by the programmer.

2. We advise you to open the Makefile’s to figure out the target and compilation commands we are going to use.

• In order to estimate the performance of your accelerators before starting the time consuming process of generating the hardware, you can target: VERSION CODE=NAME make toDesign- zedboard-i . This Makefile target specifies to our OmpSs@FPGA toolchain the compilation flag --to step=design, which will make the compilation to stop after the Vivado HLS project and the Vivado design is generated. Later we will explain how to analyze the estimated performance. In addition, we are compiling with hardware instrumentation so that we can generate and analyze execution traces.

• VERSION CODE is an environment variable that can be defined at command line, and you can use #ifdef in the source code to activate or deactivate parts of the source code. Be careful with the defines you use or you can change the meaning of keywords or variables in your code.

Usually it is good to use capital letters.

3. In order to partially or completely compile to generate the bitstream, the computer should be connected to the network (wireless or not). Then, you should have access to eduroam. Otherwise, ask us for help.

4. Compilation may take a while. Maybe you can use nohup to run the compilation so that you an continue working in the following exercise meanwhile it is compiling.

5. Remember:

• For partial compilation:

(6)

VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard-i >& \ output_toDesign.PROGRAM_TUTORIAL.txt &

During the tutorial, we will also open the Vivado HLS project, modify it and export the new IP. In that case, we will have to open the Vivado Project created with the previous command so that we update the new IP release. Make a backup of the name.xtasks.config file that you will find at zedboard PROGRAM TUTORIAL/name autoVivado/. Then, you will have to run this another partial command to finally generate the bitstream:

VERSION_CODE=PROGRAM_TUTORIAL nohup make fromSyn-zedboard-i >& \ output_fromSyn.PROGRAM_TUTORIAL.txt &

In both cases you can run:

tail -f output.txt

to observe the progress of the compilation.

• For complete compilation:

nohup make bitstream-zedboard-i >& output_PROGRAM_TUTORIAL.txt &

1.2.4 Vivado HLS analysis

1. After partial compilation upto step Design, you can analyze the vivado hls project.

VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard >& output_PROGRAM_TUTORIAL.txt &

2. Once this command above has been executed, the Vivado HLS project can be open using the following command:

vivado_hls -p zedboard_PROGRAM_TUTORIAL/name_autoVivado/Vivado_HLS/accelerator_name 3. Once it is open, we can select the Analysis Perspective on top-right corner. There, we can look for

the function accelerated to analyze the latency and interval information, in addition to resource requirements.

(7)

5. In the case of the corr fpga kernel shown in the picture there is not pipeline at all, and the latency we pay per iteration is 16 cycles per interation!!!

6. At top right, you can also observe the cycles per iterations that each loop has, and the high level pnemotecnic of the operations.

• Yellow: indicates overall latency per iteration

• Red: indicates conflicts. You can click the red cycles to see where the conflits are.

1.3 Zynq Board System

1.3.1 Boot from SD

1. BE SURE your board is turned off 2. Insert the microSD card in the board 3. Connect the board to the power supplier

4. Connect the microUSB side of the microUSB-USB cable to UART connection.

5. Connect the USB side of the microUSB-USB cable to the computer.

6. Turn on the board

7. Run dmesg in your computer and look for something like: ”.... ttyACM0: USB ACM device...

” or ”.... ttyUSB0 .... ” . This is what you have to use to setup the minicom port.

1.3.2 Minicom Connection

1. Assuming that you have seen ttyUSB1 in the dmesg log, run ”sudo minicom -D /dev/ttyUSB1 -b 115200”

2. At login use ubuntu:ubuntu .

(8)

3. The Zynq System is also running an Ubuntu 16.04.

4. Look at the date. If this is not the current one, you can run sudo date -s "DD MMM YYYY HH:MM:SS" so that you can set the current day and hour. DD stands for Day, MMM stands for the first three letter of the month, YYYY stands for the year, and HH:MM:SS stands for hour, minutes, seconds.

1.3.3 Ethernet connection

You can access through ethernet connection using ssh to connect to the board or scp to copy data to/from the host. This will be useful to transfer files between host Ubuntu system and Zynq system.

1. First, from the minicom connection run sudo ifup eth0. Usually, it is 10.42.0.220.

• Login: The IP obtained in previous steps is the one you can use to ssh or scp.

• Assuming IP: 10.42.0.220 run ssh -X [email protected] if you want to remotely login

• Copy host to zedboard: scp files [email protected]:path

(9)

1.3.4 Application Execution

1. First, compile the application in the Docker container (smp or fpga version of the code):

2. Then, transfer the following files from the Host system to the Zynq system:

• For only SMP or heterogeneous OmpSs@FPGA application: copy the ARM binary file, at the current compilation directory.

• For applications with FPGA acceleration:

– Bitstream (.bin) : this file should be under the zedboard PROGRAM VERSION/name autoVivado/

and it is the bitstream that we have to write in the FPGA device.

– name.xtasks.config : this file should be under the zedboard PROGRAM VERSION/name autoVivado/

and it is the configuration runtime file that is necessary when running the program. For partial compilation, the one you have done a backup before the second partial to end compilation. For complete compilation, the one you will find in the directory.

3. Login into the Zynq System 4. Do the following steps:

• Run: cat bitstream.bin > /dev/xdevcfg

• Run for non-instrumented:

– Run:

NX_ARGS="--summary" ./program arguments

• Run for instrumented:

– Only once: export EXTRAE CONFIG FILE=extrae.xml – Be sure you have extrae.xml file in the directory. Then run:

NX_ARGS="--summary --instrumentation=extrae" ./program arguments

1.4 Shutdown

1. Correctly shutdown the machine with sudo halt or sudo shutdown now.

1.5 Disclaimer

OmpSs@FPGA team has ported or adapt the codes of the XUP tutorials to fit the requirements of the code written for OmpSs@FPGA applications, and to show the benefit of using the OmpSs@FPGA for production and analysis. In general porting consists on:

• Constants and global variables have been moved to includes with .fpga.h extensions. Const array/matrix variables should be also moved to the .fpga.h files.

• There should be a call to the accelerated kernel from the same source code, otherwise, there will not be generated any accelerator.

• Parameters of the top function on OmpSs@FPGA uses m axi protocols, and provoke errors when using structs with different field characteristics (to be analyzed). Those fields have been separated in different parameters.

• Dimension and Size information at pragmas should be const type values.

• Makefile targets have been tuned so that the Vivado project of a compilation can be generated although the Vivado HLS resources of the kernel are over the maximum of the board. This will allow us to play with the Vivado HLS project, export the new IP, update the IP from the Vivado project, and continue the compilation.

(10)

2

First Day

2.1 Vivado HLS analisys of the MxM

In this section you will port to OmpSs@FPGA the MxM code with no Vivado HLS optimizations. Second day, you will be able to optimize it.

2.1.1 OmpSs compilation and OmpSs@FPGA porting

Do the following actions:

• The code is already prepared with the ompss pragmas for a smp application. Compile it to generate a smp version that we can execute and test the correct result. Run:

make smp-zedboard-i

• Rename the binary to smp matmul. We can copy it to the Zynq system and then run it:

• Edit matmul.cpp code and add the target device(fpga) pragma and the necessary clauses to do the copies. We only want one accelerator instance of the matmulBlock and data dependences been copied in and out of the fpga task.

2.1.2 Compilation to Design and Analysis

• Run following command to generate the Vivado HLS project and create the Vivado design:

VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard-i >& \ output_toDesign.PROGRAM_TUTORIAL.txt &

• Once it has finished open the Vivado HLS project running this command (adapted to the program name and accelerator name):

(11)

• You can do right-click to go to the source code associated to one of the operations.

• In order to figure out the details of each of the loops you can look at Performance Profile and Module Hierarchy, and the tripcount on top right panel.

(12)

Zoom in of the Module Hierarchy

Is there any of the loops with pipeline? which is the overall latency of the loops?

• Regarding to the resources, you can look at the Module Hierarchy and Resource Profile

(13)

and, on the other hand, the resource panel of the top-right view. This also shows where the memory ports are needed.

2.1.3 RTL/C Simulation

You can also adapt the Vivado HLS project generated by the OmpSs@FPGA toolchain to use test benchs and RTL/C Simulation. You will need to modify the wrapper code generated by the OmpSs@FPGA toolchain and add some compilation flags. Do the following steps:

• Open the Vivado HLS project.

• Add test bench matmul test.cpp using the left top menu at Explorer panel, and right click mouse.

• Double click matmulBlock hw hls automatic mcxx.cpp on the left panel (Synthesis view).

• Add #ifndef HW COSIM after the include files in the source code you have just open:

(14)

• Also, modify the code to make matmulBlock hw automatic mcxx wrapper to be the entry point of the simulation with the same parameters of the kernel to be accelerated in the Host code:

• Hide the rest of the code which is OmpSs@FPGA related:

to the end

• Activate the HW COSIM at compilation time and for the test bench using:

(15)

• Once you have done that you can start the RTL/C Simulation using Project and C/RTL Simulation.

Choose VHDL and input arguments 64 1, which means matrix size 64x64 and check result. Click Ok.

• After running the RTL/C Simulation, you should see something like that, where it is shown that the test is correctly passed.

(16)

2.1.4 RTL/C Simulation w/ Dump Trace

Now we are going to do the same simulation generation a simulation trace, that you will be able to analyze with Vivado.

• Start a RTL/C Simulation using Project and C/RTL Simulation. Now, you should chose verilog, dump trace all and input arguments 64 1, which means matrix size 64x64 and check result. Click Ok.

• After running this simulation you should have obtained some execution traces on the left:

(17)

• Push bottom . This will open vivado with the simulation trace. Then, you can expand the variable signals, and change the Radix format of the address variables (right click, Radix, decimal).

You can play a little bit using zoom in or out.

2.2 Vivado HLS analysis and optimization of the RGB2YUV Filter

2.2.1 OmpSs compilation and OmpSs@FPGA porting

Function yuv filter hw is already annotated with OmpSs pragmas. Also, the original code has been modified so that there is a function that call the future FPGA task from the same source code. Otherwise, the accelerator will not be generated. Constants and macros that are needed by the accelerator are all within .fpga.h includes. Channels of the image has been split in different parameters/ports, otherwise there may be problems with the m axi port and the different fields of the image t structure.

(18)

2.2.2 Compilation to Design and Analysis

Create a naive project for yuv filter in Vivado HLS and the premilinary design of the vivado project, running this command line:

VERSION_CODE=YUV_TUTORIAL nohup make toDesign-zedboard-i >& \ output_toDesign.YUV_TUTORIAL.txt &

Name YUV TUTORIAL will be part of the autoVivado project:zedboard YUV TUTORIAL, where you will find the Vivado HLS and Vivado projects.

Note that makefile includes a compilation flag (--disable utilization check) that allows you to create a Vivado project although the resources are more than the available in the Vivado HLS compilation.

In fact, this happens for this first compilation:

2.2.3 Optmizations and Analysis

Start adding some directives to be able to analyze the hardware that we are creating and to improve its performance and reduce the amount of resources.

• Open the Vivado HLS project:

vivado_hls -p zedboard_YUV_TUTORIAL/yuv_filter_autoVivado/Vivado_HLS/yuv_filter_hw

• Run synthesis (green play)

(19)

• Open synthesis, solution1, syn, report, rgb2yuv csynth.rpt. There is not information about latencies neither interleaves since the compiler cannot determine the number of iterations.

• Uncomment TRIPCOUNT pragmas to allow the iteration estimation, and add pragma HLS INLILNE off to yuv filter hw so that we can analyze this function looking at the analsys. Then, run synthesis again and once it has finished go to solution1, syn, report, yuv filter hw and open it.

Now you should have information of the latency of the top kernel function of the accelerator.

However, it doesn’t appear yuv scale since it has been inlined.

(20)

• We can look at the latencies of the three main functions : rgb2yuv, yuv scale and yuv2rgb looking at the Permoance Estimates of the yuv filter hw report: Instances and Loop (yuv scale inline).

• Analyze the cycles in the Analysis Perspective view, selecting yuv filter hw:

• Analyze the resources needed by each of the functions called by the yuv2filter hw by accessing to Detail, instances:

(21)

and the rgb2yuv:

• Add INLINE off to yuv scale function, so that we can analyze it better, and also because we would like to create a dataflow between functions. Also, if we look at the loops of the different

(22)

functions, they have significant latencies with no II. So, we would like to create pipelines in their executions using PIPELINE directive: “The pipeline directive must be applied to the inner-most loop in each function – the inner- most loops have no variable-bounded loops inside of them which are required to be unrolled and the outer loop will simply keep the inner loop fed with data”

• Again, synthesize and analyse yuv filter hw report of the synthesis:

(23)

The number of cycles has decrease from 29M cycles to 7.3M cycles. Although the number of resources have been increased a little bit the overall performance gain is significant. If we had have different solutions we could look at Projects-Compare Reports and compare all of them. In our case, to make easy the integration with the vivado project already generated we are not creating new solutions.

• Add DATAFLOW to yuv filter hw so that we want to try to run in dataflow rgb2yuv, yuv scale and yuv2rgb, and synthesize.

Adding the DATAFLOW directive:

Synthesize:

The number of BRAM increase a lot due to the dataflow. In fact, that will depend on the size of the image passed from one to another function in the dataflow execution. The size of the image we are using is 200x270 and it seems that there is not problem with the amount of BRAM resources.

By default, the dataflow creates a pingpong buffer between the functions. However, if this size was 1920x1280 the amount of BRAM will be much larger than the available on the board.

(24)

On the other hand, if we continue with the configuration by default, the vivado project will not syntesize:

Therefore, at Output Settings, specify the following command: configuration dataflow to use

(25)

2.2.4 Compilation to Bitstream

In order to generate the bitstream we have to update the IP in the Vivado Project. Do the following steps:

• From the Vivado HLS project, synthesize and export RTL.

• Open the vivado project:

vivado zedboard_YUV_TUTORIAL/yuv_filter_autoVivado/Vivado/yuv_filter/yuv_filter.xpr

• Open Block Design (left menu):

(26)

• Run Report IP Status:

• Select the IP to update (skippinng output products):

• And exit:

(27)

cp zedboard_YUV_TUTORIAL/yuv_filter_autoVivado/yuv_filter.xtasks.config .

• Run this command in order to continue the OmpSs@FPGA compilation to bitstream:

VERSION_CODE=YUV_TUTORIAL nohup make fromSyn-zedboard-i >& output.fromSyn.zedboard.txt &

After a while, the bitstream should be generated:

2.2.5 Paraver Trace analysis

One way to understand the execution time of the applications is to analyze the execution trace. OmpSs@FPGA allows us to analyze the smp and fpga application executions.

• Execute the processing of 10 images using the smp version export EXTRAE_CONFIG_FILE=extrae.xml

NX_ARGS="--summary --instrumentation=extrae" ./smp_yuv_filter

• Execute the fpga version with one accelerator, in order to obtain an execution trace.

export EXTRAE_CONFIG_FILE=extrae.xml

NX_ARGS="--summary --instrumentation=extrae" ./yuv_filter

• Copy the execution traces to the docker image.

• Run :

wxparaver *.prv

• Use the configuration file smp and fpga tasks.cfg that you will find in paraver cfg directory.

• Use the zoom in order to select the SMP tasks:

Timing of all the Execution:

(28)

Timing of one SMP task:

• Use the zoom in order to select the FPGA tasks:

Timing of all the Execution:

Timing of one FPGA task:

The overall performance is more or less the same. However, note that in the case of the SMP we are exploiting the two cores of the ARM at 666MHz, and in the case of the accelerator we are only exploiting one accelerator at 100Mhz. On the other hand, each accelerator is much faster than one SMP task execution. However, we are paying cycles to do the copy in and copy out of the data for the accelerator. Therefore, increasing the frequency of the copies, overlaping the execution of

(29)

For instance, only specifying target device(fpga,smp) will help us to improve the overall performance thanks to a heterogeneous execution:

Timing of all the Excution: SMP only tasks:

Timing of all the Excution: FPGA only tasks:

Timing of all the Excution: Heterogenous SMP and FPGA tasks:

(30)

3

Second Day

3.1 Vivado HLS analysis and optimization of a DCT code

This is a simple code used with the objetive of showing the usage and benefits of the VIVADO HLS directives. However, don’t expect an execution improvement compared to the SMP code of the application where this DCT kernel is used in the tutorial.

3.1.1 OmpSs compilation and OmpSs@FPGA porting

Do the following actions:

• The code is already prepared with the ompss pragmas for a smp application. Compile it to generate a smp version that we can execute and test the correct result. Run:

make smp-zedboard-i

• Rename the binary to smp dct. We can copy it to the Zynq system and then run it:

• Edit dct.c code and add the target device(fpga) pragma and the necessary clauses to do the copies. We only want one accelerator instance of the dct hw and data dependences been copied in and out of the fpga task.

3.1.2 Compilation to Design and Analysis

• Run following command to generate the Vivado HLS project and create the Vivado design:

VERSION_CODE=DCT_TUTORIAL nohup make toDesign-zedboard-i >& output.toDesign.zedboard-dct.txt &

The amount of resources used is not big:

(31)

• Once it has finished open the Vivado HLS project running this command (adapted to the program name and accelerator name):

vivado_hls -p zedboard_DCT_TUTORIAL/dct_autoVivado/Vivado_HLS/dct_hw

• The latency of the overall project is:

The latency and intervals in the case of dct hw, dct2d with calls to dct1d, write data and read data the hardware resources are:

DCT HW latency and intervals:

(32)

DCT2d latency and intervals:

(33)

DCT1D latency and intervals:

• The overall hardware resources used can be seen in the Utilization Estimates:

3.1.3 Optmizations and Analysis

Reports have shown that there is not pipeline applied to any of the kernels. In this section we are going to optimize the code by adding some Vivado HLS pragmas.

• Apply the PIPELINE directive to DCT Inner Loop, Xpose Row Inner Loop, Xpose Col Inner Loop, RD Loop Col, and WR Loop Col, and run synthesize.

• The output of this synthesize run should be something similar to:

(34)

The latencies has been significantly reduced meanwhile the the amount of resources has not increased too much:

• Open the Analysis perspective and determine where most of the clock cycles are spent, i.e. where the large latencies are. In particular, look at the dct2d and dct1d.

dct2d reports show that there are some parts that are not pipelined.

(35)

dct1d shows an latency of 97 cycles. The reason is that the inner-most loop couldn’t be flattened.

(36)

• Move the pipeline pragma of the dct1d from the inner to the outer loop, and run synthesize.

(37)

• Look at the console for messages that provide more information about the causes that provokes obtaining an II = 4:

It seems that there are not enough ports to access col inbuf and buf 2d in array partition. This can be seen selecting the dtc2d, and openning the Resource Tab-Memory ports. The amount of ports accessing to src variable (col inbuf or buf 2d in depending on the call) is the maximum that the BRAM has. Therefore, this indicates that maybe a bottleneck accessing BRAM memory.

(38)

• Apply ARRAY PARTITION directive to buf 2d in of dct (since the bottleneck was on src port of the dct 1d function, which was passed via in block of the dct 2d function, which in turn was passed via buf 2d in of the dct function) and col inbuf of dct 2d. Use complete to dimension 2. That results on:

That reduces half the overall latency thanks to the fact of reducing the latency of dtc1d using more BRAM resources:

(39)

• Apply the DATAFLOW directive to improve the throughput of the top function dct hw, and synthesize.

(40)

On the left, Module Hierarchy, appears the dataflow created:

And selecting dct2d:

(41)

Now, Row DCT loop and Col DCT loop seems to limit the overall performance. We should expose the dataflow to the loops inside the dct 2d. In order to achieve that we should inline dct 2d.

• Apply INLINE directive to function dct 2d.

The reduction of latency is really huge compared to the original code acceleration. Observe that the dct 2d entry is now replaced with dct Loop Row DCT Loop proc, dct Loop Xpose Row Outer Loop proc, dct Loop Col DCT Loop proc, and dct Loop Xpose Col Outer Loop proc since the dct 2d function is inlined. Also observe that all the functions are operating in parallel, yielding the top-level function interval (throughput) of 114 clock cycles.

3.1.4 Compilation to Bitstream

In order to generate the bitstream we have to update the IP in the Vivado Project. In order to to that:

• From the Vivado HLS project, synthesize and export RTL.

• Open the vivado project:

vivado zedboard_DCT_TUTORIAL/dct_autoVivado/Vivado/dct/dct.xpr

• Open Block Design (left menu).

• Run Report IP Status, select the IP to update, and update the IP (skipping output products), and exit.

• Keep the xtasks.config with the information of the accelerator: id, frequency and name.

cp zedboard_DCT_TUTORIAL/dct_autoVivado/dct.xtasks.config .

And run the following command in order to continue the OmpSs@FPGA compilation to bitstream:

VERSION_CODE=DCT_TUTORIAL nohup make fromSyn-zedboard-i >& output.fromSyn.zedboard.txt &

After a while, the bitstream should be generated.

(42)

3.1.5 Timing and Execution in the Zynq board

Copy the binary files of the OmpSs@FPGA application and xtasks.config file. Remember, ARM binary is in the current directory and bitstream (.bin) and xtasks.config file are under the autoVivado project directory (look above in the document, subsection1.3.4).

Do the following steps:

• Load the bitstream to the FPGA (it is just a cat command for the Zynq 7000 family).

• Run the dct program using the same input size than above for the smp version.

• Observe that the execution time is worst than the one for the smp version. The accelerators is so small that the overhead of calling it overcomes the benefits of the acceleration. Using several of them in parallel, or in dataflow will change this behavior. In fact, we can analyze the execution trace to see what is the real latency of a SMP task and a FPGA task.

3.1.6 Paraver Trace analysis

• Execute the smp version

NX_ARGS="--summary --instrumentation=extrae" ./smp_dct

• Execute the fpga version with one accelerator, in order to obtain an execution trace.

NX_ARGS="--summary --instrumentation=extrae" ./dct_filter

• Copy the execution traces to the docker image.

• Run :

wxparaver *.prv

• Use the configuration file smp and fpga tasks.cfg that you will find in paraver cfg directory.

• Use the zoom in order to select the SMP tasks:

(43)

• Use the zoom in order to select the FPGA tasks:

An accelerator is much faster than an SMP task execution. However, there are other overheads to setup the FPGA system and the FPGA task executions that make calling just one dct to be slower than using the SMP cores.

3.2 Hardware Instrumentation

OmpSs@FPGA allows to do a by default basic instrumentation of the copies and execution of the accelerators. The execution of this instrumented code, in case of activating the instrumentation at execution time, will generate a Paraver trace that you can visualize with the Paraver tool. We have prepared the Makefile to compile the code to have this instrumentation: all targets with -i suffix. Also, there is a very recent instrumentation feature in our OmpSs@FPGA that allows the programmer to introduce user events inside the accelerator to show up possible problems in the code.

3.2.1 Default Instrumentation

The default instrumentation is the one we have shown in previous sections where it is shown the copy of data from the host to the FPGA BRAM, the kernel execution, and then, the copy out of data from BRAM to Host.

We have used the configuration file paraver cfg/smp and fpga tasks.cfg to see the tasks executed in the SMP and in the FPGA.

3.2.2 User Instrumentation

OmpSs@FPGA allows the programmer to insert user instrumentation to the Vivado HLS code of the hardware accelerators. Although there are some limitation on the number of events to be evicted by the accelerator, it provides to the programmer with a very useful tool.

(44)

We have provided you with an optimized Vivado HLS code of a cross correlation program. This program may be useful to see if two tide waveforms are equivalents in various places on Earth. Tides are sea level changes caused by the gravitational influence of the Sun and the Moon.

In order to insert user events we have registered the key event in the main program using:

nanos_instrument_register_key_with_key(LOOP_I, "delay loop", "Delay loop iteration", true);

This call register event key ”dealy loop” with label ”Delay loop iteration” (what will appear in the Paraver trace), and key value LOOP I. The key value can be defined in the ”.fpga.h” header application file. For instance:

#define LOOP_I 10000

Then, in order to mark a burst (with begin and end) in the accelerator state, we have used the following calls. In the example, we have used burst begin and end to determine the beginning and the end of the delay iteration:

nanos_instrument_burst_begin(LOOP_I, delay+maxdelay);

#For the beginning of the burst

nanos_instrument_burst_end(LOOP_I, delay+maxdelay);

#For the beginning of the burst

Note that we add +maxdelay to delay to avoid negative values in the events. Compile the correlation program using the bitstream-zedboard-i target of the Makefile, copy the binary files and bitstream to the Zynq system, and execute it using instrumentation. Then, copy back the generated traces.

In order to execute it in the board, run this command:

export EXTRAE_CONFIG_FILE="extrae.xml"

NX_ARGS="--instrumentation=extrae" ./correlation datafiles/ibiz-30.txt datafiles/tarr-30.txt This will generate a Paraver trace correlation.prv,pcf,raw. Copy those files to the docker con-

tainer and run Paraver:

wxparaver correlation.prv

Use the configuration file xcorr delay loop.cfg to see the tasks executed in the SMP and in the FPGA. In fact, just one correlation task is executed in each device. The execution trace shows the execution time of each delay loop iteration.

(45)

3.3 MxM

3.3.1 Vivado HLS pragmas and OmpSs@FPGA MxM porting

Use Vivado HLS pragmas to :

• Add pipeline to the loop j matmul (which will mean full unroll of the loop k matmul).

• Add array partition to a and b variable so that we can have enough ports to access the data during the pipeline, and the unrolled k loop.

• Add inline off to the matmulBlock kernel to better understand the reports.

3.3.2 Compilation to Bitstream

Run the following code:

VERSION_CODE=FULL_TUTORIAL nohup make bitstream-zedboard-i >& \ output_bitstream.FULL_TUTORIAL.txt &

3.3.3 Timing and Execution of the OmpSs@FPGA application

Copy the binary files of the OmpSs@FPGA application and xtasks.config file. Remember, ARM binary is in the current directory and bitstream (.bin) and xtasks.config file are under the autoVivado project directory (look above in the document, subsection1.3.4).

Do the following steps:

• Load the bitstream to the FPGA (it is just a cat command for the Zynq 7000 family).

• Run the matmul program using the same input size than above for the smp version.

• Observe that the execution time is slightly worst than the one for the smp version.

3.3.4 Paraver Trace analysis

• Run the instrumented versions of your 32x32 smp and fpga code:

SMP version

(46)

FPGA version

• Copy the traces to the docker container and open them with:

wxparaver name_trace.prv

(47)

SMP 32x32 MxM task execution time

FPGA 32x32 MxM task execution time

3.3.5 Tunning MxM application

Based on the previous experiments we suggest you:

• Increase the block size of the matrix multiplication in matmul.fpga.h to 64x64 and adapt the II to the resources of the board.

• Compile it to generate the bitstream.

• You may obtain something like this, where the fpga version matmul is more than 2x faster thant the SMP version:

• The trace analysis shows that the accelerators spend more time than the 32x32 accelerators but the overall communication is smaller.

SMP 64x64 MxM task execution time

(48)

FPGA 64x64 MxM task execution time

3.3.6 Blocking MxM application

We have provided you with a MxM version prepared to perform blocking from inside the FPGA. The MxM block function is not optimized with Vivado HLS pragmas. Copy the Vivado HLS pragmas that you have used in previous sections.

Pay attention to the the pragma #pragma omp target device(fpga) copy deps no localmem copies num instances(1) and memcpy call functions inside the matmulFull hw function. no localmem copies deactives the generation of copies before and after the kernel execution of the FPGA task inside the accelerator. Therefore, the user has the control of doing those copies (using memcpy for instance) from host variables to local variables, as shown in the code.

(49)

#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)

#pragma omp task in([msize*msize]a, [msize*msize]b) inout([msize*msize]c) void matmulFull_hw(int msize, elem_t *a, elem_t *b, elem_t *c) {

unsigned int const m2size = msize*msize;

unsigned int const b2size = BSIZE*BSIZE;

elem_t aa[BSIZE*BSIZE];

elem_t bb[BSIZE*BSIZE];

elem_t cc[BSIZE*BSIZE];

for (unsigned int i = 0; i < msize/BSIZE; i++) {

for (unsigned int j = 0; j < msize/BSIZE; j++) { unsigned int const ci = j*b2size + i*BSIZE*msize;

memcpy(cc, &c[ci], BSIZE*BSIZE*sizeof(elem_t));

for (unsigned int k = 0; k < msize/BSIZE; k++) { unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

memcpy(aa, &a[ai], BSIZE*BSIZE*sizeof(elem_t));

memcpy(bb, &b[bi], BSIZE*BSIZE*sizeof(elem_t));

matmulBlock(aa, bb, cc);

}

memcpy(&c[ci], cc, BSIZE*BSIZE*sizeof(elem_t));

}

} }

Compile the code, and run it in the board. In our case, using the same Vivado HLS directives for the MxM Block function (with 6x64 blocks), the execution time of a 256x256 matrix size has been reduced from 0.17 seconds of the very fist version of acceleration to 0.0267, which is a 6.4x speedup.

MxM 32x32 block acceleration

MxM 64x64 block acceleration with FPGA blocking