OmpSs@FPGA 1-day PACT Course Guide
OmpSs@FPGA Team
May 23, 2019
Index
Index 1
1 Course Environment 2
1.1 Hard Disk: Ubuntu system and code . . . 2
1.2 Docker: Cross-compilation environment . . . 2
1.2.1 Login in the container . . . 2
1.2.2 Sharing data between host Ubuntu system and docker cotainer . . . 3
1.2.3 Cross-compiling. . . 3
1.2.4 Vivado HLS analysis . . . 4
1.3 Zynq Board System . . . 5
1.3.1 Boot from SD. . . 5
1.3.2 Minicom Connection . . . 5
1.3.3 Ethernet connection . . . 6
1.3.4 Application Execution . . . 7
1.4 Shutdown . . . 7
2 Case of Study: Matrix Multiply 8 2.1 OmpSs MxM . . . 8
2.1.1 SMP Performance Analysis . . . 8
2.2 OmpSs@FPGA MxM porting . . . 8
2.2.1 HLS Analysis of the Naive Version . . . 9
2.2.2 HLS Optimization and Analysis. . . 12
2.3 Execution Analysis . . . 13
2.3.1 Full Compilation and Execution . . . 13
2.3.2 Paraver Trace analysis . . . 13
2.4 Optimizing MxM application . . . 15
2.4.1 Tunning MxM application . . . 15
2.4.2 FPGA Blocking MxM application . . . 16
2.5 Hardware Instrumentation. . . 17
2.5.1 Default Instrumentation . . . 18
2.5.2 User Instrumentation . . . 18
1
Course Environment
1.1 Hard Disk: Ubuntu system and code
Ubuntu 16.04 running in the bootable Hard disk is prepared with a docker container that will help you to cross-compile to the Zynq 7000 (zedboard). Next section explains how to access to this docker container and copy from/to files when you compile. Also, this ubuntu has the minicom application to connect via serie with the board (explained later). You can also copy data to the Zedboard using ssh (explained later).
1. To Boot from the hard disk you may have to push the function key (usually F9/F12) to make BIOS appear.
2. Boot from the external hard disk: BSC-Training user, password: bsc.
3. Original source code will be provided to you by USB .
1.2 Docker: Cross-compilation environment
1.2.1 Login in the container
1. Docker container has been already created.
2. Run: sudo docker ps -a . Although it may change the release of the package, the command output should be the list of active docker containers:
3. Look for the name of the container version ompss-at-fpga 1.3.1. Run: sudo docker start -i -a name container
4. You can exit the container running exit. To login in again can repeat previous command.
1.2.2 Sharing data between host Ubuntu system and docker cotainer
1. At /home/bsc docker container directory you have access to the Ubuntu 16.04 host file system (or the directory you have chosen if you run the docker image to create a new container).
2. You can cp from/to host directories to/from contanier directories, or use directly a host directory.
1.2.3 Cross-compiling
1. Makefile and course code are prepared to compile and generate a OmpSs binary and the bitstream with the accelerator specified by the programmer.
2. We advise you to open the Makefile’s to figure out the target and compilation commands we are going to use.
• In order to estimate the performance of your accelerators before starting the time consuming process of generating the hardware, you can target: VERSION CODE=NAME make toDesign- zedboard . This Makefile target specifies to our OmpSs@FPGA toolchain the compilation flag --to step=design, which will make the compilation to stop after the Vivado HLS project and the Vivado design is generated. Later we will explain how to analyze the estimated performance. In addition, we are compiling with hardware instrumentation so that we can generate and analyze execution traces.
• VERSION CODE is an environment variable that can be defined at command line, and you can use #ifdef in the source code to activate or deactivate parts of the source code. Be careful with the defines you use or you can change the meaning of keywords or variables in your code.
Usually it is good to use capital letters.
3. In order to partially or completely compile to generate the bitstream, the computer should be connected to the network (wireless or not). Then, you should have access to eduroam. Otherwise, ask us for help.
4. Compilation may take a while. Maybe you can use nohup to run the compilation so that you an continue working in the following exercise meanwhile it is compiling.
5. Remember:
• For partial compilation:
VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard >& \ output_toDesign.PROGRAM_TUTORIAL.txt &
During the tutorial, we will also open the Vivado HLS project, modify it and export the new IP. In that case, we will have to open the Vivado Project created with the previous command so that we update the new IP release. Make a backup of the name.xtasks.config file that you will find at zedboard PROGRAM TUTORIAL/name autoVivado/. Then, you will have to run this another partial command to finally generate the bitstream:
VERSION_CODE=PROGRAM_TUTORIAL nohup make fromSyn-zedboard >& \ output_fromSyn.PROGRAM_TUTORIAL.txt &
In both cases you can run:
tail -f output.txt
to observe the progress of the compilation.
• For complete compilation:
nohup make bitstream-zedboard >& output_PROGRAM_TUTORIAL.txt &
1.2.4 Vivado HLS analysis
1. After partial compilation upto step Design, you can analyze the vivado hls project.
VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard >& output_PROGRAM_TUTORIAL.txt &
2. Once this command above has been executed, the Vivado HLS project can be open using the following command:
vivado_hls -p zedboard_PROGRAM_TUTORIAL/name_autoVivado/Vivado_HLS/accelerator_name 3. Once it is open, we can select the Analysis Perspective on top-right corner. There, we can look for
the function accelerated to analyze the latency and interval information, in addition to resource requirements.
4. One important information that we can see in this view is the II one can achieve per iteration.
At bottom left, you can see the names of the different loops in the original code. In this case, the names are automatically given by Vivado HLS. However, you can add labels to your code, and those labels will appear here.
5. In the case of the corr fpga kernel shown in the picture there is not pipeline at all, and the latency we pay per iteration is 16 cycles per interation!!!
6. At top right, you can also observe the cycles per iterations that each loop has, and the high level pnemotecnic of the operations.
• Yellow: indicates overall latency per iteration
• Red: indicates conflicts. You can click the red cycles to see where the conflits are.
1.3 Zynq Board System
1.3.1 Boot from SD
1. BE SURE your board is turned off 2. Insert the microSD card in the board 3. Connect the board to the power supplier
4. Connect the microUSB side of the microUSB-USB cable to UART connection.
5. Connect the USB side of the microUSB-USB cable to the computer.
6. Turn on the board
7. Run dmesg in your computer and look for something like: ”.... ttyACM0: USB ACM device...
” or ”.... ttyUSB0 .... ” . This is what you have to use to setup the minicom port.
1.3.2 Minicom Connection
1. Assuming that you have seen ttyUSB1 in the dmesg log, run ”sudo minicom -D /dev/ttyUSB1 -b 115200”
2. At login use ubuntu:ubuntu .
3. The Zynq System is also running an Ubuntu 16.04.
4. Look at the date. If this is not the current one, you can run sudo date -s "DD MMM YYYY HH:MM:SS" so that you can set the current day and hour. DD stands for Day, MMM stands for the first three letter of the month, YYYY stands for the year, and HH:MM:SS stands for hour, minutes, seconds.
1.3.3 Ethernet connection
You can access through ethernet connection using ssh to connect to the board or scp to copy data to/from the host. This will be useful to transfer files between host Ubuntu system and Zynq system.
1. First, from the minicom connection run sudo ifup eth0. Usually, it is 10.42.0.220.
• Login: The IP obtained in previous steps is the one you can use to ssh or scp.
• Assuming IP: 10.42.0.220 run ssh -X [email protected] if you want to remotely login
• Copy host to zedboard: scp files [email protected]:path
• Copy zedobard to host: scp [email protected]:path/file
1.3.4 Application Execution
1. First, compile the application in the Docker container (smp or fpga version of the code):
2. Then, transfer the following files from the Host system to the Zynq system:
• For only SMP or heterogeneous OmpSs@FPGA application: copy the ARM binary file, at the current compilation directory.
• For applications with FPGA acceleration:
– Bitstream (.bin) : this file should be under the zedboard PROGRAM VERSION/name autoVivado/
and it is the bitstream that we have to write in the FPGA device.
– name.xtasks.config : this file should be under the zedboard PROGRAM VERSION/name autoVivado/
and it is the configuration runtime file that is necessary when running the program. For partial compilation, the one you have done a backup before the second partial to end compilation. For complete compilation, the one you will find in the directory.
3. Login into the Zynq System 4. Do the following steps:
• Run: cat bitstream.bin > /dev/xdevcfg
• Run for non-instrumented:
– Run:
NX_ARGS="--summary" ./program arguments
• Run for instrumented:
– Only once: export EXTRAE CONFIG FILE=extrae.xml – Be sure you have extrae.xml file in the directory. Then run:
NX_ARGS="--summary --instrumentation=extrae" ./program arguments
1.4 Shutdown
1. Correctly shutdown the machine with sudo halt or sudo shutdown now.
2
Case of Study: Matrix Multiply
In this section you will port an OmpSs MxM code to OmpSs@FPGA (meaning you will target fpga tasks) with no Vivado HLS optimizations. Then, you will analyze it and add HLS pragmas to optimize it.
2.1 OmpSs MxM
2.1.1 SMP Performance Analysis
We ask you to follow next steps:
• The code is already prepared with the ompss pragmas for a smp application. Compile it to generate a smp version that we can execute and test the correct result. Copy the source files of the MxM to a directory in the docker container and, within the docker at that directory, run:
make smp-zedboard
Previous command will generate a ARM binary. Rename it to smp matmul.
• Boot the Zynq system. We assume you will be connected using serial and ssh. We can copy the binary of previous step to the Zynq system using scp and, from inside the Zynq syste, run it:
• There are other mechanisms to analyze the execution time and the most time consuming parts of the code using perf, oprofile, papi which are profiling tools that use the hardware counters of the ARM processor. Also, compiling with -pg will allow you to do profiling with gprof; in this case this tool use sampling and signals to do the profiling. However, we are not going to do that due to time constraints.
2.2 OmpSs@FPGA MxM porting
In general OmpSs@FPGA porting consists on:
• Constants and global variables have to be moved to includes with .fpga.h extensions. Const array/matrix variables should be also moved to the .fpga.h files.
• There should be a call to the accelerated kernel from the same source code, otherwise, there will not be hardware generation.
• Parameters of the top function on OmpSs@FPGA uses m axi protocols, and provoke errors when using structs with different field characteristics (to be analyzed). Those fields have been separated in different parameters.
• Dimension and Size information at pragmas should be const type values.
In our case, the most time consuming function of the MxM application we have is matmulBlock hw.
We ask you to edit matmul.cpp code and add the target device(fpga) pragma and the necessary clauses to do the copies. We only want one accelerator instance of the matmulBlock and data dependences been copied in and out of the fpga task.
2.2.1 HLS Analysis of the Naive Version
In this section we are going to analyze the hardware solution that is generated from a naive HLS version (no HLS optimization pragmas).
We ask you:
• Run following command to generate the Vivado HLS project and create the Vivado design:
VERSION_CODE=PROGRAM_TUTORIAL nohup make toDesign-zedboard >& \ output_toDesign.PROGRAM_TUTORIAL.txt &
• Once it has finished open the Vivado HLS project running this command (adapted to the program name and accelerator name):
vivado_hls -p zedboard_PROGRAM_TUTORIAL/name_autoVivado/Vivado_HLS/accelerator_name
• Open the Analysis Perspective and extend the loop i matmul, loop j matmul and loop k matmul.
• You can do right-click to go to the source code associated to one of the operations.
• In order to figure out the details of each of the loops you can look at Performance Profile and Module Hierarchy, and the tripcount on top right panel.
Zoom in of the Module Hierarchy
Is there any of the loops with pipeline? which is the overall latency of the loops?
• Regarding to the resources, you can look at the Module Hierarchy and Resource Profile
and, on the other hand, the resource panel of the top-right view. This also shows where the memory ports are needed.
2.2.2 HLS Optimization and Analysis
In this section we suggest to add some HLS directives to the MxM kernel code to improve the overall performance of the hardware accelerator of the kernel and the application. Then, we can analyze which is the estimated performance gain.
1. In order to improve the performance of the kernel we ask you to add the following directives:
• #pragma HLS PIPELINE II=1. Add it to the loop j matmul. With this, we expect to achieve the following objectives:
(a) An initiation interval of 1 in the loop j matmul loop,
(b) Full unroll of the loop k matmul, parallelizing all the computation in loop k matmul (also using more resources)
(c) Collapse of loop i matmul and loop j matmul, reducing the overall overhead due to control flow of the loop iterations.
IMPORTANT NOTE: Maybe II=1 will require a lot of hardware resources. In- crease it to II=2 or II=4, based on the analysis you do.
• #pragma HLS ARRAY PARTITION variable=name type factor=FACTOR VALUE. Add it to a and b variable so that we can have enough ports to access the data during the pipeline, and the unrolled k loop.
• #pragma HLS INLINE OFF. Add inline off to the matmulBlock kernel to better understand the reports.
2. In order to analyze the HLS performance impact we suggest you to follow the same steps that you follow at Section2.2.1: 1) run up to design generation using our toolchain, 2) run vivado hls withe project generated in point 1), and analyse the performance of the loops of the code.
We suggest to spend some time in this section until you are happy with the optimization achieved.
2.3 Execution Analysis
2.3.1 Full Compilation and Execution
In this section you are going to fully compile to obtain the ARM binary and the bitstream with the accelerators of the application you are going to use during the application execution.
We ask you:
• Run following command to obtain the bitstream and binary to run the accelerated application:
VERSION_CODE=FULL_TUTORIAL nohup make bitstream-zedboard >& \ output_bitstream.FULL_TUTORIAL.txt &
• Copy the binary files of the OmpSs@FPGA application and xtasks.config file. Remember, ARM bi- nary is in the current directory and bitstream (.bin) and xtasks.config file are under the autoVivado project directory (look above in the document, subsection1.3.4).
• In order to run the application in the Zynq system, do the following steps:
– Load the bitstream to the FPGA (it is just a cat command for the Zynq 7000 family).
– Run the matmul program using the same input size than above for the smp version.
– Observe that the execution time is slightly worst than the one for the smp version.
2.3.2 Paraver Trace analysis
One way to understand the execution time of the applications is to analyze the execution trace. OmpSs@FPGA allows us to analyze the smp and fpga application executions.
• Run the instrumented versions of your 32x32 smp and fpga code:
SMP version
FPGA version
• Copy the traces to the docker container and open them with:
wxparaver name_trace.prv
• Use the configuration file smp and fpga tasks.cfg that you will find in paraver cfg directory.
• In our case, for the 32x32 kernels we can see that accelerators are faster but they are so small that the overhead of executing them overcomes the overall performance gain.
SMP 32x32 MxM task execution time
FPGA 32x32 MxM task execution time
2.4 Optimizing MxM application
2.4.1 Tunning MxM application
Based on the previous experiments we suggest you:
• Increase the block size of the matrix multiplication in matmul.fpga.h to 64x64 and adapt the II to the resources of the board.
• Compile it to generate the bitstream.
• You may obtain something like this, where the fpga version matmul is more than 2x faster thant the SMP version:
• You can execute again with instrumentation and do trace analysis. The trace analysis shows that the accelerators spend more time than the 32x32 accelerators but the overall communication is smaller.
SMP 64x64 MxM task execution time
FPGA 64x64 MxM task execution time
2.4.2 FPGA Blocking MxM application
We have provided you with a MxM version prepared to perform blocking from inside the FPGA. The MxM block function is not optimized with Vivado HLS pragmas. Copy the Vivado HLS pragmas that you have used in previous sections.
Pay attention to the the pragma #pragma omp target device(fpga) copy deps no localmem copies num instances(1) and memcpy call functions inside the matmulFull hw function. no localmem copies deactives the generation of copies before and after the kernel execution of the FPGA task inside the accelerator. Therefore, the user has the control of doing those copies (using memcpy for instance) from host variables to local variables, as shown in the code.
#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)
#pragma omp task in([msize*msize]a, [msize*msize]b) inout([msize*msize]c) void matmulFull_hw(int msize, elem_t *a, elem_t *b, elem_t *c) {
unsigned int const m2size = msize*msize;
unsigned int const b2size = BSIZE*BSIZE;
elem_t aa[BSIZE*BSIZE];
elem_t bb[BSIZE*BSIZE];
elem_t cc[BSIZE*BSIZE];
for (unsigned int i = 0; i < msize/BSIZE; i++) {
for (unsigned int j = 0; j < msize/BSIZE; j++) { unsigned int const ci = j*b2size + i*BSIZE*msize;
memcpy(cc, &c[ci], BSIZE*BSIZE*sizeof(elem_t));
for (unsigned int k = 0; k < msize/BSIZE; k++) { unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
memcpy(aa, &a[ai], BSIZE*BSIZE*sizeof(elem_t));
memcpy(bb, &b[bi], BSIZE*BSIZE*sizeof(elem_t));
matmulBlock(aa, bb, cc);
}
memcpy(&c[ci], cc, BSIZE*BSIZE*sizeof(elem_t));
}
} }
Compile the code, and run it in the board. In our case, using the same Vivado HLS directives for the MxM Block function (with 64x64 blocks), the execution time of a 256x256 matrix size has been reduced from 0.17 seconds of the very fist version of acceleration to 0.0267, which is a 6.4x speedup.
MxM 32x32 block acceleration
MxM 64x64 block acceleration with FPGA blocking
2.5 Hardware Instrumentation
OmpSs@FPGA toolchain provides a default basic instrumentation of copies and kernel execution of the accelerators. The execution of this instrumented code, in case of activating the instrumentation at execution time, will generate a Paraver trace that you can visualize with the Paraver tool. We have prepared the Makefile to compile the code to have this instrumentation. Also, there is a very recent instrumentation feature in our OmpSs@FPGA that allows the programmer to introduce user events inside the accelerator to show up possible problems in the code.
2.5.1 Default Instrumentation
The default instrumentation is the one we have shown in previous sections where it is shown the copy of data from the host to the FPGA BRAM, the kernel execution, and then, the copy out of data from BRAM to Host.
We have used the configuration file paraver cfg/smp and fpga tasks.cfg to see the tasks executed in the SMP and in the FPGA.
2.5.2 User Instrumentation
OmpSs@FPGA allows the programmer to insert user instrumentation to the code of the hardware accel- erators. Although there are some limitations on the number of events to be evicted by the accelerator, it provides to the programmer with a very useful tool.
We have provided you with an FPGA blocking Matrix Multiply program with user instrumentation.
This code has no Vivado HLS pragmas. You should include the Vivado HLS directives of previous exercise.
User instrumented code for FPGA should first register the user events in the main program using nanos instrument register key with key. In our case, we have registered three events. For instance:
nanos_instrument_register_key_with_key(COPY_C, "Copy_C", "Copy C Data", true);
This call register event key ”Copy C” with label ”Copy C Data” (what will appear in the Paraver trace), and key value COPY C. The key value ca be defined in the ”.fpga.h” header application file. To avoid event value conflicts, we suggest to use values larger than 10000. For instance:
#define COPY_C 10000
Then, in order to mark a burst (with begin and end) in the accelerator state, use the following calls:
nanos instrument burst begin and nanos instrument burst end. In the example, we can use burst begin and end to determine the beginning and the end of a copy of C, the copy of A and B, and the computation of a tile:
#For the beginning of the burst
nanos_instrument_burst_begin(COPY_C, j+1);
memcpy(cc, &c[ci], BSIZE*BSIZE*sizeof(elem_t));
#For the beginning of the burst
nanos_instrument_burst_end(COPY_C, j+1);
Note that we add +1 to j variable to avoid value 0 in the events (also, try to avoid negative values).
Compile the program using the bitstream-zedboard target of the Makefile, copy the binary files and bitstream to the Zynq system, execute using instrumentation and copy back the traces generated.
In order to execute it in the board, run this command:
export EXTRAE_CONFIG_FILE="extrae.xml"
NX_ARGS="--instrumentation=extrae" ./matmul 128 0
This will generate a Paraver trace matmul.prv,pcf,raw. Copy those files to the docker container and run Paraver:
wxparaver matmul.prv
Use the configuration file smp and instr fpga tasks.cfg to see the tasks executed in the SMP and in the FPGA. In fact, just one correlation task is executed in the device.