www.bsc.es
OmpSs@FPGA
PATC-course
BSC OmpSs@FPGA team
Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center
March 11th , 2021
Agenda
2
OmpSs@FPGA Overview + Vivado HLS usage
● OmpSs@FPGA :
Basic Programming, Advanced Prog, Compilation and Execution
● Vivado HLS optimizations
Directives (most common) and environment Break:
● Demo/Tutorial Environment + Quizz :-):
MxM OmpSs@FPGA porting + Vivado HLS opt
MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)
Other versions
● OmpSs@FPGA New features
● Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]
Agenda
OmpSs@FPGA Overview + Vivado HLS usage
● OmpSs@FPGA :
Basic Programming, Advanced Prog, Compilation and Execution
● Vivado HLS optimizations
Directives (most common) and environment Break:
● Demo/Tutorial Environment + Quizz :-)::
MxM OmpSs@FPGA porting + Vivado HLS opt
MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)
Other versions
● OmpSs@FPGA New features
● Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]
Question: Have you
used FPGAs before?
What is a FPGA? And Zynq?
What is a FPGA?
FPGA stands for Field Programmable Gate Array
6
What is a FPGA?
FPGA stands for Field Programmable Gate Array
Programming:
HDL vs HLS
What is Zynq? Zynq-7000 – 32-bits
8
What is Zynq? Zynq-7000 – 32-bits
Programming:
HDL vs HLS And
Heterogeneous?
Zynq 702-board
OmpSs@FPGA Model
• StarSs family key concepts
• Sequential task-based program
• View of a single address/name space
• Execution in parallel: automatic
• runtime computation of dependencies
• Productivity and portability
• OmpSs is used for prototyping extensions to OpenMP
• Tasking
• data-dependences
• Heterogeneity
• Multiple address spaces
• Mercurium compiler
• …• Nanos runtime system
StarSs
OmpSs COMPSs
PyCOMPSs
@ SMP @ GPU @ FPGA @ Cluster
Task-based programming
@ CloudFPGA
Execution Model (OmpSs + accelerators)
Global thread team created on startup
─ Master starts main task (also executes)
─ N workers execute tasks
─ One representative per device (accelerator)
All threads get work from a task pool
─ Accelerator kernels become tasks
─ Tasks are labeled with (at least) one target device
» smp (default), cuda, opencl, fpga
─ Scheduler decides which task to execute
─ Tasks may have several targets (and we can have special versions using implements)
Task pool Global thread team
12
www.bsc.es
Easy to Program
Programming with OmpSs@FPGA
#include “scale.fpga.h”
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . .
#pragma omp taskwait }
#include “scale.fpga.h”
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . .
#pragma omp taskwait }
main.c
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
scale.fpga.h
14 –
Dimension and Size pragmas
should be const type
–
Constants and global variables needed by a FPGA task may have to be moved to .fpga.h includes (C++). Functions used by two different accelerators too.
–
There should be at least one call to the accelerated kernel (FPGA task code) in the same source code
–
Parameters can be structs (w/
some limitations… better not)
OmpSs@FPGA Model Details
Very Simple Code: Execution Model
• There are a pool of threads to run tasks
• Runtime decides
» When a thread executes a task (it should be ready at task pool)
» Where to execute a task based on user annotations
» Which copies has to be done
#include “scale.fpga.h”
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . .
#pragma omp taskwait }
#include “scale.fpga.h”
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . .
#pragma omp taskwait }
main.c
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
scale.fpga.h
scale_taskscale_task
Task pool Global thread team
16
Execution Model (Data Transfer): scale
copy_in
copy_out c c
a a
c a
b b
b Start tim e Accelera of tor
Fin ish ac ce ler ato r
at CMA ARM
Sequential Example: Tiled MxM
18
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
bs bs NB_J
NB_I
OmpSs Example: Tiled MxM
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
bs bs NB_J
NB_I
OmpSs Example: Tiled MxM (1 acc)
20
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
bs bs NB_J
NB_I
20
FPGA
Mercurium compiler with AIT tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically (
Stub function generated)
OmpSs Example: Tiled MxM (2 acc)
21
FPGA
BS NB_J
NB_I
BS
BS BS
#pragma omp target device(fpga) copy_deps
num_instances(2)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
Mercurium compiler with AIT tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically (
Stub function generated)
OmpSs Example: MxM kernel + Vivado HLS
Mercurium compiler with AIT tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically (
Stub function generated)
22
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
bs bs NB_J
NB_I
22
FPGA
HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
OmpSs Example: MxM kernel + Vivado HLS
bs bs NB_J
NB_I
23
FPGA
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Mercurium compiler with AIT tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically (
Stub function generated)
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Xilinx Zynq Support ++
OmpSs FPGA Support (present and near future)
24
Zynq-7000 Family
2x Cortex-A9 cores + FPGA (32-bit platforms)
SECO AXIOM Board Zynq U+ XCZU9EG-ES2
Trenz Electronics Zynq U+
TE0808 XCZU9EG-ES1
4x Cortex-A53 cores + FPGA (64-bit platforms)
ZCU102 Board
XCZU9EG-2FFVB1156
Discrete FPGAs like
Alveo and Cloud
www.bsc.es
Advanced OmpSs and OmpSs@FPGA
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool Global thread team
26
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool Global thread team
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool Global thread team
28
OmpSs@FPGA : Data Movement & Blocking
FPGA Blocking....
… data locality
optimizations, etc ...
Mercurium will not
generate memory copies
inside the accelerator,
however the Runtime will
copy the data to accesible
FPGA memory
Execution Model (Data Transfer): scale
30
copy_in
copy_out
Wrapper Local Memory Copies
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
c c
a a
c a
b b
b
Execution Model (Data Transfer): scale
copy_in
copy_out
Computation Logic Local Memory Copies
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
const int SIZE=1024;
#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);
c c
a a
b b
memcpy BRAM
COMPUTATION LOGIC
FPGA Blocking: Mercurium helps
32
FPGA Blocking....
… data locality
optimizations, etc ...
Arguments with no local copies
● Addr is automatically passed in
● Addr transparent access to SMP Memory using AXI ports
● Transparent to the Programmer!
● Be careful with memcpy !!!
FPGA Blocking: User should copy data
FPGA Blocking....
… data locality
optimizations, etc ...
Arguments with no local copies
● Addr is automatically passed in
● Addr transparent access to SMP Memory using AXI ports
● Transparent to the Programmer!
● Be careful with memcpy !!!
www.bsc.es
OmpSs@FPGA Ecosystem: Integrating
FPGA compilation into OmpSs
OmpSs@FPGA Ecosystem: AIT
OmpSs transformations
─ On full source
» Vivado HLS and Vivado
AIT
OmpSs Application
Mercurium
GCC
Nanos
xtasks lib
Extrae
OmpSs.elf bitstream
SMP FPGA
OS (petaLinux) +
BOOT.bin + image.ub + ompssatfpga driver
OmpSs phase FPGA phase
Wrapper and Code generation
Vivado HLS
Hardware generation
Netlist
OmpSs Manager Host code + Nanos++ calls
Instrumentation
Interconnect
Vivado
Netlist
Compilation
36
Few details on the steps to compile
─ Compile: use fpgacc
» [cross-compile-]fpgacc --ompss -o program program.c
» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c
─ Details...
» Compiling options:
--bitstream-generation
» Linking options:
--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som”
--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”
Execution
Few details on the steps to run the application
─ Environment:
●
Download the bitstream:
–
Zynq 7000 family: cat program.bin > /dev/xdevcfg
─ Run Program: ./program <input arguments>
» Runtime Application Configuration:
–
program.xtasks.config or xtasks.config
» With instrumentation:
–
export EXTRAE_CONFIG_FILE=”extrae.xml”
–
NX_ARGS=”--instrumentation=extrae” ./program <input>
www.bsc.es
Vivado HLS Optimizations
OmpSs Example: Vivado HLS
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Overall Overview*
* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 40
Loop Unroll
Option 1: one iteration per cycle
Option 2: several iterations per cycle in sequential
Option 3: several iterations per cycle in parallel
Be careful with the needed resources Memory conflicts!!
Pipeline
42
Several instances of the hardware can operate in pipleine
Initiation Interval: Number of cycles before you can start a
new instance of the hardware
Pipeline: Resources vs Parallelism
Looking for pipeline
● With few resources (Left)
Looking for some parallelism and pipeline:
● Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)
Looking for high parallelism
● Full unroll of L1 and L2. Too much resources!!! (Right)
Pipeline: Initiation Interval Limitations
44
Constant Bounds help…
● Example: comparison due non known limit…
Memory Accesses can have conflicts
● Arra Partition can help
Exploring Data Conflicts: Array Synthesis
Memory Accesses can have conflicts
● Arra Partition can help
Memory is synthesized (usually) to a BRAM by default
Array Partitioning
46
Memory Accesses can have conflicts
● Arra Partition can help
● #pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR
Array Partitioning: Array Dimensions
www.bsc.es
Vivado HLS Analysis
Vivado HLS: Main views
Vivado HLS: Performance analysis
50
Vivado HLS: Resources
www.bsc.es
Break!
Really?! :-)
www.bsc.es
Step by step Porting
Example: Tiled MxM
54
void matmulBlock_hw(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<BSIZE; i_b++) for (j_b=0; j_b<BSIZE; j_b++) for (k_b=0; k_b<BSIZE; k_b++)
matmulBlock_hw(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
It is better to port it to pointer, as it
happen in CUDA … GPU porting...
Step 1: SMP Porting of Tiled MxM w/ pointers
…
#define BSIZE 32
void matmulBlock_hw(T *a,T *b,T *c);
…
for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {
unsigned int const ci = j*b2size + i*BSIZE*msize;
for (unsigned int k = 0; k < msize/BSIZE; k++) {
unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
matmulBlock_hw(&a[ai], &b[bi], &c[ci]);
} } }
…
Question: What ompss pragma/s should you
should use to have a correct SMP OmpSs
version?
Step 1: SMP Porting of Tiled MxM w/ pointers
56
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {
unsigned int const ci = j*b2size + i*BSIZE*msize;
for (unsigned int k = 0; k < msize/BSIZE; k++) {
unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
matmulBlock_hw(&a[ai], &b[bi], &c[ci]);
} } }
#pragma omp taskwait
…
●
Task directive
●
in/out/inout dependences
●
Shape of the dependences
CONST_BSIZE should be a variable
Taskwait directive!!!!
What is the performance of the SMP version?
128 256 512 1024
0,9 1,0 1,1 1,2 1,3 1,4 1,5
Matrix Multiply
Speedup
seq_matmul smp 32x32 2t
Matrix Size
Speedup (seq/version)
What is the performance of the SMP version?
58
128 256 512 1024
0,9 1,0 1,1 1,2 1,3 1,4 1,5
Matrix Multiply
Speedup
seq_matmul smp 32x32 2t
Matrix Size
Speedup (seq/version)
Efficiency decreases due to the number of tasks created – Too much overhead
Question: How many tasks
have been created?
What is the performance of the SMP version?
128 256 512 1024
0,9 1,0 1,1 1,2 1,3 1,4 1,5
Matrix Multiply
Speedup
seq_matmul smp 32x32 2t
Matrix Size
Speedup (seq/version)
Efficiency decreases due to the number of tasks created – Too much overhead
Question: How many tasks
has been created?
Step 2: FPGA Porting of Tiled MxM w/ pointers
60
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {
unsigned int const ci = j*b2size + i*BSIZE*msize;
for (unsigned int k = 0; k < msize/BSIZE; k++) {
unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
matmulBlock_hw(&a[ai], &b[bi], &c[ci]);
} } }
#pragma omp taskwait
…
Question: What ompss pragma/s should you should use to create an
accelerator
implementing the task?
Step 2: FPGA Porting of Tiled MxM w/ pointers
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c); … ...
for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {
unsigned int const ci = j*b2size + i*BSIZE*msize;
for (unsigned int k = 0; k < msize/BSIZE; k++) {
unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
matmulBlock_hw(&a[ai], &b[bi], &c[ci]);
}}}
#pragma omp taskwait
●
Task device (fpga)
●
num_instances
●
copy_deps
Taskwait directive!!!!
Step 2: Note… SMP Porting of Tiled MxM
62
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp target device(smp) no_copy_deps
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {
unsigned int const ci = j*b2size + i*BSIZE*msize;
for (unsigned int k = 0; k < msize/BSIZE; k++) {
unsigned int const ai = k*b2size + i*BSIZE*msize;
unsigned int const bi = j*b2size + k*BSIZE*msize;
matmulBlock_hw(&a[ai], &b[bi], &c[ci]);
} } }
#pragma omp taskwait
…
●
Target device by default is smp
●
No copy deps since it is the same
memory space
Step 2: What is matmulBlock_hw?
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
for (i = 0; i < BSIZE; i++) { for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
}}}
Advise: Add
labels to loops!
Step 2: What is matmulBlock_hw?
64
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Advise: Add labels to loops!
If you want to
analyze the HLS
synthesis w/ Vivado
HLS
Compilation
Few details on the steps to compile
─ Compile: use fpgacc
» [cross-compile-]fpgacc --ompss -o program program.c
» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c
─ Details...
» Compiling options:
--bitstream-generation
» Linking options:
--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som,--to_step=HLS”
--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”
Compilation
66
ompss@name
:~/ BOARD=zedboard nohup make hls-analysis >& output.hls-analysis &
Step 3: Overall – Vivado HLS synthesis
Open Vivado HLS project:
vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw
Step 3: Overall – Vivado HLS synthesis
68
matmulBlock_hw kernel:
Resources & cycles! matmulBlock_hw kernel:
Operation Analysis &
Detail loop latency iteration
matmulBlock_hw kernel:
Initiation Interval?
Latency?
Pipelined?
Step 3: Operation Analysis & Detail loop latency iteration
●
In orange, latency of one iteration
●
Vertical means parallel
Step 3: Operation Analysis & Detail loop latency iteration
70
You can click-right and see source code
Step 3: Operation Analysis & Detail loop latency iteration
Tripcount analysis… place your mouse in the Performance view - orange
Step 3: Overall Loop Analysis: Latency
72
matmulBlock_hw kernel:
Resources & cycles!
matmulBlock_hw kernel:
●
Initiation Interval?
➢
No ititiation interval in any of the loops
●
Latency?
➢
Latency of loop k is 11 latency/iteration * 32 iterations
➢
Latency of loop j is
loopk_latency * 32 iterations
➢
Latency of loop i is
loopj_latency * 32 iterations
●
Pipelined?
No pipeline
Step 3: Overall – Loop Analysis: Resources
Performance Analysis: 32x32 acc, Slowdown!!!
74
512,000 0,000
0,500 1,000 1,500 2,000 2,500 3,000 3,500
4,000 Matrix Multiply 15,380
seq_matmul smp 32x32 2t 32x32_1acc_NAIVE
Matrix Size
Seconds
Almost with no effort you can achieve an accelerator!!!
however….
Question:
Why is so slow compared to the sequential
version?
Step 4: How do we optimize the kernel?
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Questions:
●
Can we force HLS to pipeline a loop?
●
Which directive should we add?
●
Where?
Pipeline
76
Several instances of the hardware can operate in pipleine
Initiation Interval: Number of cycles before you can start a
new instance of the hardware
Pipeline: Resources vs Parallelism
Looking for pipeline
● With few resources (Left)
Looking for some parallelism and pipeline:
● Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)
Looking for high parallelism
● Full unroll of L1 and L2. Too much resources!!! (Right)
Pipeline: Initiation Interval Limitations
78
Constant Bounds help…
● Example: comparison due non known limit…
Memory Accesses can have conflicts
● Arra Partition can help
Step 4: How do we optimize the kernel?
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Questions:
●
Can we force HLS to pipeline a loop?
●
Which is the
directive that we should add?
●
Where?
Step 4: How do we optimize the kernel?
80
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
II=1
Every cycle starts a new iteration of loop j
●
Question:
Are other
optimizations
produced due to
the PIPELINE
directive of this
loop?
Step 4: How do we optimize the kernel?
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
II=1
Every cycle starts a new iteration of loop j
●
Maybe… collapse of loop i and j
●
Maybe … full unroll
of loop k will happen
Step 4: Overall – Vivado HLS synthesis
82
Open Vivado HLS project:
vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw
●
II=16
●
Loop i and j collapsed
●
No loop k… full unroll
Step 4: Overall – Vivado HLS synthesis
Open Vivado HLS project:
vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw
●
II=16
●
Loop i and j collapsed (1024 iterations)
●
Long latency iteration due to reduction...
Why we don’t achieve II=1?
Step 4: Vivado HLS synthesis: Data Detail...
84
➢
Open compilation log:
vim vivado_project/matmul_ait/matmul.ait.log
➢
Look for: matmulBlock_hw_moved... Question:
How do we solve those conflicts to allow more accesses each cycle?
There are conflicts accessing to “a” variable…
Exploring Data Conflicts: Array Synthesis
Memory Accesses can have conflicts
● Arra Partition can help
Memory is synthesized (usually) to a BRAM by default
Array Partitioning
86
Memory Accesses can have conflicts
● Arra Partition can help
● #pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR
Array Partitioning: Array Dimensions
Step 5: Array Partitioning
88
…
#define BSIZE 32
const unsigned int CONST_BSIZE = BSIZE;
#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)
void matmulBlock_hw(T *a,T *b,T *c);
…
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Question:
●
Which variable/s should we partition based on the log error?
●
Which type of array
partitioning should
be used?
Step 5: Array Partitioning
…
#define BSIZE 32
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
unsigned int FACTOR=BSIZE/2;
#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Question:
Why can BSIZE be divided by two?
Let’s analyze if we
achieve the II=1 with
the new directive!
Step 5: Vivado HLS synthesis: Data Detail...
90
➢
Open compilation log:
vim vivado_project/matmul_ait/matmul.ait.log
➢
Look for: matmulBlock_hw_moved...
Question:
II=16 …
How do we solve them to allow more accesses in the same cycle?
There are conflicts NOW accessing to “b” variable…
Step 5: Array Partitioning (II)
…
#define BSIZE 32
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
unsigned int FACTOR=BSIZE/2;
#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Question:
Which type of array partitioning and factor you will use for “b”
variable?
Step 5: Array Partitioning (II)
92
…
#define BSIZE 32
void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;
unsigned int FACTOR=BSIZE/2;
#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR
#pragma HLS ARRAY_PARTITION variable=b dim=1 block factor=FACTOR loop_i_matmul:
for (i = 0; i < BSIZE; i++) { loop_j_matmul:
for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];
loop_k_matmul:
for (k = 0; k < BSIZE; k++)
sum += a[i*BSIZE + k] * b[k*BSIZE + j];
c[i*BSIZE + j] = sum;
} }}
Let’s analyze if now we
can achieve the II …
and which are the
resources achieved
Step 5: Vivado HLS synthesis: Data Detail...
➢
Open compilation log:
vim vivado_project/matmul_ait/matmul.ait.log
➢
Look for: matmulBlock_hw_moved...
YES! Now we have achieved the target II = 1!!!!
BEFORE Overall Loop Analysis: Latency
94
matmulBlock_hw kernel:
Resources & cycles!
matmulBlock_hw kernel:
●
Initiation Interval?
➢
No ititiation interval in any of the loops
●
Latency?
➢
Latency of loop k is 11 latency/iteration * 32 iterations
➢
Latency of loop j is
loopk_latency * 32 iterations
➢
Latency of loop i is
loopj_latency * 32 iterations
●