• No se han encontrado resultados

PATC-course OmpSs@FPGA

N/A
N/A
Protected

Academic year: 2023

Share "PATC-course OmpSs@FPGA"

Copied!
163
0
0

Texto completo

(1)

www.bsc.es

OmpSs@FPGA

PATC-course

BSC OmpSs@FPGA team

Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center

March 11th , 2021

(2)

Agenda

2

OmpSs@FPGA Overview + Vivado HLS usage

OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

Vivado HLS optimizations

Directives (most common) and environment Break:

Demo/Tutorial Environment + Quizz :-):

MxM OmpSs@FPGA porting + Vivado HLS opt

MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)

Other versions

OmpSs@FPGA New features

Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]

(3)

Agenda

OmpSs@FPGA Overview + Vivado HLS usage

OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

Vivado HLS optimizations

Directives (most common) and environment Break:

Demo/Tutorial Environment + Quizz :-)::

MxM OmpSs@FPGA porting + Vivado HLS opt

MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)

Other versions

OmpSs@FPGA New features

Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]

Question: Have you

used FPGAs before?

(4)

What is a FPGA? And Zynq?

(5)

What is a FPGA?

FPGA stands for Field Programmable Gate Array

(6)

6

What is a FPGA?

FPGA stands for Field Programmable Gate Array

Programming:

HDL vs HLS

(7)

What is Zynq? Zynq-7000 – 32-bits

(8)

8

What is Zynq? Zynq-7000 – 32-bits

Programming:

HDL vs HLS And

Heterogeneous?

(9)

Zynq 702-board

(10)

OmpSs@FPGA Model

(11)

• StarSs family key concepts

• Sequential task-based program

• View of a single address/name space

• Execution in parallel: automatic

• runtime computation of dependencies

• Productivity and portability

• OmpSs is used for prototyping extensions to OpenMP

• Tasking

• data-dependences

• Heterogeneity

• Multiple address spaces

• Mercurium compiler

• …

• Nanos runtime system

StarSs

OmpSs COMPSs

PyCOMPSs

@ SMP @ GPU @ FPGA @ Cluster

Task-based programming

@ CloudFPGA

(12)

Execution Model (OmpSs + accelerators)

Global thread team created on startup

─ Master starts main task (also executes)

─ N workers execute tasks

─ One representative per device (accelerator)

All threads get work from a task pool

─ Accelerator kernels become tasks

─ Tasks are labeled with (at least) one target device

» smp (default), cuda, opencl, fpga

─ Scheduler decides which task to execute

─ Tasks may have several targets (and we can have special versions using implements)

Task pool Global thread team

12

(13)

www.bsc.es

Easy to Program

(14)

Programming with OmpSs@FPGA

#include “scale.fpga.h”

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . .

#pragma omp taskwait }

#include “scale.fpga.h”

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . .

#pragma omp taskwait }

main.c

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

scale.fpga.h

14

Dimension and Size pragmas

should be const type

Constants and global variables needed by a FPGA task may have to be moved to .fpga.h includes (C++). Functions used by two different accelerators too.

There should be at least one call to the accelerated kernel (FPGA task code) in the same source code

Parameters can be structs (w/

some limitations… better not)

(15)

OmpSs@FPGA Model Details

(16)

Very Simple Code: Execution Model

• There are a pool of threads to run tasks

• Runtime decides

» When a thread executes a task (it should be ready at task pool)

» Where to execute a task based on user annotations

» Which copies has to be done

#include “scale.fpga.h”

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . .

#pragma omp taskwait }

#include “scale.fpga.h”

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . .

#pragma omp taskwait }

main.c

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

scale.fpga.h

scale_taskscale_task

Task pool Global thread team

16

(17)

Execution Model (Data Transfer): scale

copy_in

copy_out c c

a a

c a

b b

b Start tim e Accelera of tor

Fin ish ac ce ler ato r

at CMA ARM

(18)

Sequential Example: Tiled MxM

18

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

bs bs NB_J

NB_I

(19)

OmpSs Example: Tiled MxM

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

(20)

OmpSs Example: Tiled MxM (1 acc)

20

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

20

FPGA

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

Stub function generated

)

(21)

OmpSs Example: Tiled MxM (2 acc)

21

FPGA

BS NB_J

NB_I

BS

BS BS

#pragma omp target device(fpga) copy_deps

num_instances(2)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

Stub function generated

)

(22)

OmpSs Example: MxM kernel + Vivado HLS

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

Stub function generated

)

22

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

bs bs NB_J

NB_I

22

FPGA

HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

(23)

OmpSs Example: MxM kernel + Vivado HLS

bs bs NB_J

NB_I

23

FPGA

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

Stub function generated

)

(24)

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

Xilinx Zynq Support ++

OmpSs FPGA Support (present and near future)

24

Zynq-7000 Family

2x Cortex-A9 cores + FPGA (32-bit platforms)

SECO AXIOM Board Zynq U+ XCZU9EG-ES2

Trenz Electronics Zynq U+

TE0808 XCZU9EG-ES1

4x Cortex-A53 cores + FPGA (64-bit platforms)

ZCU102 Board

XCZU9EG-2FFVB1156

Discrete FPGAs like

Alveo and Cloud

(25)

www.bsc.es

Advanced OmpSs and OmpSs@FPGA

(26)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

26

(27)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

(28)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

28

(29)

OmpSs@FPGA : Data Movement & Blocking

FPGA Blocking....

… data locality

optimizations, etc ...

Mercurium will not

generate memory copies

inside the accelerator,

however the Runtime will

copy the data to accesible

FPGA memory

(30)

Execution Model (Data Transfer): scale

30

copy_in

copy_out

Wrapper Local Memory Copies

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

c c

a a

c a

b b

b

(31)

Execution Model (Data Transfer): scale

copy_in

copy_out

Computation Logic Local Memory Copies

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

c c

a a

b b

memcpy BRAM

COMPUTATION LOGIC

(32)

FPGA Blocking: Mercurium helps

32

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

Addr is automatically passed in

Addr transparent access to SMP Memory using AXI ports

Transparent to the Programmer!

Be careful with memcpy !!!

(33)

FPGA Blocking: User should copy data

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

Addr is automatically passed in

Addr transparent access to SMP Memory using AXI ports

Transparent to the Programmer!

Be careful with memcpy !!!

(34)

www.bsc.es

OmpSs@FPGA Ecosystem: Integrating

FPGA compilation into OmpSs

(35)

OmpSs@FPGA Ecosystem: AIT

OmpSs transformations

─ On full source

» Vivado HLS and Vivado

AIT

OmpSs Application

Mercurium

GCC

Nanos

xtasks lib

Extrae

OmpSs.elf bitstream

SMP FPGA

OS (petaLinux) +

BOOT.bin + image.ub + ompssatfpga driver

OmpSs phase FPGA phase

Wrapper and Code generation

Vivado HLS

Hardware generation

Netlist

OmpSs Manager Host code + Nanos++ calls

Instrumentation

Interconnect

Vivado

Netlist

(36)

Compilation

36

Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(37)

Execution

Few details on the steps to run the application

─ Environment:

Download the bitstream:

Zynq 7000 family: cat program.bin > /dev/xdevcfg

─ Run Program: ./program <input arguments>

» Runtime Application Configuration:

program.xtasks.config or xtasks.config

» With instrumentation:

export EXTRAE_CONFIG_FILE=”extrae.xml”

NX_ARGS=”--instrumentation=extrae” ./program <input>

(38)

www.bsc.es

Vivado HLS Optimizations

(39)

OmpSs Example: Vivado HLS

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

(40)

Overall Overview*

* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 40

(41)

Loop Unroll

Option 1: one iteration per cycle

Option 2: several iterations per cycle in sequential

Option 3: several iterations per cycle in parallel

Be careful with the needed resources Memory conflicts!!

(42)

Pipeline

42

Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(43)

Pipeline: Resources vs Parallelism

Looking for pipeline

With few resources (Left)

Looking for some parallelism and pipeline:

Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

Full unroll of L1 and L2. Too much resources!!! (Right)

(44)

Pipeline: Initiation Interval Limitations

44

Constant Bounds help…

Example: comparison due non known limit…

Memory Accesses can have conflicts

Arra Partition can help

(45)

Exploring Data Conflicts: Array Synthesis

Memory Accesses can have conflicts

Arra Partition can help

Memory is synthesized (usually) to a BRAM by default

(46)

Array Partitioning

46

Memory Accesses can have conflicts

Arra Partition can help

#pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR

(47)

Array Partitioning: Array Dimensions

(48)

www.bsc.es

Vivado HLS Analysis

(49)

Vivado HLS: Main views

(50)

Vivado HLS: Performance analysis

50

(51)

Vivado HLS: Resources

(52)

www.bsc.es

Break!

Really?! :-)

(53)

www.bsc.es

Step by step Porting

(54)

Example: Tiled MxM

54

void matmulBlock_hw(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<BSIZE; i_b++) for (j_b=0; j_b<BSIZE; j_b++) for (k_b=0; k_b<BSIZE; k_b++)

matmulBlock_hw(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

It is better to port it to pointer, as it

happen in CUDA … GPU porting...

(55)

Step 1: SMP Porting of Tiled MxM w/ pointers

#define BSIZE 32

void matmulBlock_hw(T *a,T *b,T *c);

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

} } }

Question: What ompss pragma/s should you

should use to have a correct SMP OmpSs

version?

(56)

Step 1: SMP Porting of Tiled MxM w/ pointers

56

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

} } }

#pragma omp taskwait

Task directive

in/out/inout dependences

Shape of the dependences

CONST_BSIZE should be a variable

Taskwait directive!!!!

(57)

What is the performance of the SMP version?

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t

Matrix Size

Speedup (seq/version)

(58)

What is the performance of the SMP version?

58

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t

Matrix Size

Speedup (seq/version)

Efficiency decreases due to the number of tasks created – Too much overhead

Question: How many tasks

have been created?

(59)

What is the performance of the SMP version?

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t

Matrix Size

Speedup (seq/version)

Efficiency decreases due to the number of tasks created – Too much overhead

Question: How many tasks

has been created?

(60)

Step 2: FPGA Porting of Tiled MxM w/ pointers

60

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

} } }

#pragma omp taskwait

Question: What ompss pragma/s should you should use to create an

accelerator

implementing the task?

(61)

Step 2: FPGA Porting of Tiled MxM w/ pointers

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c); … ...

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

}}}

#pragma omp taskwait

Task device (fpga)

num_instances

copy_deps

Taskwait directive!!!!

(62)

Step 2: Note… SMP Porting of Tiled MxM

62

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp target device(smp) no_copy_deps

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

} } }

#pragma omp taskwait

Target device by default is smp

No copy deps since it is the same

memory space

(63)

Step 2: What is matmulBlock_hw?

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

for (i = 0; i < BSIZE; i++) { for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

}}}

Advise: Add

labels to loops!

(64)

Step 2: What is matmulBlock_hw?

64

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Advise: Add labels to loops!

If you want to

analyze the HLS

synthesis w/ Vivado

HLS

(65)

Compilation

Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som,--to_step=HLS”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(66)

Compilation

66

ompss@name

:~/ BOARD=zedboard nohup make hls-analysis >& output.hls-analysis &

(67)

Step 3: Overall – Vivado HLS synthesis

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

(68)

Step 3: Overall – Vivado HLS synthesis

68

matmulBlock_hw kernel:

Resources & cycles! matmulBlock_hw kernel:

Operation Analysis &

Detail loop latency iteration

matmulBlock_hw kernel:

Initiation Interval?

Latency?

Pipelined?

(69)

Step 3: Operation Analysis & Detail loop latency iteration

In orange, latency of one iteration

Vertical means parallel

(70)

Step 3: Operation Analysis & Detail loop latency iteration

70

You can click-right and see source code

(71)

Step 3: Operation Analysis & Detail loop latency iteration

Tripcount analysis… place your mouse in the Performance view - orange

(72)

Step 3: Overall Loop Analysis: Latency

72

matmulBlock_hw kernel:

Resources & cycles!

matmulBlock_hw kernel:

Initiation Interval?

No ititiation interval in any of the loops

Latency?

Latency of loop k is 11 latency/iteration * 32 iterations

Latency of loop j is

loopk_latency * 32 iterations

Latency of loop i is

loopj_latency * 32 iterations

Pipelined?

No pipeline

(73)

Step 3: Overall – Loop Analysis: Resources

(74)

Performance Analysis: 32x32 acc, Slowdown!!!

74

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 Matrix Multiply 15,380

seq_matmul smp 32x32 2t 32x32_1acc_NAIVE

Matrix Size

Seconds

Almost with no effort you can achieve an accelerator!!!

however….

Question:

Why is so slow compared to the sequential

version?

(75)

Step 4: How do we optimize the kernel?

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Questions:

Can we force HLS to pipeline a loop?

Which directive should we add?

Where?

(76)

Pipeline

76

Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(77)

Pipeline: Resources vs Parallelism

Looking for pipeline

With few resources (Left)

Looking for some parallelism and pipeline:

Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

Full unroll of L1 and L2. Too much resources!!! (Right)

(78)

Pipeline: Initiation Interval Limitations

78

Constant Bounds help…

Example: comparison due non known limit…

Memory Accesses can have conflicts

Arra Partition can help

(79)

Step 4: How do we optimize the kernel?

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Questions:

Can we force HLS to pipeline a loop?

Which is the

directive that we should add?

Where?

(80)

Step 4: How do we optimize the kernel?

80

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

II=1

Every cycle starts a new iteration of loop j

Question:

Are other

optimizations

produced due to

the PIPELINE

directive of this

loop?

(81)

Step 4: How do we optimize the kernel?

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

II=1

Every cycle starts a new iteration of loop j

Maybe… collapse of loop i and j

Maybe … full unroll

of loop k will happen

(82)

Step 4: Overall – Vivado HLS synthesis

82

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

II=16

Loop i and j collapsed

No loop k… full unroll

(83)

Step 4: Overall – Vivado HLS synthesis

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

II=16

Loop i and j collapsed (1024 iterations)

Long latency iteration due to reduction...

Why we don’t achieve II=1?

(84)

Step 4: Vivado HLS synthesis: Data Detail...

84

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

Look for: matmulBlock_hw_moved... Question:

How do we solve those conflicts to allow more accesses each cycle?

There are conflicts accessing to “a” variable…

(85)

Exploring Data Conflicts: Array Synthesis

Memory Accesses can have conflicts

Arra Partition can help

Memory is synthesized (usually) to a BRAM by default

(86)

Array Partitioning

86

Memory Accesses can have conflicts

Arra Partition can help

#pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR

(87)

Array Partitioning: Array Dimensions

(88)

Step 5: Array Partitioning

88

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

void matmulBlock_hw(T *a,T *b,T *c);

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Question:

Which variable/s should we partition based on the log error?

Which type of array

partitioning should

be used?

(89)

Step 5: Array Partitioning

#define BSIZE 32

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

unsigned int FACTOR=BSIZE/2;

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Question:

Why can BSIZE be divided by two?

Let’s analyze if we

achieve the II=1 with

the new directive!

(90)

Step 5: Vivado HLS synthesis: Data Detail...

90

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

Look for: matmulBlock_hw_moved...

Question:

II=16 …

How do we solve them to allow more accesses in the same cycle?

There are conflicts NOW accessing to “b” variable…

(91)

Step 5: Array Partitioning (II)

#define BSIZE 32

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

unsigned int FACTOR=BSIZE/2;

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Question:

Which type of array partitioning and factor you will use for “b”

variable?

(92)

Step 5: Array Partitioning (II)

92

#define BSIZE 32

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

unsigned int FACTOR=BSIZE/2;

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR

#pragma HLS ARRAY_PARTITION variable=b dim=1 block factor=FACTOR loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

} }}

Let’s analyze if now we

can achieve the II …

and which are the

resources achieved

(93)

Step 5: Vivado HLS synthesis: Data Detail...

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

Look for: matmulBlock_hw_moved...

YES! Now we have achieved the target II = 1!!!!

(94)

BEFORE Overall Loop Analysis: Latency

94

matmulBlock_hw kernel:

Resources & cycles!

matmulBlock_hw kernel:

Initiation Interval?

No ititiation interval in any of the loops

Latency?

Latency of loop k is 11 latency/iteration * 32 iterations

Latency of loop j is

loopk_latency * 32 iterations

Latency of loop i is

loopj_latency * 32 iterations

Pipelined?

No pipeline

(95)

Before vs Now!

matmulBlock_hw kernel:

Resources & cycles! Pipeline+Array_Partition matmulBlock_hw kernel:

Resources & cycles!

Few HLS directives help a lot!!!!

Referencias

Documento similar