PATC-course OmpSs@FPGA

(1)

www.bsc.es

OmpSs@FPGA

PATC-course

BSC OmpSs@FPGA team

Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center

March 11th , 2021

(2)

Agenda

2

OmpSs@FPGA Overview + Vivado HLS usage

● OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

● Vivado HLS optimizations

Directives (most common) and environment Break:

● Demo/Tutorial Environment + Quizz :-):

MxM OmpSs@FPGA porting + Vivado HLS opt

MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)

Other versions

● OmpSs@FPGA New features

● Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]

(3)

Agenda

OmpSs@FPGA Overview + Vivado HLS usage

● OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

● Vivado HLS optimizations

Directives (most common) and environment Break:

● Demo/Tutorial Environment + Quizz :-)::

MxM OmpSs@FPGA porting + Vivado HLS opt

MxM tuning (increase number of accelerators vs Tile size) MxM Blocking (using non caching clauses)

Other versions

● OmpSs@FPGA New features

● Software and Environment support OmpSs@FPGA Docker container Contact us @ [email protected]

Question: Have you

used FPGAs before?

(4)

What is a FPGA? And Zynq?

(5)

What is a FPGA?

FPGA stands for Field Programmable Gate Array

(6)

6

What is a FPGA?

FPGA stands for Field Programmable Gate Array

Programming:

HDL vs HLS

(7)

What is Zynq? Zynq-7000 – 32-bits

(8)

8

What is Zynq? Zynq-7000 – 32-bits

Programming:

HDL vs HLS And

Heterogeneous?

(9)

Zynq 702-board

(10)

OmpSs@FPGA Model

(11)

• StarSs family key concepts

• Sequential task-based program

• View of a single address/name space

• Execution in parallel: automatic

• runtime computation of dependencies

• Productivity and portability

• OmpSs is used for prototyping extensions to OpenMP

• Tasking

• data-dependences

• Heterogeneity

• Multiple address spaces

• Mercurium compiler

• …

• Nanos runtime system

StarSs

OmpSs ^COMPSs

PyCOMPSs

@ SMP @ GPU @ FPGA @ Cluster

Task-based programming

@ CloudFPGA

(12)

Execution Model (OmpSs + accelerators)

Global thread team created on startup

─ Master starts main task (also executes)

─ N workers execute tasks

─ One representative per device (accelerator)

All threads get work from a task pool

─ Accelerator kernels become tasks

─ Tasks are labeled with (at least) one target device

» smp (default), cuda, opencl, fpga

─ Scheduler decides which task to execute

─ Tasks may have several targets (and we can have special versions using implements)

Task pool Global thread team

12

(13)

www.bsc.es

Easy to Program

(14)

Programming with OmpSs@FPGA

#include “scale.fpga.h”

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . .

#pragma omp taskwait }

void scale_task(double *b, double *c, double *a) {

for (. . . ) {

}. . .

main.c

const int SIZE=1024;

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);

scale.fpga.h

14 –

Dimension and Size pragmas

should be const type

–

Constants and global variables needed by a FPGA task may have to be moved to .fpga.h includes (C++). Functions used by two different accelerators too.

–

There should be at least one call to the accelerated kernel (FPGA task code) in the same source code

–

Parameters can be structs (w/

some limitations… better not)

(15)

OmpSs@FPGA Model Details

(16)

Very Simple Code: Execution Model

• There are a pool of threads to run tasks

• Runtime decides

» When a thread executes a task (it should be ready at task pool)

» Where to execute a task based on user annotations

» Which copies has to be done

for (. . . ) {

}. . .

void scale_task(double *b, double *c, double *a) {

for (. . . ) {

}. . .

main.c

scale.fpga.h

scale_taskscale_task

16

(17)

Execution Model (Data Transfer): scale

copy_in

copy_out c c

a a

c a

b b

b Start tim e Accelera of tor

Fin ish ac ce ler ato r

at CMA ARM

(18)

Sequential Example: Tiled MxM

18

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

bs bs NB_J

NB_I

(19)

OmpSs Example: Tiled MxM

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

…

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

(20)

OmpSs Example: Tiled MxM (1 acc)

20

…

bs bs NB_J

NB_I

20

FPGA

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

Stub function generated

)

(21)

OmpSs Example: Tiled MxM (2 acc)

21

FPGA

BS NB_J

NB_I

BS

BS BS

#pragma omp target device(fpga) copy_deps

num_instances(2)

…

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

)

(22)

OmpSs Example: MxM kernel + Vivado HLS

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

)

22

bs bs NB_J

NB_I

22

FPGA

HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

c[ia][ib] += sum;

}

(23)

OmpSs Example: MxM kernel + Vivado HLS

bs bs NB_J

NB_I

23

FPGA

c[ia][ib] += sum;

}

c[ia][ib] += sum;

}

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

)

(24)

c[ia][ib] += sum;

}

c[ia][ib] += sum;

}

Xilinx Zynq Support ++

OmpSs FPGA Support (present and near future)

24

Zynq-7000 Family

2x Cortex-A9 cores + FPGA (32-bit platforms)

SECO AXIOM Board Zynq U+ XCZU9EG-ES2

Trenz Electronics Zynq U+

TE0808 XCZU9EG-ES1

4x Cortex-A53 cores + FPGA (64-bit platforms)

ZCU102 Board

XCZU9EG-2FFVB1156

Discrete FPGAs like

Alveo and Cloud

(25)

www.bsc.es

Advanced OmpSs and OmpSs@FPGA

(26)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

#include “scale.h”

#define SIZE 100

for (. . . ) {

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code

scale.h

scale.c

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

scale.fpga.h

scale.c

26

(27)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

#define SIZE 100

for (. . . ) {

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

scale.fpga.h

scale.c

(28)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

#define SIZE 100

for (. . . ) {

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

scale.fpga.h

scale.c

28

(29)

OmpSs@FPGA : Data Movement & Blocking

FPGA Blocking....

… data locality

optimizations, etc ...

Mercurium will not

generate memory copies

inside the accelerator,

however the Runtime will

copy the data to accesible

FPGA memory

(30)

Execution Model (Data Transfer): scale

30

copy_in

copy_out

Wrapper Local Memory Copies

c c

a a

c a

b b

b

(31)

Execution Model (Data Transfer): scale

copy_in

copy_out

Computation Logic Local Memory Copies

#pragma omp target device(fpga) copy_deps no_localmem_copies num_instances(1)

c c

a a

b b

memcpy BRAM

COMPUTATION LOGIC

(32)

FPGA Blocking: Mercurium helps

32

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

● Addr is automatically passed in

● Addr transparent access to SMP Memory using AXI ports

● Transparent to the Programmer!

● Be careful with memcpy !!!

(33)

FPGA Blocking: User should copy data

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

● Addr is automatically passed in

● Addr transparent access to SMP Memory using AXI ports

● Transparent to the Programmer!

● Be careful with memcpy !!!

(34)

www.bsc.es

OmpSs@FPGA Ecosystem: Integrating

FPGA compilation into OmpSs

(35)

OmpSs@FPGA Ecosystem: AIT

OmpSs transformations

─ On full source

» Vivado HLS and Vivado

AIT

OmpSs Application

Mercurium

GCC

Nanos

xtasks lib

Extrae

OmpSs.elf bitstream

SMP FPGA

OS (petaLinux) +

BOOT.bin + image.ub + ompssatfpga driver

OmpSs phase FPGA phase

Wrapper and Code generation

Vivado HLS

Hardware generation

Netlist

OmpSs Manager Host code + Nanos++ calls

Instrumentation

Interconnect

Vivado

Netlist

(36)

Compilation

36

Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(37)

Execution

Few details on the steps to run the application

─ Environment:

●

Download the bitstream:

–

Zynq 7000 family: cat program.bin > /dev/xdevcfg

─ Run Program: ./program <input arguments>

» Runtime Application Configuration:

–

program.xtasks.config or xtasks.config

» With instrumentation:

–

export EXTRAE_CONFIG_FILE=”extrae.xml”

–

NX_ARGS=”--instrumentation=extrae” ./program <input>

(38)

www.bsc.es

Vivado HLS Optimizations

(39)

OmpSs Example: Vivado HLS

c[ia][ib] += sum;

}

c[ia][ib] += sum;

}

(40)

Overall Overview*

* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 40

(41)

Loop Unroll

Option 1: one iteration per cycle

Option 2: several iterations per cycle in sequential

Option 3: several iterations per cycle in parallel

Be careful with the needed resources Memory conflicts!!

(42)

Pipeline

42

Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(43)

Pipeline: Resources vs Parallelism

Looking for pipeline

● With few resources (Left)

Looking for some parallelism and pipeline:

● Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

● Full unroll of L1 and L2. Too much resources!!! (Right)

(44)

Pipeline: Initiation Interval Limitations

44

Constant Bounds help…

● Example: comparison due non known limit…

Memory Accesses can have conflicts

● Arra Partition can help

(45)

Exploring Data Conflicts: Array Synthesis

Memory is synthesized (usually) to a BRAM by default

(46)

Array Partitioning

46

● #pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR

(47)

Array Partitioning: Array Dimensions

(48)

www.bsc.es

Vivado HLS Analysis

(49)

Vivado HLS: Main views

(50)

Vivado HLS: Performance analysis

50

(51)

Vivado HLS: Resources

(52)

www.bsc.es

Break!

Really?! :-)

(53)

www.bsc.es

Step by step Porting

(54)

Example: Tiled MxM

54

void matmulBlock_hw(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<BSIZE; i_b++) for (j_b=0; j_b<BSIZE; j_b++) for (k_b=0; k_b<BSIZE; k_b++)

matmulBlock_hw(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

It is better to port it to pointer, as it

happen in CUDA … GPU porting...

(55)

Step 1: SMP Porting of Tiled MxM w/ pointers

…

#define BSIZE 32

void matmulBlock_hw(T *a,T *b,T *c);

…

for (unsigned int i = 0; i < msize/BSIZE; i++) { for (unsigned int j = 0; j < msize/BSIZE; j++) {

unsigned int const ci = j*b2size + i*BSIZE*msize;

for (unsigned int k = 0; k < msize/BSIZE; k++) {

unsigned int const ai = k*b2size + i*BSIZE*msize;

unsigned int const bi = j*b2size + k*BSIZE*msize;

matmulBlock_hw(&a[ai], &b[bi], &c[ci]);

} } }

…

Question: What ompss pragma/s should you

should use to have a correct SMP OmpSs

version?

(56)

Step 1: SMP Porting of Tiled MxM w/ pointers

56

…

#define BSIZE 32

const unsigned int CONST_BSIZE = BSIZE;

#pragma omp task in([CONST_BSIZE*CONST_BSIZE]a, [CONST_BSIZE*CONST_BSIZE]b) inout([CONST_BSIZE*CONST_BSIZE]c)

…

} } }

#pragma omp taskwait

…

●

Task directive

●

in/out/inout dependences

●

Shape of the dependences

CONST_BSIZE should be a variable

Taskwait directive!!!!

(57)

What is the performance of the SMP version?

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t

Matrix Size

Speedup (seq/version)

(58)

What is the performance of the SMP version?

58

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

Matrix Size

Efficiency decreases due to the number of tasks created – Too much overhead

Question: How many tasks

have been created?

(59)

What is the performance of the SMP version?

128 256 512 1024

0,9 1,0 1,1 1,2 1,3 1,4 1,5

Matrix Multiply

Speedup

Matrix Size

Efficiency decreases due to the number of tasks created – Too much overhead

Question: How many tasks

has been created?

(60)

Step 2: FPGA Porting of Tiled MxM w/ pointers

60

…

#define BSIZE 32

…

} } }

…

Question: What ompss pragma/s should you should use to create an

accelerator

implementing the task?

(61)

Step 2: FPGA Porting of Tiled MxM w/ pointers

…

#define BSIZE 32

void matmulBlock_hw(T *a,T *b,T *c); … ...

}}}

●

Task device (fpga)

●

num_instances

●

copy_deps

Taskwait directive!!!!

(62)

Step 2: Note… SMP Porting of Tiled MxM

62

…

#define BSIZE 32

#pragma omp target device(smp) no_copy_deps

…

} } }

…

●

Target device by default is smp

●

No copy deps since it is the same

memory space

(63)

Step 2: What is matmulBlock_hw?

…

#define BSIZE 32

…

void matmulBlock_hw(elem_t *a, elem_t *b, elem_t *c) { unsigned int i, j, k;

for (i = 0; i < BSIZE; i++) { for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

for (k = 0; k < BSIZE; k++)

sum += a[i*BSIZE + k] * b[k*BSIZE + j];

c[i*BSIZE + j] = sum;

}}}

Advise: Add

labels to loops!

(64)

Step 2: What is matmulBlock_hw?

64

…

#define BSIZE 32

…

loop_i_matmul:

for (i = 0; i < BSIZE; i++) { loop_j_matmul:

for (j = 0; j < BSIZE; j++) { elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

} }}

Advise: Add labels to loops!

If you want to

analyze the HLS

synthesis w/ Vivado

HLS

(65)

Compilation

Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--hwruntime=som,--to_step=HLS”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(66)

Compilation

66

ompss@name

:~/ BOARD=zedboard nohup make hls-analysis >& output.hls-analysis &

(67)

Step 3: Overall – Vivado HLS synthesis

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

(68)

Step 3: Overall – Vivado HLS synthesis

68

matmulBlock_hw kernel:

Resources & cycles! matmulBlock_hw kernel:

Operation Analysis &

Detail loop latency iteration

matmulBlock_hw kernel:

Initiation Interval?

Latency?

Pipelined?

(69)

Step 3: Operation Analysis & Detail loop latency iteration

●

In orange, latency of one iteration

●

Vertical means parallel

(70)

Step 3: Operation Analysis & Detail loop latency iteration

70

You can click-right and see source code

(71)

Step 3: Operation Analysis & Detail loop latency iteration

Tripcount analysis… place your mouse in the Performance view - orange

(72)

Step 3: Overall Loop Analysis: Latency

72

matmulBlock_hw kernel:

Resources & cycles!

matmulBlock_hw kernel:

●

Initiation Interval?

➢

No ititiation interval in any of the loops

●

Latency?

➢

Latency of loop k is 11 latency/iteration * 32 iterations

➢

Latency of loop j is

loopk_latency * 32 iterations

➢

Latency of loop i is

loopj_latency * 32 iterations

●

Pipelined?

No pipeline

(73)

Step 3: Overall – Loop Analysis: Resources

(74)

Performance Analysis: 32x32 acc, Slowdown!!!

74

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 Matrix Multiply 15,380

seq_matmul smp 32x32 2t 32x32_1acc_NAIVE

Matrix Size

Seconds

Almost with no effort you can achieve an accelerator!!!

however….

Question:

Why is so slow compared to the sequential

version?

(75)

Step 4: How do we optimize the kernel?

…

#define BSIZE 32

…

loop_i_matmul:

loop_k_matmul:

} }}

Questions:

●

Can we force HLS to pipeline a loop?

●

Which directive should we add?

●

Where?

(76)

Pipeline

76

Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(77)

Pipeline: Resources vs Parallelism

Looking for pipeline

● With few resources (Left)

Looking for some parallelism and pipeline:

● Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

● Full unroll of L1 and L2. Too much resources!!! (Right)

(78)

Pipeline: Initiation Interval Limitations

78

Constant Bounds help…

● Example: comparison due non known limit…

(79)

Step 4: How do we optimize the kernel?

…

#define BSIZE 32

…

loop_i_matmul:

loop_k_matmul:

} }}

Questions:

●

Can we force HLS to pipeline a loop?

●

Which is the

directive that we should add?

●

Where?

(80)

Step 4: How do we optimize the kernel?

80

…

#define BSIZE 32

…

loop_i_matmul:

for (j = 0; j < BSIZE; j++) { #pragma HLS PIPELINE II=1 elem_t sum = c[i*BSIZE + j];

loop_k_matmul:

} }}

II=1

Every cycle starts a new iteration of loop j

●

Question:

Are other

optimizations

produced due to

the PIPELINE

directive of this

loop?

(81)

Step 4: How do we optimize the kernel?

…

#define BSIZE 32

…

loop_i_matmul:

loop_k_matmul:

II=1

Every cycle starts a new iteration of loop j

●

Maybe… collapse of loop i and j

●

Maybe … full unroll

of loop k will happen

(82)

Step 4: Overall – Vivado HLS synthesis

82

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

●

II=16

●

Loop i and j collapsed

●

No loop k… full unroll

(83)

Step 4: Overall – Vivado HLS synthesis

Open Vivado HLS project:

vivado_hls -p vivado_project_name/matmul_ait/xilinx/HLS/matmulBlock_hw

●

II=16

●

Loop i and j collapsed (1024 iterations)

●

Long latency iteration due to reduction...

Why we don’t achieve II=1?

(84)

Step 4: Vivado HLS synthesis: Data Detail...

84

➢

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

➢

Look for: matmulBlock_hw_moved... Question:

How do we solve those conflicts to allow more accesses each cycle?

There are conflicts accessing to “a” variable…

(85)

Exploring Data Conflicts: Array Synthesis

Memory is synthesized (usually) to a BRAM by default

(86)

Array Partitioning

86

● #pragma HLS ARRAY_PARTITION variable=name dim=X type factor=FACTOR

(87)

Array Partitioning: Array Dimensions

(88)

Step 5: Array Partitioning

88

…

#define BSIZE 32

…

loop_i_matmul:

loop_k_matmul:

} }}

Question:

●

Which variable/s should we partition based on the log error?

●

Which type of array

partitioning should

be used?

(89)

Step 5: Array Partitioning

…

#define BSIZE 32

unsigned int FACTOR=BSIZE/2;

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:

loop_k_matmul:

} }}

Question:

Why can BSIZE be divided by two?

Let’s analyze if we

achieve the II=1 with

the new directive!

(90)

Step 5: Vivado HLS synthesis: Data Detail...

90

➢

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

➢

Look for: matmulBlock_hw_moved...

Question:

II=16 …

How do we solve them to allow more accesses in the same cycle?

There are conflicts NOW accessing to “b” variable…

(91)

Step 5: Array Partitioning (II)

…

#define BSIZE 32

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR loop_i_matmul:

loop_k_matmul:

} }}

Question:

Which type of array partitioning and factor you will use for “b”

variable?

(92)

Step 5: Array Partitioning (II)

92

…

#define BSIZE 32

#pragma HLS ARRAY_PARTITION variable=a dim=1 cyclic factor=FACTOR

#pragma HLS ARRAY_PARTITION variable=b dim=1 block factor=FACTOR loop_i_matmul:

loop_k_matmul:

} }}

Let’s analyze if now we

can achieve the II …

and which are the

resources achieved

(93)

Step 5: Vivado HLS synthesis: Data Detail...

➢

Open compilation log:

vim vivado_project/matmul_ait/matmul.ait.log

➢

Look for: matmulBlock_hw_moved...

YES! Now we have achieved the target II = 1!!!!

(94)

BEFORE Overall Loop Analysis: Latency

94

matmulBlock_hw kernel:

Resources & cycles!

matmulBlock_hw kernel:

●

Initiation Interval?

➢

No ititiation interval in any of the loops

●

Latency?

➢

Latency of loop k is 11 latency/iteration * 32 iterations

➢

Latency of loop j is

loopk_latency * 32 iterations

➢

Latency of loop i is

loopj_latency * 32 iterations

●

Pipelined?

No pipeline

(95)

PATC-course OmpSs@FPGA

www.bsc.es

OmpSs@FPGA

PATC-course

BSC OmpSs@FPGA team

March 11th , 2021

Agenda

Agenda

Question: Have you

used FPGAs before?

What is a FPGA? And Zynq?

What is a FPGA?

FPGA stands for Field Programmable Gate Array

What is a FPGA?

FPGA stands for Field Programmable Gate Array

Programming:

HDL vs HLS

What is Zynq? Zynq-7000 – 32-bits

What is Zynq? Zynq-7000 – 32-bits

Programming:

HDL vs HLS And

Heterogeneous?

Zynq 702-board

OmpSs@FPGA Model

• StarSs family key concepts

• Productivity and portability

• OmpSs is used for prototyping extensions to OpenMP

• Mercurium compiler

• Nanos runtime system

StarSs

OmpSs COMPSs

PyCOMPSs

Task-based programming

Execution Model (OmpSs + accelerators)

Global thread team created on startup

─ Master starts main task (also executes)

─ N workers execute tasks

─ One representative per device (accelerator)

All threads get work from a task pool

─ Accelerator kernels become tasks

─ Tasks are labeled with (at least) one target device

» smp (default), cuda, opencl, fpga

─ Scheduler decides which task to execute

─ Tasks may have several targets (and we can have special versions using implements)

www.bsc.es

Easy to Program

Programming with OmpSs@FPGA

Dimension and Size pragmas

should be const type

Constants and global variables needed by a FPGA task may have to be moved to .fpga.h includes (C++). Functions used by two different accelerators too.

There should be at least one call to the accelerated kernel (FPGA task code) in the same source code

Parameters can be structs (w/

some limitations… better not)

OmpSs@FPGA Model Details

Very Simple Code: Execution Model

• There are a pool of threads to run tasks

• Runtime decides

» When a thread executes a task (it should be ready at task pool)

» Where to execute a task based on user annotations

» Which copies has to be done

Execution Model (Data Transfer): scale

copy_in

copy_out c c

a a

c a

b b

b Start tim e Accelera of tor

Fin ish ac ce ler ato r

Sequential Example: Tiled MxM

OmpSs Example: Tiled MxM

OmpSs Example: Tiled MxM (1 acc)

FPGA

Mercurium compiler with AIT tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically (

)

OmpSs Example: Tiled MxM (2 acc)

FPGA

num_instances(2)

Mercurium compiler with AIT tool

OmpSs ^COMPSs