BSC/UPC – Introduction OmpSs@FPGA

(1)

www.bsc.es

OmpSs@FPGA

BSC/UPC – Introduction

BSC OmpSs@FPGA team

Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center

May 24th , 2019

(2)

Agenda

9:30-10:30 aprox: Vertex - VS208

●

OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

●

Vivado HLS optimizations

Directives (most common) and environment

●

Tutorial Environment:

Description: Zedboard, Hard Drive (w/ Docker), SD, etc.

Expected Results: On the hands-on...

Coffee Break

Hands-on: Campus Nord A5 Building: A5 S103 and A5S112

●

MxM OmpSs@FPGA porting + Vivado HLS opt

●

MxM tuning (increase number of accelerators vs Tile size)

●

MxM Blocking (partially provided)

●

MxM Hardware Instrumentation (partially provided)

(3)

• StarSs family key concepts

• Sequential task-based program

• View of a single address/name space

• Execution in parallel: automatic

• runtime computation of dependencies

• Productivity and portability

• OmpSs is used for prototyping extensions to OpenMP

• Tasking

• data-dependences

• Heterogeneity

• Multiple address spaces

• Mercurium compiler • …

• Nanos runtime system

StarSs

OmpSs ^COMPSs

PyCOMPSs

@ SMP @ GPU @ FPGA @ Cluster

Task-based programming

3

(4)

Execution Model (OmpSs + accelerators)

Global thread team created on startup

─ Master starts main task (also executes)

─ N workers execute tasks

─ One representative per device (accelerator)

All threads get work from a task pool

─ Accelerator kernels become tasks

─ Tasks are labeled with (at least) one target device

» smp (default), cuda, opencl, fpga

─ Scheduler decides which task to execute

─ Tasks may have several targets (implements)

Task pool

Global thread team

(5)

www.bsc.es

Easy to Program

(6)

Sequential Example: Tiled MxM

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

bs bs NB_J

NB_I

(7)

OmpSs Example: Tiled MxM

7 #pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

(8)

OmpSs Example: Tiled MxM (1 acc)

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

8

FPGA

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(9)

OmpSs Example: Tiled MxM (2 acc)

9

FPGA

BS NB_J

NB_I

BS

BS BS

#pragma omp target device(fpga) copy_deps num_instances(2)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(10)

OmpSs@FPGA Porting: Warnings!

●

Constants and global variables needed by a FPGA task have to be moved to .fpga.h includes

●

There should be a call to the accelerated kernel (FPGA task code) in the same source code

●

Parameters shouldn’t be structs (nor array of structs)

●

Dimension and Size pragmas should be const type

(11)

OmpSs Example: MxM kernel + Vivado HLS

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

11 #pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

bs bs NB_J

NB_I

11

FPGA

HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

(12)

OmpSs Example: MxM kernel + Vivado HLS

bs bs NB_J

NB_I

12

FPGA

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(13)

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

Xilinx Zynq Support ++

OmpSs FPGA Support (present and near future)

13 Zynq-7000 Family

2x Cortex-A9 cores + FPGA (32-bit platforms)

SECO AXIOM Board Zynq U+ XCZU9EG-ES2

Trenz Electronics Zynq U+

TE0808 XCZU9EG-ES1

4x Cortex-A53 cores + FPGA (64-bit platforms)

ZCU102 Board

XCZU9EG-2FFVB1156

Discrete FPGAs

and Cloud

(14)

www.bsc.es

Advanced OmpSs and OmpSs@FPGA

(15)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

15

(16)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

#define SIZE 100

for (. . . ) {

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

scale.fpga.h

scale.c

Task pool

Global thread team

(17)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

#define SIZE 100

for (. . . ) {

}. . . }

#define SIZE 100

for (. . . ) {

}. . . }

main.c

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

scale.fpga.h

scale.c

Task pool Global thread team

17

(18)

OmpSs@FPGA : Data Movement & Blocking

FPGA Blocking....

… data locality

optimizations, etc ...

Mercurium will not

generate memory copies

inside the accelerator,

however the Runtime will

copy the data to accesible

FPGA memory

(19)

FPGA Blocking: Mercurium helps

19

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

●

Addr is automatically passed in

●

Addr transparent access to SMP Memory using AXI ports

●

Transparent to the Programmer!

●

Be careful with

memcpy !!!

(20)

FPGA Blocking: User should copy data

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

●

Addr is automatically passed in

●

Addr transparent access to SMP Memory using AXI ports

●

Transparent to the Programmer!

●

Be careful with

memcpy !!!

(21)

www.bsc.es

OmpSs@FPGA Ecosystem: Integrating

FPGA compilation into OmpSs

(22)

OmpSs@FPGA Ecosystem: autoVivado

OmpSs transformations

─ On full source

» Vivado HLS and Vivado

autoVivado

OmpSs Application

Mercurium

GCC

Nanos xtasks lib

Extrae

OmpSs.elf bitstream

SMP FPGA

OS (petaLinux) +

BOOT.bin + image.ub + taskmanager driver

OmpSs phase FPGA phase

Wrapper and Code generation

Vivado HLS

Hardware generation

Netlist

Manager Task Host code + Nanos++ calls

Instrumentation

Interconnect Vivado

xtask.config ^Netlist

(23)

Compilation

23 Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--task_manager”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(24)

Execution

Few details on the steps to run the application

─ Environment:

●

Download the bitstream:

– Zynq 7000 family: cat program.bin > /dev/xdevcfg

─ Run Program: ./program <input arguments>

» Runtime Application Configuration:

– program.xtasks.config or xtasks.config

» With instrumentation:

– export EXTRAE_CONFIG_FILE=”extrae.xml”

– NX_ARGS=”--instrumentation=extrae” ./program <input>

(25)

www.bsc.es

Vivado HLS Optimizations

(26)

OmpSs Example: Vivado HLS

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 **// matrix multiplication of a A*B matrix**

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) **sum += a[ia][id] * b[id][ib];**

c[ia][ib] += sum;

}

(27)

Overall Overview*

* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 27

(28)

Loop Unroll

Option 1: one iteration per cycle

Option 2: several iterations per cycle in sequential

Option 3: several iterations per cycle in parallel

Be careful with the

needed resources

Memory conflicts!!

(29)

Pipeline

29 Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(30)

Pipeline: Resources vs Parallelism

Looking for pipeline

●

With few resources (Left)

Looking for some parallelism and pipeline:

●

Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

●

Full unroll of L1 and L2. Too much resources!!! (Right)

(31)

Pipeline: Initiation Interval Limitations

31 Constant Bounds help…

●

Example: comparison due non known limit…

Memory Accesses can have conflicts

●

Arra Partition can help

(32)

Exploring Data Conflicts: Array Synthesis

Memory Accesses can have conflicts

●

Arra Partition can help

Memory is synthesized (usually) to a BRAM by default

(33)

Array Partitioning

33 Memory Accesses can have conflicts

●

Arra Partition can help

●

#pragma HLS ARRAY_PARTITION variable=name dim=X type

(34)

Array Partitioning: Array Dimensions

(35)

www.bsc.es

Vivado HLS Analysis

(36)

Vivado HLS: Main views

(37)

Vivado HLS: Performance analysis

37

(38)

Vivado HLS: Resources

(39)

www.bsc.es

OmpSs-at-FPGA

Environment

(40)

OmpSs@FPGA Releases

Release: Software, docker and sd image

●

Package that integrates all the software necessary to use OmpSs@FPGA

●

Tested with jenkins and automatically generated

●

URL: https://ompssatfpga.bsc.es

●

URL Downloads: ompssatfpga.bsc.es/downloads

(41)

OmpSs@FPGA Releases: Docker

41 Docker image

●

Based on Debian

●

Cross-compilation software and libraries to generate ARM and bitstreams for Zynq- 7000 and Zynq Ultrascale

●

Source code example with makefiles

●

Prepared to be able to use vivado tools (no included in the image)

(42)

OmpSs@FPGA Releases: SD image

SD image

●

Based on Ubuntu 16.04. On SD image for Zynq 7000 and another for ZU+

●

Libraries, kernels modules and PATH already setup

●

Ubuntu account

(ubuntu:ubuntu)

(43)

OmpSs@FPGA Tutorial

43 Boot from Hard Disk User BSC-Training passwd: bsc

Docker container within this user:

●

Run: sudo docker ps -a Look for version 1.3.1

●

Run: sudo docker start -a -i name_container

●

Use it to cross-compile to ARM and FPGA (PC should be connected to network) Zedboard and cable connections

●

Insert SD

●

Boot

●

Use minicom for serie

●

Use ssh/scp for connecting and copying files

(44)

www.bsc.es

Hands-on: Tutorial Exepected

Results

(45)

OmpSs Example: Tiled MxM

45 #pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

…

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

…

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

(46)

Speedup (Seq/OmpSs)

128 256 512 1024

0,900 1,100 1,300 1,500 1,700 1,900 2,100

1,80

1,87 1,87

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t smp 64x64 2t

Matrix Size

Speedup (seq/version)

(47)

Paraver Analysis: SMP and Granularity help!

47 2x

(48)

Paraver Analysis: 32x32 acc, Slowdown!!!

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(49)

Paraver Analysis: Why 2 acc are not helping?

49

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

Matrix Size

Seconds

(50)

Paraver Analysis: 1 vs 2 accelerators

32x32 II=1 1 acc

32x32 II=2 2 acc

(51)

Paraver Analysis: 32x32 is too fast… for our smp

51

(52)

Paraver Analysis: 32x32 is too fast… for our smp

(53)

Paraver Analysis: Increasing Grain Size …. 64x64

53

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

Matrix Size

Seconds

(54)

Paraver Analysis: 64x64 almost fully busy

(55)

Paraver Analysis: 64x64 almost fully busy

55

(56)

Paraver Analysis: Fully busy? FPGA Blocking

>2x

(57)

Paraver Analysis: FPGA Blocking

57

(58)

Paraver Analysis: FPGA Blocking

>2x

Register

Events (main)

Begin / End Event C

A B

Compute

(59)

Paraver Analysis: Instrumented FPGA Blocking

59 Loop K:

●

Copy In C at the beginning

●

Copy in A, B and Compute

●

Copy out C at the end C A

B ^C

om pu te A

B A

B C

User instrumentation allows to understand

the internals

(60)

Execution Time: Significant Time Reduction

128 256 512 1024

0,000 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000

0,2

1,9

15,4

123,5

Matrix Multiply

Matrix Size

Seconds

(61)

Execution Time: Zoom in for small data sets

61

128,000 256,000

0,000 0,050 0,100 0,150 0,200 0,250 0,300

0,2

1,9

Matrix Multiply

Matrix Size

Seconds

(62)

Speedup (Seq/OmpSs)

128 256 512 1.024

0,000 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

1,80 1,87 1,87

4,83

7,29

9,39

Matrix Multiply

Speedup

Matrix Size

Speedup (seq/version)

(63)

OmpSs@FPGA Tutorial

63 Git clone

Doc:

●

presentation and tutorial guide (environment) Sources:

●

code to follow the tutorial Solutions:

●

it is up to you :-)

git clone https://pm.bsc.es/gitlab/ompss-at-fpga/tutorials/patc-2019.git

(64)

OmpSs@FPGA Tutorial

Git clone

Doc:

●

presentation and tutorial guide (environment) Sources:

●

code to follow the tutorial Solutions:

●

it is up to you :-)

git clone https://pm.bsc.es/gitlab/ompss-at-fpga/tutorials/patc-2019.git

HOP E YO U

ENJ OY!

(65)

BSC/UPC – Introduction OmpSs@FPGA

www.bsc.es

OmpSs@FPGA