• No se han encontrado resultados

BSC/UPC – Introduction OmpSs@FPGA

N/A
N/A
Protected

Academic year: 2023

Share "BSC/UPC – Introduction OmpSs@FPGA"

Copied!
65
0
0

Texto completo

(1)

www.bsc.es

OmpSs@FPGA

BSC/UPC – Introduction

BSC OmpSs@FPGA team

Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center

May 24th , 2019

(2)

Agenda

9:30-10:30 aprox: Vertex - VS208

OmpSs@FPGA :

Basic Programming, Advanced Prog, Compilation and Execution

Vivado HLS optimizations

Directives (most common) and environment

Tutorial Environment:

Description: Zedboard, Hard Drive (w/ Docker), SD, etc.

Expected Results: On the hands-on...

Coffee Break

Hands-on: Campus Nord A5 Building: A5 S103 and A5S112

MxM OmpSs@FPGA porting + Vivado HLS opt

MxM tuning (increase number of accelerators vs Tile size)

MxM Blocking (partially provided)

MxM Hardware Instrumentation (partially provided)

(3)

• StarSs family key concepts

• Sequential task-based program

• View of a single address/name space

• Execution in parallel: automatic

• runtime computation of dependencies

• Productivity and portability

• OmpSs is used for prototyping extensions to OpenMP

• Tasking

• data-dependences

• Heterogeneity

• Multiple address spaces

• Mercurium compiler • …

• Nanos runtime system

StarSs

OmpSs COMPSs

PyCOMPSs

@ SMP @ GPU @ FPGA @ Cluster

Task-based programming

3

(4)

Execution Model (OmpSs + accelerators)

Global thread team created on startup

─ Master starts main task (also executes)

─ N workers execute tasks

─ One representative per device (accelerator)

All threads get work from a task pool

─ Accelerator kernels become tasks

─ Tasks are labeled with (at least) one target device

» smp (default), cuda, opencl, fpga

─ Scheduler decides which task to execute

─ Tasks may have several targets (implements)

Task pool

Global thread team

(5)

www.bsc.es

Easy to Program

(6)

Sequential Example: Tiled MxM

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

bs bs NB_J

NB_I

(7)

OmpSs Example: Tiled MxM

7

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

(8)

OmpSs Example: Tiled MxM (1 acc)

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

bs bs NB_J

NB_I

8

FPGA

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(9)

OmpSs Example: Tiled MxM (2 acc)

9

9

FPGA

BS NB_J

NB_I

BS

BS BS

#pragma omp target device(fpga) copy_deps num_instances(2)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(10)

OmpSs@FPGA Porting: Warnings!

Constants and global variables needed by a FPGA task have to be moved to .fpga.h includes

There should be a call to the accelerated kernel (FPGA task code) in the same source code

Parameters shouldn’t be structs (nor array of structs)

Dimension and Size pragmas should be const type

(11)

OmpSs Example: MxM kernel + Vivado HLS

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

11

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

bs bs NB_J

NB_I

11

FPGA

HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

(12)

OmpSs Example: MxM kernel + Vivado HLS

bs bs NB_J

NB_I

12

FPGA

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

Mercurium compiler with autoVivado tool

─ Few extensions (num_instances)

─ Triggers the bitstream generation automatically ( Stub function generated )

(13)

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

Xilinx Zynq Support ++

OmpSs FPGA Support (present and near future)

13

Zynq-7000 Family

2x Cortex-A9 cores + FPGA (32-bit platforms)

SECO AXIOM Board Zynq U+ XCZU9EG-ES2

Trenz Electronics Zynq U+

TE0808 XCZU9EG-ES1

4x Cortex-A53 cores + FPGA (64-bit platforms)

ZCU102 Board

XCZU9EG-2FFVB1156

Discrete FPGAs

and Cloud

(14)

www.bsc.es

Advanced OmpSs and OmpSs@FPGA

(15)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

15

(16)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool

Global thread team

(17)

Implements & versioning: Device Tuning

Example: One single task & two different implementations

─ Scheduler decides (at runtime) which version to execute

» On resource availability (first thread requesting work)

» Smart scheduling (shortest execution time)

scale_taskscale_task

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

#include “scale.h”

#include “scale.fpga.h”

#define SIZE 100

int main (int argc, char *argv[]) {

for (. . . ) {

scale_task(B,C,&alpha);

}. . . }

main.c

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

#pragma omp target device(smp) no_copy_deps implements(scale_task)

#pragma omp task in([SIZE]c,a) out([SIZE] b)

void scale_task_smp(double *b, double *c, double *a);

void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){

double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code

scale.h

scale.c

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

#pragma omp target device(fpga) copy_deps num_instances(1)

#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);

void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {

#pragma HLS ARRAY_PARTITION variable=c complete dim=1

#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;

for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }

scale.fpga.h

scale.c

Task pool Global thread team

17

(18)

OmpSs@FPGA : Data Movement & Blocking

FPGA Blocking....

… data locality

optimizations, etc ...

Mercurium will not

generate memory copies

inside the accelerator,

however the Runtime will

copy the data to accesible

FPGA memory

(19)

FPGA Blocking: Mercurium helps

19

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

Addr is automatically passed in

Addr transparent access to SMP Memory using AXI ports

Transparent to the Programmer!

Be careful with

memcpy !!!

(20)

FPGA Blocking: User should copy data

FPGA Blocking....

… data locality

optimizations, etc ...

Arguments with no local copies

Addr is automatically passed in

Addr transparent access to SMP Memory using AXI ports

Transparent to the Programmer!

Be careful with

memcpy !!!

(21)

www.bsc.es

OmpSs@FPGA Ecosystem: Integrating

FPGA compilation into OmpSs

(22)

OmpSs@FPGA Ecosystem: autoVivado

OmpSs transformations

─ On full source

» Vivado HLS and Vivado

autoVivado

OmpSs Application

Mercurium

GCC

Nanos xtasks lib

Extrae

OmpSs.elf bitstream

SMP FPGA

OS (petaLinux) +

BOOT.bin + image.ub + taskmanager driver

OmpSs phase FPGA phase

Wrapper and Code generation

Vivado HLS

Hardware generation

Netlist

Manager Task Host code + Nanos++ calls

Instrumentation

Interconnect Vivado

xtask.config Netlist

(23)

Compilation

23

Few details on the steps to compile

─ Compile: use fpgacc

» [cross-compile-]fpgacc --ompss -o program program.c

» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c

─ Details...

» Compiling options:

--bitstream-generation

» Linking options:

--Wf,”--board=$(BOARD_NAME),--clock=100,--task_manager”

--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”

(24)

Execution

Few details on the steps to run the application

─ Environment:

Download the bitstream:

Zynq 7000 family: cat program.bin > /dev/xdevcfg

─ Run Program: ./program <input arguments>

» Runtime Application Configuration:

– program.xtasks.config or xtasks.config

» With instrumentation:

export EXTRAE_CONFIG_FILE=”extrae.xml”

NX_ARGS=”--instrumentation=extrae” ./program <input>

(25)

www.bsc.es

Vivado HLS Optimizations

(26)

OmpSs Example: Vivado HLS

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {

#pragma HLS inline

int const FACTOR = BS/2;

#pragma HLS array_partition variable=a block factor=FACTOR dim=2

#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix

for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {

#pragma HLS PIPELINE II=1 float sum = 0;

for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];

c[ia][ib] += sum;

}

}

(27)

Overall Overview*

* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 27

(28)

Loop Unroll

Option 1: one iteration per cycle

Option 2: several iterations per cycle in sequential

Option 3: several iterations per cycle in parallel

Be careful with the

needed resources

Memory conflicts!!

(29)

Pipeline

29

Several instances of the hardware can operate in pipleine

Initiation Interval: Number of cycles before you can start a

new instance of the hardware

(30)

Pipeline: Resources vs Parallelism

Looking for pipeline

With few resources (Left)

Looking for some parallelism and pipeline:

Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)

Looking for high parallelism

Full unroll of L1 and L2. Too much resources!!! (Right)

(31)

Pipeline: Initiation Interval Limitations

31

Constant Bounds help…

Example: comparison due non known limit…

Memory Accesses can have conflicts

Arra Partition can help

(32)

Exploring Data Conflicts: Array Synthesis

Memory Accesses can have conflicts

Arra Partition can help

Memory is synthesized (usually) to a BRAM by default

(33)

Array Partitioning

33

Memory Accesses can have conflicts

Arra Partition can help

#pragma HLS ARRAY_PARTITION variable=name dim=X type

(34)

Array Partitioning: Array Dimensions

(35)

www.bsc.es

Vivado HLS Analysis

(36)

Vivado HLS: Main views

(37)

Vivado HLS: Performance analysis

37

(38)

Vivado HLS: Resources

(39)

www.bsc.es

OmpSs-at-FPGA

Environment

(40)

OmpSs@FPGA Releases

Release: Software, docker and sd image

Package that integrates all the software necessary to use OmpSs@FPGA

Tested with jenkins and automatically generated

URL: https://ompssatfpga.bsc.es

URL Downloads: ompssatfpga.bsc.es/downloads

(41)

OmpSs@FPGA Releases: Docker

41

Docker image

Based on Debian

Cross-compilation software and libraries to generate ARM and bitstreams for Zynq- 7000 and Zynq Ultrascale

Source code example with makefiles

Prepared to be able to use vivado tools (no included in the image)

(42)

OmpSs@FPGA Releases: SD image

SD image

Based on Ubuntu 16.04. On SD image for Zynq 7000 and another for ZU+

Libraries, kernels modules and PATH already setup

Ubuntu account

(ubuntu:ubuntu)

(43)

OmpSs@FPGA Tutorial

43

Boot from Hard Disk User BSC-Training passwd: bsc

Docker container within this user:

Run: sudo docker ps -a Look for version 1.3.1

Run: sudo docker start -a -i name_container

Use it to cross-compile to ARM and FPGA (PC should be connected to network) Zedboard and cable connections

Insert SD

Boot

Use minicom for serie

Use ssh/scp for connecting and copying files

(44)

www.bsc.es

Hands-on: Tutorial Exepected

Results

(45)

OmpSs Example: Tiled MxM

45

#pragma omp task in([BS]a,[BS]b) inout([BS]c)

void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);

for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)

matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);

#pragma omp taskwait // Or

// other tasks depending on input output c of matrix multiply task

(46)

Speedup (Seq/OmpSs)

128 256 512 1024

0,900 1,100 1,300 1,500 1,700 1,900 2,100

1,80

1,87 1,87

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t smp 64x64 2t

Matrix Size

Speedup (seq/version)

(47)

Paraver Analysis: SMP and Granularity help!

47

2x

(48)

Paraver Analysis: 32x32 acc, Slowdown!!!

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(49)

Paraver Analysis: Why 2 acc are not helping?

49

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(50)

Paraver Analysis: 1 vs 2 accelerators

32x32 II=1 1 acc

32x32 II=2 2 acc

(51)

Paraver Analysis: 32x32 is too fast… for our smp

51

(52)

Paraver Analysis: 32x32 is too fast… for our smp

(53)

Paraver Analysis: Increasing Grain Size …. 64x64

53

512,000 0,000

0,500 1,000 1,500 2,000 2,500 3,000 3,500

4,000 15,380

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(54)

Paraver Analysis: 64x64 almost fully busy

(55)

Paraver Analysis: 64x64 almost fully busy

55

(56)

Paraver Analysis: Fully busy? FPGA Blocking

>2x

(57)

Paraver Analysis: FPGA Blocking

57

(58)

Paraver Analysis: FPGA Blocking

>2x

Register

Events (main)

Begin / End Event C

A B

Compute

(59)

Paraver Analysis: Instrumented FPGA Blocking

59

Loop K:

Copy In C at the beginning

Copy in A, B and Compute

Copy out C at the end C A

B C

om pu te A

B A

B A

B A

B A

B A

B A

B C

User instrumentation allows to understand

the internals

(60)

Execution Time: Significant Time Reduction

128 256 512 1024

0,000 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000

0,2

1,9

15,4

123,5

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(61)

Execution Time: Zoom in for small data sets

61

128,000 256,000

0,000 0,050 0,100 0,150 0,200 0,250 0,300

0,2

1,9

Matrix Multiply

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Seconds

(62)

Speedup (Seq/OmpSs)

128 256 512 1.024

0,000 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000

1,80 1,87 1,87

4,83

7,29

9,39

Matrix Multiply

Speedup

seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking

Matrix Size

Speedup (seq/version)

(63)

OmpSs@FPGA Tutorial

63

Git clone

Doc:

presentation and tutorial guide (environment) Sources:

code to follow the tutorial Solutions:

it is up to you :-)

git clone https://pm.bsc.es/gitlab/ompss-at-fpga/tutorials/patc-2019.git

(64)

OmpSs@FPGA Tutorial

Git clone

Doc:

presentation and tutorial guide (environment) Sources:

code to follow the tutorial Solutions:

it is up to you :-)

git clone https://pm.bsc.es/gitlab/ompss-at-fpga/tutorials/patc-2019.git

HOP E YO U

ENJ OY!

(65)

www.bsc.es

OmpSs@FPGA

BSC/UPC – Introduction

BSC OmpSs@FPGA team

Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center

May 24th , 2019

Referencias

Documento similar

The proposed architecture has been implemented in a Xilinx Zynq Ultrascale+ MPSoC platform (XCZU9EG-2FFVB1156E FPGA with a grade speed of -2) at a work frequency of

Government policy varies between nations and this guidance sets out the need for balanced decision-making about ways of working, and the ongoing safety considerations

After them, you will be able to build different boot files just using the AIT option or executing the steps in any of the following sections: Petalinux (2016.3) build for a custom

–  Data copies to/from device memory –  Manual work scheduling.. –  Code and data management from

Compile the program using the bitstream-zedboard target of the Makefile, copy the binary files and bitstream to the Zynq system, execute using instrumentation and copy back the

#pragma omp task in([SIZE]c,a) out([SIZE]b) void scale_task(double *b, double *c, double *a);. const

Compile the program using the bitstream-zedboard target of the Makefile, copy the binary files and bitstream to the Zynq system, execute using instrumentation and copy back the

- User program introduce the memory copies in the program to access host memory. - Dataflow pragma can be used with functions