www.bsc.es
OmpSs@FPGA
BSC/UPC – Introduction
BSC OmpSs@FPGA team
Universitat Politècnica de Catalunya, and Barcelona Supercomputing Center
May 24th , 2019
Agenda
9:30-10:30 aprox: Vertex - VS208
●
OmpSs@FPGA :
Basic Programming, Advanced Prog, Compilation and Execution
●
Vivado HLS optimizations
Directives (most common) and environment
●
Tutorial Environment:
Description: Zedboard, Hard Drive (w/ Docker), SD, etc.
Expected Results: On the hands-on...
Coffee Break
Hands-on: Campus Nord A5 Building: A5 S103 and A5S112
●
MxM OmpSs@FPGA porting + Vivado HLS opt
●
MxM tuning (increase number of accelerators vs Tile size)
●
MxM Blocking (partially provided)
●
MxM Hardware Instrumentation (partially provided)
• StarSs family key concepts
• Sequential task-based program
• View of a single address/name space
• Execution in parallel: automatic
• runtime computation of dependencies
• Productivity and portability
• OmpSs is used for prototyping extensions to OpenMP
• Tasking
• data-dependences
• Heterogeneity
• Multiple address spaces
• Mercurium compiler • …
• Nanos runtime system
StarSs
OmpSs COMPSs
PyCOMPSs
@ SMP @ GPU @ FPGA @ Cluster
Task-based programming
3
Execution Model (OmpSs + accelerators)
Global thread team created on startup
─ Master starts main task (also executes)
─ N workers execute tasks
─ One representative per device (accelerator)
All threads get work from a task pool
─ Accelerator kernels become tasks
─ Tasks are labeled with (at least) one target device
» smp (default), cuda, opencl, fpga
─ Scheduler decides which task to execute
─ Tasks may have several targets (implements)
Task pool
Global thread team
www.bsc.es
Easy to Program
Sequential Example: Tiled MxM
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
bs bs NB_J
NB_I
OmpSs Example: Tiled MxM
7
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
bs bs NB_J
NB_I
OmpSs Example: Tiled MxM (1 acc)
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
bs bs NB_J
NB_I
8
FPGA
Mercurium compiler with autoVivado tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically ( Stub function generated )
OmpSs Example: Tiled MxM (2 acc)
9
9
FPGA
BS NB_J
NB_I
BS
BS BS
#pragma omp target device(fpga) copy_deps num_instances(2)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
Mercurium compiler with autoVivado tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically ( Stub function generated )
OmpSs@FPGA Porting: Warnings!
●
Constants and global variables needed by a FPGA task have to be moved to .fpga.h includes
●
There should be a call to the accelerated kernel (FPGA task code) in the same source code
●
Parameters shouldn’t be structs (nor array of structs)
●
Dimension and Size pragmas should be const type
OmpSs Example: MxM kernel + Vivado HLS
Mercurium compiler with autoVivado tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically ( Stub function generated )
11
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
bs bs NB_J
NB_I
11
FPGA
HP, Sant Cugat, January 30th, 2018 void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
OmpSs Example: MxM kernel + Vivado HLS
bs bs NB_J
NB_I
12
FPGA
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Mercurium compiler with autoVivado tool
─ Few extensions (num_instances)
─ Triggers the bitstream generation automatically ( Stub function generated )
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Xilinx Zynq Support ++
OmpSs FPGA Support (present and near future)
13
Zynq-7000 Family
2x Cortex-A9 cores + FPGA (32-bit platforms)
SECO AXIOM Board Zynq U+ XCZU9EG-ES2
Trenz Electronics Zynq U+
TE0808 XCZU9EG-ES1
4x Cortex-A53 cores + FPGA (64-bit platforms)
ZCU102 Board
XCZU9EG-2FFVB1156
Discrete FPGAs
and Cloud
www.bsc.es
Advanced OmpSs and OmpSs@FPGA
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[jj;} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool Global thread team
15
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool
Global thread team
Implements & versioning: Device Tuning
Example: One single task & two different implementations
─ Scheduler decides (at runtime) which version to execute
» On resource availability (first thread requesting work)
» Smart scheduling (shortest execution time)
scale_taskscale_task
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
#include “scale.h”
#include “scale.fpga.h”
#define SIZE 100
int main (int argc, char *argv[]) {
for (. . . ) {
scale_task(B,C,&alpha);
}. . . }
main.c
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
#pragma omp target device(smp) no_copy_deps implements(scale_task)
#pragma omp task in([SIZE]c,a) out([SIZE] b)
void scale_task_smp(double *b, double *c, double *a);
void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code void scale_task_smp(double *b, double *c, double *a){
double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j];} // or special code
scale.h
scale.c
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
#pragma omp target device(fpga) copy_deps num_instances(1)
#pragma omp task in([SIZE]c,a) out([SIZE] b) void scale_task(double *b, double *c, double *a);
void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; } void scale_task(double *b, double *c, double *a) {
#pragma HLS ARRAY_PARTITION variable=c complete dim=1
#pragma HLS ARRAY_PARTITION variable=b complete dim=1 double alpha=*a;
for (int j=0; j < SIZE; j++) b[j] = alpha*c[j]; }
scale.fpga.h
scale.c
Task pool Global thread team
17
OmpSs@FPGA : Data Movement & Blocking
FPGA Blocking....
… data locality
optimizations, etc ...
Mercurium will not
generate memory copies
inside the accelerator,
however the Runtime will
copy the data to accesible
FPGA memory
FPGA Blocking: Mercurium helps
19
FPGA Blocking....
… data locality
optimizations, etc ...
Arguments with no local copies
●
Addr is automatically passed in
●
Addr transparent access to SMP Memory using AXI ports
●
Transparent to the Programmer!
●
Be careful with
memcpy !!!
FPGA Blocking: User should copy data
FPGA Blocking....
… data locality
optimizations, etc ...
Arguments with no local copies
●
Addr is automatically passed in
●
Addr transparent access to SMP Memory using AXI ports
●
Transparent to the Programmer!
●
Be careful with
memcpy !!!
www.bsc.es
OmpSs@FPGA Ecosystem: Integrating
FPGA compilation into OmpSs
OmpSs@FPGA Ecosystem: autoVivado
OmpSs transformations
─ On full source
» Vivado HLS and Vivado
autoVivado
OmpSs Application
Mercurium
GCC
Nanos xtasks lib
Extrae
OmpSs.elf bitstream
SMP FPGA
OS (petaLinux) +
BOOT.bin + image.ub + taskmanager driver
OmpSs phase FPGA phase
Wrapper and Code generation
Vivado HLS
Hardware generation
Netlist
Manager Task Host code + Nanos++ calls
Instrumentation
Interconnect Vivado
xtask.config Netlist
Compilation
23
Few details on the steps to compile
─ Compile: use fpgacc
» [cross-compile-]fpgacc --ompss -o program program.c
» [cross-compile-]fpgacc --ompss --instrumentation -o program program.c
─ Details...
» Compiling options:
--bitstream-generation
» Linking options:
--Wf,”--board=$(BOARD_NAME),--clock=100,--task_manager”
--Wf,”-v,--name=vivado_project_name,--dir=$(VIVADO_WORKSPACE)”
Execution
Few details on the steps to run the application
─ Environment:
●
Download the bitstream:
– Zynq 7000 family: cat program.bin > /dev/xdevcfg
─ Run Program: ./program <input arguments>
» Runtime Application Configuration:
– program.xtasks.config or xtasks.config
» With instrumentation:
– export EXTRAE_CONFIG_FILE=”extrae.xml”
– NX_ARGS=”--instrumentation=extrae” ./program <input>
www.bsc.es
Vivado HLS Optimizations
OmpSs Example: Vivado HLS
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
void matrix_multiply(float a[BS][BS], float b[BS][BS],float c[BS][BS]) {
#pragma HLS inline
int const FACTOR = BS/2;
#pragma HLS array_partition variable=a block factor=FACTOR dim=2
#pragma HLS array_partition variable=b block factor=FACTOR dim=1 // matrix multiplication of a A*B matrix
for (int ia = 0; ia < BS; ++ia) for (int ib = 0; ib < BS; ++ib) {
#pragma HLS PIPELINE II=1 float sum = 0;
for (int id = 0; id < BS; ++id) sum += a[ia][id] * b[id][ib];
c[ia][ib] += sum;
}
}
Overall Overview*
* Some information and pictures are extracted from the Xilinx University Programm Vivado HLS Presentations 27
Loop Unroll
Option 1: one iteration per cycle
Option 2: several iterations per cycle in sequential
Option 3: several iterations per cycle in parallel
Be careful with the
needed resources
Memory conflicts!!
Pipeline
29
Several instances of the hardware can operate in pipleine
Initiation Interval: Number of cycles before you can start a
new instance of the hardware
Pipeline: Resources vs Parallelism
Looking for pipeline
●
With few resources (Left)
Looking for some parallelism and pipeline:
●
Full unroll of the inner-most loop. Resources needed and throughput of loop L1 (Middle)
Looking for high parallelism
●
Full unroll of L1 and L2. Too much resources!!! (Right)
Pipeline: Initiation Interval Limitations
31
Constant Bounds help…
●
Example: comparison due non known limit…
Memory Accesses can have conflicts
●
Arra Partition can help
Exploring Data Conflicts: Array Synthesis
Memory Accesses can have conflicts
●
Arra Partition can help
Memory is synthesized (usually) to a BRAM by default
Array Partitioning
33
Memory Accesses can have conflicts
●
Arra Partition can help
●
#pragma HLS ARRAY_PARTITION variable=name dim=X type
Array Partitioning: Array Dimensions
www.bsc.es
Vivado HLS Analysis
Vivado HLS: Main views
Vivado HLS: Performance analysis
37
Vivado HLS: Resources
www.bsc.es
OmpSs-at-FPGA
Environment
OmpSs@FPGA Releases
Release: Software, docker and sd image
●
Package that integrates all the software necessary to use OmpSs@FPGA
●
Tested with jenkins and automatically generated
●
URL: https://ompssatfpga.bsc.es
●
URL Downloads: ompssatfpga.bsc.es/downloads
OmpSs@FPGA Releases: Docker
41
Docker image
●
Based on Debian
●
Cross-compilation software and libraries to generate ARM and bitstreams for Zynq- 7000 and Zynq Ultrascale
●
Source code example with makefiles
●
Prepared to be able to use vivado tools (no included in the image)
OmpSs@FPGA Releases: SD image
SD image
●
Based on Ubuntu 16.04. On SD image for Zynq 7000 and another for ZU+
●
Libraries, kernels modules and PATH already setup
●
Ubuntu account
(ubuntu:ubuntu)
OmpSs@FPGA Tutorial
43
Boot from Hard Disk User BSC-Training passwd: bsc
Docker container within this user:
●
Run: sudo docker ps -a Look for version 1.3.1
●
Run: sudo docker start -a -i name_container
●
Use it to cross-compile to ARM and FPGA (PC should be connected to network) Zedboard and cable connections
●
Insert SD
●
Boot
●
Use minicom for serie
●
Use ssh/scp for connecting and copying files
www.bsc.es
Hands-on: Tutorial Exepected
Results
OmpSs Example: Tiled MxM
45
#pragma omp task in([BS]a,[BS]b) inout([BS]c)
void matrix_multiply(T a[BS][BS],T b[BS][BS],T c[BS][BS]);
…
for (i_b=0; i_b<NB_I; i_b++) for (j_b=0; j_b<NB_J; j_b++) for (k_b=0; k_b<NB_K; k_b++)
matrix_multiply(AA[i_b][k_b], BB[k_b][j_b], CC[i_b][j_b]);
…
#pragma omp taskwait // Or
// other tasks depending on input output c of matrix multiply task
Speedup (Seq/OmpSs)
128 256 512 1024
0,900 1,100 1,300 1,500 1,700 1,900 2,100
1,80
1,87 1,87
Matrix Multiply
Speedup
seq_matmul smp 32x32 2t smp 64x64 2t
Matrix Size
Speedup (seq/version)
Paraver Analysis: SMP and Granularity help!
47
2x
Paraver Analysis: 32x32 acc, Slowdown!!!
512,000 0,000
0,500 1,000 1,500 2,000 2,500 3,000 3,500
4,000 15,380
Matrix Multiply
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Seconds
Paraver Analysis: Why 2 acc are not helping?
49
512,000 0,000
0,500 1,000 1,500 2,000 2,500 3,000 3,500
4,000 15,380
Matrix Multiply
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Seconds
Paraver Analysis: 1 vs 2 accelerators
32x32 II=1 1 acc
32x32 II=2 2 acc
Paraver Analysis: 32x32 is too fast… for our smp
51
Paraver Analysis: 32x32 is too fast… for our smp
Paraver Analysis: Increasing Grain Size …. 64x64
53
512,000 0,000
0,500 1,000 1,500 2,000 2,500 3,000 3,500
4,000 15,380
Matrix Multiply
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Seconds
Paraver Analysis: 64x64 almost fully busy
Paraver Analysis: 64x64 almost fully busy
55
Paraver Analysis: Fully busy? FPGA Blocking
>2x
Paraver Analysis: FPGA Blocking
57
Paraver Analysis: FPGA Blocking
>2x
Register
Events (main)
Begin / End Event C
A B
Compute
Paraver Analysis: Instrumented FPGA Blocking
59
Loop K:
●
Copy In C at the beginning
●
Copy in A, B and Compute
●
Copy out C at the end C A
B C
om pu te A
B A
B A
B A
B A
B A
B A
B C
User instrumentation allows to understand
the internals
Execution Time: Significant Time Reduction
128 256 512 1024
0,000 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000
0,2
1,9
15,4
123,5
Matrix Multiply
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Seconds
Execution Time: Zoom in for small data sets
61
128,000 256,000
0,000 0,050 0,100 0,150 0,200 0,250 0,300
0,2
1,9
Matrix Multiply
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Seconds
Speedup (Seq/OmpSs)
128 256 512 1.024
0,000 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
1,80 1,87 1,87
4,83
7,29
9,39
Matrix Multiply
Speedup
seq_matmul smp 32x32 2t smp 64x64 2t 32x32_1acc_NAIVE 32x32_1acc_II1 32x32_2acc_II2 64x64_II2_1acc 64x64_II2_blocking
Matrix Size
Speedup (seq/version)
OmpSs@FPGA Tutorial
63
Git clone
Doc:
●
presentation and tutorial guide (environment) Sources:
●
code to follow the tutorial Solutions:
●
it is up to you :-)
git clone https://pm.bsc.es/gitlab/ompss-at-fpga/tutorials/patc-2019.git
OmpSs@FPGA Tutorial
Git clone
Doc:
●
presentation and tutorial guide (environment) Sources:
●
code to follow the tutorial Solutions:
●