Hardware accelerators for HEP

(1)

Computing Challenges (COMCHA):

Hardware accelerators for HEP

A. Oyanguren (IFIC – Valencia)

XIII CPAN Days – Huelva, March 2022

(2)

 Motivation

 Hardware accelerators

 LHCb

 ATLAS

 Outlook

 CMS

Outline

 About COMCHA

(3)

Motivation

2028 bunches of protons per beam

10¹¹ protons per bunch Beam energy of 7 TeV (access to ~10^-16cm) Luminosity 10³⁴ cm^-2 s^-1

Crossing rate 40 MHz, i.e. 40 M collisions/s About 1 MB data per collision

 40 TB/s

(New physics rate: ~ 0.00001 events/s) Proton-proton collision

=h/m

B

_s

^-

⁺

3

(4)

ATLAS

CMS LHCb

1 MHz

(Run3: 2022)

>

New strategy !

Motivation

The trigger systems:

(5)

Motivation

Bandwidth [GB/s] ~ Trigger output rate [kHz] x Average event size [MB]

The need of storage is given by the trigger bandwidth:

How many data can we record?

Raw event data size

~1 MB (ATLAS and CMS)

~0.1 MB (LHCb) 1 kHz (ATLAS & CMS)

12.5 kHz (LHCb)

~ 1 GB/s

 Moving to Real Time Analysisschemes: turbo (LHCb), scouting (CMS) and TLA (ATLAS)  Analysis Object Data formats (AOD) at the trigger level

~ 1kB in Run 3 and <5kB in HL-LHC

 Fast reconstruction in Real Time becomes crucial !

 Fast decisions: event must be either discarded forever or sent online for permanent storage between two collisions

5

(6)

Motivation

(From S. Campana, LHCC March 2022)

 The HL-LHC:

- LHCb and ALICE have already been upgraded for Run3 (LHCb x 5 luminosity) - HL-LHC will come next…

 Data reconstruction and storage will become a tough issue, reduced data formats will not be enough

→ need to move more complex event reconstruction at the earliest stage of the trigger

(7)

Hardware accelerators

 Use more than one kind of processor or cores to maximize performance or energy efficiency.

 Exploit the high level of parallelism to handle particular tasks.

Graphic Processor Units (GPUs) Field Programmable Gate Arrays (FPGAs)

- Programmable and flexible devices - Low latency

- Low power consumption

- Ideal for compute- and data-intensive workloads - Multicore processors, highly commercial

- High throughput

- Ideal for data –intensive parallelizable applications

7

(8)

Hardware accelerators

 In practice (ex: at LHCb)

(9)

PCIe slots

3 PCIe40 (FPGAs)

2 network connections

1-3 GPUs

40 Tb/s

Event Building

 In practice (ex: at LHCb), mounted server’s CPUs:

Hardware accelerators

CPU RAM fans

(10)

The upgraded LHCb for Run3:

LHCb

- No L0 hardware trigger  full detector read-out at 30 MHz !

- Detector data received by O(500) FPGAs and built into events in the Event Building servers - Full HLT1 on Real Time with GPUs (Allen project )  O(200) Nvidia RTX A5000

RAW DATA

Global Event Cut

Selected events

Selected event Muon decoding

Muon ID Find 2^aryPV SCiFi decoding

SCiFi tracking

Parameterized KALMAN UT decoding

UT tracking VELO decoding

and clustering VELO tracking

Simple KALMAN

Find PV

[LHCB-TDR-021]

(11)

LHCb

VELO

UT PV

SCIFI

MUONS

KALMAN

 GPUs HLT1 sequence: algorithms breakdown (indicative), throughput and performance:

 Working on the implementation of more time-consuming algorithms (LLPs)

LHCb-FIGURE-2020-014 arXiv:2105.04031[physics.ins-det]

[Comput Softw Big Sci 4, 7 (2020)]

11

(12)

LHCb

 Real-time reconstruction on FPGAs with the “artificial retina”architecture

 VELO clustering already implemented for Run3 in FPGAs !

 Tracking in development for Run5 (~2030), coprocessor testbed established at CERN for tests in realistic conditions

[NIMA 453 (2000) 425-429]

(13)

ATLAS

 Investigating FPGA implementation of deep learning algorithms for real-time signal

reconstruction in particle detectors under high pile-up conditions [JINST 14 (2019) 09, P09002]

Tests of quantized models in FPGA (Xilinx ZC104) showed up to 5 times more power efficiency with respect to a GPU (Nvidia RTX 2080TI) for the CNN reconstruction.

 Machine Learning: NN models usually use floating point models, not efficient for FPGAs

 study of the impact of the quantization on convolutional neural network models

 The HL-LHC pileup degrades the pulse quality of detectors, and then the performance of the reconstruction algorithms deteriorates.

13

(14)

 Demonstrator of the HL-LHC electronics equipped with prototype of Phase-II electronics installed at Point 1 and reading- out a slice of ATLAS calorimeter:  x 

=1.0x0.1

 Inserted since July 2019, it will read-out data during Run3

 Thoroughly validated in multiple testbeam campaigns

ATLAS Tile CPM (2020): AMC ~ 7.4 x 18 cm,

Throughput (16 Gb/s line): TX: 512 Gb/s ; RX: 512 Gb/s 8 Firefly (24 links) , 1 Xilinx KU115FPGA.

FPGA Total: 1 Tbps

7.4 Gbps/cm²

ATLAS

(15)

CMS

 Real-time muon tracking algorithm on FPGAs for the Upgrade CMS

• DT Trigger primitives from the input hits (asynchronous)

• Maximum resolution and reduced dead time: resolutions ~ offline

• 400 ns drift time, but 25 ns between collisions. Also, Left-right hits ambiguity

• Expansions of the algorithm to include Pseudobayes approach for improving grouping step particularly under aging

• Better performance expected for chamber aged scenarios towards end of HL-LHC

• Full algorithm includes DT+RPC Superprimitives (as in Phase 1) ¹⁵

(16)

CMS

Firmware demonstration performed in Xilinx Virtex 7 (1 chamber phi view).

Validated at the lab (firmware-emulator comparison).

Installed at P5 and validated with Cosmic campaigns.

Exercised in KU115. Target implementation in Xilinx Virtex Ultrascale Plus VU13P (ATCA module).

Aiming at 1 or 2 sectors/FPGA.

Results for phi view DT chamber AM algo in KU115 (9%)

(17)

About COMCHA

2^nd COMCHA School –

FIC, Valencia, November 2021

https://twiki.ific.uv.es/twiki/bin/view/Main/ComCha

 Forum of discussions related to Computing Challenges in HEP and other fields

 Aiming to be transversal and synergetic

 Important for communication, coordination of activities, use of infrastructures, formation, etc …

Artificial Intelligence Machine Learning,

GPU and FPGA programming Use of Artemisa @ IFIC

(Contact: L. Fiorini, A. Oyanguren)

(18)

Outlook

 On FPGAs:

 Around 90% of FPGA market is dominated by Xilinx and Altera.

(Intel acquired Altera in 2015 and AMD acquired Xilinx in 2020)

 Wide range of FPGA models. Families and models for

> High performance

> System On Chip

> General purpose

 On GPUs:

Market dominated by NVIDIA and AMD

Huge amount of commercial models, both professional and gaming (cheaper)

Large AI developments and tools

 Hybrid: Systems combining FPGA, GPU and CPU features:

- Xilinx/AMD ACAP Versal - Altera/Intel Agilex

 Others processors (IPUs… ) ….

 HL-LHC will be characterized by improved detectors and huge data volumes

 Hardware accelerators are becoming crucial, in particular for the trigger systems

