ATLAS distributed computing and challenges in LHC Run-2

(1)

ATLAS distributed computing and challenges in LHC Run-2

Simone Campana

CERN

(2)

Event Collection in ATLAS

17/10/14

[email protected] 2

L1 trigger (hardware)

L2 trigger (software)

Event Filter (software)

40 MHz

75 kHz

~3 kHz

~300 Hz

Storage

We really do throw away 99.9999% of LHC data

before writing to persistent storage

(3)

15 PB, Big Data?

Business emails per year: 3000 PB

Content uploaded to Facebook per year: 182 PB Google search index

(2013): 98 PB LHC data output (2012): 15 PB

YouTube uploads per year: 15PB

1. Duplicate the raw data (not such a bad idea)

2. Add a similar volume of simulated data (essential)

3. Make a rich set of derived data products (some of them can be larger than the raw data)

4. Re-create the derived data products whenever the software has been

significantly improved (several times a year) and keep the old versions for quite a while

5. Place several copies of the derived data around the world

6. ATLAS ends up with 100 PB of data.

(4)

The challenge of complexity

• One event has many tracks (particles)

• Reconstruction takes 30 seconds on a modern CPU

• 100,000 cores needed to keep up with data rate

• Re-reconstructed many times

• Same amount of data has to simulated also

• Simulation takes ~100 times longer than reconstruction

• Subsequent analysis steps over all data

17/10/14

[email protected] 4

(5)

The data processing and analysis model

Real and Simulated Data

Fortunately, the problem is EMBARASSINGLY PARALLEL

(each event/collision is independent)

Data intensive processing and physics analysis at ~100 distributed

computing centers and ~1000 universities worldwide Organized Chaotic

The model is data-driven

• Distribute the data among centers

• Process data at the centers hosting it

(6)

WLCG

 The Worldwide LHC Computing Grid (WLCG) project is a global collaboration linking grid infrastructures and computer centers worldwide

 The infrastructure built by integrating thousands of computers and storage systems in hundreds of data centers worldwide enables a collaborative computing environment

 WLCG serves a community of more than 8,000 physicists around the world with near real- time access to LHC data, and the power to process it

17/10/14

[email protected] 6

What is “the Grid” btw?

Many definitions in http://en.wikipedia.org/wiki/Computing_grid You can pick up the one you like, but at CERN we use a very pragmatic one:

“a service for sharing computer power and data storage over the Internet”

(7)

WLCG Organization

Tier-0 (CERN):

•Data recording

•Initial data reconstruction

•Data distribution Tier-1 (11 centres):

•Permanent storage

•Re-processing

•Analysis

Tier-2 (~130 centres):

• Simulation

• End-user analysis

Hierarchical structure organized in Tiers

Tiers definition have to do with the ROLE of the site in the experiment computing model

(8)

ATLAS Computing Model - Clouds

17/10/14

[email protected] 8

(9)

The ATLAS Grid(s)

 ATLAS built high level services on top of the Grid baseline middleware services

 PanDA (Production and Data Analysis) for job management

 DQ2 (Don Quijote)/Rucio for data management

(10)

Distributed Data Management

 DQ2/Rucio: the data management system to implement the ATLAS computing model

 A dataset catalog and transfer system, and more

 deletion, quota management, consistency, accounting,

monitoring, end-user tools, …

 FAX: an implementation of storage federation

 Transparent access to data, regardless the locality

17/10/14

[email protected] 10

(11)

Data Management

Data transfer load dominated by

data placement to serve analysis Nearly 10GB/s transfer rate (steady state target: 1.5GB/s)

1.5GB/s 10GB/s

Fiber Cut

Fully redundant network infrastructure Analysis

data replication

(12)

Workload Management System

 Core components

 DEFT: translates user requests into task (collection of jobs) definitions

 JEDI: dynamically generates the job definitions (processing N events)

 PanDA: the job management engine

 Features

 Provide a workflow engine for both production and analysis

 Interfaces to Grid services and hides their complexity

Panda …

… and Prodsys

17/10/14

(13)

Workload Management

1.4M jobs/day, 150K concurrently running 80 trillion events processed in 10 regions over 4 months in 2011

100k 1M

15% jobs failed but only 5% of WallClockTime

was lost

(14)

Data Placement

 In ATLAS, “jobs go where data are”, so we need a broad distribution of important data

 This is simple in theory but with 2,000 users, 20 data types and 100 sites this gets a bit more complicated

 ATLAS implemented a Dynamic Data Placement model, base on data popularity

 A minimal number of “pinned” copies at different sites is distributed by default after data creation

 More “cache” copies are created if those data are “hot” (heavily accessed)

 Space occupied by “cache” copies can be reclaimed at any time if needed.

17/10/14

(15)

Dynamic Data Replication and Reduction

Data Popularity

Dynamic Replication Dynamic Reduction

Pinned Cache

(16)

Challenges for Run-2

 Run-2 period of LHC data taking will come with new challenges

 Higher trigger rate

 Higher number of collisions

 More complex events (e.g. more particles)

 The amount of computing resources is not planned to scale accordingly

 If you do not have enough resources you can:

 Use better what you have

 Use somebody else’s stuff

 We plan to do both and this is what I will show

17/10/14

(17)

Space management crisis

Primary (pinned)

Default (pinned)

Disk occupancy at T1s

 T1 dynamically managed space (green) is unacceptably small

 It compromises our strategy of dynamic replication and cleaning of popular/unpopular data

23PB on disk, created in the last 3 months and never accessed

8 PB of data on disk never been touched

(18)

The new data lifecycle model

 Every dataset will have a lifetime set at creation

 The lifetime can be infinite (e.g. RAW data)

 The lifetime can be extended

 E.g. if the dataset is recently accessed. Or if there is a known exception

 Every dataset will have a retention policy

 E.g. RAW need at least 2 copies on tape. Need at least one copy of AODs on tape.

 ATLAS Distributed Computing will flexibly manage data replication and reduction

 Within the boundaries of lifetime and retention

17/10/14

(19)

Tiers, clouds and network traffic

• Hierarchical tier organization based on Monarc network topology

• Sites are grouped into clouds for organizational reasons

• Possible communications:

• Optical Private Network

• T0-T1

• T1-T1

• National networks

• Intra-cloud T1-T2

• Restricted communications:

General public network

• Inter-cloud T1-T2

(20)

Breaking Cloud Boundaries

 You need data at some T2 (normally “your” T2)

 The inputs are at some

other T2 in a different cloud

 Example: outputs of analysis jobs

Tier-2

Tier-1 Tier-1

17/10/14

Source

Dest

Cloud B Cloud A

(21)

Data management protocols

 Rucio enables usage of different and more efficient protocols with respect of what we have today (SRM+gridFTP)

 HTTP can be used to access data from Grid storages

 Visualize logs directly on browsers

 Download files from the Grid to laptops/desktops/local clusters (dq2-get use case)

 On can benefit from many non HEP tools using HTTP

• The same that kids use to download movies

 For data access e.g. in ROOT/Athena we can use the Xrootd protocol

 HTTP based data access is not production quality today, remains a medium/long term goal

(22)

Breaking the data locality:

Remote data access and FAX

17/10/14

Goal reached ! >96% data covered

We deployed a Federate Storage Infrastructure: all data accessible from any location Analysis (and production) will be able to access remote (offsite) files

Jobs will run at sites w/o data but with free CPUs. We call this “overflow”.

(23)

FAX use cases

Grid jobs recovery: failover to FAX in case of data access issues

In production

Execute jobs at sites w/o the data and access the files remotely

In pre-production

Reading remotely Per job:

• 4.2 MB/s

• 43% CPU eff

• 42 ev/s

Reading locally Per job:

• 7.2 MB/s

• 67% CPU eff

Data at AGLT2 Computing site: MWT2

(24)

FAX overflow now in production

17/10/14

regular overflow jobs per hour 2700 410 13%

regular overflow job efficiency 92% 83%

Target Overflow: < 10%

How much time do I waste in reading instead

of processing?

Acceptable Job Efficiency, still work TDB for CPU efficiency Adoption of xAOD will allow to tune performance Completed Jobs

10K

80%

Job Efficiency

CPU Efficiency

80%

(25)

Evolution of the model and impact on sites

 So far ATLAS asked T2s

 To be well connected to their T1

 To be well connected to the T2s of their cloud

 Now we are asking T2s:

 To be well connected to ALL T1s

 To foresee non negligible traffic from/to other (large) T2s

 The site storage will serve data to other sites

 We foresee 10% average, but watch out the fluctuations

 Need to be scaled accordingly

(26)

Breaking Tier Boundaries

Because we can benefit of reliable networks, we plan to break cloud boundaries.

Because we can benefit of better protocols, we will break job-to-data locality Some T2s are equivalent to T1s in term of disk storage & CPU power

17/10/14

The next consequence: we will relax the Tiers boundaries (Run1 model defines rigidly T0/T1/T2 roles). Examples:

Different kind of jobs (e.g. Reco) will run at various sites regardless the tier level

Custodial copies of data will be hosted at various sites regardless the tier level

(27)

Opportunistic Computing Resources

 Thanks to the flexibility of the ATLAS data and workload

management, we can utilize spare cycles provided through non-grid interfaces

 Cloud resources (e.g. Amazon EC2, Openstack, ..)

 Volunteer Computing (Boinc)

 High Performance Computing (supercomputers)

Oak Ridge Titan System

Architecture: Cray XK7

Cabinets: 200

Total cores: 299,008 Opteron Cores

Memory/core: 2GB

Cloud Resources

(28)

Cloud Resources

 What is a Cloud interface?

 You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid

 Academic facilities offering access to their infrastructure through a cloud interface

 Commercial Infrastructures (AMAZON EC2, Google, …) offering good deals under restrictive conditions

 The ATLAS HLT farm (online trigger) is accessible through a Cloud interface when not in use

HLT farm 20K running jobs

Cloud Resources

17/10/14

(29)

(Opportunistic) Cloud Resources

We invested a lot of effort in enabling usage of Cloud resources

The HLT farm for example was has been instrumented with a Cloud interface in order to run simulation: Sim@P1

CERN-P1

Recently, the HLT farm was dynamically reconfigured to run reconstruction on 20M events/day

CERN-P1 (approx 5%)

(30)

Event Service

17/10/14

Efficient utilization of opportunistic resources implies short payloads (get out quickly from the resources if the owner needs it)

We developed a system to deliver payloads as short as the single event:

the Event Service. Based on core components such as:

• A new ‘JEDI’ extension to PanDA allows it to manage fine grained workloads

• The new parallel framework athenaMP brings multi/many-core concurrency to ATLAS processing, can manage independent streams of events in parallel

• Newly available object stores provide highly scalable cloud storage for small event-scale outputs

In advanced prototype phase (demonstrator exists), will be used sometime in Run-2

(31)

HPCs

High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that

others can not use (e.g. single core job slots)

24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores).

ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources)

The ATLAS production system has been extended to leverage HPC resources

EVNT,SIMUL,RECO jobs

@ MPPMU, LRZ and CSCS

Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes.

Average 1,700 running jobs

(32)

ATLAS@Home

 Volunteer computing using Boinc

 SETI@Home, Einstein@Home,

LHC@Home… now ATLAS@Home!

 Why use volunteer computing?

 It’s free

 Public outreach

 Potential solution for T2/3 and institute desktop clusters

17/10/14

(33)

How to contribute

 Install Boinc client and VirtualBox

 Linux, Mac and Windows supported

 IMPORTANT! Set boinc to use at most 50% of cores

 Register for ATLAS@Home and connect client to it

 The server is http://atlasathome.cern.ch/

 That’s it!

 Client downloads

 CERNVM image (once, ~500MB)

(34)

Boinc activity

17/10/14

Completed jobs (0.4% of total)

Failure rate: 20% (vs 5% on Grid)

#users and #hosts

(35)

Conclusions

 The ATLAS Computing system served well LHC Run-1

 Run-2 will come with more challenges

 As the use case evolves, our system needs to evolve as well as our model

 The Run-2 model will be based on more flexibility and less artificial boundaries

 Introduction of new technologies and R&D activities have

been vital ATLAS Computing for long time