ATLAS distributed computing and challenges in LHC Run-2
Simone Campana
CERN
Event Collection in ATLAS
17/10/14
L1 trigger (hardware)
L2 trigger (software)
Event Filter (software)
40 MHz
75 kHz
~3 kHz
~300 Hz
Storage
We really do throw away 99.9999% of LHC data
before writing to persistent storage
15 PB, Big Data?
Business emails per year: 3000 PB
Content uploaded to Facebook per year: 182 PB Google search index
(2013): 98 PB LHC data output (2012): 15 PB
YouTube uploads per year: 15PB
1. Duplicate the raw data (not such a bad idea)
2. Add a similar volume of simulated data (essential)
3. Make a rich set of derived data products (some of them can be larger than the raw data)
4. Re-create the derived data products whenever the software has been
significantly improved (several times a year) and keep the old versions for quite a while
5. Place several copies of the derived data around the world
6. ATLAS ends up with 100 PB of data.
The challenge of complexity
• One event has many tracks (particles)
• Reconstruction takes 30 seconds on a modern CPU
• 100,000 cores needed to keep up with data rate
• Re-reconstructed many times
• Same amount of data has to simulated also
• Simulation takes ~100 times longer than reconstruction
• Subsequent analysis steps over all data
17/10/14
The data processing and analysis model
Real and Simulated Data
Fortunately, the problem is EMBARASSINGLY PARALLEL
(each event/collision is independent)
Data intensive processing and physics analysis at ~100 distributed
computing centers and ~1000 universities worldwide Organized Chaotic
The model is data-driven
• Distribute the data among centers
• Process data at the centers hosting it
WLCG
The Worldwide LHC Computing Grid (WLCG) project is a global collaboration linking grid infrastructures and computer centers worldwide
The infrastructure built by integrating thousands of computers and storage systems in hundreds of data centers worldwide enables a collaborative computing environment
WLCG serves a community of more than 8,000 physicists around the world with near real- time access to LHC data, and the power to process it
17/10/14
What is “the Grid” btw?
Many definitions in http://en.wikipedia.org/wiki/Computing_grid You can pick up the one you like, but at CERN we use a very pragmatic one:
“a service for sharing computer power and data storage over the Internet”
WLCG Organization
Tier-0 (CERN):
•Data recording
•Initial data reconstruction
•Data distribution Tier-1 (11 centres):
•Permanent storage
•Re-processing
•Analysis
Tier-2 (~130 centres):
• Simulation
• End-user analysis
Hierarchical structure organized in Tiers
Tiers definition have to do with the ROLE of the site in the experiment computing model
The ATLAS Grid(s)
ATLAS built high level services on top of the Grid baseline middleware services
PanDA (Production and Data Analysis) for job management
DQ2 (Don Quijote)/Rucio for data management
Distributed Data Management
DQ2/Rucio: the data management system to implement the ATLAS computing model
A dataset catalog and transfer system, and more
deletion, quota management, consistency, accounting,
monitoring, end-user tools, …
FAX: an implementation of storage federation
Transparent access to data, regardless the locality
17/10/14
Data Management
Data transfer load dominated by
data placement to serve analysis Nearly 10GB/s transfer rate (steady state target: 1.5GB/s)
1.5GB/s 10GB/s
Fiber Cut
Fully redundant network infrastructure Analysis
data replication
Workload Management System
Core components
DEFT: translates user requests into task (collection of jobs) definitions
JEDI: dynamically generates the job definitions (processing N events)
PanDA: the job management engine
Features
Provide a workflow engine for both production and analysis
Interfaces to Grid services and hides their complexity
Panda …
… and Prodsys
17/10/14
Workload Management
1.4M jobs/day, 150K concurrently running 80 trillion events processed in 10 regions over 4 months in 2011
100k 1M
15% jobs failed but only 5% of WallClockTime
was lost
Data Placement
In ATLAS, “jobs go where data are”, so we need a broad distribution of important data
This is simple in theory but with 2,000 users, 20 data types and 100 sites this gets a bit more complicated
ATLAS implemented a Dynamic Data Placement model, base on data popularity
A minimal number of “pinned” copies at different sites is distributed by default after data creation
More “cache” copies are created if those data are “hot” (heavily accessed)
Space occupied by “cache” copies can be reclaimed at any time if needed.
17/10/14
Dynamic Data Replication and Reduction
Data Popularity
Dynamic Replication Dynamic Reduction
Pinned Cache
Challenges for Run-2
Run-2 period of LHC data taking will come with new challenges
Higher trigger rate
Higher number of collisions
More complex events (e.g. more particles)
The amount of computing resources is not planned to scale accordingly
If you do not have enough resources you can:
Use better what you have
Use somebody else’s stuff
We plan to do both and this is what I will show
17/10/14
Space management crisis
Primary (pinned)
Default (pinned)
Disk occupancy at T1s
T1 dynamically managed space (green) is unacceptably small
It compromises our strategy of dynamic replication and cleaning of popular/unpopular data
23PB on disk, created in the last 3 months and never accessed
8 PB of data on disk never been touched
The new data lifecycle model
Every dataset will have a lifetime set at creation
The lifetime can be infinite (e.g. RAW data)
The lifetime can be extended
E.g. if the dataset is recently accessed. Or if there is a known exception
Every dataset will have a retention policy
E.g. RAW need at least 2 copies on tape. Need at least one copy of AODs on tape.
ATLAS Distributed Computing will flexibly manage data replication and reduction
Within the boundaries of lifetime and retention
17/10/14
Tiers, clouds and network traffic
• Hierarchical tier organization based on Monarc network topology
• Sites are grouped into clouds for organizational reasons
• Possible communications:
• Optical Private Network
• T0-T1
• T1-T1
• National networks
• Intra-cloud T1-T2
• Restricted communications:
General public network
• Inter-cloud T1-T2
Breaking Cloud Boundaries
You need data at some T2 (normally “your” T2)
The inputs are at some
other T2 in a different cloud
Example: outputs of analysis jobs
Tier-2
Tier-2
Tier-1 Tier-1
17/10/14
Source
Dest
Cloud B Cloud A
Data management protocols
Rucio enables usage of different and more efficient protocols with respect of what we have today (SRM+gridFTP)
HTTP can be used to access data from Grid storages
Visualize logs directly on browsers
Download files from the Grid to laptops/desktops/local clusters (dq2-get use case)
On can benefit from many non HEP tools using HTTP
• The same that kids use to download movies
For data access e.g. in ROOT/Athena we can use the Xrootd protocol
HTTP based data access is not production quality today, remains a medium/long term goal
Breaking the data locality:
Remote data access and FAX
17/10/14
Goal reached ! >96% data covered
We deployed a Federate Storage Infrastructure: all data accessible from any location Analysis (and production) will be able to access remote (offsite) files
Jobs will run at sites w/o data but with free CPUs. We call this “overflow”.
FAX use cases
Grid jobs recovery: failover to FAX in case of data access issues
In production
Execute jobs at sites w/o the data and access the files remotely
In pre-production
Reading remotely Per job:
• 4.2 MB/s
• 43% CPU eff
• 42 ev/s
Reading locally Per job:
• 7.2 MB/s
• 67% CPU eff
Data at AGLT2 Computing site: MWT2
FAX overflow now in production
17/10/14
regular overflow jobs per hour 2700 410 13%
regular overflow job efficiency 92% 83%
Target Overflow: < 10%
How much time do I waste in reading instead
of processing?
Acceptable Job Efficiency, still work TDB for CPU efficiency Adoption of xAOD will allow to tune performance Completed Jobs
10K
80%
Job Efficiency
CPU Efficiency
80%
Evolution of the model and impact on sites
So far ATLAS asked T2s
To be well connected to their T1
To be well connected to the T2s of their cloud
Now we are asking T2s:
To be well connected to ALL T1s
To foresee non negligible traffic from/to other (large) T2s
The site storage will serve data to other sites
We foresee 10% average, but watch out the fluctuations
Need to be scaled accordingly
Breaking Tier Boundaries
Because we can benefit of reliable networks, we plan to break cloud boundaries.
Because we can benefit of better protocols, we will break job-to-data locality Some T2s are equivalent to T1s in term of disk storage & CPU power
17/10/14
The next consequence: we will relax the Tiers boundaries (Run1 model defines rigidly T0/T1/T2 roles). Examples:
Different kind of jobs (e.g. Reco) will run at various sites regardless the tier level
Custodial copies of data will be hosted at various sites regardless the tier level
Opportunistic Computing Resources
Thanks to the flexibility of the ATLAS data and workload
management, we can utilize spare cycles provided through non-grid interfaces
Cloud resources (e.g. Amazon EC2, Openstack, ..)
Volunteer Computing (Boinc)
High Performance Computing (supercomputers)
Oak Ridge Titan System
Architecture: Cray XK7
Cabinets: 200
Total cores: 299,008 Opteron Cores
Memory/core: 2GB
Cloud Resources
Cloud Resources
What is a Cloud interface?
You ask for resources though a defined interface and you get access and control of a (virtual) machine, rather than a job slot on the Grid
Academic facilities offering access to their infrastructure through a cloud interface
Commercial Infrastructures (AMAZON EC2, Google, …) offering good deals under restrictive conditions
The ATLAS HLT farm (online trigger) is accessible through a Cloud interface when not in use
HLT farm 20K running jobs
Cloud Resources
17/10/14
(Opportunistic) Cloud Resources
We invested a lot of effort in enabling usage of Cloud resources
The HLT farm for example was has been instrumented with a Cloud interface in order to run simulation: Sim@P1
CERN-P1
Recently, the HLT farm was dynamically reconfigured to run reconstruction on 20M events/day
CERN-P1 (approx 5%)
Event Service
17/10/14
Efficient utilization of opportunistic resources implies short payloads (get out quickly from the resources if the owner needs it)
We developed a system to deliver payloads as short as the single event:
the Event Service. Based on core components such as:
• A new ‘JEDI’ extension to PanDA allows it to manage fine grained workloads
• The new parallel framework athenaMP brings multi/many-core concurrency to ATLAS processing, can manage independent streams of events in parallel
• Newly available object stores provide highly scalable cloud storage for small event-scale outputs
In advanced prototype phase (demonstrator exists), will be used sometime in Run-2
HPCs
High Performance Computers were designed for massively parallel applications (different from HEP use case) but we can parasitically benefit from empty cycles that
others can not use (e.g. single core job slots)
24h test at Oak Ridge Titan system (#2 world HPC machine, 299,008 cores).
ATLAS event generation: 200,000 CPU hours on 90K parallel cores (equivalent of 70% of our Grid resources)
The ATLAS production system has been extended to leverage HPC resources
EVNT,SIMUL,RECO jobs
@ MPPMU, LRZ and CSCS
Mira@ARGONNE: Sherpa Generation using 12244 nodes with 8 threads per node, so 97,952 parallel Sherpa processes.
Average 1,700 running jobs
ATLAS@Home
Volunteer computing using Boinc
SETI@Home, Einstein@Home,
LHC@Home… now ATLAS@Home!
Why use volunteer computing?
It’s free
Public outreach
Potential solution for T2/3 and institute desktop clusters
17/10/14
How to contribute
Install Boinc client and VirtualBox
Linux, Mac and Windows supported
IMPORTANT! Set boinc to use at most 50% of cores
Register for ATLAS@Home and connect client to it
The server is http://atlasathome.cern.ch/
That’s it!
Client downloads
CERNVM image (once, ~500MB)
Boinc activity
17/10/14
Completed jobs (0.4% of total)
Failure rate: 20% (vs 5% on Grid)
#users and #hosts
Conclusions
The ATLAS Computing system served well LHC Run-1
Run-2 will come with more challenges
As the use case evolves, our system needs to evolve as well as our model
The Run-2 model will be based on more flexibility and less artificial boundaries
Introduction of new technologies and R&D activities have
been vital ATLAS Computing for long time