Plans of the WLCG for Run3 and HL-LHC era
Jose F. Salt Cairols
Instituto de Física Corpuscular
XI CPAN DAYS
21-23 October 2019
Overview
1.-The WLCG Global Collaboration 2.-Run 3 and HL/LHC Plan
3.- The Spanish LHC Computing GRID community (LCG-ES)
4.- Usage of additional compute resources 5.- Heterogeneity and Federation
6.- Software Optimization
7.- Spanish Strategy in Computing 8.- Summary and Outlook
23/10/2019 XI CPAN Days 2
1.- The WLCG Global Collaboration The Worldwide LHC
Computing GRID.
Distributed High- throughput
computing infrastructure to store, process and analyze data produced by the LHC experiments.
In numbers:
- 167 sites, 42 countries, 63 MoU’s - ~ 800 Kcores
- ~ 500 PB disk storage - ~ 750 PB tape storage
- Optical private nertwork (LHCOPN) and overlay over NREN s (LHCONE) with 10/100 Gbps links
CERN Computing Center
The equipment purchased by the centers (T0&T1 &T2) give service to the whole collaboration (as a detector)
WLCG is a worldwide and non-stop infrastructure
Contributes to the scientific and technological progress of the center which
participates in WLCG:
scientific infrastructure, expert perssonel, etc
2.-Run 3 and HL/LHC Plan
BEST GUESS Run 3:
- 2021 is a vey low data test run , resources-> same as 2018 for pp
- full Heavy Ions run is likely -> will need some level of additional resources - 2022 is a full year with a resources level of 1’5 times 2018
- 2023-24 Moderate (20%) growth rates
From I. Bird’s talk at 7th Scientific Computing Forum, 4/10/210 SCF, 4th Oct 2019, CERN
Resource Evolution
-4-5 times gap between ‘flat budget– 20% annual increase’ and resource requirements for HL-LHC
- Intense R&D to reduce data and resource requirements
- Cost evolution is not well established - Assumed price reduction
- 10% CPU, 15% disk, 20% tape
3.- The Spanish LHC Computing GRID Community (LCG-ES)
Clouds:
● CERN, CA, DE, ES, FR, IT, ND, NL, RU, TW, UK, US The PIC Cloud (ES)
● Tier1: PIC Barcelona
● Provides 5% of Tier1 data processing of CERN's LHC detectors ATLAS, CMS and LHCb
● Tier2s
: ○ CMS Spanish Tier2
○ CIEMAT Madrid
○ IFCA Santander
○ ATLAS Spanish Tier2 IFIC Valencia IFAE Barcelona UAM Madrid)
○ LHCb Spanish Tier2
○ USC Santiago de Compostela
○ UB (Universitat de Barcelona=
○ LIP Lisbon, Portugal
○ UTFSM Santiago, Chile
○ UNLP La Paz, Argentina (inactive)
- Integrated in the WCLG project (World Wide LHC Computing GRID) and following the ATLAS/CMS/LHCb computing models
- We represent the 4% of the total Tier-2s resources and the 5% of the Tier-1s ones
Total accounting of Resources:
CPU (HS06) =182K Disk (PB) = 14.5 Tape (PB) = 19.6
LCG-ES
More than 22 million finished jobs
On average, 5000 slots occupied by running jobs daily
More than 196 million events proccessed
More than 46 million files produced
Spanish Cloud performance in Run II
4.- Usage of additional compute resources
• Supercomputers for LHC
– Growing funding in supercomputing (HPC) infrastructures
• Roadmap towards Exaflop machines
• Countries/Funding agencies pushing HEP community to use these resources
– Euro HPC Beur funding 2 aprox 200 PFlps machines by 2021, 2 EXaFlops by 2024
– Data intensive computing with HPC facilities is not easy.
• Limited/ no network connectivity in complete nodes
• Limted storage for cahcing I/O event data files
– The ‘Call for resource allocAtion” in not suitable
• We need a guaranteed share of resources
• agreement with BSC
– LHC applications are NOT really suited for HPC
• No large parallelization ( no use of fast node interconnects
• No eseential use of acceleratos (GPU, FPGA)
– Substantial integration work to make HPC work for HTC
• Use of BSC (Barcelona Supercomputing Center) resources:
– Recommendation of using the computing resources of BSC coming from Funding Agency
– ATLAS: : effort devoted to addapt the queues at BSC to run simulation production jobs . In 2018, start to call for computing time (IFIC, IFAE) and several requests have been granted
• Computing hours have been requested in the Spanish
Supercomputing Network (RES) and Europe (PRACE), being granted for the IFAE 2.8 M hours and IFIC 1.2M hours in the Mare Nostrum (BSC) and 2M hours in Lusitania (Cenit)
• installed the ATLAS software and the necessary tools for the
execution of simulation work of the ATLAS detector in these HPCs, so in this way we have used resources outside the Spanish Tiers centers.
We have simulated more than 60 million event
- IFIC/IFAE-PIC led ATLAS simulation when profiting of opportunistic HPC resources
- More than 60 millions of events simulated
- More than 90% of jobs ended
successfully
– CMS:
• CIEMAT/PIC: Regarding the use of BSC resources by CMS, we still cannot use them due to the lack of network connectivity from the nodes, which is necessary in CMS to integrate them into the WMS.
There is a project with the HTCondor team to address that limitation.
• IFCA Adaptation of ALTAMIRA (node of RES in Cantabria) within the GRID Infrastructure (input de Ibán)
– The grid infrastructure of the T2 has been redesigned so that when the T2 is saturated, check the availability of free HPC resources and forward them there. At the moment pilot
examples are operating using altamira in
"parasitic" mode, but it can be easily changed.
- LHCb:
at the spanish level the LHCb groups have not started with these activities yet
23/10/2019 XI CPAN Days 11
-In December 2018: meeting at BSC to explore the possibility of having a
dedicated share for LHC computing needs
Take the example of another special ‘project
‘agreement with BSC
– February-April: to prepare an LHC Computing-BSC agreement draft
– Discussion of technical and policy questions – July 2019: Sergi Girona (BSC) will prepare
the definitive document agreement to be approved at the November BSC ‘Junta de Gobierno’
(BSC Executive Board)
- February-March 2020 could be opened for users (hopefully)
Meeting at BSC in December 2018
View of Mare Nostrum
Cloud Computing Resources:
23/10/2019 XI CPAN Days 13
Experiments have run large scale tests using Cloud compute nodes
Google Cloud, Amazon AWS, Microsoft Azure
-> (aprox) 50K cores concurrently for few days
=>Commercial cloud is
• not profitable for either (a) storage or (b) computing,
• But it can be useful to test new architectures without investing
Currentely essentially no commercial cloud use for LHC computing
Potential future opportunties:
European Open Science Cloud (EOSC)
A EU model for use of cloud computing in the private and public sector
European Science cluster of Astronomy & Particle Physics ESFRI Research Infrastructure
5.- Heterogeneity and resources federation
Federation is the key
• Federation in data storage:
– The idea is localize bulk data in a cloud service (data lake): minimize replication, assure availability – Serve data to remote ( or local) compute grid, cloud, HPC, ???
– Simple caching is all that is needed at compute site (or none, if fast network) – Federated data at national, regional, global scales
• Federation of
computing resources
– Main issue: reducing the hardware cost
– reducing the operational cost
– Co-location of data and processors is not
guaranteed- sites can be
‘diskless’
– Heterogenous computing
PIC is contributing actively in the first group with studies in Data Access and Popularity for a CMS at PIC and CIEMA measuring the effect on the applications to real data in a remote way
6.- Software Optimization
• Solution could come from the software
– 50 millions of lines of code mainly C++
– “a project / experiment cannot afford to have bad software” (Graeme’s talk in Granada)
• Initiatives:
– HEP Software Foundation
– IRIS-HEP: Institute for Research & Innovation in Software for HEP, 25M$, 5 years
– Proposal a EU Scientific Software Institute – In Spain: COMCHA forum
• New hardware architectures
– High level parallelism , new instructions sets,…
– Support in software frameworks for heterogenous hardware
• New/faster algorithms
– Machine Learning/Deep Learning
– Rewrite physics algorithms for new hardware
23/10/2019 XI CPAN Days 19
Improvement in CPU consumption by using faster phyisics algortithms in FASTSIM/FASTRECO
7.- Spanish Strategy in Computing
• Common theme in many contributions to the EPPS Granada is the desire to collaborate with and benefit from LHC R&D work
• Synergies and ‘not to reinvent the wheel’
• Situation in different projects:
DUNE and CTA will
leverage the WLCG for its Computing Infrastructure
Nuclear Physics Coll:
ESCAPE address FAIR data management
The LHC Computing Model has been adapated to the needs and the size of
AGATA collaboration
Computing @
Future Accelerators
Meeting May 2019:
Addressing the outstanding questions CLIC and Future Circular Cilliders
and implies governance evolution
Our strategy in Spain could be to establish a Computing Committee in order to coordinate the study of the computing/storage needs of the different projects/
initiatives. Our organization would be fully embedded in the governance model
described above.
8 .- Summary and Conclusions
• The Spanish LHC GRID Computing projects have been essential for the scientific achievements in LHC projects
• New needs and objectives for Run 3 and HL-LHC will imply deep changes in our organization and technical challenges for the HEP Computing
Community
– HPC resources/Cloud Computing/HLT – Reseource Federation: Data Lakes
• Export –partially or globally- the WLCG organization and perspective to other Astroparticle, Nuclear and High Energy scientific projects ( sinergy)
• Take profit of the experience of the LHC computing GRID groups at the spanish centers since they (the centers) are also involved in other non- LHC experiments
23/10/2019 XI CPAN Days 22
THANKS
QUESTIONS?