LHC Computing Perspectives in Spain for Run3 and HL-LHC
XIII CPAN Days
Miguel Villaplana
Institut de Física Corpuscular (IFIC)
on behalf of the WLCG-ES sites
March 22nd, 2022
Spanish computing sites for the LHC (WLCG-ES sites)
● Tier 1 (ATLAS, CMS, LHCb):
○ PIC-Barcelona
● Federated Tier2s
○ 60% IFIC-Valencia
○ 25% IFAE-Barcelona
○ 15% UAM-Madrid
○ 75% Ciemat-Madrid
○ 25% IFCA-Santander
○ 50% USC-Santiago
○ 50% UB-Barcelona*
● LHC sites pledges in the last 5 years:
https://wlcg-cric.cern.ch/core/federation/list/
● Integrated in the WLCG project (World Wide LHC Computing GRID) and strictly following the experiments
*UB Tier2: Now it is working as a Tier3 where MC simulation is running
Spanish computing sites for the LHC (WLCG-ES sites)
● Tier 1 (ATLAS, CMS, LHCb):
○ PIC-Barcelona
● Federated Tier2s
○ 60% IFIC-Valencia
○ 25% IFAE-Barcelona
○ 15% UAM-Madrid
○ 75% Ciemat-Madrid
○ 25% IFCA-Santander
○ 50% USC-Santiago
○ 50% UB-Barcelona*
● LHC sites pledges in the last 5 years:
https://wlcg-cric.cern.ch/core/federation/list/
*UB Tier2: Now it is working as a Tier3 where MC simulation is running
Resources: PIC
● T2K [neutrinos], MAGIC and CTA [gamma-ray astronomy], PAU and EUCLID [cosmology], VIP [instrumentation], opportunistic access to LIGO/VIRGO and DUNE, among others…
● CPU: 111 kHS06 Disk: 12.6 PB Tape: 38.4 PB
○ 325 compute nodes (9604 slots), under HTCondor v.9.0.7
● Spanish WLCG Tier-1 centre →~80% of resources
○ Provides ~4% of Tier1 data processing of CERN's LHC detectors ATLAS, CMS and LHCb
● 1⁄4 of the Spanish ATLAS Tier-2 and a Tier-3 ATLAS data analysis facility
○ ~10% of resources
● 18 GPUs available: via JupyterHub and (direct) access by some VOs, also available through Grid
● ~12 PB running on dCache 6.2.33 (~3.8 PB for ATLAS Tier-1)
● PIC is part of the national Data services RES nodes, which also allow us to exploit storage resources as well for the
Resources: PIC
● Successful network upgrade from 20 Gbps to 200 Gbps
● We saturate the LHCOPN/ONE often at 10 Gbps, each…
○ The GEANT - REDIRIS (NREN) connection is not yet migrated to 100Gbps. LHCOPN temporary fix:
LHCOPN to 100Gbps using the ‘long route’ Madrid-Lisbon-Paris-Geneva (~40ms RTT) might soon be applied, while the rest of things get fixed…
○ LHCONE hardware needs to be upgraded at the NREN level to get to 100 Gbps
● PIC participated successfully in the Tape 2021 Data Challenge
○ The network limitation did not affected these tape tests
■ NOTED (link) was enabled at CERN to load-balance the CERN-PIC LHCOPN traffic at saturation to LHCONE during the recent WLCG Network challenge
○ The current srm+gridftp protocol was used in this challenge. Gridftp will be deprecated in Run3, future challenges are expected to run with srm+https protocol
Resources: IFIC - ATLAS
● At the top of availability and reliability ranks: IFIC Tier2 is a Nucleus
○ Nucleus are Tier2s with a big amount of storage and very good network connection, passing job production on to smaller Tier2s (Satellites)
● The base capacity of this T2 is 60% of the Federated Spanish Tier-2.
● Network bandwidth was 2x10 Gbps last year. Now connected to the 100 Gbps of UValencia’s backbone
● Ready for RedIRIS-Nova at 100 Gbps -> IFIC’s WAN connectivity will be increased to 100 Gbps
● We have processed more than 300 Bill. events in these last 5 years
● Steady state of more than 5.000 running job slots since 2019, typically using 2GB per job slot
● Running with either 8 or 1 cores (“multi-core” or
“single-core”) per job
● Now additional 2K slots from MN4
Resources: IFAE - ATLAS
● The ATLAS IFAE Tier-2 is co-located with the ATLAS Tier-1 resources at PIC sharing the same infrastructure. See slides for PIC Tier-1 for details. The same team supports the Tier-1, Tier-2 and Tier-3 for ATLAS at PIC.
● The base capacity of this T2is 25% of the Federated Spanish Tier-2.
● The Tier-2 hosts the Tier-3 analysis facility of the IFAE-ATLAS group and provides 12% (120 TB) for local data analysis repository (IFAE_LOCALGROUPDISK).
● The current capacity of the IFAE Tier-2 is of the order of 12.000 HepSpecs2006 of CPU capacity and 1 PB of data storage.
● As collateral, IFAE T2 has access to the 3.5 PB data on disk and 10 PB on tape of the PIC ATLAS Tier-1, and 0.5 PB of the Tier-3 disk.
● The CPU capacity share is adapted monthly as a function of the pledge and the delivered computing resources.
● If the ES-Tier2 pledge is underperforming for some circumstances, IFAE could take CPU capacity from the exceeding of PIC resources. In more than ten years, this mechanism was never needed.
● The Spanish Data Supercomputing Network (RES-DATA) provided a multi-year grant (DATA-2020-1-0024) to increase the local data analysis repository with 200 TB on disk and 200 TB on tapeas an extension of the disk.
● The IFAE benefits from the200 Gbps upgrade of the PIC network
Resources: UAM - ATLAS
● The base capacity of this T2 is 15% of the Federated Spanish Tier-2.
● Steady state of more than 1.000 running job slots since 2019, typically using 2GB per job slot
○ Mainly running with either 8 or 1 cores (“multi-core” or “single-core”) per job, depending on type of job
● Average transfer throughput as destination/source 100/100 MB/s, with peaks up to 400/150 MB/s
○ Consistently transferring more than 200/200 TB/month
Resources: CIEMAT - CMS
● 75% of CMS Spanish Tier-2 federation
● Also providing computing services and support to other local research communities (astroparticle physics, cosmology, neutrinos…)
CPU
● ~150 CPU nodes, ~2800 slots
● HTCondor (v8.8.10) and 2 HTCondorCEs (v3.2.1) Storage
● ~2.6 PB, dCache v2.27
● dCache pools in dual-stack IPv4/IPv6
● TPC enabled for HTTPs (already moving to production) Network
● 2x10 Gbps WAN (LHCOne + Internet connections)
● Upgrade to 100 Gbps WAN pending deployment by RedIRIS and CIEMAT
Resources: IFCA - CMS
● The IFCA Tier2 is implemented on the Opensource Suite of Cloud OpenStack.
○ Integrated with the rest of the IFCA computing infrastructure.
● IFCA provides a IaaS(Infrastructure as a Service) to the Tier2 project of CMS.
○ Allows to easily benefit from already deployed services.
● Different resources can be used and shared through the batch system:
○ Grid Worker nodes (IaaS).
○ GPU nodes can also be served by the cloud system (IaaS).
○ Opportunistic running on the HPCAltamira node.
● Worker Nodes are cloud machines building singularity containers to run CMS jobs.
○ CMS software loaded through cvmfs cache.
○ Output is stored in GPFS distributed file system.
○ Containers deleted after execution.
Resources: USC - LHCb
Network connection to CESGA (Centro Supercomputación de Galicia)
●
Traffic filtering at the Perimetral Firewall
●
ACLs on some internal routers
●
Special Projects Network is not filtered and is
directly connectedto CESGA.
●
Control nodes are all connected to the Special Projects Network
● All paths are at least 10Gb/s optical fibres
●
~9kHS06 on average dedicated to MC Simulation
●
~60 jobs/hour
● ~3% of total LHCb production
●
Throughput typically below 100 MB/s
○
Peaks of up to 240MB/s observed
Resources outside WLCG pledge: Amazon cloud burning tests at PIC
● We tested AWS (Amazon Web Services) for a week (June 2019), doubling PIC compute power
●
Integration of a cloud environment with the local batch system - sporadic increase of resources
●
Special interest in a spot instance based scenario
●
Data center in Frankfurt (~40 ms) - used Condor_Annex
●
Set up HTCondor Connection Brokering (CCB)
●
Bridge server to connect the local system to the outside nodes
●
HTCondor-CE routing modified so only ATLAS and
CMSsend jobs to AWS
●
Custom WN image deployed in AWS servers, + CVMFS, + access to Squids
●
Configuration of spot instances requirements
Good optionto increase computing resources sporadically Flexible and easyto deploy through HTCondor
Not very good for data intensive jobs
Resources outside WLCG pledge: HPC - ATLAS (IFIC, IFAE-PIC, UAM)
● A huge effort that is paying back
○ Started as an opportunistic resource, now it is a backbone of our computing contribution to simulation.
● The access to HPC CPU time has been through the RES open calls.
○ From 2018 to mid 2020 as standard calls.
○ Starting in mid 2020 within the Ministerio-BSC agreement (“Proyecto Estratégico de Acceso al Marenostrum 4 para su utilización en la Computación del LHC”).
○ ~10 million hours approved and used at BSC every quarterby the Spanish ATLAS colleagues
● Three HPCs have been used Lusitania, Cibeles and MareNostrum4
● Using ARC-CEto interconnect MareNostrum and ATLAS production system
● Only simulation workflow validated - singularity containers, pre-placed at MareNostrum GPFS MareNostrum accepts only
Resources outside WLCG pledge: HPC - CMS (PIC, CIEMAT)
Current status:
● Interaction with BSC execute nodes through the login node, mounting the shared FS through sshfs and sending jobs to the Slurm scheduler via ssh. Slurm jobs launch a HTCondor slot that joins the CMS Global Pool
● CMS Software modifiedto accept sql files for conditions data at runtime
● Using cvmfs_preload to bring cvmfs CMS files to BSC. Two weeks to copy ~37M files (13 TB), at first injection. cvmfsexec used to build the cvmfs file structure
● Tested CMSSW >= 11.2 execution based on cvmfs copy, reading conditions data in a file and producing meaningful outputs
● Stage-out + Data Transfer Manager designed to transfer output data to PIC
● Testing the integration with WMAgentat the moment
● A New grant of 6 Mhours sent (typical quarter allocation for CMS)
Resources: Projections for Run 3 and beyond
Storage projections towards HL-LHC
• Storage is the most expensive resource to deploy and operate WLCG
• Developments to reduce the needs (costs) of storage infrastrucure are paramount!
• Data lakes, reduce data size and duplication, analysis via caches, central processing through buffers, …
DOMA Working Group
• In 2018 the WLCG DOMA R&D Working Group was launched, covering several activities in the area of
Data Organization, Management and Access, with a focus on the medium/long term evolutionTwiki → https://twiki.cern.ch/twiki/bin/view/LCG/DomaActivities Indico area → https://indico.cern.ch/category/10360/
• 3 active groups on:
- Data Access, Content Delivery and Caching - Third Party Copy
- Quality of Service
Moving to a network-centric model
• Definition of “datalake” in our context: an infrastructure where CPU and storage capacity are loosely coupled (not necessarily co-located)
• Storages in the datalake need to be connected by a fast and reliable network
• The decisions about storage deployment in a region will impact the network
• Distributed storage or storage-less sites will require more network
• Deployment of caches might reduce the network needs, depending on the use cases and the workflows
• Defining regional plans for storage and the corresponding network needs will be the obvious next step. The two aspects
Third party copies
● LHC data is constantly being moved between computing and storage sites to support analysis, processing, and simulation
● Need to modernize the infrastructure to sustain the at least 1Tbps of data expected in HL-LHC by 2027
● Historically, bulk data movement has been done with the GridFTP protocol but there is a need to use modern software, protocols, and techniques to move data.
● It is agreed to consider HTTP as the WLCG baseline protocol for TPC.Every storage solution should implement it and every site should deploy it
● Need to make it work at the required scale across the WLCG infrastructure
Archive storage
● A storage archive is used to preserve data that is rarely if ever accessed, often for long periods of time
● Tape Storage is not a ‘cold storage’ for LHC
● Three frontend solutions in WLCG: EOS/CTA, dCache, StoRM
● In the short term SRM will continue playing a role. FTS should hide the complexity of “stage+transfer” via SRM(dCache,StoRM) or xrootd(CTA)+HTTP
● In the medium term, harmonise the tape access through a common REST API.
Archive storage does not need to be tape
● Disk-based solution presented by KISTI
● Storage TCO needs to be considered, particularly if the usage will increase (e.g. tape carousels) - BNL studies
Reducing Operational Effort
● The infrastructure and the experiment activities will grow in size and complexity. The available effort in operations probably will not
● Rather than reducing effort, the aim is to operate at a large scale with the same effort
→ we will need to be more efficient
● Some examples of improvements in this direction in our community:
● Recent improvements in the monitoring and alerts system at
UAM: Telegraf+InfluxDB+Grafana; Setup with Ansible IFIC: local monitoring using FileBeat+ELK; Lustre low level monitoring with Probe+InfluxDB+Grafana
Towards more streamlined analysis data formats
In CMS, NANOAOD (1-2KB/evt): ROOT file as flat TTreewith entries corresponding to events and branches to physics
observables
● produced as part of the centralised workflows
● further reduce demands on disk and CPU
● satisfy the requirements of as many physics analyses as
Similarly for ATLAS:New data tiers introduced, in tabular format:
● Compression factor
● Subscription model for analysis jobs (auto re-run when new data arrives)
● Access & interpretation of the data with standard
Analysis Facilities
● In the early Grid computing model, Tier-3s were a layer of sites dedicated to analysis, frequently collocated to Tier-2s, however for private use of local groups. Tier-3 are not pledged and require non standard resources unlike the WLCG grid deployment
● Guiding principle: Help physicists minimize time-to-insight, enabling iterative exploration of the data:
○ In the future, when processing 10x lumi/evts, avoid physicists 10x waiting time!!
○ Boost productivity and competitiveness of our physics communities.
● Key features:
○ Local access to the reduced data samples (e.g. NanoAOD) with low latency
○ Processing resources available with capacity and priority (CPUs, GPUs, memory)
○ Efficient and elastic infrastructure (e.g. scheduling based on HTCondor) with dynamic expansion to HPC/Cloud to absorb peaks
○ Enhanced support to software tools (ROOT + Python ecosystem)
Analysis Facilities
Typically consist of:
● Dedicated storage resources (of order of several 100s of TB)
○ HDDfor local access to reduced data formats and produced ntuples
○ SSDfor low latency access to data with catching
● CPU resources used interactively and/or via a batch system (mostly HTCondor)
○ users are reluctant to get into the local batch system, keeping execution of analysis tasks in the UIs. As the need for resources grows, the pressure for them to use HTCondor grows too!
○ Future UI to be designed in a flexible way, with user-friendly interfaces that do not discourage users
○ IFAE: Jupyter Notebook instances can be spawned via a dedicated portal:
■ Select different profiles (CPU, mem, GPU) allocating resources at the PIC farm
● SW delivery mostly via CVMFSwith increasing presence of containers
● GPU resources available, but often not dedicated.
○ IFIC: ARTEMISA infrastructure (http://artemisa.ific.uv.es)
○ IFCA: IaaS access
○ PIC-IFAE:Jupyter Notebook instances
● Network:
○ LAN of multiple 10 Gbps to support the intense data throughput between AF processing and storage nodes
Conclusions
● We have developed a great infrastructure and community within WLCG-ES!
○ Expertise and infrastructure built upon decades of work to service the LHC experiments
○ now supports other experiments HEP/astro/cosmo
● Data is our distinctive asset!
○ Modest (~10%/year) foreseen resource increases along Run 3
○ Exception is LHCb, since its high-lumi phase is Run 3
■ Large (>50%/year) increases in resource needs
○ Large step function for HL-LHC
● Quite some R&D underway
○ Small data formats (~kB/event) for analysis
○ Try to reduce cost of storage infrastructure (data lakes, reduce data duplication, analysis via caches, central processing through buffers)
○ CPU from outside WLCG resources (HPC, GPUs, Clouds)
○ Tabular data formats for easy massive filtering operations
● Analysis Facilities are a very important aspect
○ Specialized for highly parallel local and interactive analysis
Thanks
Example: CMS AF@CIEMAT proposal