Challenges in the adoptation of the EGI paradigm for a e-Science/Tier2 centre (ES-ATLAS-T2)

(1)

Presented by:

Santiago González de la Hoz ([email protected]) IFIC – Valencia (Spain)

Challenges in the adoptation of the EGI paradigm for a e-Science/Tier2 centre

(ES-ATLAS-T2)

Santiago González de la Hoz, Álvaro Fernández Casaní, Gabriel Amorós, Carlos Escobar, Mohamed Kaci, Alejandro Lamas, Elena Oliver, José Salt, Javier Sánchez, Miguel Villaplana

EGI-InSPIRE

1

(2)

Outline

 

Introduction to the LHC and ATLAS Computing Model

  LHC and ATLAS achievements last year

  ATLAS hierarchical Computing Model

 

ATLAS Spanish Tier-2 and Tier 3

  Description

  Computing Resources

  Storage Resources

  User Analysis Jobs

 

RELATION with EGI and the Ibergrid NGI

  Transition from EGEE to EGI

  EGI organization

  Heavy User Communities: HEP case

  Migration from EGEE to EGI at IFIC

  Description of tools

 

CONCLUSIONS

(3)

LHC and ATLAS achievements last years

LHC started again 20Nov 2009 First collisions 23Nov

(900GeV) Xmas stop

One million events reached!

(before 7TeV in March 2010)

(4)

 ATLAS operation has been very successful during both pp and PbPb collisions @ LHC during 2010.

  Recorded almost all delivered luminosity

  Sub-systems operational almost 100% of the time DATA RECORDED

(5)

LHC and ATLAS achievements last years

DATA RECORDD

 Several full data re-processings @ Tier1s have been done with improved software, alignment and calibrations

 Several MC productions have been done as well, and processed with the same conditions applied for data.

Step I data taken

Jul-Sep L~10 pb^-1

Oct 29

Nov 8

Start reprocessing

Step II

# reprocessing jobs

Nov

Express Stream and Calo Stream reprocessing

Oct

Oct 8

DATA PROCESSING 2010

(6)

LHC and ATLAS achievements last years

 Huge increase of analysis running jobs: Millions of jobs are run every week at hundreds of sites

 Over 1000 different users during the past 6 moths LHC Start-up

WOLRDWIDE DATA ANALYSIS 2010

(7)

ATLAS Multi-Tier Hierarchical

Model

(8)

Hierarchical Computing Model

 MONARC was more than ten years ago:

  The landscape has changed

  We have to update the maps

 Cloud boundaries have serious impact on:

  Production and Data Processing

  Data placement and data access

  CPU utilization

  Disk space usage and data availability

  Network bandwidth

 The scenario to have one copy of derived data per cloud is not sustainable.

 The scenario to execute the entire production task per cloud is not sustainable.

 Data transfer capability today able to manage much higher bandwidth than expected/planned.

  Network is extremely reliable

  Traffic could flow more between countries as well as within

 Tier2s could be used more efficiently. Tier1 and Tier2 may become more equivalent for the network. Hierarchy of Tier1,2 no longer so important

2010 Lessons

The first steps toward a new model

(9)

Hierarchical Computing Model

 Grid and Clouds

 Local File Catalogs consolidation:

  There are more than 15 LFC ATLAS, one per cloud + 6 in US.

  If LFC is down the whole cloud is down

  It will be one catalog at CERN and and hot backup in another geographical location (BNL)

 Dynamic data placement: PD2P

  Caching data which are planned to be used

  Decrease number of primary replicas

 T2Ds. Directly Connected Tier2

  Tier2 with the direct connection to ALL Tier1s, Tier2Ds and CERN

  Tier2Ds Selection Criteria: Robustness and Network bandwidth and performance

Looking to the future

(10)

ATLAS Spanish Distributed Tier2

IFIC

IFAE

UAM

  Enable Physics Analysis by Spanish ATLAS Users

  Tier-1s send AOD data to Tier-2s

  Continuous production of ATLAS MC events

  Tier-2s produce simulated data and send them to Tier-1s

  To contribute to ATLAS + WLCG Computing Common Tasks

  Sustainable growth of infrastructure according to the scheduled ATLAS ramp-up and stable operation

T1/T2 Relationship

  FTS (File Transfer System) channels are installed for these data for

production use

  All other data transfers go through normal network routes

  Tier 1 services:

  VO Box, FTS channel server, Local file catalogue (part of Distributed Data Management)

(11)

Tier-2 and Tier -3 IFIC resources

CE CPU Cores Mem/Core HEPSpec06 ce01 Intel(R)

Pentium(R) D CPU 3.20GHz

40*2=80 2GB 40 x 2 x

5.77 = 461.6 ce04 Intel(R) Xeon(R)

CPU E5472 3.00GHz

48x8=384 2GB 48 x 8 x 9.20 = 3532.8 Intel(R) Xeon(R)

CPU L5520 @ 2.27GHz

32*8=256 2GB 32 x 8 x 18.40 = 4710.4

720 8704,8

 IFIC is an e-Science centre with two infrastructure: Tier2, GRID-CSIC

 Pledge 2010 ES-ATLAS-T2: 6000 HEPSpec06

 Extra resources thanks to GRID-CSIC infrastructure:

  Grid resources to be used by all the scientific communities in Spain which belong to CSIC

(12)

Storage resources

SUN X4500/40/70: 5x(500GB) + 14x(1TB) TOT: 710 TB ( ATLAS Pledge 2010 523 TB )

Tier-2 Resources

Disk servers agregated using linux (RHEL5) + RAID5 (software) + Lustre

The 48 disks are distributed into 6 OSTs (5*8 +1*6 + 2 OS) Lustre v1.8 (in hardware with iSCSI + HA)

One metadata server (MDS) Lustre server with redundancy RAID1.

Tier-3 Resources

  Around 100 TB → 60 TB under DDM control + 40 TB under IFIC control

  Space token dedicated to Tier-3 → ATLASLOCALGROUPDISK

  To manage local users’ data.

  It has an area on a SE but points to non-pledged space

• Switch Cisco 6509

• 10 Gbit to backbone

• 1 Gbit to worker nodes and disk servers

(13)

User Analysis jobs

 

Jobs run where the data are located (2010 model). Data grouped in datasets

 

User can ask for a replica in other site (Datri, DDM)

 

Athena package is installed by grid jobs by swadmins, and used for montecarlo production, and analysis.

 

Receiving and storing the produced data, thanks to the high availability of its sites and the reliable services provided by the team managers

 

Providing the required distributed analysis tools to allow users to use the data and produce experimental results.

Distributed analysis through Ganga and Panda

 

Some tools are going to be supported by EGI

Tier2 Dataset Job

(14)

Transition from EGEE to EGI

 

EGEE ended in April’2010 and EGI-InSPIRE continues from its legacy

 

Organized in NGIs (Pl-Grid, Ibergrid,…)

 

Transition is being done, and for what affects to ATLAS the

important issues are:

 

Support for middleware and tools.

 

Infrastructure support according to required levels

 

Not disturbing of current operations

and end-users

(15)

Heavy User Communities: HEP case

This activity provides continued support for activities currently supported by EGEE while they transition to a sustainable support model within their own community or within the production infrastructure

Main EGI-InSPIRE tasks that affects ATLAS:

 

EGI-InSPIRE SA3:

  SA3.3 Services for HEP (204PMs CERN, 60PMs INFN)

  User Community Support on Services for HEP

The services used by High Energy Physics experiments at the LHC can be classified in

(as defined in deliverable MS603)

1. 

Experiment services – developed, maintained and operated by the collaborations themselves

2. 

Middleware services – generic services at Grid middleware layer, typically operated by WLCG

3. 

Infrastructure services – fabric-oriented services operated by the sites

4. 

Database services

(16)

Experiment specific services

  Experiment services provide functionality very specific to one experiment and the corresponding computing model

  Use generic m/w where possible

  ALICE: AliEN

  ATLAS: PanDA, DDM

  CMS: CRAB, Analysis Server, Production Agent, PhEDEx, DBS

  LHCb: DIRAC

(as defined in deliverable MS603)

(17)

Middleware Services used by HEP

 

Data Management: LFC, FTS

 

Workload management: Ganga, Condor-G, gLite WMS, glideinWMS

 

Persistency services: CORAL, POOL, COOL, FroNTier

 

Monitoring services: HammerCloud, Experiment Dashboard, SAM, Nagios

 

Security Services: VOMS, VOMRS, MyProxy

 

Computing Services: LCG CE, CREAM CE, OSG CE, ARC CE

 

Storage Services: CASTOR, dCache, DPM, xrootd, StoRM, BeSTMan

(as defined in deliverable MS603)

(18)

Migration from EGEE to EGI at IFIC

 

The transition from EGEE to EGI.InSPIRE is being done and for what affects to ATLAS the important issues are:

 

All our services in Glite 3.2 and SLC5 (srmv2, gridftp, squid, top-bdII, site- bdII, WMS, proxy, MON,..)

 

Support for middleware and tools, for instance supporting NGI VOs with the Grid-CSIC infrastructure:

  Infrastructure support according to required levels. Some users communities are using NGI VOs already created or we have created new specialized VOs. Each VO should have a dedicated person, for instance to install the required software for that community.

  Not disturbing of current operations and end-users. At IFIC local VO (“ific”) is used for the new users in order to be training in the grid technologies and tools

 

The end of support for the lcg-CE service with a completed migration to the CREAM service (same situation for UI).

  The proposal is that all sites supporting LHC experiments run CREAM and are no longer required running LCG-CE for LHC.

  To support the new VOs

 

Storage Element: New space in our storage element (Lustre +Storm) is

being deployed for the new VOs without stopping the running services

(19)

Evolving ATLAS cloud model in 2011

 Summer11 ALL LFCs aggregated in a single LFC at CERN (agreement between ATLAS and WLCG)

 Cross Cloud Production

 Current situation: Some big T2 sites already associated to many Tier1s

 Adapt monitoring

 Data Collection into T2s: Extend current channel validation:

(20)

Application: Atlas Distributed Analysis using the Grid supported by Panda and GANGA

  How to combine all these: Job scheduler/manager: GANGA

  Heterogeneous grid environment based on 3 grid infrastructures: OSG, EGEE,

Nordugrid

(21)

Ganga

https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedAnalysisUsingGanga

  A user-friendly job definition and management tool

  Allows simple switching between testing on a local batch system and large-scale data processing on distributed resources (Grid)

  Developed in the context of ATLAS and LHCb

  Python framework

  Support for development work from UK (PPARCG/

GridPP), Germany (D-Grid) and EU (EGEE/ARDA)

  Ganga is based on a simple, but flexible, job abstraction

  A job is constructed from a set of building blocks, not all required for every job

  Ganga offers three ways of user interaction:

  Shell command line

  Interactive IPython shell

  Graphical User Interface

  See Kenyon’s talk on 13^th April in user environment session

(22)

Panda/Ganga usage at IFIC

 

Hammercloud: These tests are executed in a regular basis in sites to spot potential problems at ATLAS sites

(see Daniel’s talk on 12^th April in User support services session).

  The performance is shown to be dependent on the used file system. Lustre (at IFIC) works better without using file stager while dCache (at IFAE and UAM) has better behavior when the file stager is activated (ref)

 

Panda Statistics:

(23)

STEP09

Data taking

DDM

(see Fernando Barreiro’s talk on 12^th April in Data management session)

Volumes managed today:

~more than 43PB

~more than 1.7 million datasets

~more than 130 million files distributed across >100 sites

Aggregated data transfer record on 2010-05-09:

10GB/s (Plot from the ATLAS DDM Dashboard)

In production since 2004 and considered one of the largest data management environments

•  Manage the experiment’s data:

–  Data placement

–  Bookkeeping & accounting

–  Data access to the other systems and end-users

(24)

  Monitoring helps improving the reliability of the sites:

  Data transfers

 

Job Monitoring

 

Site Commissioning

(25)

Conclusions

EGEE to EGI transition tasks performed:

 

Common middleware services and operations are now supported by EGI

 

A gradual migration from LCG-CE to CREAM-CE has been done in order to support the new VOs without overlapping with the running

services. The same is carried out for the User Interfaces.

 

New space in our storage element (Lustre +Storm) is being deployed for the new VOs without stopping the running services.

 

IFIC Users submit its analysis jobs where data is, replicating most used datasets to local storage.

 

Various tools are used, some of them supported by EGI, being HEP a Heavy User Community (like biomed):

 

Ganga, Panda, DDM, Dashboards,…

Impact on the IFIC e-Science infrastructure for ATLAS:

 

LHC started again on Nov’09 and successfully reached 7 TeV.

 

IFIC is part of Spanish Tier-2, and defined its Tier-3 to fulfill ATLAS

requirements. Computing and Storage resources are in place according to

(26)

  Back up SLIDES

(27)

 ATLAS operation has been very successful during both pp and PbPb collisions @ LHC during 2010.

  Recorded almost all delivered luminosity

  Sub-systems operational almost 100% of the time DATA RECORDED

(28)

A solution: Grid technologies The offline computing:

- Output event rate: 200 Hz ~ 10⁹ events/year - Average event size (raw data): 1.6 MB/event

Processing:

- order of 40k of today’s fastest PCs Storage:

- Raw data recording rate 320 MB/sec - Accumulating at 5-8 PB/year

Worldwide LHC Computing Grid (WLCG) ATLAS Data Challenge (DC)

ATLAS Production System (ProdSys)

(29)

  Analysis Data Format

  Derived Physics Dataset (DPD) after many discussions last year in the context of the Analysis Forum will consist (for most analysis) of skimmed/slimmed/thinned AODs plus relevant blocks of computed quantities (such as invariant masses).

  Produced at Tier-1s and Tier-2s

  Stored in the same format as ESD and AOD at Tier-3s

Challenges in the adoptation of the EGI paradigm for a e-Science/Tier2 centre (ES-ATLAS-T2)

Presented by:

EGI-InSPIRE

CONCLUSIONS

LHC and ATLAS achievements last years

DATA RECORDD

DATA PROCESSING 2010

LHC and ATLAS achievements last years

WOLRDWIDE DATA ANALYSIS 2010

ATLAS Multi-Tier Hierarchical

2010 Lessons

Looking to the future

production use

Tier-2 and Tier -3 IFIC resources

Storage resources

Disk servers agregated using linux (RHEL5) + RAID5 (software) + Lustre

Tier-3 Resources

• Switch Cisco 6509

User Analysis jobs

Providing the required distributed analysis tools to allow users to use the data and produce experimental results.

Transition is being done, and for what affects to ATLAS the

Main EGI-InSPIRE tasks that affects ATLAS:

Infrastructure services – fabric-oriented services operated by the sites

 LHCb: DIRAC

Computing Services: LCG CE, CREAM CE, OSG CE, ARC CE

Support for middleware and tools, for instance supporting NGI VOs with the Grid-CSIC infrastructure:

Storage Element: New space in our storage element (Lustre +Storm) is

Application: Atlas Distributed Analysis using the Grid supported by Panda and GANGA

Panda/Ganga usage at IFIC

Panda Statistics:

In production since 2004 and considered one of the largest data management environments

• Manage the experiment’s data:

Common middleware services and operations are now supported by EGI

IFIC Users submit its analysis jobs where data is, replicating most used datasets to local storage.

IFIC is part of Spanish Tier-2, and defined its Tier-3 to fulfill ATLAS

A solution: Grid technologies The offline computing:

  LHCb: DIRAC

•  Manage the experiment’s data: