Proyectos sociales - DIAGNÓSTICO GENERAL DE LA EMPRESA

3. DIAGNÓSTICO GENERAL DE LA EMPRESA

3.2. Proyectos sociales

We have seen in the last chapter, that each core inside a Xeon Phi processor has up to 4 hardware threads called hyper-threads or SMT threads. QPhiX treats these threads inside a core as a grid of dimension Sy ∗ Sz, both of which variables can be set by the user. The number of OpenMP threads will then be equal to Ncores ∗ Sy ∗ Sz. The thread, core and SMT IDs satisfy the relation tid= Sy ∗ Sz ∗ cid + Sy ∗ smtid_z + smtid_yand can be calculated as shown in Lst. 3.3.

With the SMT threading, the data layout using tiles and the blocking scheme in place, we are now ready to understand the lattice traversal for the dslash stencil operator. It is implemented in the two routines

Dslash<FT,VECLEN,SOALEN,COMPRESS12>::DyzPlus

Dslash<FT,VECLEN,SOALEN,COMPRESS12>::DyzMinus

for dslash and its hermitian conjugate, respectively, in the file dslash_body.h (and similarly for the dslash stencil with clover term in clover_dslash_body.h). As its parameters, it takes pointers to the in- and output spinor fields, the gauge field, as well as the thread ID and the checkerboard (even or odd), cf. Lst. 3.4.

All threads loop over all the phases constructed from the geometry. If there are fewer blocks than cores, cores which are not needed, will skip to the next phase. From the thread ID, each thread will calculate its core ID, the SMT IDs, the block, it has to process, as well as the starting value in the time direction. Every thread then streams over its range of T values in its block. For each given time slice, it traverses the XYZ volume in the following way: Every core splits its respective XY-planes between the SMT threads, such that every SMT thread processes a tile of size SOALEN ∗ ngy, at any given step. The volume is then traversed by processing chunks of size SOALENin the X-direction, and increasing Y in steps of ngy (the Y-axis of the SMT grid). The next plane (associated with the Z-axis of the SMT grid) will then be reached, by looping over the Z-direction in strides of Sz.

Within the loops, the indices can be used to calculated the addresses of neighbouring spinor blocks, the output spinor and the gauge field, as well as the offsets to the neighbouring spinors for successive blocks. They will be used to prefetch data to the L2 cache, in case that software prefetching is turned on. In addition, communication for the faces is handled. Finally, all this

1 template<typename FT , i n t VECLEN , i n t SOALEN , bool COMPRESS12> 2 void Dslash <FT , VECLEN , SOALEN , COMPRESS12 > : : DyzPlus (

3 i n t t i d , c o n s t F o u r S p i n o r B l o c k ∗ psi ,

4 F o u r S p i n o r B l o c k ∗ r e s u l t , c o n s t SU3MatrixBlock ∗ u , i n t cb )

5 {

6 / / Loop over phases

7 f o r(i n t ph = 0 ; ph < num_phases ; ph ++) { 8 i n t nActiveCores = phase . Cyz ∗ phase . Ct ; 9 i f ( c i d >= nActiveCores ) c o n t i n u e; 10

11 / / Stream over b l o c k s of time s l i c e s 12 f o r(i n t c t = 0 ; c t < Nct ; c t ++) { 13 14 / / Handle B / Cs here 15 16 / / Loop over z 17 f o r(i n t cz = smtid_z ; cz < Bz ; cz += Sz ) { 18 19 / / c a l c u l a t e s p i n o r base a d d r e s s f o r : 20 / / x & y neighbours

21 / / backward / forward z neighbour 22 / / backward / forward t neighbour 23 / / output

25 / / Loop over y

26 f o r(i n t cy = nyg ∗ smtid_y ; cy < By ; cy += nyg ∗ Sy ) { 27 28 / / cx l o o p s over the s o a l e n p a r t i a l v e c t o r s 29 f o r(i n t cx = 0 ; cx < nvecs ; cx ++) { 30 31 / / c a l c u l a t e base a d d r e s s f o r gauges 32 / / and v a r i o u s o f f s e t to s u c c e s s i v e f i e l d s 33 / / f o r L2 p r e f e t c h e s 34

35 / / C a l l the pre−g e n e r a t e k e r n e l on the t i l e 36 / / with the a p p r o p r i a t e p o i n t e r s

37 d s l a s h _ p l u s _ v e c <FT , VECLEN , SOALEN , COMPRESS12 > ( . . . ) ; 38 39 } 40 } / / End f o r over s c a n l i n e s y 41 } / / End f o r over s c a l i n e s z 42 43 / / C a l l a b a r r i e r w i th i n a c ore group 44 i f( c t % BARRIER_TSLICES == 0 ) 45 b a r r i e r s [ ph ] [ b i n f o . c i d _ t ]−> wait ( b i n f o . g r o u p _ t i d ) ; 46 47 } / / end f o r over t 48 } / / phases 49 }

Listing 3.4: The skeleton of the lattice traversal for the dslash stencil. Here, indices of neighbouring spinor and gauges, output addresses, and offsets for L2 prefetches are computed and the pre-generated service kernel, which operates on tiles is called. There, the main dslash computation is carried out. (We have suppressed all QMP communication done here.) [dslash_body.h]

1 template<typename FT , i n t V , i n t S , bool C>

2 void axpy (c o n s t double alpha , c o n s t F o u r S p i n o r B l o c k ∗ x , 3 F o u r S p i n o r B l o c k ∗ y , c o n s t Geometry <FT , V , S , C>& geom , 4 i n t n _ b l a s _ s i m t ) 5 { 6 AXPYFunctor <FT , V , S , C> f ( alpha , x , y ) ; 7 siteLoopNoReduction <FT , V , S , C , AXPYFunctor <FT , V , S , C> > 8 ( f , geom , n _ b l a s _ s i m t ) ; 9 }

Listing 3.5: Structure of BLAS routines, for the example of an axpy. Every operation has its own functor, which is constructed and fed to the site loop facility template functions, which also handle reductions. [blas_new_c.h]

data is passed to the service kernel, in which the actual calculation will take place. The necessary kernels are pre-generated with QPhiX-codegen, which we will cover shortly.

In document Plan de comunicación para "El Armadillo Ilustrado" (página 37-51)