The first product of the Intel MIC architecture is the Intel Xeon Phi coprocessor code-named Knights Corner [38]. It features a large number of cores ranging from 50 to 61. Figure 2.4 shows an scheme of the main components of the architecture.
Every core in the chip is based on a simple in-order x86 architecture that runs approximately between 1.0 and 1.2 GHz. This is much lower frequency than the standard cores of the regular Intel Xeon processors. Each core has four hardware threads that are scheduled in a round-robin fashion and can issue up to two instruc- tions per cycle from the same thread context. A relevant limitation in this aspect is that a core cannot issue instructions from the same thread context in back-to-back cycles [40]. This means that in order to take advantage of all the execution cycles of each single core we need at least two threads running on the same core.
The cache hierarchy is organized in two levels. The L1 cache is local to each core and has 32KB for instructions and 32KB for data. Therefore, this level of the cache is shared among all the hardware threads of the core. The L2 cache is distributed across the different cores in modules of 512KB for both data and instructions. The whole level is shared among all threads. However, accessing to the local module of the core is potentially faster than accessing to remote modules. In addition, despite the L2 cache is shared, data can be replicated in different modules of the L2 cache. This means that if a core requests a cache line from a remote L2 module, it will keep a copy of that cache line in its local L2 module.
2.5. The Intel Xeon Phi Coprocessor 25
Figure 2.4: Intel Xeon Phi coprocessor diagram (Knights Corner). 1
Number of chips 1 VPU size 512 bits
Cores / chip 61 Memory size 16 GB
Hardware stepping C0 Memory bandwidth 352 GB/s
Threads / core 4 ECC mode Supported
Frequency 1.238 GHz Peak performance (DP) 1.2 TFlops/s
L1 size / core 32+32 KB Power consumption 300 W
L2 size / core 512 KB Software stack Gold
Table 2.1: Characteristics of the Intel Xeon Phi coprocessor 7120P
Both L1 and L2 caches are fully coherent. The coherence protocol is imple- mented by means of a distributed tag directory (DTD) which keeps the coherence information of each cache line. The tag information of each cache line is assigned to a DTD by means of a hash function. Dealing with the DTD can account for a large part of the overhead in a cache-hit memory access [127].
Cores and the L2 modules are connected through a double ring bus (one bus in each direction) as depicted in Figure 2.4. The ring also connects the memory controllers and the I/O interface that allows the coprocessor to communicate with the host through the PCIe bus. The memory controllers have up to 16 memory channels available to access the on-board GDDR memory.
Regarding the SIMD instruction set, the coprocessor comes with a specially de- signed Vector Processing Unit (VPU) that provides the architecture with 512-bit
SIMD instructions. They are denominated Intel R Initial Many Core Instructions
(Intel R IMCI). This SIMD instruction set supports gather/scatter memory instruc-
1
tions, masked instructions for predicated execution, advanced shifts and permuta- tion operations and fused multiply-and-add operations, among other features. As a result, the coprocessor can yield a sustained performance of 1.2 teraFLOPS in double precision on a 300W thermal design power (TDP) package.
The support of gather/scatter SIMD instructions is limited in Intel IMCI. In this instruction set, only gather/scatter operations that can be expressed with a single memory address as base for all vector lanes and a set of 32-bit integer offsets have direct support in hardware. In this thesis, we denote these gather/scatter operations as simple or uniform-base. Gather/scatter operations that cannot be expressed using a single memory address as base are denoted as gather/scatter of pointers.
Regarding Intel IMCI masked instructions, they have a particular feature. Most of them require an additional vector register argument that is used to set those vector lanes in the output disabled in the mask. We denote this argument old value. In this way, these masked instructions perform an implicit blend operations between the output of the instruction (for the vector lanes enabled) and the old value register (for the vector lanes disabled).
A particular feature of this SIMD instruction set exploited in this thesis is vector streaming stores. Vector streaming stores are useful for writing on non-temporal data, i.e., data that is not going to be read shortly. These vector stores perform the vector write without requesting the data of the involved cache line first (read for ownership). This allows saving memory bandwidth in case of a cache miss. In addi- tion to regular vector streaming stores, the Intel Xeon Phi coprocessor also includes a special implementation denoted as non-globally ordered vector streaming stores. This special implementation may improve performance relaxing the memory con- sistency. This means that subsequent writes to a non-globally ordered streaming store can be observed before it.
The coprocessor supports a standard software stack with a Linux operating sys- tem and programming models such as OpenMP, OpenCL or MPI. In this sense, applications written in one of these paradigms are readily available to run on the Intel Xeon Phi coprocessor. Therefore, the optimization of these programming mod- els for the Intel MIC Architecture is of great importance.
In this work, we use the Intel Xeon Phi coprocessor 7120P described in Table 2.1. In this thesis, we focus on the 61-core model 7120 that we used in our evaluation experiments with C0 silicon and ECC memory mode enabled.