• No se han encontrado resultados

PRESIDENCIA MUNICIPAL

ARTÍCULO 14.- Los establecimientos deberán ser independientes de cualquier casa-habitación y no podrán iniciar operaciones sin autorización previa y por escrito de la autoridad municipal competente, la cual deberá

XIV. Los demás que de manera expresa establezca el presente Reglamento

The first question to answer is: how expensive is a context switch? To evaluate that, I setup a benchmark that performs a remote procedure call (consisting of request and response) with running or suspended VPEs in different settings. The kernel PE and the PE that runs the benchmark always use PE-type C, containing a single out-of- order x86-64 core clocked at 3 GHz with 32 KiB L1 instruction cache, 32 KiB L1 data cache, and 256 KiB L2 cache. The PE of the communication partner varies to show the context switch on all PE types. PE-type A and B contain stream-processing and request- processing accelerator, respectively, and are clocked at 1 GHz. PE-type C is configured as the two other type C PEs. As in the previous chapters, the DDR3_1600_8x8 model of gem5 is used as the physical memory, clocked at 1 GHz. As a reference point, I show the time for core-local and cross-core IPC on the microkernel/-hypervisor NOVA [137] as well, because NOVA has a well-optimized IPC mechanism. The results for NOVA have been obtained on gem5 with the same core configuration as used in type-C PEs (see Section 3.7.2).

Figure 7.4 shows the average time over 16 runs with warm caches. The times are split into multiple components to explain the behavior. “Wake” denotes the time from the moment the benchmark application fails to send the message to its suspended communication partner until the wakeup of the target PE to perform a context switch. “CtxSw” denotes the time for the actual context switch, “Fwd” the time to forward the

message to the communication partner, and “Comm” denotes the remaining time for the communication itself. The first two rows in Figure 7.4 show the time for core-

M³−C (local) M³−C (rem−sh) M³−C (rem−ex)M³−B (rem−sh) M³−B (rem−ex) M³−A (rem−sh) M³−A (rem−ex) NOVA (remote)NOVA (local)

Time (µs)

0 1 2 3 4 5 6 7 8 9 10 11

Wake CtxSw Fwd Comm

Figure 7.4: Overhead of communication with running and suspended VPEs local (“local”) and cross-core (“remote”) communication on NOVA2. The next six rows

show the times to communicate with a VPE on another PE of different PE types. For each PE type, the first row shows the case where the communication partner has its PE exclusively for itself (“rem-ex”), followed by the case where two VPEs share a PE (“rem-sh”). In the former case, no context switch is required. In the latter case, the benchmark communicates with the two VPEs in an alternating fashion, which requires a context switch for every communication. The final row shows the time for a core-local communication on M𝟑, requiring two context switches.

As can be seen in Figure 7.4 when comparing the rows labeled “M𝟑-* (rem-sh)”

with the rows labeled “M𝟑-* (rem-ex)”, communication on M𝟑 is significantly more

expensive if the communication partner is suspended due to the context-switching overhead. The overhead is similar on all three PE types, but has different causes. At first, PE-type C is clocked at 3 GHz, while PE-type A and B are clocked at 1 GHz, with the consequence that PE-type C performs more work in a comparable period of time. This stems primarily from the fact that PE-type C executes software, whereas PE-type A and B use a finite state machine that performs the context switch in hardware. The context switch itself is more expensive on PE-type A than on PE-type B, because the content of the scratchpad memory in PE-type A needs to be saved and restored. On the other hand, the communication overhead (“Comm”) is larger on PE-type B due to TLB misses (the DTU’s TLB is not tagged and hence needs to be flushed on a context switch). The communication overhead on PE-type C is even larger due to the DTU’s TLB misses, which are caused if, for example, the DTU needs to store a received message. The reason is that on PE-type C, in contrast to PE-type B, these TLB misses are not handled in hardware, but the DTU injects an interrupt and lets the virtual-memory assistant handle the TLB miss in software.

The core-local communication on M𝟑(“M𝟑-C (local)“) is even more expensive, be-

cause it requires two context switches: one context switch from the benchmark applica- tion that sends the request to the communication partner and another switch to send the reply back to the benchmark application. However, the overhead is less than twice as high, because the M𝟑kernel knows that both communication partners share a PE. For

that reason the M𝟑kernel can directly switch from VPE

1to VPE2without asking VPE1

first whether it is currently idling, resulting in a shorter wakeup time (“Wake”). On the

2On NOVA, the core-local communication requires two context switches. The cross-core communication

does not necessarily involve context switches, but has a high overhead due to the use of inter-processor interrupts and kernel entries and exits.

Section 7.7 – Evaluation 1ms 2ms 4ms 8ms Runtime (relative) 0.990 0.992 0.994 0.996 0.998 1.000 Time slice

Figure 7.5: Total runtime of a simultaneous execution of two applications compared to a sequential execution, using different time slices

other hand, the communication overhead (“Comm”) is larger, because both messages need to be forwarded, leading to more interrupts caused by TLB misses in the DTU.

In summary, the context-switching overhead is much higher on M𝟑than on NOVA,

because context switches are done remotely based on a communication protocol be- tween the kernel and RCTMux. There is still room for optimizations by, for example, using a tagged TLB in the DTU, though. I would also like to highlight that core-local communication is fast on NOVA, whereas cross-core communication is fast on M𝟑, if

the communication partner is running. This suggests that a combination of these two approaches could be used to get the advantages of both. I will discuss this combination in more detail in the conclusion in Chapter 9.

Documento similar