Capítulo 2: La eficiencia como inspiración
2.1 Eficacia en uniformes ignífugos
2.1.1 Tecnología textil
M𝟑 supports VPE::run and VPE::exec to clone and load applications. Due to the
similarities to fork and exec on Linux, this section compares their performance. As described in Section 5.9.4, VPE::run clones the current state of the application and executes a given function in another VPE. If virtual memory is supported, copy-on- access is used to clone the state, similar to copy-on-write for fork on Linux. VPE::exec loads a new application from the file system into another VPE. If virtual memory is supported, demand loading is used on M𝟑, as for exec on Linux. In contrast to exec
on Linux, VPE::exec can currently only be used on other VPEs, not on the own VPE. It should also be mentioned, that fork and exec on Linux offer more features than their equivalent on M𝟑. For these reasons, the results of the comparison should be
interpreted with a grain of salt. Nevertheless, comparing them puts the performance of M𝟑’s operations into perspective.
First, I compare the performance of VPE::run and fork. On M𝟑, the benchmark
creates a new VPE and calls VPE::run. On Linux, the benchmark calls fork. In both cases, the measurement is started before the VPE creation or fork and is stopped as soon as the child starts executing. The benchmark has an array of varying size (1 B to 8 MiB) in its static data segment, which is initialized before the measurement, to evaluate the influence of the application’s size on the performance. Figure 5.14 shows the average of four runs after one warm-up run. M𝟑shows comparable performance
to Linux on type-B PEs, but is significantly slower on type-C PEs due to the slower DTU register accesses. As can be seen, the performance depends on the application’s size in all configurations. On all configurations except PE-type A, the reason is that copy-on-access/copy-on-write requires to set all writable pages to read-only. On M𝟑,
this is only done for the parent and all pages for the child are created on demand. On Linux, all page table entries of the parent are set to read-only and are copied to the child, which is the main reason why Linux’s performance scales worse with the application size in this benchmark. On PE-type A, all data needs to be copied eagerly due to the missing virtual-memory support, leading to bad performance for large applications. However, this is acceptable, because the scratchpad memory in type-A PEs is typically in the order of 100 KiB, limiting the application size anyways.
The comparison of VPE::exec and exec has been done similarly. On M𝟑, the
benchmark creates a new VPE and calls VPE::exec to execute an application. On Linux, the benchmark calls vfork and also executes an application in the child process. On
Section 5.14 – Summary
1 B
2 MiB 4 MiB 8 MiB
Linux 0.0 0.1 0.2 Time (ms) 1 B
2 MiB 4 MiB 8 MiB
M³−A
0.6 1.2 2.3
1 B
2 MiB 4 MiB 8 MiB
M³−B
1 B
2 MiB 4 MiB 8 MiB
M³−C
1 B
2 MiB 4 MiB 8 MiB
M³−C*
Figure 5.15: Performance comparison of vfork+exec and VPE::exec on different PE types and with varying application sizes.
both OSes, the time is measured from the VPE creation or vfork until the child starts executing. In this case, vfork instead of fork is used to let the child process borrow the parent’s address space until the call of exec, improving the performance on Linux. Like for the previous benchmark, the application that is executed in the child VPE/process contains an array of varying size (1 B to 8 MiB) in its static data segment. Figure 5.15 shows again the average of four runs after one warm-up run. Since the array of varying size does not need to be cloned in this case and is also not touched, the performance is mostly independent of the application’s size (except for M𝟑-A). Similarly to the previous
benchmark, M𝟑’s performance is roughly on the same level as Linux’s performance. On
PE-type A, the entire application needs to be loaded, which costs significantly more time than using demand loading. Note also that on type-A PEs, VPE::exec takes much longer than VPE::run, because VPE::exec requires to load the data first from the file into the parent’s address space and to copy it afterwards into the child’s address space. In contrast to that, VPE::run simply copies the data from the parent’s address space into the child’s address space.
5.14
Summary
This chapter extended the system by caches and virtual-memory support. To this end, I introduced two new PE types (called PE-type B and PE-type C) to the existing type (called PE-type A). All three PE types have the same external interface to collaborate seamlessly with each other, but differ in the way the CU is attached to the DTU and the internal memory (scratchpad memory or caches). PE-type A is intended for ac- celerators that prefer untranslated access to scratchpad memory (SPM). PE-type B is intended for accelerators that desire cached-based access to large amounts of data. Since these accelerators typically lack virtual-memory support, the DTU adds the support externally and transparently to the accelerator. Finally, PE-type C is intended for general- purpose cores that already have a memory management unit (MMU). For that reason, the MMU is reused for virtual-memory support and the DTU offloads the translation of virtual addresses to a small component running on the core called virtual-memory assistant (VMA).
Fortunately, in the remaining chapters of the thesis and when working with the system, the differences between the PE types can mostly be ignored. The reason is that all PE types have the same external interface, hiding these differences. In particular, all PE types accept RDMA requests, which always refer to the address space of the
running VPE: the SPM on PE-type A and the virtual address space in PE-type B and C. Furthermore, page faults are handled by M𝟑’s pager in the same way, independent of
whether the paged application is running on PE-type B or PE-type C. The pager receives page faults as messages from the application’s PE (either from the VMA or from the DTU) and handles the page faults by using the kernel’s mechanism to update page table entries.
The evaluation has shown that the performance of DTU commands is worse on PE-type C, because accessing the DTU’s registers is slower. However, this slowdown is mostly negligible in more realistic settings such as file and pipe accesses, which show comparable performance. Comparing the performance of page fault handling and application loading to Linux has also shown that M𝟑is competitive in this regard.
Chapter 6
Autonomous
Accelerators
In the previous chapters, I introduced the architecture that enables the integration of very different kind of compute units (CUs) as first-class citizens. I also introduced different processing element (PE) types that suit different kind of CUs. After focusing on general-purpose cores in the previous chapters, this chapter integrates accelerators into type A and type-B PEs and shows how M𝟑’s concepts enable accelerators to run
autonomously.
6.1
Motivation and Related Work
Running accelerators autonomously has multiple benefits. First, accelerators typically offer substantial energy savings over general-purpose cores [69, 86, 91]. However, if accelerators need to be assisted by a typically power-hungry general-purpose core during their operation, the system cannot benefit from the energy savings. For example, Google’s TPUs burden their controlling CPU with 11 % to 76 % load just to operate the TPU [69]. Second, if the CPU does not need to assist the accelerator, the CPU can perform other work in the meantime. Third, with an increasing number of accelerators that the CPU needs to assist simultaneously, the CPU becomes the bottleneck.
In this chapter, I first explain how accelerators can access OS services such as file systems or network stacks based on the file protocol introduced in Chapter 4. I also show how this protocol can be used for direct accelerator-to-accelerator communication. There are already specialized solutions that allow access to OS services by a specific type of accelerator, like GPUfs [133] and GPUnet [75] for GPUs or BORPH [135] and ReconOS [26] for FPGAs. However, there is no general solution that would grant any accelerator first-party access to OS services and also allow direct communication between multiple accelerators without assistance by the CPU. CAPI [15] uses a similar approach to integrate accelerators into a system, but focuses on cache coherency and address translation, whereas I am focusing on protocols to access OS services.
Second, I show how fine-grained interruptibility can be combined with autonomous operation. If a system wants to support multiple activities with different priorities on a single accelerator, a low-latency context switch to the prioritized activity is needed. However, accelerators are typically invoked by software and are not interruptible until the computation is complete. One way to lower the latency is to reduce the amount of data per invocation. Consequently, the compute time per invocation is reduced, but software needs to continuously invoke the accelerator, which causes more
PE Core SPM DTU PE Core SPM DTU PE SPM DTU PE Accelerator SPM DTU PE Core $$ DTU PE $$ DTU PE DTU PE Core DTU DRAM MMU ME SPM $$ App App Core Server Client Accelerator
CPU utilization and power consumption. Alternatively, the fine-grained invocation can be done in hardware by adding a simple state machine with preemption points next to the accelerator logic. This approach allows to get the best of both worlds: operate accelerators autonomously and interrupt them with low latency. As the context- switching mechanism will be subject of the next chapter, this chapter only explains how accelerators can be interrupted with low latency.