Almost all the major applications used today have some form of dynamically generated code being executed inside. Web browsers aggressively use the latest Just-in-time (JIT) compi- lation techniques to speed up Javascript execution. Dynamic instrumentation and dynamic translation of code is also common in trace tools that inject performance analysis code for gathering metrics from running applications. However, program flow analysis with hardware trace of such code presents many challenges. To explain the current limitations, we first define its scope, some background on code execution and then move towards an example where it manifests.
For a given target process P, Figure 7.1 illustrates the operating system’s view of the process memory. The virtual address space for P has some content in the form of pages, a number of which are in the executable Virtual Memory Areas (VMA), typically the .text sections of a process, which contains the executable code of the program and shared library code. This is shown as VMA1 and contains executable file-backed pages which we name as code
CS
r Heap BSS Data Text Process P.
.
.
.
.
.
.
Page42 Page43..
.
.
.
.
.
vm_end vm_start Page44 Page0 Page1 Page2 VMA0 VMA1 vm_end vm_start File Backed AnonymousCS
pFigure 7.1 Runtime and file-backed code section for a process P as observed by the OS
P. Consider that P now generates dynamically compiled code. Typically for such code, the
memory is dynamically allocated on the heap and the code copied to the assigned pages which are then marked as executable. Unlike pages in VMA1, these pages in VMA0 are anonymous
and do not contain a backing file. We name these pages as part of the code section CSr. At
runtime, some of these pages may need to be modified and revised. As seen in figure 7.2, at execution, each revision or a new dynamically compiled section can be considered as a single segment CSrn, where n is the number of times a new section is added or a previous
section revised and rewritten at runtime. This behavior is common for every userspace and in-kernel dynamically executed code. As an example, for a userspace JIT compiled network packet filter based on eBPF, CSr may represent a single page worth of dynamically compiled
filter code which may be modified repeatedly at runtime, based on policy requirements. We elaborate more on this in section 7.5. We can now define the process control flow function
F (P ) as, F (P ) = F (CSp) ∪ n X i=1 F (CSri)
where F denotes the instruction flow of a given code and P
signifies the union of individual flows of CSrri. However, a software-only approach for generating the flow F (P ) would also
involve extra code sections before each CSr and add additional instructions to the critical
execution. As discussed, in case of JIT compiled code, this is only currently achieved using JIT compiler specific functions or language dependent APIs [65, 68].
CSr1 Runtime Code Pages CSp CSr2 CSrn Tr1 Tr1 Trn
Code Execution Flow
Tp
Hardware Trace Process
Code
Figure 7.2 Corresponding hardware pages
thus generating true execution profiles at very low-overhead. We have discussed this in de- tail in our previous work [60]. Therefore, for each branch encountered in CSp and CSr, the
processor generates encoded trace packets representing the decision on a branch taken or not taken, along with the instruction pointer (IP) if required. We represent these trace packets symbolically for CSp and CSr as Tp and Tr. For branch traces, the decoding of this enco-
ded trace requires the availability on disk of the static binaries of the running process, as the pages belonging to this VMA are file-backed. Therefore when traced with hardware, the process code section control flow F (CSp) can now be derived as,
F (CSp) = Π(CSp, Tp)
where Π is a map and merge function that takes the statically available process code segment (CSp) and the corresponding hardware trace packets (Tp) as input, and generates the flow as
output. However, for dynamically generated CSrnsection, it is not possible to faithfully obtain F (CSrn), as the packets Trndon’t map to any available code segments, since they belong to a
VMA which is anonymous memory. For example, JIT compilers cause in memory execution of short sections of dynamically generated code which the hardware trace decoders fail to account for, as they expect static binaries while decoding. As discussed in the previous section, a solution to the problem of non-availability of CSr sections is use JIT or language specific
APIs that periodically dump runtime compiled code when it is generated and executed. However, this may require recompilation of the JIT supported runtime, which may not be in diagnosing production systems that don’t allow code modifications. Moreover, this also adds the undesired API code in the critical flow path which we observe eventually in F (P ). We observed this problem throughout in many locations in the Linux kernel where modifying
code for optimization is a fairly common occurrence in trace, network packet filter and security subsystems. The problem is more acute in userspace where multiple languages may be using JIT compilation, and APIs to dump and analyze JIT code may not always be available. This motivated us to approach the limitation of reconstruction in state-of-the-art hardware trace systems, from a different perspective. We therefore propose a kernel-assisted technique that monitors and keeps track of executable code memory to record CSrn sections transparently,
in order to generate accurate program flow. Therefore, to get the flow of a given dynamic code section CSrn we can define F (CSrn) as,
F (CSrn) = Γ(CSrn, vma(CSrn), ts(CSrn), Trn)
Function Γ takes as input the code section (CSrn), address of the VMA in which this CSrnsec-
tion belongs (vma(CSrn)), along with the timestamp of the revision of this section (ts(CSrn))
and the associated hardware trace for the section (Trn). We store the timestamp, address and
content of each new dynamic code section revision with our FlowJIT technique, which then allows reconstruction of the hardware trace, something not otherwise possible.