Capítulo 2 El Control Interno en el proceso de la Disciplina Práctica
2.1 Caracterización de la UCLV y la carrera de Contabilidad y Finanzas
WMTools is designed on the principle of data collection and retention. WM- Trace is the tool specifically designed to collect memory allocation information from parallel applications and save this information to file, allowing for in-depth processing at a later date. Whilst the implications of this data capture and storage methodology are non-trivial the benefits are clear. Despite potentially gigabytes of trace information, representing millions of nodes in the application call tree, being generated you have an exact replica of the series of events allowing for in-depth analysis [39].
WMTrace Application + Libraries stdlib.h / libc Linux Kernel Internal Buffer C o mp re sso r Stack Database Trace File Malloc / Calloc / Realloc / Free Event Processor Call Stack Traversal ELF Reader
Figure 4.1: WMTrace data collection process
4.1.1
Library Structure
WMTrace is a dynamic C++ library which interposes POSIX based memory allocation calls, such as malloc, calloc, realloc and free.
Figure 4.1 illustrates the internal layout of WMTrace. As we can see the library sits between the application and other dynamic libraries such as system libraries. Data from memory management calls are intercepted and passed to the event processor which records the size, time and location of allocations. This event data is then written to an internal buffer. Call stacks, which are generated from these events to represent the location of an allocation, are passed to a stack dictionary which maps call stacks to a unique ID, as a mode of compression. Periodically the internal buffer is flushed and along with a list of the newly observed call stacks this information is passed through a compression engine, which in-turn passes the data to file. Analysis of the application, through Elf and the virtual address space, is also performed and stored in the trace files, allowing function addresses to be resolved at a later date.
4.1.2
Application Interaction
WMTrace is a dynamic library which is linked via an LD PRELOAD operation at runtime, during the application setup phase. There are many benefits to this format, including avoiding the need for compile time linking – there is no need to recompile applications before tracing them with WMTrace.
WMTrace is specifically designed to handle MPI based applications, and is initiated by an application’s call to ‘MPI Init’. This allows WMTrace to establish separate trace files based on rank information.
ELF
From the binary WMTrace is able to ascertain the static memory partition, which doesn’t present as a malloc but still contributes to memory consumption. WMTrace also queries the ELF header for function address information, this is used to resolve addresses obtained during call stack traversal. To gather information regarding the function addresses of dynamic libraries, WMTrace must query the virtual address space, using the ‘dl iterate phdr’ function.
We note that WMTrace uses function address information from the ELF headers and resolves locations to within function address ranges, as such infor- mation is largely available even without debugging information in the binary.
Stack Tracing
Stack tracing allows WMTools to understand the ‘location’ of an allocation, with respect to the sequence of function calls which caused it. This information is essential for any form of complex analysis that differentiates between allocations. However the collection of this information is expensive, and can generate a lot of data. The complexities are handled by a third party stack tracing library, libunwind [90], which is reasonably efficient and highly portable.
We experimented with alternative methods of collecting call stack informa- tion. There are various methods of improving the performance of frequent call stack traversal, using additional operations [40, 92, 110, 119, 121, 125]. Many of
these methods involve modification of the stack, and the insertion of markers, allowing for detection to prevent further unnecessary traversal.
We developed a heuristic call stack traversal method, presented in [104], which uses the repetition of patterns and the stack size to deduce change. With this method we were able to predict call stack suffixes with an average accuracy of 89%, providing an overall speedup of 12% to WMTrace.
Using our initial technique some applications, such as AMG, experienced stack prediction accuracy as low as 5.2% – a result of low call stack densities within the application. Methods employed to improve this accuracy were detri- mental to the performance of the technique, thus reducing the gains available to WMTools.
During this heuristic traversal we were unable to validate our predictions without knowledge of the correct call stack information. Thus the variability of accuracy becomes an issue, as this would inturn diminish confidence in later analysis we did not pursue the method any further, within WMTools.
4.1.3
Data Storage
WMTrace has a simple method of data storage, utilising a single trace file per MPI process. This allows each process to act independently, saving runtime, but resulting in potentially large combined file output.
Data storage is key to WMTools, as it facilitates the offline analysis of runs, allowing for different forms of analysis to be performed as and when they are required. The drawback of this method is the volume of data generated, with implications on both storage and I/O performance.
As WMTrace therefore employs lossless data collection and storage, the size of the trace files is dependent on the number of allocations, which in most circumstances will grow over time. The implication is that with extremely long runs these trace files will build up in size, potentially causing problems.
WMTrace employs an internal buffer, to facilitate the periodic staging of data to file. As this buffer is appended to file it is passed through a ZLib [133]
compression engine, reducing the data volume.
Ferreiraet al.discuss the importance of managing data storage volumes in HPC, and the potential benefit of using standard compression algorithms to minimise data from log outputs [38]. They utilise a parallel pzip2 algorithm, pbzip2 [44], in conjunction with a staging area, similar to the internal buffers utilised in WMTrace. They achieve compression ratios of over 80% HPCCG (a conjugate gradient benchmark), though as the compression was handled by a dedicated ‘spare’ core they do not discuss the performance implications of this technique.
WMTrace handles the storage of stack traces in a different way to events. As there is a lot of repetition, we maintain a map structure recording all unique call stacks. This method of compression is more efficient than relying on ZLib to spot repetition. Newly observed call stacks are then periodically written to file, before the event trace segment, and are passed through the ZLib engine for additional compression. As a form of fault tolerance trace files are well formed, allowing the partial analysis of runs which fail mid execution.