To demonstrate the scaling issue of dynamic loading at large-scale, a parallel benchmark was run on two different Linux clusters: Sierra at LLNL and JUROPA at JSC. The first cluster Sierra is equipped with 1,856 compute nodes, each with 12 cores. Data is stored on an NFS or on a Lustre file system. The test environment on the cluster will be illustrated in more detail in Section 6.4. Figure 2.6 shows the results of scaling Pynamic, a benchmark that stresses the dynamic loader (cf. Section 6.2). The benchmark loads in this test 495 libraries with a total size of 1.1 GiB. In the first test, the benchmark loads the libraries from NFS. The overall runtime increased rapidly, and this measurement had to be stopped at a small scale to prevent an I/O storm that would affect other users on the system. This demonstrate the poor parallel support of NFS. It makes dynamic loading impossible when using more than 5 % of
82.4 125.4 156.6 249.0 287.6 332.8 341.50 461.25 606.54 0 100 200 300 400 500 600 700 0 1,536 3,072 4,608 6,144 7,680 9,216 10,752 12,288 13,824 15,360 W al l T ime (s ec onds ) # tasks Pynamic on NFS Pynamic on Lustre
Figure 2.6: Results of Pynamic benchmark runs on Sierra (12 tasks per node, 495 shared objects, 215 utility libraries, 1.1 GiB library data).
2.3 Scalability of Parallel Loading
the system. This is noteworthy, because system files, including system libraries, are often stored on a central NFS file system. The benchmarks on Sierra ran during production time concurrently with other applications. Therefore, the tests were repeated multiple times and the local memory caches for the file systems were reset before program execution. This gave more reproducible results.
The second test was run with libraries stored in the Lustre file system. The results show a more linear growth in the runtime. The tests were performed with up to 512 nodes (6,144 processes). In general, the parallel file system is better suited for this kind of file operation. However, parallel HPC file systems are typically optimized for heavy write operations with large files and data parallelism. Parallel read operations to small files and a high metadata rate are not the typical file access pattern for which the file system is optimized. The measurements show that using the Lustre file system can be a partial solution to the loading problem, but only up to a moderate number of processes. The tests had to be stopped at 30 % of the system size because the performance of the Lustre file system was already starting to degrade at this scale. The improved scaling of the benchmark can be explained by the higher read bandwidth of library data and the use of local memory by Lustre to cache and share the data. Similar to NFS, the look-up requests are all sent to one metadata server (MDS) and are potentially a reason for the exponential growth of the load time with an increasing number of tasks. To verify that metadata operations occur in a high count during runtime, the benchmark program was instrumented to count the number of corresponding system calls. For the above benchmark run each task will issue about 5,671 open and stat system calls. Extrapolating this to the total system size of Sierra (23,328 tasks), the number would increase to more than 130 million system calls. Without optimization of metadata handling, such a high number of file-system operations would overload the metadata server.
The results of the benchmark runs on the second Linux cluster JUROPA are shown in Fig- ure 2.7. Meanwhile, the JUROPA system is replaced by its successor JURECA [54]. The JUROPA system was equipped with 3288 compute nodes, each with 8 cores and 24 GiB mem- ory [55]. In contrast to the Sierra cluster, all file systems of JUROPA were Lustre file systems, including the home file system. All measurements were done within the production environ-
7.2 16.6 31.940.9 52.4 83.5 110.5 167.7 192.6 219.8 0 50 100 150 200 250 0 1,024 2,048 3,072 4,096 5,120 6,144 7,168 8,192 Lo ad T ime (s ec) # Tasks Cloadtest on Lustre
Figure 2.7: Results of the CLoadtest benchmark run on JUROPA (8 tasks per node, 710 libraries, 32 MiB library data).
ment of JUROPA. That implies that other applications were running parallel to the benchmark. Measurements on shared resources like the parallel file system could be influenced by these applications. Contrary to the first tests on the Sierra cluster, it was not possible to reset the local memory caches of Lustre on the compute nodes. Therefore, in multiple similar tests on the same nodes, subsequent runs will benefit from the local memory caches and will need less time to load the libraries. The measurements were repeated at least three times and the best value was reported. The caching on local nodes influences only the timings for data loading, not the look-up timings, because the Lustre file system does no local caching of metadata. All metadata requests are sent directly to the metadata server. On JUROPA, a self-written bench- mark program (CLoadtest) was used for the measurement. The benchmark consists of a set of dynamic libraries containing compiled C-code and a main program written in C that loads the libraries at startup (cf. Section 6.1). The implementation of the main program is differ- ent to Pynamic, which uses a Python script as a main driver program. Furthermore, dynamic libraries are embedded in Python modules, which have to be loaded additionally.
As data caching on JUROPA could not be influenced, the tests were configured to focus on metadata operations by increasing the number of look-up operations and decreasing the size of libraries. This leads to a benchmark configuration of 710 dynamic libraries with 32 MiB library data per task. Similar to Sierra, the test runs were stopped at a scale of 512 nodes (4096 tasks) to prevent an overload of the MDS. Up to this scale, the load time increases linearly with the number of tasks. This indicates that metadata operations are serialized on the MDS. Compared to the results on Sierra, the timings show no exponential behavior when scaling to a larger number of tasks. The different configurations of the tests on Sierra (focus on data and metadata) and JUROPA (focus only on metadata) indicate that the exponential contribution might be caused by contention during parallel data loading and not by the look-up operations. To summarize the results of these experiments, it can be concluded that the startup of dynam- ically linked applications on Linux clusters does not scale to a large number of processes. Furthermore, it exposes a fundamental scaling problem due to the serial behavior of the dy- namic loader. Especially, optimizations of the look-up and the load procedure are required. This optimization can be achieved by caching the metadata in local memory and the exploita- tion of application parallelism. The next section will show an example of how I/O caching strategies can help improve library look-up and loading on the hierarchical I/O infrastructure of the Blue Gene system.