4. Destino de la recaudación
4.1. Eficiencia del Estado en la asignación de los recursos recaudados
The first challenge to face is the data preparation step. The core of the problem is that every input data for this study is in the form of ROOT data files. It is widely known why this format is used by the HEP community, but it has to be remarked that this format is not the most suitable to directly apply ML learning techniques. The first task to cope with has been to find a mean to “transition” from any ROOT input files to some format that is more adequate to interface with any existing ML framework that is used in other (non HEP) sciences, or in non-scientific sectors. The best design choice is enforced in the TFaaS architecture as of reading the ROOT file content into NumPy arrays to be kept in memory and further used with ML in a streamlined manner. On the other hand, for operational reasons of simplicity in debugging and data checking, the code was developed to also allow a dump of the data onto local disk as a tabular CSV (comma separated values) format, i.e. a format in which tabular data (numbers and text) are stored in plain
4.3. BUILDING A ML MODEL FOR... 57 text, with each line of the file acting as a data record, and each record consisting of one or more fields, separated by commas.
The task reduces itself then to coding the process of transformation from ROOT to CSV. This is definitely not a pioneering task. In the HEP community, a variety of interfaces to ROOT have been popping up over the time, ranging from more structured efforts to one-shot scripts written by individuals upon their specific needs. It must be noted, though, that a performant and scalable solution was seeked for this task, and compatibility with the TFaaS design was a requirement. TFaaS offers the uproot component for this task, as presented in4.1.1. We actually evaluated also one alternative approaches, i.e. the use of root2numpy, a Python extension module that provides an efficient interface between ROOT and NumPy. From its documentation, it is reported that its internals are compiled in C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations. In our tests, though, we experienced uproot to be more performant4.1.1: given it is natively supported in the TFaaS architecture, we opted for its adoption in the data preparation phase.
Another factor to consider is that the Monte Carlo samples and the data samples used for this study are officially produced by the CMS top PAG (Physics Analysis Group) and exploited for analysis by the all-hadronic top sub-group, and are used in this thesis as they are produced by the aforementioned teams, with no change or manipulation. They are organized with a flat structure, made by a TTree with its branches. It is crucial that any way to read such ROOT file to perform the transformation can handle such data organization. The uproot component has the native capability to deal with ROOT files with such flat structure, so this stood as an additional motivation for the adoption of uproot.
In order to operationally implement this choice, the author of this thesis col- laborated to the coding and debugging of a scalable python script specifically for this purpose inside the overall TFaaS prototype design. This script acts as a high-performance wrapper to uproot in order to read ROOT files in a streamlined and performing way. The development of this component took a non-negligible part of the thesis work, in constant contact with the core uproot developers. Some of the main issues addressed and fixed are:
• ROOT file reading (that were not correctly and completely written) [85], • the costruction of the jagged arrays [86] that allow to read arrays from ROOT
files,
• the reading of vectors of booleans [87].
The fix to these issues resulted into new releases of the uproot component, which is gaining solidity over time also thanks to this thesis work. On the other hand, in this thesis all most recent beta versions functionalities could be proficiently exploited. The work in collaboration with the uproot development team will continue also after this thesis dissertation.
The aforementioned python script used to read ROOT files via uproot calls is complex and not fully reported verbatim in this thesis, but a small snapshot of how its main concept applies is below:
58 CHAPTER 4. S/B DISCRIMINATION... import uproot
t=uproot.open("input_file.root")["events"] t.show()
for (array,) in t.iterate("triggerBit", entrysteps=1000, outputtype=tuple):
print array
where it strikes for its simplicity to an end-user. In the first line the user imports the uproot python module. In the second line, uproot is used to open a ROOT file named input_file.root which has inside a TTree named events, that gets imported. In the third line, the content of the TTree is shown/printed, so that on the screen the user can see all the branches and their related type (see Figure 4.4). The last lines are the simplest a user can code in order to read the content of the file, using event bunches of a given size (1000), in particular from a specific branch (triggerBit in this example).
Figure 4.4: Output of the show() function. See text for details.
In summary, a python script which is part of the TFaaS infrastructure and which internally uses uproot has been co-developed by the author of this thesis work. It efficiently allows to convert files from the ROOT format into a structure (CSV) that is simpler and manageable by all commonly used ML frameworks and algorithms. In Figure 4.5, an example of the headers of a CSV file produced by this tool is shown. This concludes the data preparation part.