3. Metodología
4.2 Análisis de segundo nivel
4.2.1 Intra-personal
4.2.1.1 La Conciencia emocional
Methods
Taking into account the results from the three toy examples we want to summarise shortly some characteristic properties of the discussed learning methods:
5.6 Typical Properties of Different Classification Methods 107 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
support vector machine
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5.30: ’Gaussians’: Support vector machine training result – the chosen support vectors and the final output distribution.
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 random forest 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5.31: ’Gaussians’: Random forest training result – two trees (left and middle) of 200 trees building up the final output distribution (right) by averaging.
them had some kind of enclosed region which could not be described with simple cuts. On the last example the simple cuts performed fairly well but the axis-parallel cuts have great difficulties with any kind of correlation.
• Naive Bayes, C4.5 and random forest often have the same problem as the simple cuts: They are based on axis-parallel cuts. If a structure with decision boundaries parallel to the axes is given (like the ’Hole’ example), they perform well. Any kind of correlation or rounded decision boundaries are problematic for these three methods. The random forest can cope with these problems best due to its averaging strategy.
• k-nearest-neighbours and range search show the typical properties of local density estimators: They describe the decision boundary quite well, independent of its struc- ture. They have no problems with correlations or rounded regions. However, the derived decision boundaries are always much noisier than they should be. Range search shows in addition the problem of the regions where no decision can be made due to missing training events, even if the classification would be simple.
• Neural networks have no problems with correlations and are especially well suited for piecewise linear decision boundaries. However, the approximation of rounded decision boundaries with several hidden neurons is not perfect.
108 5. Statistical Learning Methods
• Support vector machines with a Gaussian kernel show a behaviour complementary to neural networks. The description of rounded decision boundaries works excellent with support vector machines. In contrast, linear decision boundaries are less well described and often lead to artefacts and thus to misclassifications.
We see that each of the learning methods prefers a specific kind of problem for which it then gives very good results. This closes the circle and brings us back to the discussion at the beginning of this chapter about the different kinds of biases which are implemented inside each learning method. As mentioned in section 3.15, the remaining question is which kind of bias is suited best for the datasets which will be faced in physics analysis. Chapter 7 will try to answer this question by comparing different learning methods on many different dataset from high energy and astrophysics experiments.
Chapter 6
Software Development
In this chapter the software framework will be described which was developed to allow the application and evaluation of many different statistical learning methods to all kinds of datasets from physics experiments. Also the toy problems discussed in the last chapter have been processed within this framework. On the one hand a large fraction of the software development depends on the data sources and therefore on the specific experiment. It is important to realise that the access to and the management of data coming from a large experiment is always a demanding and time consuming task. On the other hand there is also a large fraction of the software development which is independent of the specific data source.
Generally all implemented learning methods and their automation as well as the prior analysis and the posterior evaluation work independently of the underlying experimental situation. There is, however, again a part of the evaluation which does depend on the application since the quantity which should be optimised may be defined depending on the application.
The programs described in this chapter are mainly implemented in C++using its type- safe object oriented style and its fast executables. These programs make use of the ROOT library [97] mainly for the graphics tools and basic mathematical operations. Several scripts which operate on top of the basic programs are programmed in perl since this language provides an excellent interface to the shell and very comfortable text processing.
The framework of learning methods and the programs for the evaluation of learning methods contain about 15.000 lines of code. In contrast, the experiment dependent pro- grams like data access and special performance evaluation contain about 55.000 lines. This discrepancy confirms the comments above that large parts of the work and thus also large parts of the programs have to be dedicated to the interaction with the specific experiment.
6.1
Data Access and Preprocessing
The first step towards the application of statistical learning methods to a new problem from physics analysis depends much on the hardware and software which is used in this experiment. Usually some kind of analysis chain already exists and has to be used to obtain the training data for a statistical learning method. Tools implementing an interface between the experiment dependent software and the framework of learning methods (which will be described below) have been created for each of the experiments described in this
110 6. Software Development thesis. They range from simply reading in text tables or binary data files to the usage of very complex software frameworks.
Example: Data Access in the Large Software Frameworks of H1 and MAGIC
The training data for the level 2 neural network trigger consists of the level 1 trigger quantities which are sent to the level 2 system for each event (see section 2.1.4). This data stream is stored on mass storage in a special “bank”, unfortunately only for every second event due to a limitation in the readout system. Libraries in fortran and since recently also inC++ allow the access to the H1 data files and in particular to the neural network data. Several hundreds of megabytes have to be processed to extract the neural network inputs for some hundred events.
A large software framework called MARS copes with the data analysis for the MAGIC telescope. ThisC++framework has to be used to be able to extract detector data from the MAGIC data files. Depending on the abstraction level of the data (information from all pixels vs. Hillas parameters, compare section 3.9), typical file sizes are tens of megabytes to several gigabytes.
The preprocessing step also depends strongly on the experiment from which the data is taken. Some datasets may suggest only one specific input vector which is obtained in a trivial way from the available data stream. Much more often, however, the given detector data can be preprocessed in different ways including not only standards like normalisation but specific operations based on knowledge about the detector.
Example: Preprocessing for the MAGIC telescope and the XEUS pixel-detector
Substantial preprocessing was, for example, done for the MAGIC datasets since dif- ferent sets of inputs were formed which depend on completely different levels of abstraction. The basic analysis presented in section 7.6 works on the level of the Hillas parameters. First steps towards an analysis based on pixel information have been presented in section 7.6.3.
For the XEUS datasets the preprocessing step is essential since it can already filter out background. A framework which transforms uncalibrated data files into various analysis formats has been developed and will be presented in appendix B.1.
Additional tools have been developed which make typical data manipulation tasks easier. Among them one tool “squares” the input dimension (adding xixj ∀i, j to the inputs x1. . . xn). Another one selects or removes specific inputs from a datafile, others help managing weights and IDs which are used to distinguish between training, selection and test sets.