• No se han encontrado resultados

CAPÍTULO III: LA RSC Y EL COMPORTAMIENTO DEL CONSUMIDOR

3.4 El efecto mediador de la satisfacción del consumidor

The more data used for training, the better the model will perform. However, in some cases, the resulting performance enhancement is small compared to the difficulty of acquiring accurate labelled data. Figure 4.5 illustrates the model’s behaviour according to the size of the data used in training. It shows that models reach a point where the chance of overfitting becomes less likely. However, the model improvement reaches a point where adding more data does not significantly change the performance. This led to further investigations being conducted to improve the performance further.

Figure 4.5 RF incremental learning curve according to DS1 random forest performance according to the size of training data used

From 20,000 training data points onwards, the cross-validation performance did not change significantly, which means that adding more training data points would not lead to significant performance improvement. Therefore, better descrimantive features needed to be found instead of continuing to collect/extract the same features used in DS1. That is why DS2 was collected, which has a higher number of features and was labelled with an extra manual validation stage to develop an accurate ground truth dataset. DS2 was used in all the experments described in the next chapter so that a better model could be built using a dataset with higher dimensions. The next chapter will discuss the use of more complex models that was investigated. More advanced and complex models can be achieved by adding more discriminative features to the dataset used [146].

4.6 Conclusion

In this preliminary experiment, the author first investigated several supervised machine learning models to find which type of algorithm gives good results without going further into tuning it. Further time and effort was invested to optimise the chosen algorithm (RF) and come up with the best hyper-parameters for the dataset, the highest performing model using the smallest number of features and the smallest structural parameters (tree number, max tree depth and maximum leaf size), to find the least complex but high-performing classifier.

As RF showed the highest performance among the compared algorithms, the researchers went further to determine the main hyper-parameters that limit the model’s performance and complexity. Experiments show that the tree number hyper-parameter in RF gives more stable results; however, it could reach a point when more trees do not add any significant impact to the performance. However, trees’ stopping criteria hyper-

parameters, such as tree depth and minimum sample leaf, have a direct impact on the model complexity, so choosing the optimal values that give good results but are not overfitted to the trained dataset is essential for model evolution. For parameter tuning, the aim was to determine the best setting for the number of trees, the size of leaf nodes and the depth of the trees in the RF classifier. However, these numbers would be suitable only for the training set used.

Then another experiment was conducted to study three feature selection methods that belong to two techniques, which are the filter and wrapper methods. First, the required minimal structural hyper-parameter values of RF were determined. Following this, a model was built using all the features and the feature set was reduced to the minimally required set. The best feature set reduction was achieved using the computationally costly MDA wrapper method; however, relatively similar performance (although statistically significantly lower) and feature set reduction was achieved by using the two chosen filtering methods as well.

In summary, the results show that the process of hyper-parameter tuning is essential and can make a difference in terms of finding the balance of high performance and encountering an overfitting problem. Furthermore, the feature selection process could enhance performance and reduce the resources required for extracting expensive unnecessary features or by reducing the dataset dimension. Focusing on using only important features in building the classification model reduces the model’s unnecessary complexity. It is also important to report the parameter values and the details of the feature set optimisation method that was applied to guarantee the reproducibility of the results reported in studies.

More Informative Features and Ensemble

Learning Methods Used to Detect Malicious

5.1 Introduction

The previous chapter gives a clear understanding of what is required when building a machine learning-based detection method, such as model selection, tuning, and feature selection. As an ensemble-based model (RF) showed better results, further investigations were conducted by studying several ensemble methods using tree-based models. More ensemble learning models are used in this chapter’s experiments, such as XGBoost, extra random trees and gradient boosting trees. Moreover, in the second part of these experiments, several ensemble learning models were combined using two methods of combination. Combining models could aid in optimising performance by producing an even better model or/and could contribute to the process of model selection. Therefore, in this chapter the aim is to investigate the possibility of achieving both or at least one of the previous goals.

It was also shown in the previous chapter that the model enhancement reached a point where no further improvement could be achieved even when more data was used in training. Therefore, to overcome this issue, all the experiments applied in this chapter are based on DS2, which contains deeper features extracted from several sources. To obtain further information about the nature of this dataset and what new features were added and what features were eliminated, see section 3.3 in chapter 3.