1.2 Marco Teórico
1.2.8 Capacitación del talento humano
Before the different machine learning models can utilize the data, it has to be prepared and cleansed first. This process is important as well as the model construction part. Both
of these processes will have an impact on the final results of the models and therefore it has to be done correctly in order to achieve valid results.
The data from Datastream contained some missing values. Which is why first, all these missing values were labeled as -99 999. The models will read these values as outliers and therefore these values would not have a major effect on the models and the predictions. This is necessary in order to keep the data in a panel data format and not having to narrow down the data set any further. For instance, if all missing values are dropped and just one company is missing the P/E value for the year 2000, all the other companies that have the P/E values available have to be dropped out as well from the data set in order to keep it in a balanced panel format. Therefore, missing values are labeled and not dropped. Also, Milosevic (2016) did the same process for the data used in his research. Python program- ming language was used to construct the whole data set and formatting it into the panel form. Python was also utilized in the data analysis and machine learning part of the study: constructing statistical models, making predictions, and receiving and evaluating the re- sults.
The categorical feature, such as a company name in this case, was encoded. This means that all company names are converted to whole numbers starting from 0 up to 72. For example, in this study Afarak will be assigned 0 because it is the first company and Amer Sports is labeled as 1 and so on. After this part, all the numbers are shifted to separate columns, in this case, each company have its own column. Instead of one company name column, there are now 72 new columns for each company. Finally, all these columns are transformed into dummy variables were 1 means that the specific row information con- cerns the company and 0 that the information in the specific row does not concern the company. This must be done because the company names themselves do not have any hierarchical meaning, for instance, company name of Nokia is not bigger than company name of Elisa. If the encoded values are not transformed into dummy variables, the mod- els will evaluate company name, which was assigned number 4 more relevant than com- pany, which was assigned number 3 in the encoding part. The encoding was done in Py- thon by utilizing an OneHotEncoder method from sklearn package. At the end of the encoding phase, one of the transformed columns has to be dropped out in order to avoid a dummy variable trap.
The dummy variable trap is also known as the perfect collinearity, where two or more independent variables are perfectly correlated. One variable can be predicted from the other variables. The dummy variable trap occurs when the same number of categories is transformed into the same number of dummy variables. It can be avoided by excluding one of the categorical variables in the model (Bech & Gyrd-Hansen 2005). For instance, in this study there are 73 different companies and therefore there are 72 different categor- ical variables. One of these variables has to be excluded to avoid the perfect collinearity and it does not matter which one of the variables is dropped.
Feature scaling is also needed to standardize the range of all the independent variables in the study. This process is also known as data normalization. Feature scaling is im- portant because some machine learning algorithms will not work properly if the data is not standardized. For instance, many classifiers such as SVM or KNN calculate the dis- tance of two different data points using the Euclidean distance. Therefore, if one of the features has a wider range of values, it will dominate the other feature, and this will lead to false results (Young & Jeong 2009; Bo, Wang & Jiao 2006). In the study, all variables were scaled by using a StandardScaler method in Python from sklearn package.
After the data preparation was done, the data was divided into training and testing data sets. First, the models are trained with the training data set and then the trained models try to predict the dependent variable in the test data set. The test data set contains new information that the models have not encountered before in the training phase. The train- ing data set includes all the values from 1/2000 to 12/2015. The last two and half years, 1/2016–7/2018, of the whole-time period are selected in the test set for prediction pur- poses, instead of randomly picked values from the whole sample time frame. If the values are randomly picked from the whole-time frame, all the models will contain look-ahead biases.
Look-ahead bias occurs when the models have access to information, which would not be available and known during the period when it is analyzed (Daniel, Sornette & Woehr- mann 2009). For instance, when predicting the values of the year 2002, the model already has information on some of the next and its following year values, which would not be normally known. Look-ahead bias increases accuracy of models and would lead to false and biased results. Due to this, the data is separated into training and testing sets based on the year and not a random selection.