Capítulo 1. Marco Teórico – Referencial 1.1 Introducción
1.9. Tipologías de problemas de ruteo de vehículos
As mentioned before, the number of occurrences and the number of outflows of Table 5.3 are observations. In order to use the estimated outflow rate in practice, these observations must be converted to expectations. In other words, the results must be generalized that fits in a model. That model can be used in practice in order to know the expected number of TEMPs that leave the organisation when they receive less work than desired. The section below explains the steps needed to determine that model.
Visualization of the results
First, it is useful to have a visual overview of the results. As mentioned before Table 5.3 indicates that the likeliness of outflow increases when a TEMP receives less work. Is this the case in every situation, are there any outliers or is that increase of likeliness linear or exponential? These questions may rise and more insights can be generated by Figure 5.9.
Regression model
Figure 5.9 shows that there is some indication for a relation between the outflow rate and the difference in receiving less work. If a regression model must be applied to this set of data, only a linear regression or polynomial regression are applicable (as mentioned in chapter 3.4). There is one independent variable (percentage of less shifts worked per week than usual) and one dependent variable (the outflow rate), in addition to that the variables are continuous. Therefore logistic regression cannot be used.
In order to answer the question, "is it true that if the percentage of less shifts worked increases, the outflow rate also increases", the correlation between the two variables must be calculated.
corr(y, x) =corr(outf lowrate, dif f erence) = 43%
A correlation of only 43% means that four out of ten times the percentage of receiving less shifts increases, the outflow rate increases as well. This is rather low so a linear regression or polynomial regression model will not perform that good. There are two terms that indicate the performance of a regression model, the p-value and the R-squared value.
Figure 5.9: The initial result of the outflow rate of Table 5.3. Run time = 7 hours
The p-value can be interpreted as the value that indicates how well the changes of the independent variable are related to the changes of the dependent variable. A low p-value means that the dependent variable is a good addition for the model. The definition of the R-squared value is given by the percentage of the variation of the dependent variable that is explained by the regression model. A high R-squared value indicates that the variability of the dependent variable is well explained by the regression model. The results are listed in Table 5.4. A plot of these results can be find in Figure 5.10, the linear regression model is displayed on the left and the polynomial regression model is displayed on the right.
Table 5.4: The result of two regression models to determine the outflow rate.
Model P-value R-squared
Linear regression x: 0.034 0.151
(y=c0+c1∗x)
Polynomial regression x: 0.019 0.338
(y=c0+c1∗x+c2∗x2) x2: 0.014
(a) Linear regression model (b) Polynomial regression model
Figure 5.10: Two regression model to determine the outflow rate
As indicated by the correlation of only 43%, the results were not expected to be good. Therefore it is interesting to find ways to improve the models. In addition to that, currently the support of both models is quite low. That means that there are some values for the independent variable with only a few observations. The support can increase if those observations are left out. More about this process is explained in the next section.
Improvement of regression model
As mentioned in literature section 3.4, a good regression model deals with the trade-off between underfitting and overfitting. The polynomial regression model is included to overcome the tendency of a linear regression model to underfit the data. A drawback of a polynomial regression model is the possibility of overfitting the data, but the proposed polynomial model includes only up to a second degree polynomial. If a higher number degree polynomial was proposed, the likelihood of overfitting the data increases. Since Figure 5.10(b) does not show aspects that the data is overfitted and only up to a second degree polynomial is used, there is no need to implement models like ridge or lasso regression.
There are certain values of the input data that have a few observations. For example in Table 5.3 bin 50-52.2% has 9 number of TEMPs that left because of less work. The confidence that the corresponding outflow rate is a reliable value is low, due to the lack of observations. There is no rule of thumb to leave out data with a limited number of observations, therefore the only criteria is that the decision must be based on logical intuition. Table 5.3, column "number of outflows", provides the number of observations. The bins with less than 30 observations per bin are excluded since that is less than 1% of all the observations. The assumption is that the input data for the regression models is more reliable. Again the correlation value is calculated to know if the exclusion of observations has the potential to improve the regression model.
corr(y, x) =corr(outf lowrate, dif f erence) = 83%
The higher correlation indicates that 8 out of 10 times the percentage less shifts increases, the likeliness of outflow increases as well. This suggests that the linear and polynomial regression models can be improved in terms of p-value and R-squared value. The results are listed in Table 5.5 and Figure 5.11.
Table 5.5: The result of the improved regression models to determine the outflow rate.
Improved model P-value R-squared
Linear regression x: 6.67e-05 0.64
(y=c0+c1∗x)
Polynomial regression x: 1.14e-04 0.64 (y=c0+c1∗x+c2∗x2) x2: 0.95
(a) Improved linear regression model (b) Improved polynomial regression model
Figure 5.11: Two regression model to determine the outflow rate
Both the p-value and the R-squared values improved, except for the second term of the polynomial model. This means that the second term of the polynomial model is not a good addition, since a high p-value indicates that there is a high probability that there is no difference between the outflow rate variable and the second term if the less shifts worked per week variable. Thus, the polynomial regression model does not meet the conditions to be statically significant. The conclusion is that the linear regression model is a good generalization of the observations between the percentage less worked than usual and the outflow rate of TEMPs. From this point, all the calculations and examples are shown with the linear regression model.