Contrabando - COMERCIO FRONTERIZO ENTRE TULCÁN E IPIALES

CAPITULO III COMERCIO FRONTERIZO

3.5 COMERCIO FRONTERIZO ENTRE TULCÁN E IPIALES

3.5.3 Contrabando

To calculate risk score, a data mining model is developed for available data about the independent variables and dependent variable. In the current case study, initially there are 25 independent variables available for building a model for risk score (see Table 5.1). However, the proposed approach adopted a different view point and selected only variables (risk factors) those are significant predictors of supply performance that is used to define the dependent variable (supplier risk). In the current case study we have selected a list of factors according to purchasing managers’ perception and previous literature about supply risk. However instead of using all these factors directly into supplier risk assessment, a rule based knowledge discovery approach is used to reduce the number of independent variables by identifying the factors actually contributing toward realised supply risk or impacting on supply performance. Furthermore, numerical type data is dsicretized into bins (categorization).

To see the impact of the proposed approach on classification performance it is compared with other approaches,

151  A model is developed without using knowledge discovery approach. This means that a model is developed for initial listed risk factors (25 independent variables) and actual supplier performance without using knowledge discovery approach. Then the result of this model (given in detail in appendix III) is compared with knowledge driven model build for risk scoring.

 A model is developed without knowledge discovery and using state of the art discretization approach such as equal-width binning method (results are shown in appendix IV). Equal width binning approach convert the numerical type data into categorical type data by converting numerical values into equal width of bins (results are shown in appendix V).

 A model is developed using state of the art variable selection approach such as Correlation-based Feature selection method (Hall 1998). Correlation-based Feature selection method identifies the subset of important variables by considering individual predictive capability of each feature along with the degree of redundancy between them. The result of this model (shown in Appendix VI) is compared with knowledge driven model build for risk scoring.

 A model is developed using both state of the art variable selection and discretization approaches. The result of this model (shown in Appendix VII) is compared with knowledge driven model build for risk scoring.

Sample about 136 supplier over of period of six year with 25 risk factors and their supplier performance consists of total 820 observations, out of which 684 observation were used to build different models (statistics are given in appendix VII). A sample of 136 observations is used to test these build models (statistics are given in appendix IX). Output of resultant models in terms of classification performance is compared with knowledge discovery model in term of classification performance is given in Figure 6.10.

152 Figure 6.10: The comparison of KD risk scoring with other approaches All the models without knowledge discovery approach have acceptable overall accuracy with respected to both Neter (1966)’s method and Hair et al. (2006)’s benchmark, however all model behaves poorly for the minority class. All models except without knowledge discovery and with discretization approach model have not achieved required benchmark stated by Hair et al. (2006)’s i.e. 61.5% for minority class. In the case of unbalance data, the minority class higher accuracy is more desirable especially when minority class is the main target. Such as in current study, minority class labelled as “bad” is main target and it is desirable to have high accuracy. The proposed knowledge discovery model provides much higher accuracy for both minority and majority class than the required benchmarks (both Neter (1996)’s method and Hair et al. (2006) benchmarks). Furthermore, the knowledge discovery base supplier risk scoring model outperformed all other models without knowledge discovery approach on all metrics.

The build model without Knowledge discovery has 74.3% accuracy, while the proposed approach conduct the variable selection and provide 86.8% classification accuracy. According to Occam's Razor’s principle the simplest is best and furthermore unnecessary predictors will add noise to the estimation of desire output i.e. supplier risk.

Automatic variable selection method such one used in this study is aimed to construct a model that predicts well or explains the relationships in the data; however automatic

153 variable selections does not guarantee the consistency for these goals. Piramuthu, (2004) compared different feature selection techniques in his study but did not find a real winner. Automatic variable selection is a means to an end and not an end itself. In the current study, proposed method for variable selection provided higher prediction than automatic variable selection method in the current thesis. These results highlight the validity of knowledge driven risk scoring model building method that enhanced the classification performance.

Further, in the proposed approach, a knowledge driven discretization method is proposed to reduce the number of possible values of numerical type variable. The problem of choosing the interval borders and the correct artily for the discretization of a numerical value range remains an open problem in numerical feature handling (Kotsiantis and Kanellopoulos, 2006). The models built using most common discretization technique “equal width binning” are outperformed by the knowledge driven model on all performance evaluation metrics (see Figure 6.10). There is common harmony in data mining literature that there is no universal approach for building best data miming model, however different methods or techniques can be applied that perform better for a given problem in available resources. The current proposed approach performed better in the current problem domain and available data that underline its validity for the stated problem with the given resources, excluding the claim of its “comprehensiveness” in such problem domain and given resources.

In document Análisis de la incidencia que ha tenido el distrito aduanero en el desarrollo socioeconómico de la Ciudad de Tulcán (página 35-38)