• No se han encontrado resultados

1. NORMAS LEGALES TRIBUTARIAS DE LOS COMPROBANTES

2.3 ANÁLISIS DE LA GUÍA METODOLÓGICA DE LA APLICACIÓN DE LOS

2.3.1 PASOS QUE DEBE REALIZAR EL CONTRIBUYENTE EN LA

In real world problems, the collected data can be affected by one or more of the following complications: noise, missing values, inconsistent data and high dimensionality. Thus, pre-processing the data before training the predictive models can improve the quality of the prediction and/or reduce the time required to deliver the prediction (Han et al. (2011), Bruha and Famili (2000), Salvador et al. (2016) and Zliobaite and Gabrys (2014)). A single or multiple stages of pre-processing methods can be applied to the data. In general, pre-processing techniques can be used to perform different tasks. These tasks include (but not limited to) the following:

• Outlier removal: An outlier is defined as a point that lies outside the mean dis- tribution of the input data (Theodoridis and Koutroumbas (2006)). The existence of outliers can be a result of noisy measurements. In most cases outliers are con- sidered as noise or exceptions and are discarded during the pre-processing stages. However, when the number of outliers is large or when they represent an event of interest to the designer (for examples, the outliers in fraud detection, where rare cases of frauds are more important than regular events), outliers can be analysed through outliers mining techniques (Han et al. (2011)).

• Data normalization: The ranges of the data features can vary widely as they rep- resent different aspects of the prediction problem. For example, in customers data age and salary can have completely different ranges which do not necessarily rep- resent their influence in the prediction problem. In machine learning methods, especially if these methods provide their prediction using a distance based metric, such as, nearest neighbour or clustering methods having large differences in the scales of the features can affect the accuracy of the prediction. Thus often nor- malising the data can lead to improving the performance of the prediction method (Han et al. (2011)).

• Missing values: In practise, some of the data features can have missing values for certain instances. This can be due to one or more for the following: malfunction in the equipment used to collect the data, missing information for certain users, error in entering the data, etc (Mitchell (1998) and Alpaydin (2014)). Pre-processing techniques can be used to handle missing values problem. If the data set is large enough, the data instances with missing values can be discarded, however, in real world problems this is seldom the case. The missing values can be predicted using

one of the following techniques (Han et al. (2011)): filling the values manually, replacing missing value with a global constant, using the mean of the respective feature to fill in missing values or predicting missing values.

A number of machine learning approaches have been proposed to handle missing values without explicitly provide alternative values for them. Examples on these methods can be found in neural network ensembles such as the network reduction method proposed by Sharpe and Solly (1995) which trains a set of Multilayer Per- ceptrons (MLPs) on a different possible combination of the features. Furthermore, Krause and Polikar (2003) developed a NN ensemble which deals with missing values by training its base predictors on random subsets of the features. Another approach is proposed by Juszczak and Duin (2004), where an ensemble consists of one class classifiers, each trained on a single feature. Therefore, if a feature has a missing value for certain sample, the classifier can still provide prediction for this sample using the other features.

On the other hand, some of the well-known methods used to generate decision trees have additional mechanisms that allow them to deal with missing values, such as, ID3 (Quinlan (1986)) where an additional edge in the tree is provided for each missing feature. This edge contains the possible values for the missing feature. An extension of ID3 is the C4.5 (Quinlan (2014)) which uses probabilistic approaches to deal with missing values.

• Dimensionality reduction: One of the main factors that control the complexity of the predictive model is the size and dimensionality of the data. Thus reducing the dimensionality of the data can reduce the complexity of the model. Furthermore, using smaller number of features (without loss of information) can better explain the underlying process which generates the data as well as help to visualize and analyse the data (Alpaydin (2014)).

Dimensionality reduction can be achieved using one of the following two ap- proaches:

– Feature selection: In this approach a subset of the features that contains the most information about the prediction problem is selected. The subset is se- lected based on a predefined metric, such as, features correlation (Tumer and Ghosh (1996)) or mutual information (Cover and Thomas (2012)).

new set that has a fewer number of features. Feature extraction methods can be classified according to the type of learning method into: supervised and unsupervised methods.

The most widely used feature extraction methods are Principle Component Analysis (PCA) and Linear Discriminate Analysis (LDA) which are linear projection methods that can be applied to supervised as well as unsupervised methods. Examples on non-linear dimensionality reduction methods are iso- metric feature mapping (Tenenbaum et al. (2000)) and locally linear embed- ding (Saul and Roweis (2000)).

Documento similar