3. La acción de extinción del dominio como una expresión de la política criminal contra el delito
3.3 Características de la acción de extinción de dominio
As discussed in the Chapter2, GA based solutions are buried in chromosomes. The real-life information needs to be encoded in chromosomes before kick starting at the natural selection process. For example, rule conditions need to be encoded in chromosomes. The encoding of discrete attributes is very simple where each discrete value forms a condition in the chromosome as explained in Section 2.5.1. However, the encoding of numeric attributes needs discretization before forming a condition. When the range of a numeric attribute value is discretized, then its discretized values can be treated as discrete attribute values for encoding.
The task of discretization of numerical attributes is well known to statisticians and machine learning researchers. There are many discretization methods available in literature. They are broadly divided into two classes e.g. unsupervised methods and supervised methods. Examples of unsupervised methods include equal-distance and equal-frequency discretization. In equal-distance discretization, equal width intervals are constructed between maximum and minimum values of the attribute. This method assumes that the underlying data fits reasonably well into a uniform distribution. The equal-frequency discretization partitions the attributes into intervals counting the same number of instances. Since the unsupervised discretization methods do not use class information, it is also called the class-blind discretization method [Kerber 1992]. The unsupervised discretization method is useful in association and characteristic rule mining. However, it has no use in classification rule mining.
Supervised discretization methods are widely used in machine learning problems, since class information is available here. The supervised discretization methods include ChiMerge [Kerber 1992], Chi2 [Liu and Setiono 1995], Error-based and Entropy-based discretization [Kohavi and Sahami 1996], Rough Set-based discretization [Nguyen and Nguyen 1998]. ChiMerge has been widely used and referred to in DM and machine learning papers. In ChiMerge, instances of the dataset are sorted according to the value of the attribute to be discretized. Initially, intervals are formed through regarding each unique value as an interval of the attribute. The statistical measure
χ
2 value of all adjacent intervals is computed and the adjacent interval, whichhas lowest
χ
2value is merged. This merging continues until all adjacent intervals havethe higher predefined
χ
2.Similar to ChiMerge, we propose a new discretization algorithm named Support- ErrorMerge, where instead of using
χ
2, rule class support [Definition 2.2], and rule class error [Definition 2.4], are used. This algorithm is outlined in Function 3.2. The algorithm produces a set of intervals for each attribute and each class. The algorithm merges initial intervals until resultant intervals are within the requirement of minimum class support and maximum allowed class error.The SuppErrMerge discretization algorithm is different in many ways from existing supervised discretization algorithms. Existing discretization algorithms use statistics parameters by utilizing the error information between classes (e.g.
χ
2 in ChiMerge, inconsistency rate in Chi2) in building intervals of the numerical attributes. However, SuppErrMerge uses error information as well as support information in the interval building process. As discussed Section 2.7, interesting rules can be mined using different combinations of class error and class support. The combinations that produce interesting rules are accepted in the SuppErrMerge discretization method. The second feature of this discretization is to produce continuous or non-continuous intervals of interest to the user. Existing methods always produce continuous intervals. This is because the rule mining method finds rules in intervals of user interest (support or error) and some intervals may not have this feature. Hence, discrete intervals are more suitable for rule mining. The third feature of this method is to generate a multi set of intervals for each class. This enables the rule mining process to deal with a smaller number of intervals. For example, if a problem has 3 classes in the data set and a discretization method achieves certain efficiency with 10 intervals of numerical attributes, then a rule mining process needs to examine these 10 intervals for each class to discover the rules. This is equivalent to examining 30 intervals altogether. If the SuppErrMerge method is used, then it will produce at most 10 intervals for each class and there is a possibility that some classes will have less than 10 intervals and in that case the rule mining process needs to examine less than 30 intervals. This gives a significant improved performance in the rule mining algorithm.To test the SuppErrMerge discretization method a dataset named as IRIS was used from the Machine Learning Repository1. This dataset consists of 150 instances,
1
Machine Learning Repository is available at kdd.ics.uci.edu SUPP-ERR-MERGE-DISCRETIZATION
INPUT:
instances {Training instances} attribute_list {Attribute list}
target_attribute {Target attribute} min_class_support {Minimum support} max_class_error {Maximum allowed error}
STEPS:
for each continuous attribute from attribute_list do Sort(attribute, instances)
for each class ∈target_attribute do merge =true;
do while merge merge := false;
Support_error_calculation(attribute,data,target_attribute) for each interval of attribute do
if ClassSupport(interval)>= min_class_support and ClassError(interval)<= max_class_error and
ClassSupport(next(interval))>= min_class_support and ClassError(next(interval))<= max_class_error then
if ClassSupport(interval+next(interval)) >= min_class_support and ClassError(interval+next(interval)) <= max_class_error then
merge_intervals(interval,next(interval)) merge := true; endif endif endfor merge := (data,attribute); enddo Save_intervals(attribute,class) endfor endfor
RETURN: a list of intervals of continuous attributes
which are described by four numerical attributes e.g. Sepal length, Sepal width, Petal length, and Petal width. Each instance belongs to one of three classes e.g. Iris-setosa, Iris-versicolor, and Iris-virginica. The SuppErrMerge discretization is applied to this dataset with a different combination of support and error. Table 3.3 shows the result of the discretization on Iris dataset. Needless to say when two adjacent intervals are merged then the resultant interval produces a higher support and a higher error than the previous 2 adjacent intervals. After discretization, if an attribute makes k number of intervals, then the chromosome for GA takes k number of bits to encode the attribute using the Michigan approach. The performance study of the SuppErrMerge discretization method will be presented in Section 3.7 when it is adopted in GA rule mining.
Class # Intervals for Sepal length # Intervals for Sepal width # Intervals for Petal length # Intervals for Petal width Maximum class error =5% and Minimum class support=80%
Iris-setosa 1 0 1 1
Iris-versicolor 0 0 1 0
Iris-verginica 0 0 1 1
Maximum class error =10% and Minimum class support=80%
Iris-setosa 1 0 1 1
Iris-versicolor 0 0 1 0
Iris-verginica 0 0 1 1
Maximum class error =15% and Minimum class support=80%
Iris-setosa 1 0 1 1
Iris-versicolor 0 0 1 1
Iris-verginica 0 0 1 1