In the classification problem, the dataset is represented as a structured table of records and features. Each feature is identified by a feature_id, that is set to some value v for each record, or to a null value for not available information. More formally, the input dataset D is represented as a relation R, whose schema is given by d distinct features A1. . . Ad and a class attribute C∈ C , where C is the set of distinct classes.
a class label (a value belonging to the domain of class attribute C). Each pair (feature,
value)can be seen as an item and we can use the notation in Section 2.1.1.
Not available data are represented with a null value, or not represented at all, as transactional datasets do not have a fixed structure. If a transaction has a length of exactly d items, or equivalently has no missing values, it’s a point in a space of d dimensions. The common practice in classification is to define a training set, i.e. a part of the labeled dataset that is used to train the algorithm, and a test set, from which the labels are removed. The two sets are used together with other techniques, like cross-validation, to simulate the behavior of the algorithm towards unlabeled, new data and validate its performance.
Association rules [9] are made of an antecedent itemset A, and a consequent itemset B, and are read as A yields B, or A⇒ B. When the consequent is made of a single item, and specifically an item belonging to the set of class labels, the association rule can be used to label the record. We inherit the naming in [10] and call these rules Class Association Rules, or CARs.
Both association rules and CARs share a number of metrics that measure their strength and statistical significance [11, 12]. The support count (supCount), or absolute support, of an itemset is the number of transactions in the dataset D that contain the whole itemset. The support (sup), or relative support, of a rule A⇒ B is defined as supCount(A ∪ B)/|D|, where |D| is the cardinality of D. The confidence (conf ) of the rule is defined as supCount(A ∪ B)/supCount(A), and in CARs it measures how precise the rule is at labeling a record. The lift of a rule (lift) is a measure of the (symmetric) correlation between the antecedent and consequent of the extracted rules and it is defined as conf(A ⇒ B)/sup(B). Lift values significantly above 1 indicate a positive correlation between rule antecedent and consequent, meaning that the implication between A and B holds more than expected in the source dataset. The χ2of a CAR is the value of the χ2statistics computed against the distribution of the classes in D , which states whether the assumption of correlation between the antecedent and the consequent is statistically significant. Other measures used to sort rules and CARs are their length (len), i.e. the total number of items of A and B, and the lexicographical order (lex).
Another measure that is widely used in classification algorithms is the Gini impurity [13]. The Gini impurity measures how often a record would be wrongly labeled, if labeled randomly with the distribution of the classes in the dataset. It is
used for example in decision trees, to evaluate the quality of the splits at each node. The Gini impurity of a dataset, or portion of it, is computed as
Gini=
∑
i∈C
fi(1 − fi)
where fiis the frequency of class i in the dataset, or portion of it, for which we are
computing the impurity. A portion of dataset is considered pure if its Gini is equal to 0, that happens when only a single label appears. We will refer to the Gini Impurity of an itemset, as the impurity of the portion of the dataset that contains the itemset. In association rule mining and associative classification, the user is usually able to set some minimum threshold for the above-mentioned quality measures, like a minimum support minsup, a minimum confidence minconf, a minimum positive lift
min+lift, etc. The model generation phase of associative classifiers is usually based on two steps: (i) Extraction of all the CARs with a support higher than a minimum support threshold minsup and a minimum confidence threshold minconf and (ii) Rule selection by means of the database coverage technique, firstly introduced in [10]. The database coverage technique works as follows. First, given an ordered list of CARs extracted from a training set, it considers one CAR r at a time in the sort order and selects the transactions in the training set matched by r. For each matched transaction t, it checks also if r classifies properly t. If r classifies properly at least one training record, then r is kept. Differently, if all the training transactions matched by r have a class label different from the one of r, then r is discarded. If r does not match any training data then r is discarded as well. Once r has been analysed, all the transactions matched by r are removed from the training set and the next CAR is analysed by considering only the remaining transactions.