5. RESULTADOS Y DISCUSIÓN
5.3 DIMENSIÓN: CONDUCTAS DE ESTUDIANTE (MOTIVACIÓN
5.3.4 ACTITUD DEL ESTUDIANTE FRENTE A LA INSTITUCIÓN (COMPROMISO
The HyperSplit decision tree technique [107] is a successor algorithm to HiCuts that aims to improve upon HiCuts in both memory footprint and classification per- formance. At its heart HyperSplit works similar to HiCuts, although it significantly differs from HiCuts when it comes to the heuristic used for the selection of the cut dimension. Furthermore, HyperSplit always performs exactly one cut in the chosen cut dimension, in contrast to the γ cuts performed by HiCuts. Accordingly, the HyperSplit data structure is always a binary tree, which can be stored and accessed efficiently.
The HyperSplit paper [107] describes two different heuristics to choose the di- mension to cut, but recommends the usage of one specific heuristic due to its superiority over the other in terms of space savings. Before we dive into the details of the superior heuristic, we introduce the notion of interval weights. As we have seen in Section 4.3.1, the endpoints of the geometric rule representations in a dimension j can be used to partition the jth axis of a d-dimensional box. More precisely, when we consider a rule set R = ⟨R1, . . . , Rn⟩, where the geometric rep-
resentation of each rule B(Ri) =[︁ai1, bi1
]︁
× . . .[︁
aid, bid]︁
lies within a d-dimensional box B = [a1, b1] × . . . × [ad, bd], the endpoints aij and bij partition the jth axis
[aj, bj]of B into at most lj ≤ 2n + 1 disjoint intervals Ijk, k ∈ {1, . . . , lj}. The
weight w(Ijk)of an interval Ijkis the number of rules that intersect with Ijkin the jth dimension, i. e., w(Ijk) =⃓⃓ ⃓ {︂ Ri| [︂ aij, bij]︂∩ Ijk}︂⃓⃓ ⃓, (4.20)
Fig. 4.20: HyperSplit cut dimension and cut point determination using interval weights.
dimension δ to cut, HyperSplit’s dimension heuristic chooses the dimension δ with the smallest average interval weight wδ, with
wj = lj ∑︂ k=1 w(Ijk) /︃ lj. (4.21)
Since the cut will take place in one interval Iδ
k, possibly every rule that intersects
with Iδ
kwill be duplicated. Therefore, the heuristic’s idea is to pick the dimension
where the average number of possibly duplicated rules is minimal. Accordingly, the dimension Y would be chosen to cut in Figure 4.20, since wY < wX.
The next important step is to determine the cut point ρ in the interval [aδ, bδ]. A
reasonable choice for ρ would be a point that separates half of the rules in B into the interval Ileft = [aδ, ρ − 1]and the other half into Iright= [ρ, bδ]. Although this
is not always possible, e. g., due to rule overlaps, HyperSplit aims to approximate such a bisection by choosing ρ such that the interval weights of Ileft and Irightare close. To this end, ρ is set to the start point of the interval Iδ
m, such that m is the
smallest value in {1, . . . , lδ} with m ∑︂ k=1 w(Iδk) > 1 2 lj ∑︂ k=1 w(Iδk). (4.22)
In Figure 4.20, ρ would be set to m = 4, because m = 4 is the smallest value that satisfies Condition 4.22. The figure indicates that this cut point indeed separates rule R1 from rule R2.
The general procedure that generates a HyperSplit decision tree from a specified rule set is analog to HiCuts. Figures 4.22 shows the HyperSplit tree that would be generated from the rule set given in Figure 4.21. It becomes apparent that the HyperSplit tree in Figure 4.22 has the same height as the HiCuts tree from Figure 4.19, but requires only five nodes (in contrast to the 16 nodes of the HiCuts tree). The reason for this is the greater care with which HyperSplit places its cuts with regard to tree balancing and rule duplication avoidance.
The worst case complexity of HyperSplit does not differ from that of HiCuts in terms of tree height T↑and number of nodes |T | in the tree. However, the authors
Fig. 4.21: Complete HyperSplit example with five rules, β = 2, and H = [0, 15] × [0, 9].
Fig. 4.23: Array representation of the HyperSplit tree shown in Figure 4.22.
of the HyperSplit paper suggest that T↑∈ O(d · log(2n + 1)) [107]. In comparison to HiCuts trees, however, the binary HyperSplit trees have the advantage that they can be stored and accessed in a compact way using an array representation, as sketched in Figure 4.23. Here, each node in the HyperSplit tree is represented as a 64 bit word. In case the word represents an inner node, the first 32 bit contain the offset to the word representing the right child node (the left child node is always the rightmost next word) and the cut dimension δ, while the second 32 bit store the cut point ρ. If the word represents a leaf node, the first 32 bit are zero, while the second 32 contain a pointer to the corresponding sub rule set. That way, the entire search data structure can not only be stored in one coherent memory chunk (except for the sub rule sets), it also requires only forward jumps to be executed, which allows for cache-efficient memory accesses.