5. RESULTADOS Y DISCUSIÓN
5.3 DIMENSIÓN: CONDUCTAS DE ESTUDIANTE (MOTIVACIÓN
5.3.3. PARTICIPACIÓN EN ACTIVIDADES EXTRA-CURRICULARES
Hierarchical Intelligent Cuttings (HiCuts) by Gupta and McKeown [63] recursively partitions the header space H through equidistant cuts, thereby creating a HiCuts decision tree. The root of the tree is associated with the hyperrectangle of the entire header space B(H) and represents the vantage point of the recursive decision tree generation process. During preprocessing, a node N is cut if the hyperrectangle B(N ) that is associated to N covers more than β rules. Here, β is a predefined constant and is also called the binth [63] of the decision tree. Often, lower values of β result in deeper decision trees and faster searches of the child nodes during classification, whose sub rule sets are queried linearly. In contrast, high values of β lead to small decision trees that can be created quickly, which, however, can result in worse classification performance, as the linear searches become more expensive.
When a node N is cut, HiCuts uses various heuristics in order to decide (1) which dimension δ should be cut, and (2), how many cuts γ should be performed in dimension δ. Finding adequate values for both the cut dimension and the number of cuts is crucial for the performance of HiCuts, both in the preprocessing and in the classification phase, due to the fact that choosing bad values for, e. g., the cut dimension, can lead to memory explosions, huge preprocessing times, and bad classification performance. The original HiCuts paper [63] proposes four different heuristics that can be used to compute the cut dimension δ, but does not give a recommendation which of these heuristics should be used in which situations. We therefore describe only one of these heuristics, which was the most successful in our experiments and is also recommended for use in related work [122].
Fig. 4.16: The HiCuts dimension heuristic chooses δ = Y since |ΠY| = 4 > |ΠX| = 2.
The idea of the dimension choice heuristic is to pick the dimension δ with the most rule projection points. That is, given a node N that covers the area B(N ) =
[︂
aN1 , bN1 ]︂× . . . ×[︂aNd, bNd]︂, all rules Ri ∈ R are taken into consideration whose
geometrical representation B(Ri)intersects with B(N ). We refer to the sub rule
set of the rules associated to N through intersection by RN. Then, for each
dimension j, the set of distinct rule projection points Πj is computed with
Πj = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ x ⃓ ⃓ ⃓ ⃓ ⃓ ⃓ ⃓ ⃓ ⃓ ⃓ (︂ x = aRj ∧ x ∈[︂aNj , bNj ]︂)︂∨ (︂ x = bRj ∧ x ∈[︂aNj , bNj ]︂)︂with R ∈ RN and B(R)j = [︂ aRj , bRj ]︂ ⎫ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎭ . (4.15)
Finally, the cutting dimension δ is the dimension with |Πδ| ≥ |Πj| ∀j ∈ {1, . . . , d}.
In case of a tie, the cutting dimension can be determined randomly from the best candidates. The dimension selection heuristic is illustrated in Figure 4.16.
After the cutting dimension δ has been determined for a node N , a cut heuristic chooses the number of cuts to perform in the dimension δ with respect to the bounds of N . Here, the HiCuts algorithm relies on a greedy strategy that starts off with max{4,⌊︁√︁
|RN|⌋︁
} cuts, and then stepwise doubles the amount of cuts γ until the space required by the resulting child nodes exceeds a threshold ΣN. More
precisely, with CN being the set of child nodes of N and spfac being a predefined
space factor, the termination criterion for the greedy search is
⎛ ⎝|CN| + ∑︂ C∈CN |RC| ⎞ ⎠> (spfac · |RN|) . (4.16)
Of course, γ is naturally limited by the amount of points in the interval[︂aN δ , bNδ
]︂
of the area B(N ). Similar to the binth parameter, the space factor spfac influences the shape of the generated decision tree: the larger spfac, the more cuts can be performed, which can result in a shallower and broader tree. However, large
values for spfac can quickly lead to memory explosions, which is why the literature refers to small values of spfac of up to eight [63,122].
Once the cut dimension δ and the number of cuts γ have been determined, the node N is partitioned into γ + 1 child nodes C1, . . . , Cγ+1. This happens by
splitting the area
B(N ) =[︂aN1 , bN1 ]︂× . . . ×[︂aNδ , bNδ ]︂× . . . ×[︂aNd, bNd]︂ (4.17) into γ + 1 areas B(Ci) = [︂ aN1 , bN1 ]︂× . . . ×[︂aCi δ , b Ci δ ]︂ × . . . ×[︂aNd, bNd]︂ (4.18) with i ∈ {1, . . . , γ + 1}, γ+1 ⋂︁ i=1 [︂ aCi δ , b Ci δ ]︂ = ∅and γ+1 ⋃︁ i=1 [︂ aCi δ , b Ci δ ]︂ = [︂aNδ , bNδ ]︂. These areas are equally sized, except occasionally the last area B(Cγ+1).
Figure 4.17 depicts the above described HiCuts cutting process and thereby illustrates the major cause of memory usage in HiCuts, namely rule duplication. For example, the rule R1 intersects with the areas of the children C1, . . . C4 and is
thus duplicated three times, as it must be processed in the child nodes. By looking at the figure, we could also assume that rule R2 is duplicated into the children
C3, C4, and C5. However, in the case of child C3, an important subtlety must be
respected: note that within the area B(C3), the rule R2 is completely covered by
the rule R1and therefore can never be reached when the classification process
enter C3. Rule R2 should therefore be removed from the child C3, and generally,
Fig. 4.17: The HiCuts cutting process creates five child nodes C1, . . . , C5by cutting four
times in dimension X.
every rule in a node N that is completely covered by more highly prioritized rules should be removed from N . In the original HiCuts publication [63], this redundancy removal is mentioned as an optional and “time consuming” way to optimize the storage requirements of the resulting tree. However, in order to guarantee that the tree building process always terminates, this redundancy removal step is mandatory. As a simple example, consider a rule set R = ⟨R1, R2⟩
with B(R1) = B(R2). Without redundancy removal, it is not possible to construct
a HiCuts tree for R with a binth value β = 1, because no executed cut can separate R1from R2.
Figures 4.18 and 4.19 illustrate a complete example for the HiCuts tree generation process. It can be seen that even for the small rule set in Figure 4.18 with five
Fig. 4.18: Complete HiCuts example with five rules, β = 2, and H = [0, 15] × [0, 9].
rules, the resulting decision tree in Figure 4.19 has a considerable size of 16 nodes with a total of 13 sub rule sets associated to the leaf rules. Furthermore, note that the rules R2, R3, R4, and R5have a duplication factor of 3, 5, 2, and 7, respectively,
which significantly contributes to the tree’s size. In order to mitigate the large storage requirements for the decision tree, the HiCuts paper [63] suggests to join neighbored child nodes with the same set of associated rules. For example, in Figure 4.19, the nodes N7 and N8 can be joined to a single node N7,8 that
covers the area [6, 8] × [0, 3]. Note however, that the number of pointers in the parent node remains the same, and pointers to originally distinct nodes are simply redirected to the joined node.
However, the generated decision tree TR for a rule set R can be used for quick
packet classifications by traversing the tree based on the header data hp for an
incoming packet p, until a leaf node is reached. The traversal starts at the root node and, whenever a non-leaf node N is encountered, the metadata stored in N is used to quickly decide into which child the search must descend. More precisely, when we assume that pointers to the child nodes are stored in a zero-indexed array AN, then the O(1) operation
AN[i] with i = min
{︄⌊︄ Header offset ⏟ ⏞⏞ ⏟ hδ− aNδ /︄ ⌊︃bN δ − aNδ + 1 δ + 1 ⌋︃ ⏞ ⏟⏟ ⏞
Number of points in child intervals, can be precomputed
⌋︄
, δ
}︄
(4.19)
leads to the correct child node Ci+1. When the search has finally descended into
a leaf node N , the corresponding sub rule set RN is searched linearly for the
most highly prioritized matching rule. For example, the packet p with header hp = (13, 5) is classified by starting the tree traversal at the root node N1 and
computing the child index min{⌊⌊15−0+113−0 5 ⌋
⌋, 4} = 4, which points to the node N6.
Subsequently, the index min{⌊⌊9−0+15−0 5 ⌋
⌋, 4} = 2 points to the leaf node N14, whose
associated sub rule set RN14 = ⟨R5⟩ yields the most highly prioritized matching
rule R5.
With TR↑ being the height of the decision tree, the overall classification time is in O(TR↑), since every classification requires the traversal from the root to a leaf node associated to a rule set with a small maximum size of β. Also, the required space and the tree construction time are proportional to the number of nodes |TR| in
the decision tree TRand therefore in O(|TR|). Both the height height(TR)and the
number of nodes |TR| are hard to predict due to the utilized heuristics. Although it
is generally possible to incrementally update a decision tree [63], a rule insertion
Classification Data structure Data structure Memory
operation creation update requirements
O(T↑) O(|T |) O(|T |) O(|T |)
T: the HiCuts decision tree T↑: height of T |T |: number of nodes in T
Tab. 4.8: HiCuts performance characteristics.
or deletion can affect every single node in the tree. Furthermore, depending on the nature of the update, such as a rule insertion, additional backtracking may be required in order to avoid deterioration of the tree’s structure. According to the literature [58,61,107,144,147], |T | is in O(nd)and T↑ is in O(d). However,
these publications do not provide a proof for the abovementioned bounds. These performance indicators are summarized in Table 4.8.