3.2 Din´ amica de part´ıculas en campos electromagn´ eticos
3.2.6 Concepto de emitancia en la din´ amica transversal del haz
tion
GIM is designed for high dimensional data. In particular, note from section 3.8
that it scales linearly with the dimensionality of the vectors, since the framework is based purely on vector valued functions. Since GIM operates on vectors existing in some space X, it is possible to reduce the dimensionality of those vectors without aecting the operation of GIM. Furthermore, the semantics and scrutability of the pattern do not change. That is, the results are still an interaction expressed as a subset of V. For example, the clustering approach considered in section 3.9 may easily have dimensionality reduction methods applied, since the distance measure is usually preserved quite well in the reduced space. For dimensionality reduction to be applicable in a GIM method, the result of mI(·) in the reduced space should be suciently similar to the results when they are applied in the original space, and aI(·) must lead to the same semantics in the reduced space as in the original space. Chapter 4 provides some discussion on the use of Singular Value Decomposition to reduce the space in the context of itemset mining.
3.15 Vector Representations and Subspace Projections
This section briey discusses the importance of dierent vector representations in GIM and the ability for them to exploit subspace projections. Examples in this chapter have had real valued vectors (for example in section 3.6 and 3.10), integer valued vectors (for example in section3.8and section3.7) and of course binary valued vectors (for example, any counting approach and some graph based approaches). Since the GIM framework operates using only functions on vectors, the way these vectors are implemented has a signicant impact on the run time of these operations and hence the algorithm. Dierent instantiations of functions also favour dierent vector implementations. While it is beyond the scope of this section to give a detailed analysis, a few examples will be shown to demonstrate the issue and provide practical advice.The most basic task in interaction mining is often determining in which samples an interaction exists. Measures on the interaction typically require only the values corresponding to these samples. First, suppose that we are interested only in the presence or absence and not the values. Then interaction vectors are (logically) sets containing those sample ids that contain the interaction. There are multiple ways to implement this. One method is to use a standard set implementation for xV0.
60 3.15. VECTOR REPRESENTATIONS AND SUBSPACE PROJECTIONS This has the advantage that aI(xV0, xv) operates in O(max{|xV0|,|xv|}) time (set
intersection) andmI(xV0) operates inO(|xV0|) time, where |xV0|is the set size and
is typically much less than the number of samples, n. A disadvantage is that such sets have considerable computational and space overhead. A better alternative is to use bit-vectors, where a bit xV0[i] is set if the ith sample contains the interaction V0. This has the advantage that aI(·) is the bit-wise AN D operation, which is a machine level operation and can thus be performed quickly. Also, the space usage per sample is low. Despite fast execution times and low space per sample, the run- time ofaI(·)and mI(·) are typicallyO(n) and space usage isO(n) regardless of the
logical size of the set. Another ecient method, particularly when the data is sparse, is to use an integer array storing the indexes of those samples containingV0. If this array is sorted by sampleid, then intersection can be implemented very quickly and thus aI(xV0, xv) operates in O(max{|xV0|,|xv|}) time (while still maintaining the
order). mI(xV0) can be implemented to operate in the same time and the space
usage depends on the number of non-zero values;O(|xV0|).
As a rule of thumb, the bit-vector approach is much faster than a set implementation. Similarly, using one bit per sample per variable uses less space in practice than a set-based implementation. The sparse method using a sorted integer array is sometimes faster than the bit-vector approach, and sometimes slower. This seems to be determined primarily by the platform used (processor and OS) and less to by the density of the vectors. The majority of the work in this thesis uses bit-vectors, as the sparse method was investigated only toward the end of the PhD.
The sparse vector method is easily extended to problems where the value of a variable or interaction is required too. For example, the problems presented in sections 3.8
and 3.7 can be expected to have sparse integer vectors and mining expected or probabilistic frequent itemsets have sparse real valued vectors. Rather than using arrays of lengthn, it is only necessary to store the(index, value) pairs of the non-
zero elements. This reduces the space required and improves the computation time of mI(xV0)toO(|xV0|), where|xV0|is the number of non-sparse values in the interaction
vector. aI(xV0, xv) operates in O(max{|xV0|,|xv|}). Experiments have shown that
this sparse approach not only uses much less space than a full array implementation, but is also signicantly faster (for example, experiments in chapter13 demonstrate this).
3.15.1 Subspaces, Projections and Geometric Interaction Mining
Geometrically, the sparse vector method records only those dimensions (samples) spanning the subspace of X where the current interaction V0 has a presence. Let
CHAPTER 3. GENERALISED INTERACTION MINING 61