3. METODOLOGÍA
4.3 COMPROBACIÓN DE HIPÓTESIS
The general steps of the activity recognition process is related to the general KDD process (cf. Chapter 1). Contributions as well as applicative publications that use particular tech- niques will be summarized in the following part. A more detailed survey of this processing chain is given in [17].
Data Preprocessing
Recorded time series data from accelerometers often contains noise of high frequency, which in many cases distorts the actual signal. Thus, sliding-window-based average [127] or median filters [118] are applied in order to remove outliers. Furthermore, removing the effect of the gravitational force is supposed to distinguish activity from non-activity phases. This is in general obtained by applying a low-pass filter, as shown in [13, 118].
4.1 Similarity of Time Series 27
Segmentation
In order to separate periodic parts from nonperiodic parts, time series are divided into subsequent segments. In the literature, there exist different techniques for segmenting time series. Sliding-window-based methods [131, 174, 201, 202] are suitable for online processing and provide pattern matching algorithms starting with a reference sample that is extended until the matching distance exceeds a particular threshold. Top-down approaches in the context of time series processing [148, 153, 188] recursively split a time series into two subsequences w.r.t. an approximation error threshold. Complementary approaches [123, 124, 125] work in a bottom-up manner, starting with n2 segments of size 2 (where ndenotes the number of observations) and combining adjacent subsequences until an upper cost bound is reached.
In [122], time series segmentation is used to obtain a piecewise linear representation of time series, such as PLA [167] PLR [122]. The authors propose the SWAB framework, which combines the advantages of sliding-window, which is most efficient, and bottom-up, which provides best segmentation quality. Nevertheless, sliding-window-based methods are still most frequently used in the area of activity recognition [17], as it is suitable for online processing. Selected coefficients obtained by dimensionality reduction methods and further characteristics are then obtained by feature extraction. Thus, the segmentation method used in Chapter 6 is based on a sliding-window algorithm. The obtained segments will turn out to provide a good separation of the activity recordings.
Feature Extraction
Periodic and nonperiodic segments of time series are commonly described by a combination of features of different types.
• Time-domain features, such as mean, variance and standard deviation [146, 171, 205] or the Root Mean Square (RMS) [100, 164] are directly derived from the time series. Further prominent examples of this feature type are the average time between the peaks [146] and the number and average value of the peaks [205].
• Many publications apply well-known dimensionality reduction techniques by trans- forming the time series into the frequency domain (see also Subsection 4.1.1). Fre- quency-domain features can be derived by the DFT (or FFT, Fast Fourier Trans- form [62]) and are used in [20, 193]. Features like aggregated FFT coefficients or the entropy of the frequency domain [20], that distinguish activities where similar energy values are detected (e.g., running and cycling), or single FFT coefficients [133] are also used in existing literature.
• A combination of domains w.r.t. time and frequency is given by wavelet features, derived from the DWT and is used in the context of gait classification [169].
• Heuristic features cannot be directly derived from the time series, but require math- ematical and statistical methods to be extracted from the three dimensions of ac-
celerometrical time series data simultaneously. A prominent example here is the
Signal Magnitude Area (SMA), which is defined by the sum of the absolute values of all axes within the current time window and that is used in several works [13, 58, 118, 127, 212]. A further feature of this class is given by the Inter-Axis Correlation, which is a suitable measure to distinguish between movements measured at different body parts [20]. However, the authors of [171] could prove that this feature performs inferior to the simple features like mean and standard deviation. Further heuristic features will be presented in Chapter 6.
The adequate combination of features is an important task, since the classification of the time series highly depends on a good representation.
Feature Vector Dimensionality Reduction
In order to reduce the computational effort of the classification process, dimensionality reduction is typically applied in order to remove redundant information; this decreases the size of the feature vectors. In the context of accelerometer data, the literature distinguishes between methods of feature selection and feature transformation, which can also be used in combination.
• Feature selection methods include, for example, methods based on Support Vector Machines (SVMs [132], e.g., applied in [203]), or the forward-backward search tech- nique [218] (e.g., used in [171] and also in Chapter 6).
• Feature transformation methods further support the separation of different classes. Commonly applied techniques here are thePrincipal Component Analysis (PCA[170], e.g., used in [212, 216]), the Independent Component Analysis (ICA [81], e.g., used in [166]) or the Linear Discriminant Analysis (LDA[95], e.g., used in [100, 127] and also in Chapter 6).
Classification
The effectiveness and also the efficiency varies with the selection of the classifier. Using similarity-based classifiers, similarity queries performed within the classification process can be further accelerated via indexing techniques. The latter issue will be addressed in Chapters 7 and 8 for the case of k-nearest neighbor (k-NN) queries, which are performed in the context of k-NN classification.
Several publications in the context of activity recognition apply supervised classification methods based on pattern recognition and training phases, e.g., decision trees [20, 106, 118],
Hidden Markov Models [205], Gaussian Mixture Models [13], k-NN classifiers [106, 171],
Na¨ıve Bayes classifiers [106], Support Vector Machines (SVMs) [132], or Neural Net- works [127, 146]. Chapter 6 will propose an additional step that improves the classification result.
4.2 Indexing in High-Dimensional Feature Spaces 29
4.2
Indexing in High-Dimensional Feature Spaces
4.2.1
Full-Dimensional Indexing
The contributions of research on full-dimensional index structures are abundant [183]. Established index structures, such as [23, 28, 101, 119], are designed and optimized for the complete data space where all attributes are relevant for data partitioning and clustering or for simply satisfying a query predicate. With increasing dimensionality, however, index structures degrade rapidly due to the curse of dimensionality [24].
A solution is provided by commonly applied methods enhancing the sequential scan, for example the VA-file [207]. Other approaches use a hybrid structure, which is tree-based, but requires to scan successive blocks or node elements [27, 28].
A third solution tackling the problem of indexing high-dimensional data calledBOND is given in [85], which is also a search strategy enhancing the sequential scan. Contrary to the aforementioned techniques, BOND exploits modifications w.r.t. the physical database de- sign. The basic idea is to use a columnstore architecture (as known from NoSQL database systems), sort the columns according to their potential impact on distances and prune columns if their impact becomes too small to change the query result. However, BOND
depends on particular assumptions that restrict the applicability of the approach. Chap- ter 7 [40] will introduce a solution that overcomes most of these restrictions.
4.2.2
Indexing Approaches for Subspace Queries
The first approach addressing the problem of subspace similarity search explicitly has been proposed by the Partial VA-file in [136]. There, the authors propose an adaptation of the VA-file [207] to the problem of subspace similarity search. The basic idea of this approach is to split the original VA-file into one partial VA-file for each dimension, containing the approximation of the original full-dimensional VA-file in that dimension. Based on the in- formation of the partial VA-files, upper and lower bounds of the true distance between data objects and the query are derived. Subspace similarity queries are processed by scanning only the relevant files in the order of relevance, i.e., the files are ranked by the selectivity of the query in the corresponding dimension. This processing is similar to [85], which will be reviewed in the full-dimensional case in Chapter 7, and which implicitly addresses the subspace problem by its physical design via weighted search. A third approach to the problem is proposed in [156], although only ε-similarity range queries are supported. The idea of this multipivot-based method is to derive lower and upper bounds for distances based on the average minimum and maximum impact of a possible range of the subspace dimensions; these bounds are computed in a preprocessing step for a couple of pivot points. All these approaches are variations of the sequential scan. Contrary, Chapter 8 will present two index-based solutions that accelerate similarity search in arbitrary subspaces.
31
Chapter 5
Knowing: A Generic Time Series
Analysis Framework
5.1
Motivation
Supporting the data mining process by tools was and still is a very important step in the history of data mining. With the support of several tools like ELKI [1], MOA [56],
WEKA[102],RapidMiner [165] orR[175], scientists are nowadays able to apply a diversity of well-known and established algorithms on their data for quick comparison and evalua- tion. Although all frameworks perform data mining in their core, they all have different target groups.
WEKAandMOAprovide both algorithms and GUIs. By using these GUIs, the user can analyze datasets, configure and test algorithms and visualize the outcome of the according algorithm for evaluation purposes without needing to do some programming. As the GUI cannot satisfy all complex scenarios, the user still has the possibility to use the according APIs to build more complex scenarios in his or her own code.
RapidMiner integratesWEKAand provides powerful analysis functionalities for analy- sis and reporting which are not covered by theWEKAGUI itself. RapidMiner provides an improved GUI and also defines an API for user extensions. Both RapidMiner and WEKA
provide some support to external databases.
The aim of ELKI is to provide an extensible framework for different algorithms in the fields of clustering, outlier detection and indexing with the main focus on the compara- bility of algorithm performance. Therefore, single algorithms are not extensively tuned to performance, but tuning is done on the application level for all algorithms and index structures. Like the other frameworks, ELKI also provides a GUI, so that programming is not needed for the most basic tasks. ELKI also provides an API that supports the integration of user-specified algorithms and index structures.
All the above frameworks provide support for the process of quick testing, evaluating and reporting and define APIs in different depths. Thus, scientists can incorporate new algorithms into the systems. R provides a rich toolbox for data analysis. Also, there are
many plug-ins which extend the functionality of R.
In cases where the requirements enforce a rapid development from data mining to a representative prototype, these unstandardized plug-in systems can cause a significant delay which is caused by the time needed to incorporate the algorithms. Each implemen- tation of an algorithm is specifically adapted to the according framework without being interchangeable.
With the use of a standardized plug-in system like OSGi1, Java Plug-in Framework (JPF) or Java Simple Plug-in Framework (JSPF), each implementation of an algorithm does not have to be specifically adapted to the according framework. This chapter will introduce Knowing (Knowledge Engineering) [41], a framework that addresses this short- coming by bridging the gap between the data mining process and rapid prototype devel- opment. This is achieved by using a standardized plug-in system based on OSGi, so that algorithms can be packed in OSGi resource bundles. This offers the possibility to either create new algorithms as well as to integrate and exchange existing algorithms from com- mon data mining frameworks. The advantage of these OSGi compliant bundles is that they are not restricted for use inKnowing, but can be used in any OSGi compliant architecture.
The data mining tool Knowing includes the following contributions:
• a simple, yet powerful graphical user interface (GUI),
• a bundled embedded database as data storage,
• an extensible data mining functionality,
• extension support for algorithms addressing different use cases, and
• a generic visualization of the results of the data mining process.
Details of the architecture of Knowing will be given in Section 5.2. The application sce- nario, described in Section 5.3, presents the medical monitoring system MedMon [186], which itself extends Knowing. In the developer stage, it is easily possible to switch be- tween the scientific data mining view and the views which will be presented to the end users later on. AsMedMon is intended to be used by different target groups of the medical area (physicians and patients), it is desired to use a single base system for all views and only deploy different user interface bundles for each target group. This way, the data mining process can seamlessly be integrated into the development process by reducing long-term maintenance to a minimum, as only a single system with different interface bundles has to be kept up to date and synchronized instead of a special data mining tool, a physician tool and a patient tool.
1OSGi Alliance:
5.2 Architecture 33
5.2
Architecture
5.2.1
Modularity
Applying a standardized plug-in system like OSGi, the bundles can be used in any OSGi compliant architecture like the Eclipse Rich Client Platform (RCP)2 or the NetBeans RCP3. Then, the integration of existing algorithms can simply be done by wrapping and packing them into a separate bundle. Such bundles are then registered as independent service providers to the framework. In either case, algorithms are wrapped into Data Processing Units (DPU) which can be integrated and configured via pluggable RCP-based GUI controls. Thus, the user is able to perform an arbitrary amount of steps to pre- and postprocess the data. Furthermore, the possibility is provided to use the DPUs contained in the system in any other OSGi compliant architecture. As dependencies between resource bundles have to be modeled explicitly, it is much easier to extract particular bundles from the system. This loose coupling is not only an advantage in case where algorithms should be ported between completely different systems, but also if the GUI should be changed from a data mining view to a prototype view for the productive system. This can be done by either using the resource bundles containing the DPUs, or by directly extending
Knowing itself.
In the current implementation, the Knowing framework is based on the established and well-known Eclipse RCP system and uses the standardized OSGi architecture4, which allows the composition of different bundles. This brings the big advantage that data miners and developers can take two different ways towards their individual goal: if they start a brand new RCP-based application, they can use Knowing out of the box and create the application directly on top of Knowing. The more common case might be that an RCP- or OSGi-based application already exists and should only be extended with data mining functionality. In this case, only the appropriate bundles are taken from Knowing and integrated into the application.
The following part describes the architecture of theKnowing framework, which consists of a classical three-tier architecture comprising data storage tier, data mining tier and GUI tier, where each tier can be integrated or exchanged using a modular concept.
5.2.2
Data Storage
The data storage tier of Knowing provides the functionality and abstraction layers to access, import, convert and persist the source data. The data import is accomplished by an import wizard using service providers, so that importing data is not restricted to a particular format.
Applying the example of the MedMon application, a service provider is registered that reads binary data from a three-dimensional accelerometer [198] which is connected via
2Eclipse RCP: http://www.eclipse.org/platform/ 3NetBeans RCP: http://netbeans.org/features/platform/ 4Eclipse Equinox: http://www.eclipse.org/equinox/
Figure 5.1: The process chain editor of Knowing user interface.
USB. The data storage currently defaults to an embedded Apache Derby database5 which is accessed by the standardized Java Persistence API (JPA & EclipseLink). This has the advantage that the amount of data being read is not limited by the main memory of the used workstation and that the user does not have to set up a separate database server on his or her own. However, by using the JPA, there is the possibility to use more than 20 elaborated and well-known database systems which are supported by this API6. An important feature in the data storage tier arises from the possibility to use existing data to support the evaluation of newly recorded data, e.g., to apply particular parts of the data as training sets or reference results.
5.2.3
Data Mining
This tier includes all components needed for data mining and data analysis. OSGi bundles containing implemented algorithms are available fully transparently to the system after the bundle is registered as a service provider.
Algorithms are either implemented directly or wrapped in DPUs. Following the design of WEKA, DPUs represent filters, algorithms or classifiers. One or more DPUs can be bundled into an OSGi resource bundle which is registered into the program and, thus, made available in the framework. Bundling algorithms enforces a pluggable and modular
5Apache Derby:
http://db.apache.org/derby/
6List of supported databases:
5.2 Architecture 35
architecture, so that new algorithms can be integrated and removed quickly without the need for extensive dependency checks. The separation into bundles also provides the pos- sibility of visibility borders between bundles, so that separate bundles remain independent and, thus, the system remains maintainable. The modularity also provides the possibility to concatenate different algorithms into processing chains so that algorithms can act both as sources and targets of processed entities (cf. Figure 5.1). Raw data for example first could pass one or more filtering components before being processed by a classification or clustering component.
The creation of a processing chain (a.k.a. model) of different, concatenated algorithms and data-conditioning filters is supported by GUI controls, so that different parameters or concatenations can be tested easily. After a model has proved to fit the needs of a use case, the model can be bundled and later be bounded to other views of the GUI, so that the cost for porting, adapting and integration is minimized to binding components and models together. Hence, porting and adapting algorithms and other components from different APIs is not needed.
This architecture provides the possibility to integrate algorithms from other sources like [1, 56, 102, 175], so that existing knowledge can be reused without having to reim- plement algorithms from scratch. This also provides the possibility to quickly replace components by different implementations if performance or licensing issues require to do so.
In the data mining part of the application, Knowing does not only support plain Java but also relies on the use of the Scala programming language7. Scala is a functional and object-oriented programming language which is based on the Java Virtual Machine, so that it seamlessly integrates into Knowing. The advantage of Scala in this part of the application lies in the simple possibility of writing functional code shorter than in regular