Turquía, 1907, Colección Victor Burguete Madrid
NOVENA PROMOCIÓN, 1909 1913:TEODORO DE ANASAGASTI Y ALGÁN
2.2. Pensión de Teodoro de Anasagasti, 1909-1913 Actividad del pensionado
we need for a process producing explanations, and work in Knowledge Discovery to analyse how to generate them automatically. We then focus on the way to exploit the available knowledge in the Web of Data, by trying to understand which Semantic Web technologies and which Graph Theory algorithms and models can be integrated in our process.
The design of the prototype aiming at validating our research hypothesis is then realised
in an iterative way, following the idea of [Denning,1981] that experimentation can be used
in a feedback loop to improve the system when results do not match expectations. In the
following chapters (Chapter3,4,5and6) we approach each of our sub-questions, propose a
solution and use the experimental evaluation to assess whether the system is producing the expected results, which in our case consist of meaningful explanations for a pattern generated from the Web of Data.
The final part of our methodology is then to assess the validity of our system in an empirical user study designed around a large real-world scenario. This is presented in
Chapter7. We evaluate the system by comparing the explanations generated automatically
from the Web of Data with the ones provided by the set of users who took part in the study.
1.5 Approach and Contributions
The resulting system, that we called Dedalo, is an automatic framework that uses background knowledge automatically extracted from the Web of Data to derive explanations for data, which are grouped into some patterns according to some criteria.
Below we present the process in details, namely by showing some real-world applications of our approach, an overall picture of the process, and the contributions that our approach is bringing.
1.5.1 Applicability
The approach that we propose could benefit many real-world domains, namely those where background knowledge plays a central role for the analysis of trends or common behaviours. For instance, Dedalo could be exploited for business purposes such as decision making or predictive analytics, by providing the experts with the Linked Data information that they might miss to explain the regularities emerging from the raw data collected using data analytics methods. A practical application is the Google Trends scenario used throughout this thesis: namely, Linked Data could help in explaining the increased interest of the users towards some topics, which could be used to improve the user experience and profiling.
A second application is in educational fields such as Learning Analytics, where Dedalo could be helpful to accelerate the analysis of the learners’ behaviours. This would allow universities to improve the way they assist people’s learning, teachers to better support their students or improve their courses, as well as the staff to plan and take their decisions.
Finally, Dedalo could be applied in the medical contexts, by helping the experts in explaining patterns and anomalies requiring some external knowledge, e.g. the environmental changes affecting the spread of diseases.
Or course, these are only a few examples of the way the explanation of patterns through background knowledge from Linked Data can be useful.
1.5.2 Dedalo at a Glance
Figure1.1presents an overview of Dedalo, with indications about the chapters of the thesis
describing each part. As one can see, every step requires the integration of knowledge from the Web of Data, which is the core aspect within our process.
Figure 1.1 Overview of Dedalo according to the thesis narrative.
Hypothesis Generation. Assuming a pattern (any data grouped according to some criteria: Clusters, association rules, sequence patterns and so on), this first step is to search the Linked Data space for information about the data contained in the pattern, and then to generate some correlated facts (alternatively called anterior events, hypotheses or candidate explanations throughout this work), which might be plausible explanations for it.
1.5 Approach and Contributions | 13 We combine here several techniques from Machine Learning, Graph Theory, Linked Data and Information Theory to iteratively explore portions of Linked Data on-the-fly, so that only the part of information needed for the explanation is collected. By doing so, we avoid inconveniences such as dataset indexing or crawling, while comfortably keeping the resolution of not introducing any a priori knowledge within the process.
Hypothesis Evaluation. In this step, we evaluate the unranked facts so that we can assess which ones are valid and may represent the pattern.
We use Linked Data combined with techniques from Information Retrieval, Rule Mining and Cluster Analysis to define the interestingness criteria, giving the generated hypotheses a priority order. This step also includes the study of a Machine Learning model that predicts the likelihood of improving the quality of the hypotheses by combining several of them. Hypothesis Validation. The final step consists in validating the ranked hypotheses so that they are turned into explanations that can be considered valuable knowledge.
The validation process exploits Linked Data and applies techniques from Graph Theory and Machine Learning to identify the relationship between a pattern and a hypothesis generated from Linked Data that is correlated to it.
1.5.3 Contributions of the Thesis
This thesis aims at being a contribution to the process of automatically discovering knowledge and at reducing the gap between Knowledge Discovery and the Semantic Web communities. From the Semantic Web perspective, our main contribution is that we provide several solutions to efficiently manage the vastness of the Web of Data, and to easily detect the correct portion of information according to the needs of a situation. From a Knowledge Discovery perspective, we show how pattern interpretation in the KD process can be qualitatively (in time) and quantitatively (in completeness) improved thanks to the use of semantic technologies.
Specific contributions, further detailed in the corresponding chapters, are as follows: • we present a survey on the definition of explanation from a Cognitive Science perspective,
and we formalise it as a small ontology (Chapter2);
• we show how the interconnected knowledge encoded within Linked Data can be used in
• we reveal how URI dereferencing can be combined with graph search strategies that
access Linked Data on-the-fly and remove the need for wide data crawling (Chapter4and
Chapter6);
• we show that the Entropy measure [Shannon,2001] is a promising function to drive a
heuristic search in Linked Data (Chapter4);
• we present some metrics and methodologies to improve and predict the accuracy of the
explanations (Chapter4and Chapter5);
• we detect which factors in the Linked Data structure reveal the strongest relationships between entities, and enclose them in a cost-function to drive a blind search to find entity
relationships (Chapter6);
• we present a methodology to evaluate the process of automatically generating explanations
with respect to human experts (Chapter7);
• we show how to identify the bias that is introduced in the results when dealing with
incomplete Linked Data (Chapter8).