8.2.2 LOS MÉDICOS EN LA OBRA DE HERGÉ Médicos que se mencionan
9- PIC42B2: Ensayo clínico sin consentimiento informado
Our experiments gave promising results towards the use and acceptance of probabilistic approximation for estimating the quality of linked open datasets. Figure7.2shows the time taken in all implemented metrics (actual and approximated) against the evaluated datasets. The graph clearly shows that all approximate metrics have a lower runtime than their equivalent actual metric. Whilst the approximate metrics for link external data providers, dereferenceability (approach two), and extensional conciseness computed for all datasets, the approximate clustering coefficient computed five datasets within a reasonable time. The actual link external data providers also computed all datasets within a reasonable time. The actual dereferenceability metric managed only to compute two datasets, while the rest computed up to the WordNet dataset. Table7.8shows the metric (actual and estimated) values for the datasets.
The approximate results are in most cases very close to the actual results. However, approximate measures are calculated in an acceptable time unlike their actual counterparts. As part of a larger effort to implement scalable LOD quality assessment metrics, we assessed the metrics identified in [160] and assigned to them possible approximation techniques discussed in chapter (cf. Table7.9).
invaluable results that can be the basis for further studies. These results show that with probabilistic approximation techniques:
1. Runtime decreases considerably – for larger datasets easily by more than an order of magnitude; 2. Loss of precision is acceptable in most cases with less than 10% deviation from actual values; 3. Large linked datasets can be assessed for quality even within very limited computational capabilit-
ies, such as a personal notebook.
7.4 Concluding Remarks
Ensuring a scalable and time acceptable assessment of linked big datasets is important for the ever- growing Web of Data. Data consumers are usually satisfied with near-to-accurate data quality values, rather than the considerably more time consuming process that provides the exact results. In this chapter, we have demonstrated how three probabilistic techniques sampling, Bloom Filters and clustering coefficient estimation can be successfully applied for Linked Data quality assessment. Furthermore, we extend our previous work [49] by investigating the stratified sampling technique.
Using both sampling techniques, our idea was to sample the assessed dataset in order to reduce time-consuming processes. More specifically, we applied these techniques to those metrics that require network operations to complete. Such operations can be expensive, mostly depending on the bandwidth and data source latency. With Bloom Filters we aimed at finding a time efficient solution to detect possible duplicate resources in the assessed dataset. Using bit vectors stored in memory, duplicates can be detected in streams of data. Finally, the clustering coefficient estimation was applied on a graph representation of
LAK 75K LSOA 270K S’OTON 1M WN 2M SWETO 15M S-XBRL 100M
Extensional Conciseness 0.9948 1 0.7375 0.948 0,000370 N/A
Approx. Extensional Conciseness 0.9945 1 0.6601 0.8447 0.9998 0.1097
Clustering Coefficiency 0.9610 1 0.9335 0.7592 N/A N/A
Approx. Clustering Coefficiency 0.9782 0.9999 0.9930 0.8104 1 0
Link External Data Providers 3 0 2 0 0 1
Approx. Link External Data Prov. 3 0 2 0 0 1
Dereferencibility 0.99485 0.744414 N/A N/A N/A N/A
Approx. Dereferencibility (A2) 0.99600 0.73 0.02597 0.916 0.0 0.034
Table 7.8: Metric value (actual and approximate) per dataset.
Category Dimension Metric Approximation Technique
Accessibility
Availability Dereferenceability of the URI Sampling Dereferenced Forward-Links Sampling Interlinking Detection of Good Quality Interlinks Random Walk
Dereferenced Back-Links Sampling
Performance Usage of Slash-URIs Sampling
Intrinsic
Syntactic Validity Syntactically Accurate Values Sampling Conciseness
High Extensional Conciseness Bloom Filters High Intensional Conciseness Bloom Filters
Duplicate Instance Bloom Filters
Contextual Relevancy Relevant Terms Within Meta-Information Attributes Page Rank
Coverage Sampling
7.4 Concluding Remarks
the assessed dataset in order to measure the neighbourhood density of a resource. However, this graph measure still needs refinement and in general the logical next step would be to create a sample of the graph before applying network measures.
Our comprehensive experiments have shown that we can reduce runtime in most cases by more than an order of magnitude, while keeping the precision of results reasonable for most practical applications. All in all, we have demonstrated that using these approximation techniques enables data publishers to assess their datasets in a convenient and efficient manner without the need of having a large infrastructure for computing quality metrics. Therefore, probabilistic techniques can make quality assessment scalable for large linked datasets (cf. RQ 3 in Section1.3). The contribution in this chapter opens up further quests to apply other probabilistic techniques to important but otherwise intractable quality metrics.
C H A P T E R
8
A Preliminary Investigation Towards
Improving Linked Data Quality using Outlier
Detection
As linked datasets usually originate from various structured (e.g. relational databases), semi-structured (e.g. Wikipedia) or unstructured sources (e.g. plain text), a complete and accurate semantic lifting process is difficult to attain. Such processes often contribute to incomplete, misrepresented and noisy data, especially for semi-structured and unstructured sources. Issues caused by these processes can be attributed to the fact that either the knowledge worker is not aware of the various implications of a schema (e.g. incorrectly using inverse functional properties), or because the schema is not well defined (e.g. having an open domain and range for a property). In this chapter we are concerned with the latter cause, aiming to identify potentially incorrect statements in order to improve the quality of the knowledge base. When analysing the schema of the DBpedia knowledge base, we found out that from around 61,000 properties approximately 59,000 had an undefined domain and range. This means that the type of re- sources attached to such properties as the subject or the object of an RDF triple can be very generic, i.e. owl:Thing. For example, the property dbp:author, whose domain and range are undefined, has in- stances where the subject is of type dbo:Book and the object of type dbo:Writer, and other instances where the subject is of type dbo:Software and the object of type dbo:ArtificialSatellite (e.g.dbpedia:Cubesat_Space_Protocol dbp:author dbpedia:AAUSAT3). Without looking at a schema one would intuitively expect that a Book has a Person as its author, but if the author property is under-specified, that is without a domain and range specified in the schema, its semantics cannot be fully understood. Thus, lifting such domain and range restrictions on properties can undeniably increase the possibility of having incorrect RDF statements. Such incorrect statements decrease the quality of the dataset and as a result the precision of results when querying, in turn affecting data consumers who use the knowledge base, including service providers as well as end users.
In this chapter we explore the possibility of using distance-based outlier detection, in an attempt to automate the detection of incorrect RDF triples in a data source. In this preliminary investigation we do not consider object literals. Furthermore, we show how this approach can be used as a Linked Data quality metric.
Overall the particular contributions of this chapter are:
• the definition of an outlier detection for Linked Data method aiming at effectively (precision) and efficiently (scalability) detecting statements that are potentially incorrect w.r.t. domain and range; • the preliminary analysis of the method with regard to various similarity measures, parameters as
well as regarding its time and space complexity.
Furthermore, we look into how this approach can provide interested stakeholders with an approximative quality value that detects the number of outliers in a dataset. We implement this metric for the Luzzu framework (cf. Chapter4).
Our proposed approach is explained in Section8.1, together with the analysis of its space and time complexity. Experiments and evaluations of our approach are documented in Section8.2.