In this section, we introduce some preliminary definitions before presenting an overview of KD2R approach. In the following, we consider that the dataset D1 of the Figure 3.1 is under the UNA.
3.2.1 Keys, Non Keys and Undetermined Keys
We consider a set of properties as a key for a class, if every instance of the class is uniquely identified by this set of properties. In other words, a set of properties is a key for a class if, for all pairs of distinct instances of this class, there exists at least one property in this set for which all the values are distinct.
Definition 7. (Key). A set of properties P (P ✓ P) is a key for the class c (c 2 C ) in a dataset D if:
9pj(9U 9V pj(X,U) ^ pj(Y,V )) ^ (8Z ¬(pj(X,Z) ^ pj(Y,Z)))
For example in D1 (see Figure 3.1), the property lastName is a key for the class Person since every last name in the dataset is unique. The set of properties { firstName, bornIn} is also a key since (i) there do not exist two persons sharing values for the properties f irstName and bornIn and (ii) whatever is the country that the person p3 is born in, this set of properties will always be considered as a key.
We denote KD.cthe set of keys of the class c w.r.t the dataset D.
Definition 8. (Minimal key). A set of properties P is a minimal key for the class c (c 2 C )
and a dataset D if P is a key and @ P0a key s.t. P0⇢ P
We consider a set of properties as a non key for a class c if there exist at least two distinct instances of this class that share values for all the properties of this set.
Definition 9. (Non key). A set of properties P (P ✓ P) is a non key for the class c (c 2 C ) and a dataset D if:
9X 9Y (X 6= Y ) ^ c(X) ^ c(Y ) ^ (^
p2P9U p(X,U) ^ p(Y,U))
For example in D1, the property f irstName is a non key for the class Person since there exist two people having as first name the name “Wendy”.
We denote NKD.cthe set of non keys of the class c w.r.t the dataset D.
Definition 10. (Maximal non key). A set of properties P is a maximal non key for the class
c (c 2 C ) and a dataset D if P is a non key and @ P0a non key s.t. P ⇢ P0
To be able to apply the pessimistic and optimistic heuristics, some combinations of properties cannot be considered neither as keys nor as non keys. More precisely, a set of properties is called an undetermined key for a class c if (i) this set of properties is not a non key and (ii) there exist at least two instances of the class that share values for a subset of the undetermined key and (iii) the remaining properties are not instantiated for at least one of the two instances.
Definition 11. (Undetermined key). A set of properties P (P ✓ P) is an undetermined key for the class c (c 2 C ) in D if:
• (i) P /2 NKD.c and
• (ii) 9X 9Y (c(X) ^ c(Y) ^ (X 6= Y) ^ 8pj
((9Z (pj(X,Z) ^ pj(Y,Z)) _ @W (pj(X,W ) _ @W pj(Y,W ))))
For example in D1, the persons p1, p2 have the same first name, (“Wendy”), but for person p2 no information about her friends is given. Thus, the set of properties { f irstName, hasFriend} is an undetermined key. If we consider that hasFriend(p2, p3) is true in the dataset D1, then { firstName, hasFriend} is a non key.
We denote UKD.cthe set of undetermined keys of the class c for a dataset D.
Following the Definition 10, an undetermined key P is maximal if there does not exist
an undetermined key P0such that P ⇢ P0.
Undetermined keys can be considered either as keys or as non keys, depending on the selected heuristic. Using the pessimistic heuristic, undetermined keys are considered as non keys, while using the optimistic heuristic, they are considered as keys. The discovered undetermined keys can be validated by a human expert who can assign them to the set of keys or non keys.
3.2.2 KD2R overview
A naive automatic way to discover the complete set of keys in a class, is to check all the possible combinations of properties that refer to this class. Let us consider a class described
by 60 properties. In this case, the number of candidate keys is 260 1. Even if we consider
that the size of each key will be small in terms of number of properties, the number of candidate keys can be millions. In the previous example, if we consider that the maximum number of properties for a key is 5, the number of candidate keys is more than six million. For each candidate key, to ensure if it refers to a key or not, the values of all the instances concerning this candidate key should be explored. In order to minimize the number of computations, we propose a method inspired by [SBHR06] which first retrieves the set of maximal non keys (i.e., sets of properties that share the same values for at least two instances) and then derives the set of minimal keys from the non keys. Unlike keys, having
(a) KeyFinder for one dataset (b) Key merge for two datasets Fig. 3.2 Key Discovery for two datasets
only two instances sharing values for a set of properties are enough to consider this set as a non key.
In Figure 3.2, we show the main steps of the KD2R approach. Our method discovers the keys for each RDF dataset independently. In each dataset, KD2R is applied on the classes that are previously sorted in a topological order. In this way, the keys that are discovered in the superclasses can be exploited when keys are discovered in their subclasses. For a given dataset Di and a given class c, we apply KeyFinder (see Algorithm 1), an algorithm that finds keys for each class of a dataset. The instances of a given class are represented in a prefix tree (see Figure 3.2(a)). This structure is used to discover the sets of maximal undetermined keys and maximal non keys. Once all the undetermined keys and non keys are found, they are used to derive the set of minimal keys. KeyFinder repeats this process for every class of the given ontology. To compute keys that are valid for the classes of two ontologies, KeyFinder is applied in each dataset independently and once all the keys for every class are found, the obtained keys are then merged in order to compute the set of keys
that are valid for both datasets (see Figure 3.2(b)).