We now describe the creation of the prefix tree which represents the descriptions of the instances of a given class in one dataset. Each level of the prefix tree corresponds to a property p and contains a set of nodes. Each node contains a set of cells. Each cell contains: 1. a cell value: (i) when p is a property, the cell value is one literal value, one URI instantiating its range or a null value and (ii) when p is an inverse property, the cell value is one URI instantiating its domain or an artificial null value.
2. IL: (i) when p is a property, the Instance List, called IL, is the set of URIs instantiating its domain and having as range the cell value, and (ii) when p is an inverse property, the Instance List is the set of URIs instantiating its range and having as domain the cell value.
3. NIL: the Null Instance List, called NIL, is the list of URIs for which the property value is unknown and for which we have assigned the cell value (null or not).
4. a pointer to a single child node.
Each prefix path corresponds to the set of instances that share cell values for all the properties involved in the path.
In order to consider the cases where property values are not given in the dataset, we create first an intermediate prefix tree, called IP-Tree. In IP-Tree, the absence of a value for a given property is represented by an artificial null value. The final prefix tree, called FP-Tree, is generated by assigning all the existing cell values of one node to the cell that contains the artificial null value.
3.3.2.1 IP-Tree creation
In order to create the IP-Tree, we use all the properties that appear at least in one description of an instance of the considered class. For each value of a property, if there does not exist already a cell value with the same value, a new cell is created and the Instance List IL is initialized with this instance. When a property does not appear in the description of an instance, we create or update, in the same way, a cell with an artificial null value. The creation of the IP-Tree is achieved by scanning the data only once.
Example 3.3.1. Example of CreateIntermediatePrefixTree algorithm.
Figure 3.5 shows the IP-Tree for the descriptions of instances of the class db:Restaurant in the RDF dataset D2 presented in Figure 3.4. The creation of the IP-Tree starts with the first
Algorithm 2: CreateIntermediatePrefixTree Input : (in) RDF DataSet s , Class c Output: root of the IP-Tree
1 root newNode() 2 P getProperties(c,s) 3 foreach c(i) 2 s do 4 node root 5 foreach pk2 P do 6 pk(i) getValue(i) 7 if pk(i) == /0then
8 if 9 cell12 node with null value then
9 node.cell1.IL.add(i)
10 else
11 cell1 newCell()
12 node.cell1.value null
13 node.cell1.IL.add(i)
14 else
15 foreach value v 2 pk(i)do
16 if 9 cell12 node with value v then
17 node.cell1.IL.add(i)
18 else
19 cell1 newCell()
20 node.cell.value v
21 node.cell.IL.add(i)
22 if pkis not the last propertythen
23 if hasChild(cell1)then
24 node cell.child.node()
25 else
26 node cell.child.newNode()
27 return root
instance which is the restaurant r1. A new cell is created in the root node containing the name of the country in which the restaurant is located. The next information concerning this restaurant is the city where it is located. To store this information a new node will be created as a child node of the cell “Spain”. In this new node, a new cell is created to store the value c1. The process continues until all the information about an instance are represented in the tree. For each new instance, the insertion begins again from the root.
Fig. 3.5 IP-Tree for the instances of the class db:Restaurant
Fig. 3.6 FP-Tree for the instances of the class db:Restaurant 3.3.2.2 Final Prefix Tree creation
Using the IP-Tree, we generate a FP-Tree (see Algorithm 3). This is done by assigning, for each node, the set of possible values contained in its cells, to the artificial null value of this node. If no null values exist in an IP-Tree, this tree is also the FP-Tree. We use the Null Instance List NIL to store the instances for which the property value is unknown. This information will be used by UNKFinder (Algorithm 5) to distinguish non keys from undetermined keys.
Example 3.3.2. Example of CreateFinalPrefixTree algorithm.
In Figure 3.6, we give the FP-Tree of the RDF dataset D2. As we notice in Figure 3.5, the restaurants r2 and r3 are both located in “USA”. The restaurant r2 is located in the city c2 while there is no information about the location of the restaurant r3. This absence is represented by a null cell in the IP-Tree. Therefore, to build the FP-Tree, we assign the value c2 to null value of r3 for the property db:city. The NIL is now {r2,r3} and r3 is stored in NIL (see Figure 3.7(b)). This assignment is performed using the mergeCells
operation. This process is applied recursively to the children of this node (see Figure 3.7(c)) in order to: (i) merge the cells of the child nodes that contain the same value and (ii) to replace the null values by the remaining possible ones.
Algorithm 3: CreateFinalPrefixTree Input : IPT: IP-Tree
Output: FPT: FP-Tree
1 FPT.root mergeCells(getCells(IPT.root))
2 foreach cell c 2 FPT.root do
3 nodeList getSelectedChildren(IPT.root,c.value)
4 nodeList.add(getSelectedChildren(IPT.root,null))
5 c.child MergeNodeOperation(nodeList)
6 return FPT
Algorithm 4: MergeNodeOperation
Input : (in) nodeList, a list of nodes to be merged
Output: mergedNode, the merged node and its descendants
1 cellList getCells(nodeList)
2 mergedNode mergeCells(cellList)
3 if nodeList contains non leaf nodes then
4 foreach cell c 2 mergedNode do
5 childrenNodeList.add(getSelectedChildren(nodeList,null))
6 childrenNodeList.add(getSelectedChildren(nodeList,c.value))
7 c.child MergeNodeOperation(childrenNodeList)
8 return mergedNode