A SchemaGuide [LRB10a] is a tree-based metadata structure that summarises the structural (pattern) information of an XML tree and describes constraints. It is used by the contain-ment checking algorithm to determine the containcontain-ment relationship between fragcontain-ments and identify the common sub-expressions between views. Moreover, as will be shown in Chap-ter 7, it provides metadata that can be exploited for performance gains.
A SchemaGuide describes the structure of an XML tree without concern for the content of the corresponding XML document. Similar to other existing metadata structures, it also provides constraints on XML data that are either explicitly defined in a DTD file or an XML schema, or implicitly outlined by a structural-based summary such as a strong DataGuide [GW97] or a QueryGuide [IHH09]. The difference between the SchemaGuide and existing metadata structures is that it provides more detailed structural information of XML docu-ments, e.g., the subtree structure of each instance node and the order of each instance node appearing in the subtree. Similar to an XML tree, a SchemaGuide has a tree based structure
and is represented by a 4-tuple as depicted in Definition 5.1.
Definition 5.1 [SchemaGuide]
Given an XML document and its tree representation t, a SchemaGuide G is a tree sum-marising the structural information in t and it is represented by a 4-tuple < RG, NG, EG, LG>where RGdenotes the root node of G; NGis a set of nodes in G; EGrepresents a set of edges in G; and LGcontains a set of labels in G, where LG⊆ Σ.
If G is a SchemaGuide corresponding to an XML tree t, then we say that G summarises tor t conforms to G, denoted by G t. Given a schema node u, where u ∈ NG, LG(u) returns the label of u and LG(u) ∈ Σ. E+G is the transitive closure of EG defining the ancestor-descendant relationship between any arbitrary pair of schema nodes in G.
Figure 4.1b (see Page 45) is a snapshot of the SchemaGuide corresponding to the segment of the Worldbikes dataset shown in Figure 4.1a. Each instance node of an XML tree is mapped to a node within the SchemaGuide, to differentiate between nodes in an XML tree and nodes in a SchemaGuide, we refer to nodes in a SchemaGuide as the schema nodes.
Each schema node is uniquely identified by an integer value called the schema node id (sid), e.g., the number that is associated with each node in Figure 4.1b. We use guto denote a subtree of a SchemaGuide rooted at the schema node u and it contains all nodes that are transitively reachable from u, e.g., a subtree rooted at time in Figure 4.1b is denoted by gtime.
As depicted in Definition 5.2, for an XML tree and the corresponding SchemaGuide, we define a mapping function, ϕ, which maps all nodes between the XML tree and the corre-sponding SchemaGuide. The mapping function must retain all characteristics as specified in Definition 5.2 and serves as an essential concept for our later propositions and proofs. It is the many-to-one characteristic of the mapping function that makes it suitable for an XML tree to SchemaGuide mapping where there are generally many instances of a schema node in the XML tree.
Definition 5.2 [The Mapping Function (ϕ)]
Given a tree t and a SchemaGuide G, G t. Let ϕ : Nt→ NG be a mapping function where the following characterises are preserved:
1. Root preserving: ϕ(Rt) = RG.
2. Edge preserving: if (u, v) ∈ Et−→ (ϕ(u), ϕ(v)) ∈ EG. 3. Label preserving: ∀ u ∈ Nt−→ Lt(u) = LG(ϕ(u)).
4. Root-to-Node Path preserving: ∀ u ∈ Ntand u is reachable from Rtby the path P
−→ ϕ(u) is reachable from ϕ(Rt) by the same path P . 5. Subtree Structure preserving: ∀ u ∈ NtV
∀ v ∈ Ntu −→ ϕ(v) ∈ NGϕ(u)
VLt(v) = LG(ϕ(v)).
6. Order preserving: ∀ u ∈ NtV
∀ vi ∈ Ntu, 0 < i < k, where k is the number of children of u −→ ϕ(vi) ∈ NGϕ(u)and ϕ(vi) is the ith child of ϕ(u).
As shown in Definition 5.2, when mapping nodes from an XML tree t to a SchemaGuide G, the mapping function ϕ guarantees: 1) the root node between t and G are mapped; 2) for any pair of instance nodes in t, their edge must be identical to the edge between their mapped schema nodes in G; 3) the same labels are assigned to nodes in t and G that are mapped;
4) for any instance node in t and its mapped schema node in G, they are on the same path;
5) for any mapped nodes in t and G, they must have identical subtree structure and; 6) the order of nodes within the subtree must remain same. Existing research has focused on the first four characteristics and we extend existing work by providing the more detailed metadata information required by 5 and 6. This is the key contribution in our SchemaGuide as it made possible to reduce the search space required for containment checking.
Based on the mapping function defined in Definition 5.2, we now present a new property for our SchemaGuide. Property 5.1 outlines the fact that if an XML tree t conforms to a SchemaGuide G, then there must exist a many-to-one mapping between instance nodes in tand schema nodes in G. Every instance node in t maps to a schema node in G, whereas, every schema node in G can be mapped to at least one instance node in t.
Property 5.1 Given an XML tree t and a SchemaGuide G, if G |= t, then there exists a mapping function ϕ which maps every instance node in t to a schema node in G, whereas, a schema node can be mapped to multiple instance nodes.
Furthermore, according to Property 5.1, each instance node within an XML tree maps to a schema node in the corresponding SchemaGuide and as a result, it derives the correspond-ing sid from the mapped schema node. As was shown in Figure 4.1a, each instance node is associated with an integer value representing the sid. Based on the concepts outlined in Definition 5.2 and Property 5.1, for any XML tree, one could build its corresponding SchemaGuide during the XML document parsing process by temporarily storing all root-to-leaf and subtree structures of each instance node. A schema node is only created if the root-to-leaf and subtree structure associated with an instance node have never been previ-ously encountered during the parsing. The algorithm for generating a SchemaGuide were implemented using the Xerces2 SAXParser [Xerces2], which scans an XML document and each time it encounters a tag, it calls the corresponding tag handler method. We record all paths and subtree structures by iteratively concatenating/substracting tags that are encoun-tered during the parsing.