In the context of the Web and the Semantic Web often graph shaped data is considered in addition to tree shaped data. It is usually the case, that a dominant spanning tree in the graph is given and the ‘missing’ links are added by means of some reference mechanism. XML provides an integral reference mechanism known as ID-/ID-Ref [61] and various linking and reference based stan- dards have been built around XML, likeXML Fragment Interchange[52] ,XML Linking Language (XLink)[53] and many others. A practical example of such graph structured data is widespread in HTML documents—the use of internal links. Consider the HTML document in code example 25 with two references modelling a kind of circular link structure between two paragraphs:
<html> <body>
<p id="p1">
This paragraphs refers to the <a href="#id">next paragraph</a> (and is referred by it).
</p> <p id="p2">
This paragraphs refers to the <a href="#id">previous paragraph</a> (and is referred by it).
</p> </body> </html>
Code Example 25 An HTML document with two references modelling a kind of circular link structure between two paragraphs.
Another example of typically graph shaped structures modelled in XML are RDF documents.
3.3.1
Reference Types and Typed References
Modelling of graph shaped structures is supported in current XML schema formalisms like DTD, Relax-NG and XML Schema by modelling elements containing ID and ID-Ref attributes. The
ID/ID-Ref mechanism is global throughout the whole document, meaning that any reference may refer to any identifier. Unfortunately this does not permit to model typed references, e.g. references to elements of certain type. Consider the following example for illustration of the ID/ID-Ref mechanism:
Example 3.3
Consider a grammar using similar syntax as in example 3.2 that models books and authors in a kind of bibliographical database. To prevent redundancy and misspelling of the author names, the authors are kept in an index of authors and are referred to from the definition of the books in another section of the document. The grammar uses the special type name “b” for references and the type name “ @ ” is used to denote IDs. An element should only get one identifier, which arguably simplifies understanding document instances. To emphasise the special role in multiplicity of an identifier (each element may not contain more than one identifier), it is prefixed to the element name in the type rules. An author element contains an “@”, denoting that instances of that type have an identifier and the authors list of a book has an arbitrary amount of references to author elements. Note, that for the sake of conciseness with the syntax ofR2G2 presented later on, the type names “b” and “@” have been chosen so that they harmonize with R2G2. In turn, the syntax ofR2G2has been streamlined with the syntax of Xcerpt. The mapping to XML attributes namedidandrefis given in an opaque way in this example, yet a concrete schema or type formalism would need a way to specify such mappings. Further on the type namesN ameandT itleare synonyms for plain text, or CDATA in the sense of XML—again, a concrete schema or type formalism needs support for atomic data types to be practically useful.
G= (Bibliography,
{b,@, Bibliography, AuthorIndex, Author, BookIndex, Book, Authors},
{bib, authors, author, books, book}, R)
whereR=
{ Bibliography→bib[AuthorIndex, BookIndex]
, AuthorIndex→authors[Author+]
, Author→@author[N ame]
, BookIndex→books[Book+]
, Book→book[T itle, Authors]
, Authors→authors[ b+] }
<bib> <authors>
<author id="se">Shamir Eli</author> <author id="sw">Stevens W.</author> <author id="as">Abiteboul Serge</author> <author id="bp">Buneman Peter</author> <author id="sd">Suciu Dan</author> </authors>
<books> <book>
Automata, Languages, and Programming <authors refs="as se"/>
</book> <book>
Data on the Web
<authors refs="as bp sd"/> </book>
<book>
Advanced Programming in the Unix environment <authors refs="sw"/> </book> <book> TCP_IP Illustrated <authors refs="sw"/> </book> </books> </bib>
Code Example 26 An XML document instance valid with respect to the schema shown above.
The former document could be graphically interpreted as follows:
Now, consider the following example for illustration of problems with untyped references:
Example 3.4
In difference to the former example, books are also referable elements here, and author elements contain references to their books. This is modelled by providing the reference type name “ˆ” and the identifier type name “@” both in the author and book elements.
G= (Bibliography,
{bib, authors, author, books, book}, R)
whereR=
{ Bibliography→bib[AuthorIndex, BookIndex]
, AuthorIndex→authors[Author+]
, Author→@author[N ame, ˆ+]
, BookIndex→books[Book+]
, Book→@book[T itle, Authors]
, Authors→authors[ ˆ+] }
The following example document illustrates a valid document with respect to the grammar. Unfortunately, it is not possible to distinguish references to authors from references to books, resulting in aconceptuallyinvalid document, where a book is referred to as an author of another book and an author contains another author in his list of published books.
<bib> <authors>
<author id="se" ref="alap as"> <!--CONCEPTUALLY WRONG--> Shamir Eli
</author>
<author id="sw" ref="apitue ti">Stevens W.</author> <author id="as" ref="alap dotw">Abiteboul Serge</author> <author id="bp" ref="dotw">Buneman Peter</author>
<author id="sd" ref="dotw">Suciu Dan</author> </authors>
<books>
<book id="alap">
Automata, Languages, and Programming <authors refs="as se"/>
</book>
<book id="dotw"> Data on the Web
<authors refs="as pb sd alap"/> <!-- CONCEPTUALLY WRONG --> </book>
<book id="apitue">
Advanced Programming in the Unix environment <authors refs="sw"/> </book> <book id="ti"> TCP_IP Illustrated <authors refs="sw"/> </book> </books> </bib>
Code Example 27 This document isconceptually wrongwith respect to the schema 3.4.
The document in code example 27 could be graphically interpreted as follows (the conceptu- ally wrong references are denoted by flashes crossing the edges):
The proposed schema language R2G2 will introduce typed references to regular tree gram- mars to model graph shaped data in a more precise way. Syntactically, typed names are type name extensions of references—a type name intended to be referred to is appended.
Example 3.5
This grammar is an extension of the grammar in example 3.4, such that the references to authors and books are clearly separated. The conceptually erroneous example document of example 3.4 is invalid under this grammar:
G= (Bibliography,
{ˆ,@, Bibliography, AuthorIndex, Author, BookIndex, Book, Authors},
{bib, authors, author, books, book}, R)
whereR=
{ Bibliography→bib[AuthorIndex, BookIndex]
, AuthorIndex→authors[Author+]
, Author→@author[N ame, ˆBook+]
, BookIndex→books[Book+]
, Book→@book[T itle, Authors]
, Authors→authors[ ˆAuthor+] }
3.3.2
About (Non Tree Structured) Graphs and Tree Grammars
On some graph serialisation formalisms like RDF, the structure of the serialisation is semanti- cally irrelevant in the sense, that the underlying graph semantics of different serialisations is considered to be isomorphic. The underlying graph structure may nevertheless need some sort of schematizing. As long as the graphs have a special node, called the root node from now on, which is chosen as the starting point for graph traversals, a tree grammar can also be used to model some structural properties of the graph. The root is used as starting point for graph traversal. For a given root, the set of all possible graph traversals is unambiguously determined. A tractable way to realize rooted graph modelling using tree grammars is based on simulation preorder: a tree grammar is a generator for the language of all trees that can be obtained by means of rule application and also of all rooted graphs that can be obtained by sharing of nodes that result from the same (possibly infinite) chain of rule applications. An implication is that, con- cerning schema validity, there is no distinction of two graphs where one of them shares one in- stance of a node in many positions, and the other one has multiple instances with identical shape
or value used instead of sharing. One graph is then indistinguishable from another one, if it is
simulatedby the other graph. Graph isomorphism is arguably the most precise notion ofequality of indistinguishablefor graphs, simulation is weaker in the sense, that graphs that are not isomor- phic may simulate or even bi-simulate. Consider example 3.1 for two rooted directed graphs that bi-simulate, but that are not isomorphic—the central difference between (bi)simulation and iso- morphism is, that there is no bijection between the nodes of two bi-simulating graphs such that the related nodes have similar in and outboundbehaviour. In contrast, such a bijection is needed for isomorphic graphs. The disadvantage of identification, distinction or recognition of objects using graph isomorphism is, that decidability comes with exponential cost, while the simulation preorder of two graphs can be checked in polynomial time. Arguably identification, distinction or recognition of objects based on simulation is useful on many practical contexts on the web, as many practical use cases with Xcerpt prove [33]. For a brief introduction of simulation and simulation unification along with some examples, see section 2.5.4.