1.3. OBJETIVOS
2.2.1. PROCEDIMIENTO DIRECTO EN EL DELITO DE TRÁNSITO POR EL
2.2.1.3. El exceso de pasajeros
• unordered/ordered content: In XML documents, content is always considered as being ordered (the
so-called document order). In many applications, particularly in semistructured databases, it is how- ever desirable to be able to consider data as unordered, i.e. the order in which data items occur is irrelevant.
Xcerpt allows to mix both ordered and unordered content.
• query specific constructs: As Xcerpt is a pattern-based language, it is necessary to enrich term pat-
terns with certain query-specific constructs like variables or partial/total and ordered/unordered term specification (see Query Terms below), but nonetheless stay as close as possible to the representation of data items.
4.2
Data Terms: An Abstraction for Data on the Web
Data terms represent XML documents and data items in semistructured databases. Data terms correspond to ground functional programming expressions and ground logical atoms. Syntactically, they are very sim- ilar to the semistructured expressions introduced in 2.1, but they contain additional constructs that allow to represent peculiarities of XML (like attributes). Apart from the special constructs for ordered/unordered term specification and the Xcerpt reference mechanism, data terms are thus just a simplified syntax for XML, or “XML in disguise”. Data terms are not restricted to representing XML data or semistructured ex- pressions: they are meant as an abstraction of many of the available formalisms for rooted, graph structured data like data represented in OEM or ACeDB, but also Lisp S-expressions or RDF graphs.
1 <data-term> := ( oid "@" )? <ns-label> <list> . 2 <ns-label> := (<ns-prefix> ":")? label
3 <ns-prefix> := label | ’"’ iri ’"’ .
4 <list> := <ordered-list> | <unordered-list> . 5 <ordered-list> := "[" <attributes>? <data-subterms>? "]" . 6 <unordered-list> := "{" <attributes>? <data-subterms>? "}" . 7 <data-subterms> := <data-subterm> ( "," <data-subterm> )*
8 <data-subterm> := <data-term> | ’"’ string ’"’ | number | "ˆ" oid . 9 <attributes> := "attributes" "{" <attribute> ( "," <attribute> )* "}" . 10 <attribute> := <ns-label> "{" ’"’ string ’"’ "}" .
Like in the grammar of Section 2.1, expressions between<and>are non-terminal symbols (or vari- ables). Expressions enclosed in the quotation characters"or’are terminal symbols.oidandlabeldenote object identifiers and expression labels (tag names), respectively. oid,label, andstring are character sequences corresponding to XML identifiers, tag names, and text content.numberis an arbitrary integer or floating point number.iriis an internationalised resource identifier as defined in [61]. In this thesis, the symbolˆis often replaced by the more concise symbol↑, which is unfortunately not available in ASCII.
If a data term t is of the formlabel[t1,...,tn]orlabel{t1,...,tn}, then theti are called im-
mediate subterms of t. Subterms of theti are called indirect subterms of t. If neither “immediate” nor
“indirect” is specified, the term subterm usually only refers to the immediate subterms of a term. In anal- ogy to the XML terminology, t is the parent term of its subterms, (immediate) subterms are sometimes also referred to as child terms, and the topmost parent term is called the root term. In an¡attributes¿expression of the formattributes{label1{...},...,labeln{...}}, the labels must be different, because XML
attributes need to have different names.
Example 4.1
Consider again the publication list from Section 2.1. The representation of this semistructured data item as a data term (or semistructured expression) is shown on the left. An equivalent representation (except subterm ordering) as an XML document is shown on the right. Note that the document prologue is omitted for brevity.
CHAPTER 4. XCERPT
publications { book {
title [ "Folket i Birka p˚a Vikingarnas Tid" ], authors [
author [ "Mats Wahl" ], author [ "Sven Nordqvist" ] author [ "Bj¨orn Ambrosiani" ] ]
}, book {
title [ "Boken Om Vikingarna" ], authors [
author [ "Catharina Ingelman-Sundberg" ] ]
} }
<publications> <book>
<title>Folket i Birka p˚a Vikingarnas Tid</title> <authors>
<author>Mats Wahl</author> <author>Sven Nordqvist</author> <author>Bj¨orn Ambrosiani</author> </authors> </book> <book> <title>Boken Om Vikingarna</title> <authors> <author>Catharina Ingelman-Sundberg</author> </authors> </book> </publications>
In this example, the terms with labelbookare immediate subterms or child terms of the term with label
publications, which is also the root term. The term with labelpublicationsis thus the parent term of the terms with labelbook. The terms labelledauthorare immediate subterms of the respective terms labelledauthors, and indirect subterms of e.g. the respective terms labelledbook.
Data terms may be used as an abstraction for many other formalisms that represent hierarchical or graph structured data. The following two examples show the publication list as a Lisp S-expression and in the
Object Exchange Model (OEM). (publications
(book
(title "Folket i Birka p˚a Vikingarnas Tid") (authors
(author "Mats Wahl") (author "Sven Nordqvist") (author "Bj¨orn Ambrosiani") )
) (book
(title "Boken Om Vikingarna") (authors
(author "Catharina Ingelman-Sundberg") )
) )
{ publications: { book:
{ title: "Folket i Birka p˚a Vikingarnas Tid", authors:
{ author: "Mats Wahl", author: "Sven Nordqvist", author: "Bj¨orn Ambrosiani" },
}, book:
{ title: "Boken Om Vikingarna", authors:
{ author: "Catharina Ingelman-Sundberg" } }
} }
4.2.1
Term Specifications
Like semistructured expressions, data terms allow the specification of ordered and unordered lists of sub- terms. These properties are expressed by using different kinds of braces to parenthesise the subterms.
• Square brackets (i.e.[ ]) denote ordered term specification, i.e. the order of subterms in the list is significant. An ordered term specification allows to select subterms by position and is important e.g. in text documents.
• Curly braces (i.e.{ }) denote unordered term specification, i.e. the order of subterms in the term is insignificant, although they are stored in a particular sequence. An unordered term specification allows to rearrange subterms in the list e.g. for building an index for faster access, or for more efficient use of a storage system (like grouping several small subterms in a single page of background memory while storing large subterms in an individual page each). Unordered term specification is commonly found in semistructured databases.
In Example 4.1 above, the term with labelpublicationshas an unordered term specification, meaning that the order of thebooksubterms is irrelevant, i.e. the storage system might choose to rearrange them in a different order. The terms with labelauthorshave ordered term specification, meaning that the order of the list ofauthorelements is significant (e.g. for proper citing).
Terms with different term specifications may be nested (i.e. subterms of a term may have a term specification different from the parent term’s), but nesting of term specifications within the same list of subterms is not permitted. For example, the term f{g["a","b"],h{"c","d"}} is a data term, but
f{"a",["b","c"],"d"}is not.
4.2. DATA TERMS: AN ABSTRACTION FOR DATA ON THE WEB
4.2.2
References
References are used for representing graph structures in a textual syntax. In Xcerpt data terms, subterms of the formoid @ t(read: “oid at t”) are defining occurrences ofoidand associate the identifieroidwith the subtermt. Subterms of the formˆoid(or↑oid, read: “reference to oid”) are referring occurrences of
oidand refer to the subterm associated with the identifieroid. As with semistructured expressions, every identifier may occur at most once in a defining occurrence, and an identifier used in a referring occurrence must also occur in a defining occurrence somewhere.
References in data terms are a unified representation for the various linking mechanisms available for XML (and other formalisms), like ID/IDREF, XPointer, XLink and URIs, and serve to simplify their rep- resentation in Xcerpt.1 Unlike other query languages, Xcerpt automatically dereferences such references when querying, i.e. a reference can be treated like a parent-child relationship.
Example 4.2
The following two terms are considered to be equivalent:
f { b { &o1 @ d {} }, c { ↑&o1 } } f { b { ↑&o1 }, c { &o1 @ d {} } }
4.2.3
Attributes
Unlike XML, Xcerpt does not have a special representation for attributes. Instead, XML attributes are treated as subterms of a term with the specific restriction that the value may not be structured content. An attribute of the formkey = "value"is represented in Xcerpt as a term of the formkey{"value"}
In order to separate attributes from child elements and thus retain the possibility to perform one-to- one transformations between Xcerpt and XML, Xcerpt groups them in a special subterm with the label
attributes. Since attributes in XML are always unordered, this special subterm always has an unordered term specification (see above). As a convention, every data term should contain at most oneattributes
subterm, and this subterm, if existent, should be the first subterm in the list of subterms (even in case the parent term is unordered). Also, all attributes of a term need to have different labels.
Example 4.3
Each book in thebib.xmldatabase of Section 2.4.2 contains an attributeyearin the XML syntax. Con- sider for example the following book:
<book year="1995">
<title>Vikinga Blot</title> <authors>
<author>
<last>Ingelman-Sundberg</last> <first>Catharina</first> </author>
</authors>
<publisher>Richters</publisher> <price>5.95</price>
</book>
In Xcerpt syntax, this book can be represented as follows. Note in particular that the element itself is ordered (as it is a representation of an XML document) while the attributes are unordered:
1Note that Xcerpt is not limited to its own reference mechanism: e.g. ID/IDREF can easily be dereferenced using an appropriate query (cf. Section 5.1.2).
CHAPTER 4. XCERPT
book [
attributes { year { "1995" } }, title [ "Vikinga Blot" ], authors [ author [ last [ "Ingelman-Sundberg" ], first [ "Catharina" ] } ], publisher [ "Richters" ], price [ "5.95" ] ]
This treatment of attributes has the main advantage that no exceptions are needed in the definition of Xcerpt extensions like variables or regular expressions. Instead, since attributes are represented in the same term structure as elements, it is possible to use the standard constructs for all occurrences of attributes.
4.2.4
Namespaces
Xcerpt supports namespaces in a straightforward manner that follows closely the use of namespaces in XML (cf. Section 2.2.6). Like in XML, namespaces are URIs (uniform resource identifiers) or IRIs (inter-
nationalised resource identifiers). Namespace prefixes can be declared and are then separated from term
labels by a colon. As an extension to XML namespaces, it is also possible to use the namespace URI as a prefix2.
Namespace Declarations
Namespace prefixes are declared with the keywordns-prefixfollowed by the defined prefix, a=and the namespace IRI. The default namespace (i.e. the namespace of all subterms that do not have an explicit namespace prefix) can be defined with the keywordns-default, followed by=and the namespace IRI of the default namespace.
1 <ns-declaration> ::= "ns-prefix" <ns-prefix> "=" ’"’ iri ’"’ 2 | "ns-default" "=" ’"’ iri ’"’ .
As a simplification over XML namespaces, this thesis allows namespace declarations only outside terms. This restriction obviously anticipates nested namespace declarations and shadowing, and thus a syn-
tactic one-to-one mapping between XML documents and Xcerpt terms preserving the namespace prefixes
is not always possible, although the two approaches have equivalent expressiveness (both allow to associate namespace IRIs with term/element labels). Transforming XML documents that use nested namespace dec- larations into data terms and vice versa is nevertheless possible as the namespaces themselves are preserved and just the namespace prefixes might get lost. Further refinements of namespaces that take into account both nested declarations and shadowing are currently being investigated.
Namespaces in Data Terms
In Xcerpt terms, namespaces are used almost as in XML. The most significant difference to XML is that the namespace IRI may also be used as a namespace prefix. In this case, it is not necessary to define the namespace in advance.
2In XML, this is not admissible due to syntactic restrictions. Xcerpt does not need to adhere to such restrictions as it is not necessary to retain backwards compatibility with applications that are not namespace aware.
4.2. DATA TERMS: AN ABSTRACTION FOR DATA ON THE WEB
1 <ns-prefix> = label | ’"’ uri ’"’ .
Example 4.4 (Namespaces in Xcerpt)
Consider again Example 2.14 on page 30, which illustrated the use of namespaces in XML by adding aremarkselement to address book entries that might contain HTML elements for markup. It uses the namespace prefixato refer to the address book schema, and the namespace prefixbto refer to the XHTML schema. As a data term, this document might be represented as follows:
ns-prefix a = "http://www.myschemas.org/address-book" ns-prefix b = "http://www.w3.org/2002/06/xhtml2" a:address-book { &o1 @ a:person { a:name { a:first { "Mickey" }, a:last { "Mouse" } }, a:phone { attributes { a:type { "home" } }, "19281118" }, a:knows { ↑&o2 }, a:remarks {
b:strong{"Note:"}, "The phone number is also the", b:em{"birthday"},"!"
} }, &o2 @ a:person { a:name { a:first { "Donald" }, a:last { "Duck" } } } }
Instead of declaring the namespace prefixb, it would also be possible to use the namespace URI directly, as in the following example. Note also the use of the default namespace declaration.
ns-default = "http://www.myschemas.org/address-book" address-book { &o1 @ person { name { first { "Mickey" }, last { "Mouse" } }, phone { attributes { type { "home" } }, "19281118"
CHAPTER 4. XCERPT
},
knows { ↑&o2 }, remarks {
"http://www.w3.org/2002/06/xhtml2":strong{"Note:"},
"The phone number is also the",
"http://www.w3.org/2002/06/xhtml2":em{"birthday"},"!"
} }, &o2 @ person { name { first { "Donald" }, last { "Duck" } } } }