2. Panorámica de la situación actual (2012)
2.3 Situación de las tres Brechas Digitales de Género (BDG)
Together with schema independent non-queriable XML compressors, non-homo- morphic queriable compressors are the categories from which more tools have been developed during the last years. Some representative methods of the second group are the 11 following tools:
XCQ. XCQ [LNWL03, NLWL06] is an XML schema-aware compressor based on a technique called DTD Tree and Sax Event Stream Parsing (DSP), that tries to
takes advantage of the information provided by the XML document Document Type Denition (DTD) to generate concisely compressed data, but also useful to perform query evaluation. The DSP technique separates document structure and data content from the input SAX event stream produced while parsing the XML document. Similarly to those XML compressors that use the knowledge of a schema specication (like Millau, SCA, XAUST, etc.), it only encodes the structural information that can not be inferred from the DTD, that is, occurrences of ∗, +, ? and | operators. On the other hand, data part is arranged applying a path-based partition grouping. Each time data values are encountered, they are sent to the data stream associated with the full tree path connecting the data to the root node. In addition, these data streams are then divided into indexed blocks. Both, structure stream and blocks of data streams are nally individually compressed using a general text compressor, usually gzip.
Data block division slightly worsens compression ratio due to data commonalities that are limited to the contents of the current block. However, since blocks can be compressed and decompressed as individual units and given that they are created in a path-based manner, it also makes possible to only decompress those blocks that are relevant for a posed query. Therefore, a critical feature of XCQ is to determine the accurate block size, given that compression and query performance would be inversely aected.
XCQ supports the evaluation of a subset of XPath queries involving not only selection and predicates, but also aggregation operators (e.g. count, sum, average, etc.) and equality comparisons (e.g. =).
XQzip. XQzip [CN04] introduces indexing structures to support a wide range of XPath queries over the compressed XML document, although partial decompression is still needed for the matching of string conditions. XQzip separates structure (i.e. tags and attributes6) from data (i.e. element content and attribute values) while
parsing the XML document. The rst stream is used to build the Structure Index Tree (SIT), an indexing structure that removes duplicate structures from the XML document to improve query performance. In Figure 4.20 b) an example of a SIT is illustrated, which corresponds to the tree structure of the XML fragment of Figure 4.20 a). In turn, data are rst grouped into dierent containers according to their associated tag/attribute, and then further divided into smaller data blocks which are separately compressed using gzip. These blocks can be decompressed individually, hence avoiding full decompression in query evaluation. Yet this leads to a trade-o between compression ratio and decompression overhead when querying, as happened in XCQ. If the block size is small, redundancies across separated blocks are not properly used, while if a large block size is dened it will be costly to decompress it. Hence, it may be dicult to nd a suitable block size for both compression and query evaluation. To minimize decompression overhead in query evaluation,
XQzip applies the Least Recently Used (LRU) algorithm to manage a buer pool for the decompressed data blocks, thus avoiding repeated decompressions if the data is already in the pool. XQzip addresses dierent types of XPath queries, such as multiple predicates with mixed value-based and structure-based query conditions, but it also allows comparison (e.g. =, >, <, >=, <=, etc.), string (e.g. contains and starts-with) and aggregation operators.
XML Document 21, 2 43, 4 8, 5 11, 9 18, 6 32, 7 52, 8 18, 10 33, 11 52, 12 69, 13 14, 3 0, 0 17, 1 ROOT regions region @id clients person 0 17 21 14 43 8 name nif phone company cif web 18 32 52 11 33 69
Element / Attribute ID assignment
Elem/AttID, nodeID Tree node b) a) <regions> <region id=”C22"> <clients> <person> <name>Miguel Zas</name> <nif>32145680N</nif> <phone>+34555101212</phone> </person> <person> <name>Sara Weinstz</name> <nif>44246381P</nif> <phone>+34652124133</phone> </person> <company> <name>EpsTon</name> <cif>A15128910</cif> <phone>+34981241267</phone> <web>www.epston.es</web> </company> </clients> </region> </regions>
Figure 4.20: SIT structure (b) of an XML document fragment (a).
XMLZip. This compressor [XMLb] takes as input the DOM tree representation of an XML document, and it basically divides that tree into dierent components by pruning it at a certain depth, d, that can be specied by the user. Then each component is separately compressed with gzip. The component that contains all the nodes in the tree up to depth d is called the root component. The rest ones are child components and correspond to all the sibling subtrees starting at depth d. These children are replaced into the root component by references. Figure 4.21 shows an example of the DOM tree component division performed by XMLZip using d = 2. XMLZip does not improve compression ratios, compared with those obtained by compressing the document with the underlying gzip, yet its main advantage is that XMLZip supports partial decompression, by decompressing the portions of the compressed components that are needed for query evaluations.
<account> <sale date=”01/03/2012"> <product> <description> King bed-ModR124 </description> <price>876</price> </product> </sale> <sale date=”02/03/2012"> <product> <description> Wardrobe-ModS42 </description> <price>1721</price> </product> </sale> </account> XML Document account sale date=”01/03/2012" sale date=”02/03/2012" product product
description price description price
King bed-
ModR124 876 Wardrobe-ModS42 1721
root component
child components
Figure 4.21: DOM tree division in XMLZip.
XQueC. This compressor [ABMP07] focuses on query speed rather than compres- sion eciency. As XGrind and XPRESS, XQueC compresses individual data items of the XML document to avoid decompression during query processing, but if diers from them on the separation of document structure and data parts. With respect to structure, tag and attribute names are encoded using a binary representation of log2N bits, being N the total number of dierent names. Furthermore, XQueC
builds a structure tree of the input XML document, where each node is assigned an unique identier reecting the order of the represented tag/attribute in the document and also the corresponding assigned code. Meanwhile, data values specied by the same root-to-leaf path are grouped into a same container. XQueC can choose to compress the XML data by applying either the ALM algorithm [Ant97], or the classical Human compressor [Huf52]. In the former situation, order is preserved in the encoded data, thus allowing one to perform range queries directly over the compressed values. In turn, Human algorithm supports prex-wildcards (although not inequalities). Moreover, XQueC considers containers grouping into sets according to their contained data common properties to improve compression eciency. To determine containers association, as well as the appropriate choice of the suitable compression algorithm, XQueC creates cost models of the dierent possible congurations by exploiting query workloads information.
XQueC supports a wide subset of XQuery language. To this aim, it also builds additional data structures and indices. For instance, it creates a dataguide [GW97], that is, a structural summary representing all possible paths in the document, and links each node to the corresponding data container. What is more, XQueC links each individually compressed data item to its corresponding node in the structure
tree. Those auxiliary structures signicantly improve query performance, however they may incur in a huge space overhead.
XSeq. XSeq [LZLY05] is another example of grammar-based compressor. It is based on Sequitur [NMW97a, NMW97b], a linear-time on-line algorithm that generates a context-free grammar that uniquely represents the input string. XSeq uses this algorithm to compress each of the several containers in which structure and data tokens of an input XML document have been previously separated. In addition, XSeq makes use of a set of indices to correlate data values stored in dierent containers, thus improving querying eciency. For instance, a header index, pointing to each dierent container, and a structural index, through which each data value can be quickly located in the container without decompression. Data containers also include devoted indices. All those features grant to XSeq the ability of directly processing queries (in particular, XPath queries) over the compressed document, without full or partial decompression. XSeq is also able to process only relevant data values for a given query, thus avoiding a sequential scan of irrelevant compressed data.
XCPaqs. This compressor [WLLH04] separates structure and content, and compresses them separately. For the structural part, individual tags, but also complete root-to-leaf paths are considered. XCPaqs gathers statistics for both components, and it rst codes tags with Human compressor [Huf52]. Then paths, which can be described as a series of tags in Human code, are further encoded, by using again the same encoder. Connection between structure and content is kept by the path order in the original document associated to each data. When processing the document, path type (i.e. data type and range of values of the data associated to a same root-to-leaf path) is recognized, in such a way, that data is compressed by using a specic compressor depending on the corresponding inferred path type. For instance, enumerated-type data are dictionary encoded, while string data are encoded with a sux compressor, and long text is compressed with the Burrows-Wheeler Transform [BW94]. The obtained results from structure and content encoders are nally combined based on their connection relations, leading to a 2-ary nal structure.
XCPaqs can solve XQuery queries. Before query processing, tags in the query are translated into their corresponding code and then the query plan is split into three steps: i) to select appropriate path codes; ii) to relate elements and conditions according to their content; iii) to construct the nal result.
ISX. ISX [WLS07] proposes a compact storage scheme for XML, providing at the same time, ecient support for XPath query evaluation, and also update operations like insertions and deletions. ISX distinguishes three dierent storage layers: the topology layer, the internal node layer, and the leaf node layer. The rst one
stores the tree structure of the XML document by using a balanced parentheses encoding derived from [KM90]. The internal node layer, in turn, stores the elements, attributes and signatures of the data content for enabling fast text queries. Finally, data values are actually stored in the leaf node layer. Those data are referenced by the topology layer and can be compressed by various common compression techniques (usually gzip). Additionally, ISX creates auxiliary data structures over the basic storage scheme to allow ecient query processing.
TREECHOP. All procedures in TREECHOP [LMD05] visualize the input XML document as a tree structure, where non-leaf nodes correspond to elements and attributes, but also to CDATA sections, comments and processing instructions. In turn, leaf nodes are character data, such as attribute values and data content enclosed by an element. TREECHOP compresses the XML document in an adaptive way. As tokens are received by a SAX parser, new tree nodes are created and sent to the compression stream. Each non-leaf node is assigned a binary codeword. This codeword is uniquely assigned based on the complete path from the root of the tree node. Hence, nodes with the same absolute path, will receive the same codeword. Formally, the codeword Cn assigned to a non-leaf node n, with
parent node p, is formed by the concatenation of three codes Cp, Gn, and Tn.
Cp, represents the codeword of p, while Gn is a Golomb code [Gol66] assigned
to n based on its order with respect to p. Finally, Tn, is a sequence of 3 bits
denoting the kind of node (e.g. an element, an attribute, a comment, etc.). This encoding scheme keeps the structure of the original XML document. Regarding the leaf nodes, they are processed in a similar manner, using in addition reserved byte values to indicate the beginning and end of the associated character data. As node information is added to the compression stream, it is compressed using gzip. Like XGrind, TREECHOP supports exact-match queries through a sequential scan over the compressed document, while range-match queries require data values decompression to be further validated.
LZCS. Although it yields into this category, LZCS [ANF07] can not be considered a general purpose XML compressor, since it is specically adapted to compress highly structured XML documents, and hence it does not perform well with arbitrary ones. Inspired by the Ziv-Lempel compression, LZCS replaces identical subtrees by a pointer to their rst occurrence. To improve compression the LZCS transformation of a document can be further compressed with a classical compressor. In particular, authors use the semi-static word-based Human method [Mof89] and two PPM schemes [CW84], namely PPMdi and PPMz. The former keeps LZCS transformation properties related to navigation ability, while the latter does not. In [ANF09], authors show how to perform some basic XPath operations (regarding child, descendant, parent, and ancestor axes, and also text matching operator) over the LZCS transformation, by using a streaming approach. The main
idea is to speed up path matching operations by taking advantage of the work done over repeated substructures.
XBzipIndex. As rst disclosed in Section 4.2.2.2, XBzipIndex is the compressed and searchable tool of the XBW transform adaption presented in [FLMM06, FLMM09]. Like XBzip, the XBW transform computation of an XML document, given by ⟨ bSlast, bSα, bSpcdata⟩, constitutes the rst step of XBzipIndex construction.
But to keep navigation and searching purposes, it also needs to support rank and select operations over bSlast and bSα. Hence these two arrays are stored by
using a compressed representation supporting the aforementioned operations (see [FLMM09] for more implementation details). In turn, bSpcdata, is rst split into
homogeneous buckets, in such a way that two elements are held in the same bucket if they have the same upward path, and afterwards a FM-index [FM01, FM05] representation is created for each bucket. Under this representation, XBzipIndex allows answering two dierent kind of queries: i) //Π, ii) //Π[fn : contains(., γ)], where Π denotes a fully-specied path consisting of tag/attribute names and γ is an arbitrary string.
One of the distinctive features of XBzipIndex is that it constitutes the rst solution combining compression and indexing. The compressed data represents at the same time the structured text and an index built on it. That is called a self- index [NM07].
SXSI. Like XBzipIndex, Succinct XML Self Index (SXSI) [ACM+10] is another
tool for compressed indexing of XML data. Yet it is able to support a wider range of XPath queries than that addressed by XBzipIndex. SXSI is tailored to work in main memory, and uses a compressed index representation for XML data able to solve queries involving some of the forward XPath axes, together with dierent text functions (e.g. `=', contains, and starts-with).
SXSI regards XML documents as both an ordered set of strings, and also as a labeled tree dened by the hierarchical tags. Hence, it establishes a separation between the structure itself and the text content. Figure 4.22 illustrates the model used by this proposal for a given XML fragment. Note that the actual tree is formed by the solid edges, whereas dotted edges show the connection with the textual parts. Each node of the tree representing an element is labeled by its corresponding tag name, text nodes are modeled as leaves labeled with #, and each attribute node is represented as a sequence of nodes where the rst one is labeled with @, its child node is the attribute name itself and the leaf child denotes the associated attribute value by means of the special label %. Observe that there is exactly one text content related to each tree leaf labeled # or %. Nodes of the tree are assigned global identiers, but also each text content receives its own text
<shop> <product mod=”12b"> <name>skirt</name> <size>m</size> </product> <product mod=”23c"> <name>handbag</name> <color>black</color> </product> </shop> XML Document <shop <product @ <name <size # % skirt m mod 12b # @ <name <color # % handbag black mod 23c # <product 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6
Figure 4.22: Example of SXSI data model.
identier. Then, SXSI concatenates all text data7 and represents them by using
a succinct full-text self-index, namely the FM-index [FM05]. This index is based on the BWT [BW94] and supports pattern matching operations8. In turn, the
tree structure is represented by combining two dierent and aligned sequences: a balanced parentheses representation of the tree skeleton, and a sequence of the tag identiers of each tree node. Tree navigation operations are directly inherited from the implementation of the rst sequence [SN10]. Figure 4.23 shows how SXSI models the structural and textual parts of the example depicted in Figure 4.22.
Par = ( ( ( ( ( ) ) ) ( ( ) ) ( ( ) ) ) ( ( ( ( ) ) ) ( ( ) ) ( ( ) ) ) Tag = S p @ m % /% /m /@ n # /# /n s # /# /s /p p @ m % /% /m /@ n # /# /n c # /# /c /p /S S = shop s = size p = product c = color n = name Tree T = 12b$skirt$m$23c$handbag$black$ F = $$$$$$1223aaabbbccdghikklmnrst L = Tbwt = kmgctb$$12lbh2d$3ana$kcsb$ai$r Text collection
Figure 4.23: Tree and text data representation in SXSI.
The aforementioned data structures constitute the base for query evaluation. Each XPath query is translated into an alternating tree automaton [CDG+07,
7Each one appended with the special end-marker $.
Hos10]. Conventionally, the run of a tree automaton visits every node of the input tree, but SXSI makes use of the information kept on the indexes and applies dierent techniques to only visit the relevant ones [MN10], thus reducing processing times.
Part II
Our proposal: XXS
Chapter 5
The XML Wavelet Tree
In this chapter we present the rst core part of XXS, the XML Wavelet Tree (XWT), a new data structure to represent an XML document in a compressed and self-indexed way (see Figure 5.1). The XWT constitutes a new approach for compact representation of XML documents, which takes about 30%-40% of the original document size, allowing at the same time their ecient processing and querying: XWT provides implicit indexing properties that can be successfully proted to eciently support XPath queries, as it will be later seen from Chapter 7 to Chapter 9. Do XXS Q u e ry P a rs e r Query Module XML Representation XML Docu ment XML Wavelet Tree Q u e ry E v a lu a to r XXS
Figure 5.1: XML representation of XXS: the XML Wavelet Tree (XWT). 99
This chapter focuses on the XML Wavelet Tree data structure description. Section 5.1 rst introduces the main construction features of this representation, while Section 5.2 details the basic procedures to decompress and search over the XWT. Sections 5.3 and 5.4 end the chapter by uncovering some of the main XWT