RADIOS LIBRES LAS VOCES DE LA GENTE
3.3 LA MUSICA RELIGIOSA DE SIXTO ARANGO GALLO
Query performance is always the most important issue for an RDF system, therefore we focus on the query process in this section.
Similar to traditional database systems in terms of system architecture, the main compo- nents of Jena, Sesame, RDF-3X and Virtuoso include a query engine, a storage subsystem and a database. The query engine is used to parse the query from a user or an application program and produce an execution plan, represented as a tree of relational operations [70]. The storage subsystem includes a buffer or even its own file cache manager, which, as the name suggests, manages the buffering of data and reduces the number of disk accesses.
The general process of query implementations for the four triple stores is illustrated in Figure 3.1, which comprises three main phases from top to bottom: query parsing, query planningand query execution. Unlike performance evaluations done previously, which have only focused on the time cost of entire query process, in our evaluations, we will measure the time cost of each phase to track performance more precisely.
To better understand the insight implementation of triple stores, we examine the detailed process of the three phases in turn. As the first query parsing process is simple, namely the strings of input queries are analysed based on the SPARQL syntax, here, we just focus on the latter two phases.
3.2 RDF Store Querying 43
Parsing Planning
Query
Join Data Access
Triple i Triple i+1 ... Execution B/B+ Tree Page Buffer/Cache
Fig. 3.1 The work flow of the general query process in triple stores.
3.2.1
Query Planning
The sequence of underlying implementations of a query such as joins rely on the responsible query plan, and an unoptimized plan will bring large number of redundant intermediate results and thus impacts the query performance because of the memory consumption as well as result materialization. SPARQL queries typically generate deep query plans and RDF lacks information about access patterns available in relational databases (e.g. foreign keys). This makes query planning, and in particular join order optimization, challenging and resource-consuming. Currently, most RDF store’s query optimizer can only collect limited statistics, such as RDF-3X’s histograms as described previously, and Jena which collects the number of times a predicate appears. Further, some systems (for example, Virtuoso) cache query plans for later use. We will examine this time for each store so as to demonstrate their differences.
3.2.2
Query Execution
Joins. The join implementations in an RDF store has been extensively studied in previous chapters - candidate results of two sub-graphs will join based on their join keys following the responsible query plan. As the execution time of SPARQL queries is dominated by such operations, here we just report on the join methods used in the four stores examined.
Namely, Jena and Sesame only use nested-loop joins, RDF-3X uses merge joins as well as hash joins and Virtuoso uses all the three types of joins. The latter two systems always choose the most efficient joins in the planning phase according to the cost of each kind of join, which means that they could spend more time on the query planning and consequently reduce the query execution cost.
Data Access. If a join operation organizes the general operation of all the triple patterns in a query, then the data access process can be considered as the detailed implementation of retrieving bindings for single triple patterns. This process is always costly and thus an efficient indexing structure is always needed so as to enable fast location of the required data pages and then retrieve them. Jena [86], Sesame [19] and RDF-3X [89] use B/B+ Tree indexes, suitable for range queries, and the index scheme of Virtuoso contains primary key and bitmap indices. Jena also provides three triple indexes on spo, pos and osp to accomodate different triple patterns, while Sesame offers two indexes spoc and posc by default, and RDF-3X maintains 15 indexes (6 indexes and 9 aggregated indices) for covering all the possible join patterns. The redundancy is offset by index compression methods. Virtuoso provides two full indexes posg and pogs, where the g indicates the graph name, and three partial indexes sp, op and gs as default. All systems use a dictionary, mapping values to numeric identifiers. The triple indexes, and most operations, operate on these numeric identifiers.
Data Caches. Practically all RDF stores (and all databases) employ caching mechanisms for triple indexes and dictionaries to improve the performance of frequently encountered queries. This kind of data cache is always implementation-specific. For example, Sesame employs a caching and buffering approach using the Java heap. During data retrieval, it will access the buffer or cache to check whether the required data is there and start an index scan. If there is no matched data, the needed B-tree node will be read into the buffer first before seeking to the exact data position. Depending on the location of the requested data, some B-tree nodes will be processed directly, some will be read from the disk cache and some will be read directly from disk. Caches influence data access operations like index scans, page reads and triple lookups. To obtain a more precise description of the performance of such operations, we record the number of index scans and their timing, the number of the pages read and the number of the triple lookups for a single query. All these data is useful for describing the dynamics of data searching, which is directly associated with query performance.