I. INTRODUCCIÓN
2. Encefalopatías espongiformes transmisibles (EET)
2.4 EET en humanos
Given a time-interval query q[tb, te], we must retrieve for each query term v ∈ q all postings from the index for the term v whose valid-time interval overlaps with the query time-interval, i.e., [ti, tj)∩ [tb, te]6= ∅.
Note that, in contrast to the processing of time-point queries described above, there may not be a single list Lv : [tk, tl)such that [tb, te]⊆ [tk, tl)which could thus alone be used to answer the query. Recall that Pv denotes the set of time intervals for which our index keeps a posting list for term v. To find all postings relevant to term v and time interval [tb, te], we have to determine a set Lv ⊆ Pv of time intervals that together cover the query time-interval [tb, te]. Note that we can safely assume that every time interval from Lv overlaps with the query time-interval [tb, te], i.e.,
∀[tk, tl)∈ Lv : [tk, tl)∩ [tb, te]6= ∅ , (3.14) since we could otherwise remove [tk, tl) and still process the query. Similarly, we can assume that there is no subsumption between time intervals in Lv, i.e.,
∀[ti, tj)∈ Lv ∀[tk, tl)∈ Lv : [ti, tj)⊆ [tk, tl)⇒ [ti, tj) = [tk, tl) . (3.15) Putting these two together, we can assume that Lv is a sequence
Lv =h [tb1, te1), . . . , [tbm, tem)i (3.16)
3.5 Query Processing
of m time intervals arranged in ascending order of their begin boundary retain-ing tei ≥ tbi+1, tb ∈ [tb1, te1), and te ∈ [tbm, tem). To identify relevant postings for the query, we merge the posting lists L.v : [tb1, te1) and L+v : [tbi, tei) for 1≤ i ≤ m. In doing so, we are guaranteed to read all postings whose valid-time interval overlaps with [tb1, tem). This is because of the dichotomy that post-ings whose valid-time interval overlaps with [tb1, tem) are either already alive before time tb1, which we obtain by reading the posting list L.v : [tb1, te1), or have been created during (tb1, tem), which we obtain by reading the posting lists L+v : [tbi, tei). Moreover, since [tb, te] ⊆ [tb1, tem), we are guaranteed to see all postings relevant to the query. While merging the posting lists, we filter out postings that have a valid-time interval [ti, tj)such that [ti, tj)∩ [tb, te] =∅.
The merging and filtering can be implemented efficiently using a priority queue, thus exploiting the fact that posting lists have a consistent sort order. This order is naturally preserved for the merged list of relevant postings.
As for time-point queries, the query-processing performance of a time-interval query depends on the total number of postings read, thus taking into account filtered-out and duplicate postings. For a term v contained in a given time-interval query, the question which sequence Lv to pick and thus which posting lists to merge can be formalized as the following optimization problem:
Definition 3.5 (Time-interval query optimization problem)
argmin
When choosing Lv, we thus aim at minimizing the total number of postings read, while making sure that we see all postings relevant for the term v and the time interval [tb, te]. The greedy Algorithm 2 computes an optimal solution to the above optimization problem. The algorithm keeps track of an optimal solution l for the time interval [tb, t)and its associated cost c as a triple (t, c, l) in the set S. In each iteration of the main loop, one additional triple is added to S, by determining the globally cost-minimal solution for a time interval [tb, t0)that has not yet been covered. To this end, the algorithm extends an optimal solution (t, c, l)already recorded in S by appending a time interval [tk, tl)from Pv that does not introduce a gap, i.e., t ∈ [tk, tl). If [tk, tl) is the first time interval,
the solution obtained has cost|L.v : [tk, tl)| + |L+v : [tk, tl)|. Otherwise, its cost is yielded by incrementing the cost of the extended solution by|L+v : [tk, tl)|. The algorithm terminates and outputs an optimal solution for [tb, te], once a solution for t0 > te has been determined.
The algorithm closely resembles Dijkstra’s algorithm [KT05] for computing shortest paths in a graph. Note that the pseudo-code in Algorithm 2 is meant to illustrate the ideas underlying the algorithm, but does not achieve good time and space complexities. In detail, if implemented as described, it achieves time complexity in O(|Pv|3) and space complexity in O(|Pv|2). If a priority queue is used to implement the set S and optimal solutions are represented implic-itly using pointers, these can be reduced to O(|Pv| · log |Pv| ) time and O( |Pv| ) space, respectively, as discussed for Dijkstra’s algorithm in Kleinberg and Tar-dos [KT05]. The optimality of the algorithm is stated in the following theorem.
Theorem 3.1 Algorithm 2 determines an optimal sequence Lv of time intervals.
We need two lemmas to prove Theorem 3.1.
Lemma 3.1 Algorithm 2 adds triples in non-decreasing order of their cost to the set S.
Proof of Lemma 3.1 By contradiction. Let (t, c, l) and (t0, c0, l0)be two triples that the algorithm adds in consecutive iterations to the set S. We assume c0 < c, i.e., the triple added last has lower cost. This is impossible, since in each iteration Algorithm 2 extends a solution that is already in S. Therefore, either l ⊂ l0, i.e., in the last iteration the solution added just before is extended, which implies c ≤ c0. Or, l0 extends another solution that was already in S when l was determined, which also implies c ≤ c0, since otherwise our greedy algorithm would have selected (t0, c0, l0)in the first iteration. Lemma 3.2 For any (t, c, l) ∈ S the sequence l is an optimal solution for [tb, t). Proof of Lemma 3.2 By induction over|S|.
|S| = 1: For the initial case S = {(tb, 0, ∅)} the lemma holds, since no work and therefore zero cost is needed to cover the empty interval [tb, tb).
|S| = 2: Let (t, c, l) be the triple that Algorithm 2 adds to S in the first iteration of the main loop. By design, in this first iteration, the algorithm chooses the [tk, tl)∈ Pvthat has minimal cost |Lv : [tk, tl)| and fulfills tb ∈ [tk, tl). Thus, any other solution for [tb, t)must have a cost of at least c, since otherwise the algorithm would have selected it.
3.5 Query Processing
Algorithm 2: Determining an optimal sequence Lv of time intervals whose posting lists are merged to retrieve all postings relevant to query term v and query time-interval [tb, te]
Data: Set of time intervals Pv, query time-interval [tb, te] Result: Sequence of time intervals Lv ⊆ Pv
/* Initialization */
|S| → |S| + 1 for |S| ≥ 2: Let (t, c, l) be the triple that Algorithm 2 adds to S and let (t0, c0, l0) be the triple that the algorithm extends to this end. Assume that there is a solution (t, c∗, l∗) that achieves cost c∗ < c. Note that there must be an opti-mal (t00, c00, l00) ∈ S such that l00 ⊂ l∗ – this holds, for instance, for the initial triple (tb, 0, ∅) added to S. Therefore, Algorithm 2 could gradually extend (t00, c00, l00)and yield (t, c∗, l∗). The fact that it extends (t0, c0, l0)into (t, c, l), when growing S, indi-cates that any extension of (t00, c00, l00)has cost of at least c. Therefore, (t, c∗, l∗)must have cost c∗ ≥ c, which contradicts our assumption.
Proof of Theorem 3.1 Due to Lemma 3.2 we know that the solution l0 output by Al-gorithm 2 is an optimal solution for [tb, t0)that also covers [tb, te]since t0 > te. Any other solution covering [tb, te]must have a cost of at least c0 due to Lemma 3.1.
As a side remark, note that for the special case tb = te corresponding to a time-point query, Algorithm 2 correctly picks a list Lv : [tk, tl)that has shortest length while retaining tb∈ [tk, tl), which matches our above description of how time-point queries are processed using TTIX.
Note that the result of a time-interval query may contain more than one ver-sion per document. This is in contrast to time-point queries whose results con-tain at most one version per document. This subtle difference has ramifications on the bookkeeping required during query processing (for time-point queries results can be uniquely identified based on their document identifier) and on the temporal coalescing techniques presented in the following section.
3.6 Temporal Coalescing
If we employ TTIX, as described thus far, to index a versioned document collec-tion, we would na¨ıvely create one posting per term per document version. For frequent terms and highly-dynamic document collections, this leads to a huge number of postings that have to be kept in our index. Often, though, changes to documents are minor (e.g., spelling corrections), leading to a typically high de-gree of redundancy between consecutive versions of the same document. In this section, we describe techniques that leverage this high degree of redundancy to reduce the total number of distinct postings kept in our index. Toward this
3.6 Temporal Coalescing
objective, the techniques presented coalesce postings belonging to consecutive (i.e., temporally adjacent) versions of the same document at index-build time and thus construct postings that contain information about consecutive versions of the same document. The difference between the techniques lies in the type of posting payload that they target.
When presenting the temporal coalescing techniques for Boolean, scalar, and positional payloads in the following, we assume that our input consists of a sequence of n temporally adjacent postings
I =h ( d, [t1, t2), p1), . . . , ( d, [tn, tn+1), pn) i . (3.17) The input thus represents a contiguous time period during which the term was present in the document d. If the term disappears from d but reappears later, multiple input sequences are obtained that are dealt with independently. All methods produce an output sequence
O =h ( d, [t10, t20), p10), . . . , ( d, [tm0 , tm+10 ), pm0 ) i (3.18) that consists of |O| = m ≤ n coalesced postings and covers the same time interval as the original postings from the input sequence I, so that t1 = t10 and tn+1 = tm+10 .
Our temporal coalescing techniques deal with postings belonging to a term v in isolation and therefore naturally allow for parallelization or implementation using a modern data-processing approach such as MapReduce [DG10]. We fur-ther assume that input sequences have already been computed. Letting Lv de-note the list of all postings belonging to term v, computing the input sequences, i.e., sorting Lvand splitting it up into sequences of temporally adjacent postings is possible in time O(|Lv| · log |Lv|).