• No se han encontrado resultados

CAPÍTULO I. ASPECTOS GENERALES

F. Funda-TEC

1.1.2. Escuela de Agronomía

We start this section by providing detailed information about weak and strong com- positions. After that, we discuss how weak and strong 4-compositions can be used to represent the query-document model that was used in this dissertation.

If each of the k numbers in an ordered arrangement (such as the type of arrange- ment that is introduced in Section 2.2.5) must be positive, then the arrangement is not only a weak k-composition, but is also a (strong) k-composition. The set of (strong)

k-compositions is a proper subset of the set of weak k-compositions. Figure 2.1 on the following page depicts the relationship between sets of weak compositions and sets of compositions. From this point on, (strong) compositions are generally referred to as simply compositions unless the author wants to contrast a (strong) composition with a weak one. The notation [k], used in the quote below from B´ona (2006), denotes the set of the firstk positive integers, that is, [5] represents the set {1,2,3,4,5}.

More formally, here are definitions for weak compositions and compositions: A sequence (a1,a2, ...,ak) of integers fulfilling ai 0 for all i, and (a1 + a2 + ... +ak) = nis called aweak composition ofn. If, in addition, the ai arepositive for all i∈[k], then the sequence (a1,a2, ... , ak) is called a composition of n. (B´ona,

2006)

For example, the compositions of size 4 of the number 5 are (1, 1, 1, 2), (1, 1, 2, 1), (1, 2, 1, 1), and (2, 1, 1, 1). An alternative way of viewing them is as ordered sums:

5 = 1 + 1 + 1 + 2 = 1 + 1 + 2 + 1 = 1 + 2 + 1 + 1 = 2 + 1 + 1 + 1.

C W

W\C

Figure 2.1: The relationships between the sets of compositions (C) and weak compositions (W) for a positive integernintok parts. The circle represents the set of compositions and the backslash (\) symbol denotes set complementation. The set W\C denotes the weak compositions that are not simultaneously compositions. That is, the set W\C denotes the weak compositions that are not members of set C. The gray region represents the members of W\C.

An alternative viewing is:

3 = 0 + 3 = 1 + 2 = 2 + 1 = 3 + 0.

Now, let us imagine that we have a collection ofN documents and a particular single- term query. Furthermore, let us assume that, for each document, we are interested in two pieces of information: whether that document is relevant to the query and whether its bag of terms contains the query term. This divides the document collection into 4 non-overlapping (i.e., mutually exclusive) categories: the documents that are relevant and contain the query term (r1 denotes the cardinality of this category), the documents

that are relevant but do not contain the query term (r0 denotes the cardinality of this category), the documents that are non-relevant and contain the query term (s1 denotes the cardinality of this category), and the documents that are non-relevant and do not contain the query term (s0 denotes the cardinality of this category).

Each of these categories contains anywhere from none to all of the documents in the collection. No matter how many documents each category contains, though,

r0+r1+s0+s1

must always equal N because each document must be a member of exactly one of these 4 categories. Notationally, let

N =R+S =n0+n1

represent the total number of documents in a collection with

R =r0+r1

representing the number ofrelevant documents andS =s0+s1 representing the number of non-relevant documents. Figure 2.2 on the next page uses a contingency table to depict the relationships between these variables.

The above requirements are very naturally modeled by a set of weak compositions of size 4 ofN. Each weak composition is represented by the following ordered arrangement: (r1,s0,r0, s1). There is nothing special about this particular arrangement, the sequence above is just one of 4! = 24 different ways that we could have arranged those 4 distinct symbols. Two of the remaining 23 possibilities are (r0, r1, s0, s1) and (r0, s0, r1, s1).

n0 n1 N R S r0 r1 s0 s1

query term is present in the document?

document is relevant? Yes No

No Yes

Figure 2.2: The relationships discussed earlier between N, R, S, r0, r1, s0, s1, n0, and

n1 can be succinctly expressed by this 2x2 contingency table.

documents can be divided into 4 non-overlapping (i.e., mutually exclusive) categories. The set of weak compositions for a particular query and an associated document collection of cardinality N represents all of the unique ways that N documents could be assigned to the 4 categories just mentioned above. How to calculate the cardinality of this set is discussed below.

A primary item of interest in some of the modeling scenarios that this research ex- plored was the sample space of weak compositions for an N-document collection. More specifically, the interest was in the generation of the sample space and the number of weak compositions in this space whose elements satisfied particular mathematical constraints. This research mainly used weak compositions of size 4 to help determine probabilities or proportions in various modeling scenarios.

In IR terms, a weak composition of size 4 is a collection of N documents where at least one of the following conditions must be true: the number of relevant documents that contain the query term is 0 (i.e., r1 = 0), the number of relevant documents that do not contain the query term is 0 (i.e., r = 0), the number of non-relevant documents

that contain the query term is 0 (i.e.,s1 = 0), or the number of non-relevant documents that do not contain the query term is 0 (i.e., s0 = 0).

Also, in IR terms, a composition of size 4 is a collection of N documents where all of the following conditions must be true: the number of relevant documents that contain the query term is positive (i.e., r1 1), the number of relevant documents that do not contain the query term is positive (i.e., r0 1), the number of non-relevant documents that contain the query term is positive (i.e., s1 1), and the number of non-relevant documents that do not contain the query term is positive (i.e., s0 1).

According to B´ona (2006) and Weisstein (2003), the number of compositions ofninto

k parts is given by Ck(n) = n−1 k−1 (2.2.1) and the number of weak compositions of n intok parts is given by

˜ Ck(n) = n+k−1 k−1 , (2.2.2)

where nk denotes the number of combinations of n things taken k at a time, Ck(n)

denotes the number of compositions of n into k parts, and ˜Ck(n) denotes the number

of weak compositions of n into k parts. Figure 2.1 on page 23 illustrates an impor- tant relationship between the set of weak compositions of n into k parts and the set of compositions of n into k parts.

Related symbols that are used later in this work are C(n, k) (an alternate notation for nk), P(n, k) to denote the number of permutations of n things taken k at a time, and n! to denote the number of permutations ofn distinct objects.

The first identity above (i.e., Equation 2.2.1) provides a way to determine the cardi- nality of the sample space when each integer in a composition must be at least 1. The second identity (i.e., Equation 2.2.2) calculates the cardinality when an integer is allowed

to be 0. The latter identity is expected to be of more use in this research mainly because any of the 4 integers in an ordered arrangement of 4 integers for a modeling scenario could have the value 0. For example, the weak composition (r1, s0, r0, s1) = (1, 5, 0, 3) represents a nine (e.g., 1 + 5 + 0 + 3 = 9) document collection that has 1 relevant document where the query term is present, 5 non-relevant documents where the term is absent, 0 relevant documents where the term is absent, and 3 non-relevant documents that have the term present.

2.2.7

Combinatorial Generation and Enumeration Algorithms

Documento similar