Evaluación de un laboratorio de verificación

E MPLEO DE A RMAS Q UÍMICAS Y SOBRE SU D ESTRUCCIÓN (CAQ)

III. RESULTADOS Y DISCUSIÓN

III.1. Evaluación de un laboratorio de verificación

As discussed in Chapter 3, users and items are represented as user-and-item-concept hierarchies, respectively. This section presents a novel way of estimating the likelihood that user ua is interested in item bk using a probabilistic language modelling approach.

Such a likelihood is approximated by the probability that the user concepts of interest are generated by the items descriptors, i.e. P (ua|bk). The basic assumption is that if the LM

of an item can generate the concepts that characterise the users’ item interests, then the user is likely to be interested in the corresponding item.

This novel inferential language model (ILM) approach also makes both semantic and statistical inferences to estimate the probability that a user uais interested in an item

bk. An LM is a probabilistic function that assigns a probability to a string t drawn from

some vocabulary set T . Probabilistic LM has been applied to estimate the relevance of a document d with respect to a query q in terms of the likelihood of generation probability in the field of IR [68, 90]. Moreover, language modelling method has been successfully applied to opinion mining [58, 60]. In the present context, the usual query has been replaced by a descriptor of the user’s interests, and a document is a set of terms describing

the item.

P (d|q) ∝ P (d)Y

t∈q

((1 − λ)P (t|MD) + λP (t|Md)) (4.7)

Here, P (d)Q

t∈q((1 − λ)P (t|MD) + λP (t|Md)) is proportional (∝) to P (d|q), and the

probability of P (d|q) of the relevance of document d to a given query q; Md is a lan-

guage model built for each document d; MD is a language model built for the entire

document collection. This equation combines the probability of the document with the general collection frequency of words t. To generate the prediction scores needed to make recommendations, Equation 4.7 has been modified as follows:

P (bk|ua) ∝ P (bk)

ci∈Hua

((1 − λ)P (ci|MD) + λP (ci|Mbk)), (4.8)

where Hua is the user’s concept hierarchy and D is the set of all item-concept hierarchies

(that might be rated by other users). From the Equation 4.8, we derive the following results: P (bk|ua) ∝ ln{P (bk) Y ci∈Hua ((1 − λ)P (ci|MD) + λP (ci|Mbk))} P (bk|ua) ∝ ln(P (bk)) + X ci∈Hua ln((1 − λ)P (ci|MD) + λP (ci|Mbk)) (4.9)

Normally, a concept describing a user’s interest absent in a bkdoes not necessarily

mean that the item is not relevant to the user’s interests, because the document indexing scheme is not perfect and sometimes synonymous concepts are used in concept hierarchies. For instance, if the user’s interest is described by the descriptor data mining, an item bk (e.g., a book) about knowledge discovery from databases is very relevant even

the item. To reduce the effect of underestimating the probability of unseen concepts in an LM, various document smoothing methods have been proposed [68, 90]. The basic idea is to replace the zero probability of an unseen concept by a small value rather than zero. With Jelinek-Mercer smoothing [68], P (ci|Mbk) is updated using the following equation:

P (ci|Mbk) = (1 − λ)PM L(ci|Md) + λPM L(ci|MD) (4.10)

PM L(ci|MD) =

tf (ci, D)

|D| , (4.11)

where D is the set of all item-concept hierarchies; PM L(ci|MD) is the maximum like-

lihood estimation of the entire item collection LM; λ is the Jelinek-Mercer smoothing parameter, which may take values in the range of [0.1, 0.7] [84, 116]; tf (ci, D) represents

the occurrence frequency of ci in the entire item collection D, i.e. all item hierarchies.

Using the item-collection model D to smooth an item language model Mbk might

partially solve the problem of the zero probability of an unseen user term. However, the generation probability might still be highly underestimated. For example, an item with the descriptor knowledge discovery from databases is actually very likely to match the interest, i.e. a high generation probability, of the user described by data mining. Accordingly, an ILM that accounts for both semantic and statistical term associations is proposed to address the above issue. Our inferential language is defined and updated P (ci|Mbk) using the following equation:

P (ci|Mbk) = (1 − λ) (1 − γ)P_{M L}(c_i|M_b k) + γPIN F(ci|Mbk) ! + λPM L(ci|MD) (4.12)

PIN F(ci|Mbk) = P ci,cj ∈R P (ci|cj)P (cj|M_bk) |R| = P ci,cj ∈R P (cj→ci)P (cj|M_bk) |R| , (4.13)

where PIN F(ci|Mbk) is the item inferential language model. This is an extension of the

original ILM developed by Nie et al. [78], where previous ILM only considers semantic term relationships captured in WordNet. In this thesis, the rule set R contains the set of concept relations in the form of cj → ci, e.g., soccer → sport, which might be acquired

from an external source, such as WordNet. Meanwhile, statistical concept associations such as wii → game, they are dynamically discovered from the set of item descriptions via context-sensitive text mining or sequential text mining methods [59].

Based on the above discussion, it is very difficult to calculate P (ci|Mbk) because of

the hierarchical relationships between concepts. The extended ILM provides us an indi- cation for considering only concept associations for approximating P (ci|Mbk). Generally

speaking, all possible associations can be described as the similarity between item bk’s

concepts and user ua’s concepts. Hence, in this thesis, the user-item concept hierarchy

similarities cs(ua, bk) is used to approximate P (ci|Mbk). One might also simply use

P (ci) to replace P (ci|MD), where P (ci) is the probability of concept ci in all relevant

item concept hierarchies. Thus, if we let npop(bk) = ln(P (bk)) describe a given item

popularity, based on the above analysis and Equation 4.9, we propose the following approximation equation for estimating P (bk|ua):

p score(ua, bk) = α × npop(bk) + (1 − α)[β × cs(ua, bk)

+(1 − β) X

c∈Hua∩H_bk

where α, and β are experimental coefficients between 0 and 1.

The proposed CTLM is designed as an adaptation of the LM to account for the probability of taxonomic concepts so that relevance between users and items is utilised to enhance the efficacy of subsequent recommendation-making. As such, the proposed CTLM approach is composed of three parts: (1) item popularity npop(bk), (2) the user-

item concept hierarchy similarities cs(ua, bk) and (3) the concept probability P (ci). The

details of the user-item concept hierarchy similarities and item popularity were described in Section 4.3.1 and subsection 4.4.1.1. In the following subsections, the details for each constituent part are discussed. Then, the proposed CTLM recommender algorithm is presented.

4.4.2.2 Concept Probability

The proposed the probability of concept P (ci) in item concept hierarchies Hbk is relevant

to the user ua’s preferences or interests. One must assume that user ua is interested in

the items bk, as the user ua has an affinity for the concepts ci intrinsic to those items.

P (ci) describes the appearance of ciin all relevant item concept hierarchies Hbk. It can be

approximated by the term frequency of concept ci’s occurrence in items bk ∈ B divided

by the total number of all concepts used in the entire items collection B in the training set. The probability of concept P (ci) can thus be calculated as follows:

P (ci) =

|{bk∈ B|ci ∈ Hbk}|

bk∈B|{ci|ci ∈ Hbk}|

(4.15)

In document Un análisis del funcionamiento de los laboratorios de verificación de la OPAQ (página 78-86)