Instrumento de recolección de información

Capítulo 4. Marco Metodológico

4.2 Diagnóstico

4.2.2 Instrumento de recolección de información

The relations defined by our data model in Section 3.1 can be implemented in database tables as (key,value)-pairs. The key is a tuple of entities (– in our case, the users, tags or documents) which are involved in a relation and the value corresponds to the weight

defined by the respective relation (– in our case, the friendship strength, tag similarity, score, etc).

By creating an appropriate index, the entries in such database tables can be sorted in descending order of their values. In this way, the database tables follow the semantic of inverted lists where entries are not sorted by their keys but inversely, according to the respective key’s value. Hence, we will use the notion of inverted lists as a synonym for the database tables implementing the same functionality.

By sorting database tables in this inverted way, entities can be sequentially fetched in descending order of their values without the need for way more expensive random accesses to the database and, thus, the tables can be used like inverted lists with typical top-kquery processing algorithms.

Top-k query processing is a fundamental cornerstone of ranked retrieval of docu-ments and many other modern applications. Ideally, an efficient query processor would not read the entire input (i.e. all (key,value)-pairs from the underlying relations) but should rather find ways of early termination when the k best results can be safely de-termined, using techniques like priority queues, bounds for partially computed agggation values, pruning intermediate results, etc. These issues have been intensively re-searched in recent years (e.g. [32, 35, 44, 54, 61, 91, 98, 125]) and are well-understood.

Most top-k algorithms scan, i.e. sequentially read, inverted lists and aggregate pre-computed per-term or per-dimension scores into in-memory “accumulators”, one for each candidate document. The optimisations in the IR literature aim to limit the num-ber of accumulators and the scan depth on the index lists in order to terminate the algorithm as early as possible. This involves a variety of heuristics for pruning poten-tial result candidates and stopping some or all of the index list traversals as early as possible, ideally after having seen only short prefixes of the potentially very long lists.

For this, it is often beneficial that the entries in inverted lists are kept in descending order of score values rather than being sorted by document identifiers.

Our algorithms SOCIALMERGEand CONTEXTMERGEoperate on index structures corresponding to inverted lists, too, since both generally fall into the well-established framework of so-called threshold algorithms (TA) as well. They depend on impact-sorted inverted lists for efficient top-k query processing and require that score aggre-gation functions are monotonic (e.g. a weighted summation). Eventually, we employ variants of Fagin’s Threshold Algorithm (TA) [54] with flexible scheduling of list scans ([124, 20]). Hence, the design decision to cast our data model into inverted lists has been crucial for both of our algorithms. The details about SOCIALMERGEand CON

-TEXTMERGEare given in Chapter 5 and 6, respectively.

4 Problem Statement

In order to introduce our SOCIALMERGE and CONTEXTMERGE algorithms and the scoring models used, we first formalise the notion of a query. In line with the free-text tagging of social networks, we define a query as follows.

Definition 4.1 (Query qU). A query qU ={t⁰, . . . , tn−1} is a set of query tags issued by a query initiatorU to the social network. A query tag is a keyword, corresponding to a tagt_iused by some user in the network to annotate a document.

The result of a query is then defined as follows:

Definition 4.2 (Query Result RU). The result RU of a queryqU is a ranked list of documents, annotated by at least one of the query tagsti∈ q^U or a tagtesimilar toti

which is determined during the query processing by expanding a tagti ∈ q^U tote. The result list is ordered according to a query-specific document score.

In particular, a query-specific document score enables top-k query processing for efficiently retrieving the k documents with the highest document scores in regard to a query. The definition is perfectly in line with the current querying model of popu-lar search engines. However, in contrast to those search engines, the document scores used in our model—details are given in Section 5.1 and 6.2—also contain a social component: the, by definition, query-specific content-based score of a document is ad-ditionally user-specific, i.e. it depends on the social context of the query initiator.

Even though commercial search engines offer similar personalisation approaches, social tagging networks are the natural habitat to further explore and improve this idea since having the additional asset of knowing friendship relations and the friends’

tagging behaviour. Moreover, by considering these additional assets for computing user-specific query results, queries become high-dimensional and traditional IR text-retrieval methods are bound to fail. The dimensions of a query are defined as follows:

Definition 4.3 (Query Dimensions). The involved components for computing the final score of a document with respect to a query are called thequery dimensions.

Accordingly, high-dimensional queries involve a high number of components dur-ing the result computation.

Unlike in standard text retrieval, the dimensions of a query, when considering friendship relations in social tagging networks and in presence of our proposed scoring methods (see Section 5.1 and 6.2), are not only the tags in a query qU. Instead, for each query tag ti, the score of a document is additionally influenced by a user-specific score. Assuming m users in the system and a query with n tags, a query therefore has got m · n dimensions. If we additionally consider expansions of tags, the number of query dimensions will again increase a lot. Such high-dimensional queries cannot be efficiently handled by the existing variants of threshold algorithms (TA) for standard text retrieval, since usually, for each dimension a corresponding inverted list has to be precomputed and is eventually involved in the query processing.

In the following chapters, we introduce two different algorithms for efficiently retrieving content-based and user-specific query results for high-dimensional queries from social tagging networks. Both algorithms are based on the data model introduced in Section 3.1.

5 S OCIAL M ERGE Algorithm

Our first algorithm developed for the SENSE framework is called SOCIALMERGE. Subsequently, we introduce the associated scoring model followed by details about the query processing.

5.1 Scoring Model

The scoring model used with our SOCIALMERGEalgorithm is based on the relations defined by our data model in Section 3.1. The score for a document depends on the

tags that have been used to annotate the document and on the users who have tagged the document. In this scoring model, we perform semantic expansions by considering tags that are similar to the keywords appearing in a query, and social expansion by preferring documents tagged by close friends. More formally, let be

qU ={t⁰, . . . , tn−1}

a query with query tags t0, . . . , tn−1. We define the social score ssc(qU, d) of a docu-ment d with respect to a query qU initiated by user U in the following way:

Definition 5.1 (Social Score ssc(qU, d)).

ssc(qU, d) = X

ti∈qU

sts(ti, d, U)

where sts(ti, d, U) is the single tag score of a document d with respect to a query tag ti ∈ q^Uand the user U who issued the query.

We define the single tag score sts(ti, d, U) as follows:

Definition 5.2 (Single Tag Score sts(ti, d, U)).

sts(ti, d, U) = DR(d) × X

U_f∈FLIST(U )

s_f(U, Uf)

× max

t⁰∈SIMTAGS(ti){tsim(t⁰, t_i)· s^d(Uf, t⁰, d)}

× UR(U^f)

Before defining each of the components used in the definition of the single tag score sts(ti, d, U), we introduce the notion of friendship in regard to the scoring model used with our SOCIALMERGEalgorithm.

With this scoring model, we use only the user-user relation F riendship(U1, U2, type= social, sf)

of type social as presented in Section 3.1.2 and abstractly given in Definition 3.1. We implement the social friendship relation in our scoring model based on the friendship graph in social tagging networks (see Definition 3.2) by considering the shortest path distances of users in the graph. Since we are using a different notion of shortest path in later chapters, we explicitly define the shortest path for our SOCIALMERGEalgorithm in the following (obvious) way:

Definition 5.3 (Shortest Path). Let be G the directed, unweighted friendship graph of a social tagging network andU and Uf two different users inG. The length of a path inG leading from U to Uf is equal to its number of edges. A path of shortest length leading fromU to Ufis called ashortest path from U to Uf.

Based on this definition, we now can define the distance between two users in the friendship graph.

Definition 5.4 (Distance of Uf wrt. U ). Let be G the directed, unweighted friendship graph of a social tagging network withU and U_f being two different users inG. The distancedist(U, Uf) of Uf with respect toU is equal to length of the shortest path leading fromU to Uf inG if such a path exists, otherwise∞, i.e. ex-ists. Furthermore, we denote with FLIST(U ) the list of all social friends of a user U . Formally, we define the social friendship of two users as follows:

Definition 5.5 (Social Friendship). A user Uf 6= U is a (social) friend of U if and only if there is a path fromU to Uf in the friendship graph of a social network, i.e.

Ufis a friend ofU ⇐⇒ U^f ∈ F^LIST(U )

⇐⇒ 0 < dist(U, U^f) <∞ We denote withFLIST(U) the list of all transitive friends ofU .

The measure in our scoring model for the social friendship strength s(U, Uf) of a user Uf in regard to a user U favours users that are closer to U in the friendship graph of the social network. The intuition is, that if U issues a query, the results from users close to U in the friendship graph are preferred because it is likely that a user is more interested in results from her friends or that she trusts them more than unknown users.

We define the friendship strength as follows:

Definition 5.6 (Friendship Strength sf(U, Uf)).

sf(U, Uf) =

where|U| is the number of all users in the network.

The friendship strength of a user U with respect to herself is set to0, while the friendship strength for a social friend Uf is equal to the inverse of the square of the shortest path distance from U to Uf. If there is no path between two users in the friend-ship graph of a social network, the friendfriend-ship strength is set to a constant which equals to the inverse of the square of the longest possible distance, that is, a path leading over all users.

After having cast the friendship relation defined in our data model into our soring model, we now define the remaining components of the single tag score sts(ti, d, U) given in Definition 5.2.

The value of tsim(t, t⁰) corresponds to the tag similarity of tag t⁰in regard to tag t and actually implements in our scoring model the tag-tag relation

T agSimilarity(t1, t2, tsim)

introduced with our data model in Section 3.1.2.

In our scoring model, the similarity of tags is based on the co-occurrence of tags in the document collection. Formally, it is defined as follows:

Definition 5.7 (Tag Similarity tsim(t, t⁰)). The list of all tags t⁰which are similar to a tagt is denoted with SIMTAGS(t). The similarity of two tags t and t⁰is computed by the Dice coefficient on the set of documents in the social network tagged with both tags, i.e.

tsim(t, t⁰) = 2· df(t ∧ t⁰) df(t) + df (t⁰)

wheredf(t∧ t⁰) is the document frequency for both tags t and t⁰, i.e. the number of documents that are tagged witht and t⁰;df(t) and df (t) is the document frequency for the single tagt and t⁰, respectively.

Note:The most similar tag with respect to a tag t is t itself, i.e. tsim(t, t) = 1, and thus, is the entry in SIMTAGS(t) with the highest similarity value.

U R(U ) and DR(d) define the rank of a user U in the friendship graph and the rank of a document d in the document graph of a social network, respectively. The user rank weights documents from users with a high reputation stronger, while the document rank generally boosts the single tag score sts(ti, d, U) for high authoritative documents in the social network. The user or document rank is equal to the PageRank [99] score of the respective entity defined by a random walk on the user or document graph, respectively, with a random jump probability(1− ) set to 0.15. Formally, the user rank is defined as follows:

Definition 5.8 (User Rank U R(U )).

U R(U ) = 1−

|U| + · X

∀Ui:U ∈D^IRECTFRIENDS(Ui)

U R(Ui)

Analogously, the document rank is defined as:

Definition 5.9 (Document Rank DR(d)).

DR(d) = 1−

|D| + · X

∀di:di→d

DR(di) outdegree(di)

where|D| is the number of all documents in the network, outdegree(dⁱ) is the number of outgoing links fromd_iand= 0.85 is a damping factor.

Finally, the document score used in Definition 5.2 of the single tag score corre-sponds to the ternary user-tag-document relation

T agging(U, t, d, score) introduced with our data model in Section 3.1.3.

The score computation of a document with respect to a certain tag is based on a user-specific BM25 [105] formula. Its definition looks as follows:

Definition 5.10 (Document Score sd(U, t, d)). The score of a document d that is tagged with a tagt by a user U is defined as

sd(U, t, d) = k1+ tfU(t, d)

K+ tfU(t, d) · logNU− df^U(t) + 0.2 df_U(t) + 0.5 where

(1− b) + b lengthU(d)

avg_(d0tagged by U ){length^U(d⁰)}

andk₁andb are constants, set to k₁= 1.2 and b = 0.5.

The value of tfU(t, d) corresponds to the number of times U tagged d with t, dfU(t) corresponds to the number of times U tagged any document with t, N_U is equal to the total number of documents tagged by U , and length_U(d) corresponds to the number of tags given to d by U .

Notes on remaining entity relations

By summing up the document score sd(U⁰, t, d) over all users U⁰ in the social net-work, the resulting value is independent of any user relation in regard to a querying user U . Hence, the aggregated document scores for a document d and tag t from the entire social tagging network implements in our scoring model the document-tag rela-tion

Content(d, t, score)

as defined by our data model in Section 3.1. We make use of it in Section 5.3 for en-abling a semantic search strategy (see Definition 5.12).

The scoring model used in our SOCIALMERGEalgorithm neither implements the document-document relation Linkage(d1, d₂, w) (see Section 3.1.3) nor the user-do-cument relation Rating(U, d, rating) (see Section 3.1.2) as defined by our data model in Section 3.1. The reason is that the datasets crawled from real world social tagging networks (see Section 3.2) and used in our experimental evaluation presented in Sec-tion 5.4, do not exhibit such relaSec-tions to the full extend or only could be harvested on a limited scale.

In document UNIVERSIDAD AUTÓNOMA DE NUEVO LEÓN FACULTAD DE FILOSOFÍA Y LETRAS (página 48-52)