Caracterización de la Sucursal Cubalse Villa Clara.

Capitulo III. Validación del Procedimiento General Propuesto para un Sistema de Gestión de Seguridad y Salud en el Trabajo en la Sucursal Cubalse Villa Clara.

3.1 Caracterización de la Sucursal Cubalse Villa Clara.

The co-occurrence recommender we use as a baseline is a combination of user- related and document-related tags. We combine the sets of tags co-occurring with the query user and the set of tags co-occurring with the query document. For query posts where the set of user and document-related tags does not contain the wanted number of recommendations (or is empty), we add the most popular tags in the data overall to fill the recommendation set to the required size.

User Tags (UT)

The User Tags recommender bases its recommendations on the query user only and ignores the document to be tagged. For a query post (uq, dq,∅), the set of candidate tags consists of the past tags of the query useruq. The score for each of the candidate tagstis the co-occurrence count ofuq andtdivided by the total number of posts of uq, and can be seen as the probability that uq will use tfor any document. We use the notation UT(uq, t) to denote the prediction score of tagtfor useruq, calculated as

UT(uq, t) =

|(uq, d∃, t)∈A|

|(uq, d∃, S∃)∈P|

whered∃ is any document and S∃ is any tag set. The numerator of the fraction is

denominator is the total number of posts made byuq. Document Tags (DT)

Analogous to the User Tags recommender, we calculate a tag co-occurrence probability for a documentdq as

DT(dq, t) =

|(u∃, dq, t)∈A|

|(u∃, dq, S∃)∈P|

where u∃ is any user and S∃ is any tag set. The numerator is the number of co-

occurrences ofdand t, and the denominator is the total number of posts containing d. The tag score DT(dq, t) represent the probability that tagt will be assigned to dq by any user.

Most Popular Tags (MP)

The simplest baseline recommender is the Most Popular Tags recommender which recommends the same set of tags for all test posts, the recommendation consisting of the topN most frequently used tags in the system. The MP score for a tag t is the probability thatt will be used in any post by any user for any document, and is calculated as

MP(t) = |(u∃, d∃, t)∈A|

|P|

where the numerator is the number of tag assignments containing tag t, and the denominator is the total number of posts.

Co-Occurrence Recommender (CoOcc)

To recommend tags which are related to the document as well as personalised to the user’s preferences, the tag prediction scores of the User Tags (UT) and the Document Tags (DT) recommenders are combined. To combine the prediction scores from UT and DT, we apply the standard approach of taking a linear combination of the two prediction sets [Lipczak and Milios, 2010a; Gemmell et al., 2010]. However, we believe that it is important to highlight that the candidate tag set of the UT recommender does not include all tags that are found by DT, and vice versa. Since the candidate tag set of the combined approach includes all tags that appear in either only one or both of the source sets, we refer to the weighed linear combination method of prediction sets as the union. The score of each tagtin the union UT∪DT is a weighted sum of the tag’s scores in UT and in DT. The effect is that scores

of tags which appear in both source sets are increased relative to scores of tags which appear only in UT or only in DT. We use the notation UT(uq) to denote the candidate tag set of the UT recommender for user uq, and UT(uq, t) to denote the prediction score of tag t for user uq. Similarly for DT, DT(dq) denotes the candidate tag set for documentdq and DT(dq, t) denotes the prediction score of tag t. We calculate the score of each tag in the union UT∪DT as

UT∪DT(uq, dq, t) =        b∗UT(uq, t) + (1−b)∗DT(dq, t) ift∈UT(uq)∧t∈DT(dq) b∗UT(uq, t) ift∈UT(uq)∧t6∈DT(dq) (1−b)∗DT(dq, t) ift6∈UT(uq)∧t∈DT(dq) where 0≤b≤1 is a parameter that determines the balance in importance given to scores from UT and DT. Before combining the recommendation sets, we normalise the tag scores in each of the two source sets so that they sum to one. To find the optimal setting for parameterb, we tunebon the evaluation set and then use the best value on the test set. Since a query post can contain a user and document that are both new or have only very few tag co-occurrences in the historical data, the UT∪DT recommendation set might not contain a sufficient number of tags in some cases. For these cases, we use the most popular tags (MP) to fill the recommendation set to the required size and produce the final recommendations of the CoOcc recommender. From MP, the tags which are not already included in UT∪DT are appended to the end of the tag rankings of UT∪DT in order of their overall popularity. Tags which are added from MP can thus never outrank existing tags in UT∪DT, and the scores from MP do not influence the scores of existing tags in UT∪DT. The most popular tags are used only as a means to fill the recommendation set to the required size.

A direct comparison of the accuracy achieved with FolkRank and a recommender based on co-occurrence is presented in [Jäschke et al., 2008], where Folk- Rank is reported to produce significantly better results. Similarly to CoOcc, the co-occurrence recommender used in [Jäschke et al., 2008], called “most popular tags mix”, is a weighted combination of tags co-occurring with the query user and tags co-occurring with the query document. However, there are major differences be- tween “most popular tags mix” and our CoOcc approach in how the co-occurrence scores are calculated and how the weights are normalised before combination. The recommender used in [Jäschke et al., 2008] only considers the absolute user-tag and document-tag co-occurrence counts without considering the total number of posts of the user or document, which would correspond to using only the numerator in our UT and DT score calculations. Before combining the user-tag and document-tag

co-occurrences to give the final recommendation set, J¨aschke et al. normalise the co-occurrence counts from the two sources to be in the interval [0,1] so that the highest scoring tag has a score of 1 and the lowest scoring tag a score of 0. However, the distribution of co-occurrence counts across all tags of a user and all tags of a document are not considered. Only the minimum and maximum scores in each of the source sets are taken into account in the calculation of the normalised scores.

In document Procedimiento general para el diseño e implementación de un sistema de gestión de seguridad y salud en el trabajo Aplicación en la Sucursal Cubalse Villa Clara (página 67-70)