• No se han encontrado resultados

LA ESCRITURA COMO “ACTIVIDAD LITERARIA” EN LOS LIBROS DE TEXTO.

15. Consulta el argumento del Mio C id y cópialo en tu cuaderno…

Since Herlocker et al.’s comprehensive investigation of collaborative filtering pa- rameters for the MovieLens dataset [102, 104], there has been mostly general acceptance that these parameters are the best for that dataset when using a nearest-neighbourhood approach. The space of possible parameter values and their combinations explored by Herlocker et al. were large and the exploration was done in an exhaustive manner. In recent years there are many new collabo- rative filtering datasets available, with potentially different characteristics to the M ovieLensdataset. The motivation for this work is to ascertain if similar param- eter values are useful for datasets other than the MovieLens dataset. This will give an insight into the generalisability of the settings used across the datasets. The approach adopted in this work involves the use of a genetic algorithm to learn the best set of collaborative filtering parameters — rather than using the methodology of an exhaustive combination of different parameter values. Ge- netic algorithms are stochastic search techniques that evaluate a population of solutions (individuals) over a number of iterations (generations) and, at each iteration, evaluate how good (or fit) each solution is [110, 81]. Based on this eval- uation, some simple operations are performed on the solutions to create a new, “better” population for the next iteration. The process continues until a satis- factory solution is found or until a set number of iterations have been reached. The terminology and technique is based on the principles of evolution via natural selection. The process begins with a set of solutions, the population, which is rep- resented by chromosomes. Generally, this initial population is created randomly. A fitness function is used to evaluate each solution. Based on this fitness function, a proportion of the solutions are picked for the next generation. Operations of crossover and mutation are performed on the solutions to create new solutions for the next generation (analogous to reproduction). The process is repeated many times until some stopping criteria is met.

Some work has applied genetic algorithms in the collaborative filtering domain. Hwang et al. [116, 118] use a genetic algorithm, per user, to learn an optimal weighting scheme for the collaborative filtering system for each user. Both col-

laborative and inferred content information is used (a user’s rating for an item is taken as a rating for the features of that item). In comparison to a tradi- tional collaborative filtering approach, improvements were seen with the genetic algorithm approach (using the metrics of precision, recall and the F1 measure). da Silva at al. [65] use a genetic algorithm to find the best combination of rec- ommendations given the output of six different collaborative filtering techniques. The fitness function used is a combination of RMSE, a weight assigned to each technique giving a measure of the technique’s importance, and a weight indicat- ing the quantity of user ratings available. The work described in this chapter is unique in using a genetic algorithm to search the space of possible parameter values.

4.3

Methodology

The collaborative filtering technique used is a standard memory-based nearest- neighbourhood approach consisting of the following steps:

• A portion of users are chosen as the test users and a portion of their items are withheld as test items. The task is to generate predictions for the withheld test items for the test users.

• Using a similarity function (Pearson correlation, Spearman rank correlation or cosine similarity), users similar to the test users are found (which are called the test users’ neighbours). Deviation from the mean is used to normalise user ratings. Similarity scores between users are “dampened” if the number of items co-rated by two users is below a certain significance threshold.

• Using a prediction formula, predictions for test items are calculated using a function based on the neighbour’s ratings for the test items, the neighbour’s similarity score with the test user, the neighbour’s mean ratings and the test user’s mean rating.

• The accuracy of the predictions are calculated based on the predicted rat- ings calculated by the system and the actual ratings given to the test items in the withheld set. The mean absolute error (MAE) metric is used where the overall error of all predictions for all users are averaged together per run.

For the genetic algorithm, a set of parameters are chosen and their representation (position) in the chromosome is decided. The parameters initially chosen are based on a subset of those tested in the work by Herlocker et al. [104]. The flow of control of the genetic algorithm is depicted in Figure 4.1 where, for each generation:

1. Pick test users and test items. The same test users and items are used to evaluate all individuals in a generation, but a new set of test users and items are picked for a new generation to avoid over-fitting for one set of test users and test items.

2. Randomly generate a population of individuals, of a fixed size.

3. Calculate the fitness of each individual, where each individual represents a set of values for the parameters tested. For each individual:

(a) Set all of the collaborative filtering parameters to the values indicated in the individual.

(b) Find nearest neighbours and make predictions for the test users and items based on the set of parameter values.

(c) Calculate the average MAE (mean absolute error) score for the test users and items and return this as the fitness score of the individual. The genetic algorithm for this experiment is required to minimise the fitness score, that is, the lower the MAE value the better (more fit) a solution is.

4. Perform the genetic algorithm operations of crossover, mutation and selec- tion:

(a) The crossover operator used is single point crossover and the crossover rate is 80%.

(b) The mutation rate is set at 2%.

(c) The selection operator used is roulette wheel selection based on MAE scores (as the fitness value).

(d) The population size is 20 for experiment 1 and 200 for experiment 2. (e) The genetic algorithm iterates for 12 generations for experiment 1 and

for 50 generations for experiment 2.

Figure 4.1: Flow of control of GA experiment.

• sigT, the significance threshold, which is an integer in the range 0 to 100. This is used when calculating the similarity between users so as to dampen the similarity between two users if the number of co-rated items between the users is less than this threshold [104, 165, 157, 139]. The dampening used is that described by Herlocker [104]: multiply the similarity score between two users by n

d where n is the number of co-rated items between the two users

and d is the significance threshold. Note that if the significance threshold value is 0 it will imply that this dampening will not be used. A value of n= 100 is chosen as the limit for dampening as it does not seem reasonable to dampen a similarity score if the number of co-rated items is greater than 100.

• sim, the similarity option, which is an integer value in the range 0 to 2. This indicates which similarity function should be used to find the similarity between users. The options are:

0: Spearman rank correlation. 1: Pearson correlation.

2: Cosine similarity.

• P, the predict option, which is an integer value in the range 0 to 3 indicating which version of a prediction formula is used, and what users are involved, in the prediction. As shown in Table 4.1, when P is 1 or 3, then the most similar top-N neighbours to the current user, for some N, will be used to

Table 4.1: Values for Parameter P .

P avg. co-rated

items

avg. all items other

parameter requried correlation threshold 0 2 corrT top-N neighbours 1 3 N

form predictions. When P is 0 or 2 then correlation thresholding is used, that is, all users who have a similarity to the current user, greater than some threshold, are used to form predictions. The difference within each approach (1/3 and 0/2) is whether, when calculating the average rating value of users, these averages are calculated over all the ratings a user has given to all the items the user has rated (P = 2 or P = 3) or whether the average is calculated only over the ratings given to co-rated items between the current user and each of the other users (P = 0 or P = 1).

• N, the top-N value, which is an integer in the range 0 to 300. As shown in Table 4.1, this is used when the option of using top-N (option 1 or 3) is chosen and indicates the number of neighbours that will be used to form a prediction.

• corrT, the correlation threshold value, which is a real value in the range [0.0 − 0.35]. As shown in Table 4.1, this is used when the predict option of correlation thresholding (option 0 or 2) is chosen. The limit of 0.35 was chosen as, in reality, user similarities would rarely be greater than this. An open source Python genetic algorithm framework was used (Pyevolve 0.51)

with the neighbourhood-based collaborative filtering technique implemented in Python.