applies to NDCG; however, it is a more subtle measure so it is able to differentiate between the orders of the items in the ranked list. Therefore, an NDCG score can tell us how well relevant items are ranked, which would be an optimal solution for the user.
3.5
Possible extensions: Difficult and Popular Items
The main reason to identify the direction of the error was to assign some importance to items with respect to the deviance between their ground truth and predicted rating. This is based on the perception of what is important to users who expect good quality recommendations only on items that are of interest to them. There are other qualities of items that can be captured as a posterior knowledge which would affect the quality of the recommendation. Some of these features of the items can affect recommendation on a global level (as opposed to the user level we considered earlier in this chapter) such as popularity or difficulty to predict score (for a simplest case that can be based on the variance of the item ratings). These qualities of the items can determine whether the item can be accurately predicted as well as whether the items is important to be predicted correctly. For example, in order to improve the accuracy of a recommender systems for most of the users, items that are frequently rated should be predicted correctly; another strategy is to concentrate on highly rated items instead. In other scenarios, new items that have not received many ratings but has a potential to become popular should be emphasised and predicted correctly. This illustrates that on a global level the importance (e.g. popularity) of the items would help to focus on them in order to improve the accuracy of the overall system.
3.5.1
Selective learning
After identifying important items the prediction error on them should be reduced: boosting techniques can help to identify weak learners and improve the performance on these learners by creating a single strong learner. One example of boosting is Adaboost that was first introduced in [SF96], where it was used to generate a highly accurate hypothesis by combining many weak hypotheses, each of which with only moderate accuracy. This extension would consist of applying boosting techniques to a simple SVD algorithm [KBV09], sampling a subset of the whole dataset and concentrating on items that are likely to be predicted wrong. Our initial assumption is that this approach would have two main advantages. First, it would improve the accuracy of the algorithm; which is the basic property of boosting. In addition, it would provide a way to scale an SVD algorithm by enabling to divide the task into small segments which could be used to distribute the task. This part could be scaled using the MapReduce framework [DG08]. This approach would also provide the necessary framework to incorporate directional based errors into a more general framework, since the data can be sampled based on the direction of the error made by the baseline algorithm or the distance between items can be based on this direction.
It is important to identify the qualities of the items that tend to be easily predicted, depending on the model these qualities include the frequency, the mean and the variance rating of the items and users in the dataset. We show that popularity can be defined as a combination of these qualities. In order to tackle the problem we need to investigate them separately. The frequency of an item has two very distinct effects of recommendation, first, there are items that received fewer ratings, because they are not popular, also
43 3.5. Possible extensions: Difficult and Popular Items 1.5 2 2.5 3 3.5 4 4.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Mean RMSE SVD Item−based User−based 0 200 400 600 800 1000 1200 1400 1600 1800 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 Frequency RMSE SVD Item−based User−based
(a) Popularity is measured by the average rating of the item. (b) Popularity is measured by the frequency of votes.
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 Variance RMSE SVD Item−based User−based
(c) Popularity is measured by the variance of the ratings.
Figure 3.6: Popularity of Items against the Error Rate (RMSE) for a number of popular algorithms.
there are items which has less ratings because they are new, so it is important to differentiate between these two. This quality is also important in the phrase of evaluation since the frequency of the item in the training set is proportional to the frequency in the test set given that the data was divided randomly. In the case of RMSE that means that number of test points for a given items are propositional to the number of training points, therefore, emphasizing items that have rated many times is the best way to achieve high performance (Figure 3.6 (b)). The average rating of the item is clearly a good indication of popularity. Figure 3.6 (a) shows that neighbourhood-based models are sensitive to the mean of the item whereas latent factor based models provide a more stable prediction. The reason for that is due to the fact that neighbourhood-based models capture localised relationships whereas latent factor-based models capture a different level of structure that is more global.
In addition, modelling the correlation between items and users can affect the quality of the recom- mendation. For example, user-based recommender systems have a higher error rate at higher ranking positions. This can be explained by the way different algorithms deal with neighbourhoods and the nature of the data. It is usually the case that the data consist of more users than items, so it is easier to find meaningful correlation between items than users. Furthermore, item correlation based on little information is more meaningful that user correlation. For instance, if we take try to define correlation based on only one rating; surely, if a user agrees on one item with other users, we cannot say that he or