4. PROPUESTAS DE EXPLOTACIÓN DIDÁCTICA
4.3. Actividades para niveles C1-C2
Table 32 shows the results of regression analysis in Imhonet dataset. We can see that although RMSEs of models are low, the p-value of R2 in CD-CCA is insignificant and none of the coefficients are significant in this model.
For SD-SVD and CD-SVD, we see a positive relationship between their RMSEs and kurtosis of ratings in the source domain, the ratio of users to source domain items, and
Table 32: RMSE regression analysis results for the Imhonet dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001
CD-CCA CD-SVD SD-SVD
intercept 0.2182*** 0.1618** 0.1794**
source to target density ratio -0.0091 0.006 -0.0224
target var rating 0.0071 -0.0141 0.0626
source item size 0.0134 -0.0783** -0.0433
source kurtosis rating 0.0096 0.1116** 0.0815*
user to target item ratio 0.0049 0.3044*** 0.1286** user to source item ratio -0.0073 0.2614*** 0.2398**
RMSE 0.0129 0.0121 0.017
P value 1.83E-01 3.22E-04 7.31E-03
R2 0.425 0.961 0.86
the ratio of users to the target domain items. For SD-SVD, the first two relationships are meaningless because it does not use source domain information. Also, as the number of items in the source domain increases, the error of CD-SVD decreases.
For regression on the improvement ratios, we report the results in Table 33. Here, non of the p-values of R2s are significant. But, we see a significant positive relationship between
the ratio of users to source domain items and CD-CCA’s IR; and a negative relationship between the ratio of users to the target domain items and CD-SVD’s IR. This means that the taller the source domain rating matrix is, the more improvement we have in CD-CCA over SD-SVD; and the fatter the rating matrix in the target domain is, the less improvement we have in CD-SVD over SD-SVD.
7.3.5 Summary
As we can see in the regression results, there are many different factors that are important in each of the algorithms and each of the datasets. For example, the number of users is an important factor only in the Supermarket dataset, or the first component’s CCA correlation is only important in the Yelp dataset.
Table 33: Improvement ratio regression analysis results for the Imhonet dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001
CD-CCA CD-SVD
intercept 0.0774 0.0252
source to target density ratio -0.003 -0.0677
target var rating 0.0883 0.2254
source item size -0.1218 0.0904
source kurtosis rating 0.1118 -0.0714
user to target item ratio 0.2152 -0.4745* user to source item ratio 0.4171* -0.0414
RMSE 0.0582 0.0774
P value 6.84E-02 1.17E-01
R2 0.636 0.535
There are some common important factors between the datasets also. For example, the density ratio of source and target domains appears in both Supermarket and Imhonet regression models and the source domain kurtosis appears in both Yelp and Imhonet dataset models.
However, there are two contradictory results between the datasets. The first one is related to the density of target domain. It has a positive relationship with RMGM’s im- provement ratio in the Supermarket dataset and a negative relationship with it in the Yelp dataset. It means that in the Supermarket dataset, the denser the target domain is, the more improvement RMGM has over SD-SVD. But, in the Yelp dataset, the denser target domain contributes to less improvement of RMGM’s results.
The second one is the skewness of ratings in the target domain. In the Supermarket dataset, it has a positive relationship with RMGM’s error and a negative relationship with its improvement ratio. However, this relationship works in the reverse direction in the Yelp dataset. This can be one of the reasons that leads to the good performance of RMGM in the Yelp dataset compared to the Supermarket dataset. Also, we can see that the mode of ratings in training data has a positive relationship with RMGM’s RMSE. We know that in highly skewed datasets, the mode of ratings moves in the direction of skewness. So, if
the ratings are skewed towards higher ratings (which is the case for most explicit feedback recommender systems’ datasets), the mode is also going to be higher. Although there is no direct correlation and collinearity between the mode and skewness of Yelp, we think that this general rule can explain some of the contradiction that we see in the regression analysis of the Yelp dataset. Basically, we hypothesize that some of the “positivity” of the relationship between skewness and RMSE of RMGM in the Yelp dataset, is absorbed by the positive relationship between the mode of target ratings and RMGM’s RMSE. The same can be true for the median of target domain ratings, that appears in the Yelp dataset’s regression analysis.
As for the CCA-related features, we see the number of correlations that are > 0.95 in the Supermarket dataset’s regression analysis and the canonical correlation of first component in the Yelp regression analysis. We can see that the direction of their relationship, when significant, is as expected: to lower the cross-domain recommenders’ error and to increase their improvement ratios. However, they are not present in the Imhonet dataset’s regression analysis.
Also, we should note the number of data points in the analysis. Each domain pair is one datapoint in this regression analysis. So, we have 12 data points for the Imhonet dataset, 50 in the Supermarket dataset, and 158 in the Yelp dataset. These number of datapoints, especially for Imhonet, are not nearly enough for having a powerful regression analysis.
Eventually, we only looked at the possible linear relationships among the dependent and independent variables. Thus, we cannot find other kinds of possible relationships, such as polynomial or exponential ones. Looking at the scatter plots of independent variables and error of algorithms in appendices, we can see that most of the independent variables do not have a strict linear relationship with the dependent variables.