4. PROPUESTAS DE EXPLOTACIÓN DIDÁCTICA
4.2. Actividades para niveles B1-B2
In this dataset, we end up with 10 variables after performing the multicollinearity analysis. These variables and the results of regression on the RMSE of algorithms are listed in Table
28. This table shows the coefficients of each of the variables, with stars representing their significance, the RMSE, R2, and p-value for R2 of the model.
1We tried starting the multicollinearity analysis with all of the variables. The results were similar to, or
Table 28: RMSE regression analysis results for the Supermaket dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001
CD-CCA CD-SVD RMGM CMF SD-SVD
intercept 0.4244*** 0.3917*** 0.4287*** 0.1095 0.4319***
target mode rating 0.0018 0.0412 0.0811* -0.2233 0.0256
mean user KL-divergence -0.0129 0.0249 -0.0349 0.1211 -0.0143
user to source item ratio -0.1115*** -0.0158 -0.0111 0.2977 -0.0099
CCA correlation ≥ 0.95 -0.1316*** -0.0268 -0.0419 0.1632 -0.0266
target density 0.0028 -0.0566 -0.2134*** -0.138 -0.0583
target item size 0.0149 -0.0168 -0.0661 0.005 -0.025
target skewness rating -0.0135 -0.0543 0.0794* 0.4595** -0.0567
source to target density ratio 0.0154 0.2252*** 0.1584*** -0.2605 -0.0787
user size 0.0401** 0.1261*** -0.1154*** 0.1354 0.1512***
variance user KL-divergence -0.0156 -0.0439 0.1452** 0.5374* -0.1199*
RMSE 0.0202 0.0514 0.0441 0.23 0.0474
P value 1.55E-06 1.12E-05 1.73E-19 1.66E-03 1.79E-08
R2 0.572 0.52 0.913 0.351 0.666
Based on the reported R2s, we can see that all p-values are significant. Also, we can see
that although all p-values for the R2s are significant, the p-values of all variables are not.
For CD-CCA, we see negative significant relationships between the RMSE with the number of components with CCA correlation more that 0.95 and the ratio between number of users and number of source domain items. This means that as the CCA correlation increases, the error of CD-CCA decreases. Also, as we have a taller source domain rating matrix, we have less error in CD-CCA. However, an increase in the number of users by itself increases the error. For CD-SVD, the density ratio and number of users both increase the RMSE. This means that having a denser source domain, compared to target domain, we get more error in CD-SVD.
For SD-SVD, we see a positive relationship for the number of users and a negative one for the variance in user-based KL-divergences between the source and target domains. The latter relationship is meaningless since the source domain information is not used in SD-SVD algorithm.
In RMGM, we can see that the more skewed the target domain is, we will have more error. Also, the user-based KL-divergence variance, ratio between source and target domain densities, and mode of ratings in the target domain all have a positive relationship with the error. However, the denser the target domain is and the more the number of users is, the less error we have in RMGM.
The user-based KL-divergence variance and the target domain skewness have a positive relationship with CMF’s error also.
The maximum significant coefficient variable belongs to the number of canonical corre- lations >= 0.95 for CD-CCA, source to target domain density rations for CD-SVD, number of users for SD-SVD, target domain density for RMGM, and the variance in user-based KL-divergence of domains for CMF.
In general, we can see that the variance of user-based KL-divergence and the target domain skewness are both positively related to CMF and RMGM errors; the number of users can have a positive or negative relationship with the RMSE of algorithms; and the density ratio between source and target domains have a positive relationship with the error of both RMGM and CD-SVD.
The relationships get more clear if we look at the improvement ratios in Table29. Here, we see that variance in user-based KL-divergence is associated with less improvement in all of the cross-domain recommenders, compared to SD-SVD. This means that as the KL- divergences of each users’ ratings between source and target domains varies more, using the source domain information helps less in cross-domain recommendations. The next important factor is the density ratio between source and target domains, which is significant for IR of RMGM, CD-SVD, and CD-CCA. The denser the source domain is, compared to the target domain, the less improvement we will have in the RMSE of these algorithms compared to SD-SVD. Skewness of ratings in the target domain has a negative effect on the IR of RMGM and CMF, in accordance with its relationship with the error of these algorithms. The number of users have a contradictory effect in CD-CCA, but its relationship with RMGM is consistent. Although it has a positive relationship with the RMSE of CD-CCA, it has a positive relationship with its IR too. In other words, although the more users we have, the more the error of CD-CCA will be, we will also see more improvement over SD-SVD with
Table 29: Improvement ratio regression analysis results for the Supermaket dataset; *: sig- nificant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001
CD-CCA CD-SVD RMGM CMF
intercept 0.0685 0.1671 0.06 1.2936
target mode rating 0.0063 -0.0795 -0.3454* 0.4984
mean user KL-divergence -0.0118 -0.161* 0.0601 -0.4848
user to source item ratio 0.2434* 0.0464 -0.0177 -0.9167
CCA correlation ≥ 0.95 0.2415 0.0163 0.0094 -0.5688
target density -0.1324 -0.0589 0.4745** 0.0809
target item size -0.1064 -0.0295 0.1086 -0.1517
target skewness rating -0.1239 -0.0147 -0.4594** -1.9305**
source to target density ratio -0.2716** -1.0848*** -0.7398*** 0.6203
user size 0.2434*** 0.1104 0.6934*** 0.0982
variance user KL-divergence -0.3685** -0.3047** -0.9634*** -2.5057**
RMSE 0.1023 0.0972 0.1911 0.7658
P value 2.62E-12 4.74E-18 4.59E-19 1.84E-05
Table 30: RMSE regression analysis results for the Yelp dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001
CD-CCA CD-SVD RMGM CMF SD-SVD
intercept 0.4889*** 0.5221*** 1.9023*** 0.2964 0.5979***
source kurtosis rating -0.0022 0.1295 -0.0134 0.1188 0.0287
first component correlation -0.0099 -0.105* -0.4184*** -0.4066 -0.0746 target mode rating 0.2317*** 0.2401*** 0.2979*** 0.2769 0.2486***
target density -0.2123** -0.3217*** 0.1057 -0.3196 -0.3424***
target median rating 0.1951*** 0.1286** 0.078 0.2482 0.1379**
target skewness rating 0.8091*** 0.9822*** -0.4937** 1.7259 0.9036***
RMSE 0.1213 0.1317 0.1673 0.9004 0.1381
P value 2.33E-18 2.81E-19 1.06E-35 2.86E-01 6.50E-18
R2 0.449 0.459 0.682 0.00949 0.435
larger number of users. This is because the error of SD-SVD is also positively correlated with the number of users.
In addition to these relationships, IR of CD-CCA improves as we have a taller source domain rating matrix, and IR of CD-SVD improves as the average user-based KL-divergence of the two domains decreases, and thus there is more similarity between average user rating distributions. Also, as the mode of target domain ratings increases, which can be an indicator of skewness of ratings, the IR of RMGM decreases.