Actividades para niveles B1-B2 - PROPUESTAS DE EXPLOTACIÓN DIDÁCTICA

4. PROPUESTAS DE EXPLOTACIÓN DIDÁCTICA

4.2. Actividades para niveles B1-B2

In this dataset, we end up with 10 variables after performing the multicollinearity analysis. These variables and the results of regression on the RMSE of algorithms are listed in Table

28. This table shows the coefficients of each of the variables, with stars representing their significance, the RMSE, R2_{, and p-value for R}2 _{of the model.}

1_{We tried starting the multicollinearity analysis with all of the variables. The results were similar to, or}

Table 28: RMSE regression analysis results for the Supermaket dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001

CD-CCA CD-SVD RMGM CMF SD-SVD

intercept 0.4244*** 0.3917*** 0.4287*** 0.1095 0.4319***

target mode rating 0.0018 0.0412 0.0811* -0.2233 0.0256

mean user KL-divergence -0.0129 0.0249 -0.0349 0.1211 -0.0143

user to source item ratio -0.1115*** -0.0158 -0.0111 0.2977 -0.0099

CCA correlation ≥ 0.95 -0.1316*** -0.0268 -0.0419 0.1632 -0.0266

target density 0.0028 -0.0566 -0.2134*** -0.138 -0.0583

target item size 0.0149 -0.0168 -0.0661 0.005 -0.025

target skewness rating -0.0135 -0.0543 0.0794* 0.4595** -0.0567

source to target density ratio 0.0154 0.2252*** 0.1584*** -0.2605 -0.0787

user size 0.0401** 0.1261*** -0.1154*** 0.1354 0.1512***

variance user KL-divergence -0.0156 -0.0439 0.1452** 0.5374* -0.1199*

RMSE 0.0202 0.0514 0.0441 0.23 0.0474

P value 1.55E-06 1.12E-05 1.73E-19 1.66E-03 1.79E-08

R2 0.572 0.52 0.913 0.351 0.666

Based on the reported R2_{s, we can see that all p-values are significant. Also, we can see}

that although all p-values for the R2_{s are significant, the p-values of all variables are not.}

For CD-CCA, we see negative significant relationships between the RMSE with the number of components with CCA correlation more that 0.95 and the ratio between number of users and number of source domain items. This means that as the CCA correlation increases, the error of CD-CCA decreases. Also, as we have a taller source domain rating matrix, we have less error in CD-CCA. However, an increase in the number of users by itself increases the error. For CD-SVD, the density ratio and number of users both increase the RMSE. This means that having a denser source domain, compared to target domain, we get more error in CD-SVD.

For SD-SVD, we see a positive relationship for the number of users and a negative one for the variance in user-based KL-divergences between the source and target domains. The latter relationship is meaningless since the source domain information is not used in SD-SVD algorithm.

In RMGM, we can see that the more skewed the target domain is, we will have more error. Also, the user-based KL-divergence variance, ratio between source and target domain densities, and mode of ratings in the target domain all have a positive relationship with the error. However, the denser the target domain is and the more the number of users is, the less error we have in RMGM.

The user-based KL-divergence variance and the target domain skewness have a positive relationship with CMF’s error also.

The maximum significant coefficient variable belongs to the number of canonical corre- lations >= 0.95 for CD-CCA, source to target domain density rations for CD-SVD, number of users for SD-SVD, target domain density for RMGM, and the variance in user-based KL-divergence of domains for CMF.

In general, we can see that the variance of user-based KL-divergence and the target domain skewness are both positively related to CMF and RMGM errors; the number of users can have a positive or negative relationship with the RMSE of algorithms; and the density ratio between source and target domains have a positive relationship with the error of both RMGM and CD-SVD.

The relationships get more clear if we look at the improvement ratios in Table29. Here, we see that variance in user-based KL-divergence is associated with less improvement in all of the cross-domain recommenders, compared to SD-SVD. This means that as the KL- divergences of each users’ ratings between source and target domains varies more, using the source domain information helps less in cross-domain recommendations. The next important factor is the density ratio between source and target domains, which is significant for IR of RMGM, CD-SVD, and CD-CCA. The denser the source domain is, compared to the target domain, the less improvement we will have in the RMSE of these algorithms compared to SD-SVD. Skewness of ratings in the target domain has a negative effect on the IR of RMGM and CMF, in accordance with its relationship with the error of these algorithms. The number of users have a contradictory effect in CD-CCA, but its relationship with RMGM is consistent. Although it has a positive relationship with the RMSE of CD-CCA, it has a positive relationship with its IR too. In other words, although the more users we have, the more the error of CD-CCA will be, we will also see more improvement over SD-SVD with

Table 29: Improvement ratio regression analysis results for the Supermaket dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001

CD-CCA CD-SVD RMGM CMF

intercept 0.0685 0.1671 0.06 1.2936

target mode rating 0.0063 -0.0795 -0.3454* 0.4984

mean user KL-divergence -0.0118 -0.161* 0.0601 -0.4848

user to source item ratio 0.2434* 0.0464 -0.0177 -0.9167

CCA correlation ≥ 0.95 0.2415 0.0163 0.0094 -0.5688

target density -0.1324 -0.0589 0.4745** 0.0809

target item size -0.1064 -0.0295 0.1086 -0.1517

target skewness rating -0.1239 -0.0147 -0.4594** -1.9305**

source to target density ratio -0.2716** -1.0848*** -0.7398*** 0.6203

user size 0.2434*** 0.1104 0.6934*** 0.0982

variance user KL-divergence -0.3685** -0.3047** -0.9634*** -2.5057**

RMSE 0.1023 0.0972 0.1911 0.7658

P value 2.62E-12 4.74E-18 4.59E-19 1.84E-05

Table 30: RMSE regression analysis results for the Yelp dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001

CD-CCA CD-SVD RMGM CMF SD-SVD

intercept 0.4889*** 0.5221*** 1.9023*** 0.2964 0.5979***

source kurtosis rating -0.0022 0.1295 -0.0134 0.1188 0.0287

first component correlation -0.0099 -0.105* -0.4184*** -0.4066 -0.0746 target mode rating 0.2317*** 0.2401*** 0.2979*** 0.2769 0.2486***

target density -0.2123** -0.3217*** 0.1057 -0.3196 -0.3424***

target median rating 0.1951*** 0.1286** 0.078 0.2482 0.1379**

target skewness rating 0.8091*** 0.9822*** -0.4937** 1.7259 0.9036***

RMSE 0.1213 0.1317 0.1673 0.9004 0.1381

P value 2.33E-18 2.81E-19 1.06E-35 2.86E-01 6.50E-18

R2 0.449 0.459 0.682 0.00949 0.435

larger number of users. This is because the error of SD-SVD is also positively correlated with the number of users.

In addition to these relationships, IR of CD-CCA improves as we have a taller source domain rating matrix, and IR of CD-SVD improves as the average user-based KL-divergence of the two domains decreases, and thus there is more similarity between average user rating distributions. Also, as the mode of target domain ratings increases, which can be an indicator of skewness of ratings, the IR of RMGM decreases.

In document Las canciones en el aula de ELE: una propuesta didáctica con música alternativa (página 37-41)