In closing, the question remains as to whether there are any worthwhile avenues of research left unexplored with these data. While more exploratory analyses could be conducted at this point, the utility of conducting these analyses must be considered. As previously stated under section Lack of Participants, these data may be too few to justify more detailed analyses. We have only N=39 students who responded to open-response measures, and N=40 who responded to forced- choice measures. This limited data set must be accounted for in considering extensions to this work, starting with the potential strategy of collecting more data.
7.5.1 Collecting Additional Data
One possibility to extend this work would be to continue data collection. Collecting additional data will likely require an additional year to make contact with a teacher, schedule additional sessions with this teacher, and administer MathSpring again. After the data is collected we must
then consider where to begin in coding these new data: initial open-response coding, inter-rater agreement testing, or simply having the first author apply tags to this new data set. Starting from the first step, initial open-response coding would be the most time intensive step. It would require that we consider changing the existing coding scheme to accommodate possible new tags, which could be present in the new cohort of participants. In addition to experienced coders, new coders would have to be found who are unfamiliar with the existing coding scheme, they would have to tag the new responses, and then schedule time to discuss their coding schemes. The current data set required 3 months during the summer of 2017 to coordinate all volunteers’ schedules, so it is reasonable to plan that the new coding scheme would require another 3 months.
Starting from the second step, inter-rater reliability testing, would require less time. Two coders would have to tag the new data set given the existing lexicon of tags. This would require at least one volunteer to make time for this project. Further, it’s possible that despite following the existing coding scheme the coders could identify responses that would deviate from this lexicon. That would require re-considering adding or removing tags from the lexicon. While this work could extend, it is reasonable to plan that this re-application and interrater reliability check would take at least a month.
Finally, if the existing scheme were simply applied by the first author, the approach would take only a week. However, as previously stated, collecting a new data set could take up to a year.
7.5.2 Additional Analyses: Detector of Affect
Detectors of affect have been built using similarly small data sets (Wixon et al, 2014) and performed relatively poorly as compared to detectors built using larger data sets. However, for the open-response condition, at least there are more potential emotional self-reports available. This means that detecting each possible affective state is more challenging because the additional specificity reduces the total possible cases to be considered.
The problem of highly specific tags could be addressed by considering only tags & tag combinations which met a minimum sample size criteria: a set minimum for the number of reports and the minimum number of students who report particular tags. Then only these more common tags would be considered. However, this still leaves the additional problem of bias due to all the analyses that have been conducted up to this point.
The approach used up to this point has been to test specific research questions using basic statistical tests regarding which events precede/follow each other. This approach has undermined the potential for building an unbiased detector: having already looked in detail at which events appear to precede particular reports means that the author has observed what features will likely predict self-reported emotions.
Finally, we should consider that the resulting affect detector would be less of a means of detecting self-reported affect in an unbiased manner than a means of describing the data and quantifying what features would have the greatest impact on students’ self-reports.
7.5.3 Additional Analyses: Structural Equation Modeling
Structural equation modeling might not be a suitable approach for these data due to the
particularly small sample size. As a general guideline sample sizes N < 200 are often excluded from SEM analyses (Boomsa, 1982). Given that we have only N=39 students who responded to open-response measures, and N=40 who responded to forced-choice measures we are far below
the expected minimum. Further, if we consult Table 44 we see that many specific constructs are only expressed by a very small subset of the total sample group of participants. Note the number of tests that were not performed because fewer than 8 participants would be counted in the given T-Test. Of the tests we could run the total sample sizes are quite small compared to the
recommended total of N > 200. However, if more data were available, Structural Equation Modeling would be an interesting further avenue to explore, to analyze a tier of pretest incoming variables that describe the student, to behaviors and emotions inside of the tutoring system, to a third tier of post-tutor and outcome variables.