Epílogo a mis cuatro azares (bitácora del proceso)

Two of the four criteria for the inclusion of the study were related to their research design. Even though the existence of a comparison group and the pre-test and post-test

assessment enables the investigation of the counterfactual cases and examining the pre- test equivalence of the two groups, the studies included in the systematic literature review did not provide equally trustworthy results.

The design varied amongst the studies. Therefore, there was a stage of evaluation of the trustworthiness of the studies. This scale was created based on the idea of judging the trustworthiness of the studies suggested by Gorard (2015b) and Gorard, See and Siddiqui (2017) and finally by the scale used in Siddiqui and Ventista (2018). According to their estimation of the trustworthiness of the study, the researchers considered the strength of the research design in relation to the research question, the scale of the study, dropout, data quality and other threats to validity.

I created a system based on these recommendations to evaluate controlled trials which examine causal research questions. However, I did not follow the scale suggested by the authors completely because I focused on the consistency of rating and replicability of my findings. The scale suggested by the authors is intuitive, which means that a lot of times the person who evaluates the study has to take serious decisions without clear thresholds. For example, about the scale of the study the authors recommend large number of cases for the highest rating, medium number of cases for the next category. It becomes obvious that this is not a clear threshold, and this is a relatively vague judgement. Without the assistance of a rubric, different raters might consider the same study as having high or medium number of cases.

According to the system I developed (Table 4.1.) there are three different areas to be considered. The indicators of the quality of the studies are symbolised with stars. Each of the three areas offers to the study particular numbers of stars. The maximum number of stars that a study can get is 5, while the minimum is zero. Each category can offer a different number of starts in the final grading.

This grading system refers to the specifications of the research design and reporting of the studies in this systematic literature review. It grades only three areas suggested by Gorard, See and Siddiqui (2017). These authors evaluated the research design based on the research question. However, the system I developed refers to only controlled trials aiming to respond to causal research questions. First of all, this system rates the research design of the any controlled trial based on the way that the participants were assigned to a comparison or intervention group. If that was random then the study gets two stars for the overall score. If there was matching based on specific criteria, then the study gets one star because it is recognised that the matching

was based on known criteria and this assignment to group might have not been effective if the criteria were different. Finally, a study receives zero stars in case there was only a comparator group.

When the number of the participating units in a study is very small, then the randomisation cannot be considered trustworthy (Gorard, 2013, p.128). For this reason, I did not give two stars to studies with small sample even if they claimed random allocation of participants within the groups.

The second indicator of quality is the sample size of the smaller group. The idea of evaluating the N of the smaller in number comparison group is based on the scale of Gorard, See and Siddiqui (2017) who evaluate the sample size based on comparison group. However, as it has already been mentioned, the same authors did not set a clear threshold for the sample size to distinguish high, medium and low effect sizes. I decided on the number 100. This is arbitrarily, but I chose to use if for consistency reasons and in order to make my results replicable. However, I recognise that a study which might have 99 cases in the smaller comparison group does not significantly differ from the one which has 100. However, a threshold had to be set somewhere and if I enable a study with 99 cases to be considered in the other category, the same argument could apply for a study with 98 cases and so on.

Finally, the third indicator was the attrition of the study from the pre-test to the post-test. A study which does not report dropout should be graded with zero stars, because it is untrustworthy. A study which reports attrition, which is higher than 15% of the overall sample, introduces serious concerns about the results. Hence, this study is rated with one star. The study which reports attrition, but it is smaller than 15% of the overall sample, it can be graded with two stars. As it applied in the threshold for the sample size, 15% is an arbitrary threshold that was applied for consistency in the grading and enabling the replicability of the study. Furthermore, one main weakness of adopting this approach is the fact that the attrition of both groups is considered as a unity. In some cases, participants drop out form both groups whilst in others participants drop out only from the one.

Table 4.1. Trustworthiness Indicators for Evaluation of the Research Design of the Studies evaluating the impact of the programme.

Trustworthiness indicators

0 Total

Marks per indicator Research Design Randomised

Controlled Trial with big sample (at least 100 participants in each group) Matched Comparison group or randomisation within groups (with smaller sample) Comparator group 0-2

Sample Size (of the smallest group in the study) Not applicable N ≥ 100 100 > N 0-1 Attrition (from pre-test to the post-test) Reported attrition which is ≤ 15 % of the overall sample Reported attrition which is > 15% of the overall sample Not reported attrition 0-2

Total Stars for the Study: 0-5

This evaluation system is not exhaustive, and these are not the only indicators to be used for the evaluation of the studies. There are other criteria which can reduce the trustworthiness of the findings, such as the measurement tools used. A measurement tool which is focused on the exact skills targeted by the school-based intervention is more likely to demonstrate bigger impact for the intervention group. Similarly, the pre- test equivalence can play a significant role in the results that occur, and it was not examined by the grading system. These were excluded from the general scale because they would make it excessive and overcomplicated. These additional characteristics are included in the discussion and the judgment of the studies individually.

This system of quality for the study does not demonstrate anything about the impact that the study found about P4C. A high-quality study can find any type of impact (positive, negative or no impact). The impact of the programme should be examined separately from the quality of the study. In this systematic literature review,

the impact of the programme was examined after the inclusion and the evaluation of the studies.

In document Sátira al libro de un viejo (página 107-118)