DIAGRAMA DE CLASES GENERAL Grafico N° 43 Diagrama de Clases

Elaborado por: María Elsa Alomoto Alomoto.

DIAGRAMA DE CLASES GENERAL Grafico N° 43 Diagrama de Clases

Concerning the thinking problems, which are multiple-choice, I searched for specific indicators which I judged that they could help me understand to what extent the tools were successful.

9.5.1. Item difficulty and item discrimination

There are two main theories for tests the last two centuries. The first is the Classical Test Theory and the other is the Item Response Theory. Classical Test Theory claims that the observed test scores are a combination of the true score and measurement error

(DeVallis, 2006; Koretz, 2006). The true score is the average score that the person would obtain if the performance was measured repeatedly by similar assessments - assuming that there is no practice effect with the person becoming better because of getting used to the assessments (Cronbach, 1961, p.129). However, in the case of this pilot study it was not possible to calculate the true score because of the lack of repeated measurement tools.

Instead a Rasch model approach was used. The Rasch model primarily espouses that the score which can be attributed to a student depends on the student ability and on the difficulty of the items (Magno, 2009). In this analysis, there is a consideration of the item difficulty and item discrimination. According to the Rasch Model students’ ability, item difficulty and discrimination are measured in the same scale. Item difficulty as the name suggests is the level of difficulty that one of the constructed thinking problems might have and it is calculated by the proportion of students who got the item wrong. The item discrimination refers to the extent that ‘an item differentiates correctly among test takers in the behaviour that the test is designed

171

to measure’ (Anastasi, 1988, p. 210). In other words, it refers to the extent that an item distinguishes effectively the high performing students of the low performing concerning their performance on the measured traits.

Specifically, the correct answers were scored with 1 and the wrong with 0, thus the items were scored dichotomously either right or wrong. After the administration and scoring of the test, it is possible to estimate the item difficulty and the item discrimination. Item difficulty in dichotomous item score can be calculated by calculating the mean of each item in SPSS (Frequencies>Descriptive statistics). Thus, items which have mean = 1 (when the label 1 means that the student has taken the item correct), are the items which have been answered correctly by all the students. From these means the item facility can be estimated, so an item which has facility 1 has been answered correctly from everyone, so it has 0 difficulty. The mean of each item represents its facility. It is really easy with a simple subtraction to calculate the item difficulty (1- mean). The item difficulty and discrimination for both of the problems are presented (Table 9.1.-9.4.).

Table 9.1. Item Difficulty for Form 1

Thinking Problem 1: Does James ride a bicycle?

0.36

Thinking Problem 2: Who do you believe?

0.44

Thinking Problem 3: The meeting 0.17 Thinking Problem 4: Listening to classical music

0.31

Thinking Problem 5: Two friends were talking

0.25

Thinking Problem 6: The weather 0.92

Table 9.2. Item Difficulty for Form 2

Thinking Problem 1: Does your brother learn the guitar?

0.36

Thinking Problem 2: Who do you believe?

0.26

Thinking Problem 3: The road 0.20 Thinking Problem 4: Pocket money 0.29 Thinking Problem 5: An announcement 0.26 Thinking Problem 6: For the end...let’s eat a cake!

172

The item difficulty is an extremely important factor. If an item is answered correctly by all the students, it takes some space in the assessment form, it requires time and effort from the students to be completed, but it does not provide any additional about pupils’ ability. Therefore, the above analysis was important, because it confirmed that none of the items was too easy, since there was no item answered correctly by everyone. These values were interpreted according to the purpose of the assessment. The purpose of this assessment was not to rank the participants or to select the higher performing participants. On the contrary, the purpose of the assessment was to evaluate to what extent students were creative and developed the skill of critical thinking. In other words, this assessment could be perceived as an assessment of mastery of the critical thinking skill for group comparisons. According to Anastasi (1988, p.210) the item difficulty can be interpreted according to the use of the tests and particularly recommended mastery skill tests to have items with difficulty around 0.80.

Based on this recommendation, it could be argued that the items were too easy. However, I decided not to change them, because if my hypothesis was correct, with P4C leading to improvement of thinking skills and creativity, then these students were a cohort with more developed thinking skills than an average group, as they have been involved in P4C the last 4 years. Furthermore, since both forms were administered towards the end of the school year students were more mature than the participants in the trial. The problem ‘Let’s eat a cake!’ should have been parallel for ‘The Weather’ problem since this was the respective problem of evaluating the problem-solving skill in the other form. Nevertheless, the problem ‘Let’s eat a cake’ was too easy when compared to the problem ‘The Weather’. Therefore, the first was removed. Two items of reasoning were included in both forms (thinking problems 3 and 4) in order to have reasoning of different difficulty. As I aimed when I designed the assessments, the first reasoning problem was more difficult than the second one (Table 9.1 and Table 9.2).

The discrimination of the test items is discussed in Item Response Theory in 2 or 3 parameters model (Sick, 2008). An item has a good discrimination when all or some of the high scoring students get it right, but low scoring students or almost all of the low scoring students answer wrong. On the contrary, it has poor discrimination when equally high and low scoring students get the item right and it has negative discrimination when solely low scoring students and not high scoring get the item right. About the discrimination for dichotomously scored items for normal distribution Pearson correlation in SPSS is done. Even though Pearson correlation has been used to

173

reveal the item discrimination, I do not report the statistical significance. In other words, I do not discuss whether this correlation has been found statistical significance. The statistical significance testing is based on the assumptions of randomisation in the sampling method (Gorard & Gorard, 2016) and hence it was not appropriate for this case. However, this process allowed me to identify items with low discrimination.

Table 9.3. Item Discrimination for Form 1

Thinking Problem 1: Does James ride a bicycle?

0.392

Thinking Problem 2: Who do you believe?

0.729

Thinking Problem 3: The meeting 0.443 Thinking Problem 4: Listening to

classical music

0.516

Thinking Problem 5: Two friends were talking

0.421

Thinking Problem 6: The weather -0.251

Table 9.4. Item Discrimination for Form 2

Thinking Problem 1: Does your brother learn the guitar?

0.539

Thinking Problem 2: Who do you believe?

0.485

Thinking Problem 3: The road 0.388 Thinking Problem 4: Pocket money 0.484 Thinking Problem 5: An announcement 0.485 Thinking Problem 6: For the end...let’s

eat a cake!

0.578

Concerning the specific item correlation with the overall performance of the students in the critical thinking test the desirable correlations were found. A correlation which is negative entails that students of low performance get the question right. This might be due of guessing or potentially construct irrelevance and it is apparent that none of them are desirable for a reliable and valid test. However, the number of students was low for any conclusion. These correlations only provided some indicators about the item discrimination. Furthermore, a test of a multi-facet construct like this, low correlations were expected because each item presented different information about the performance of the student on a different task and aspects of the construct.

174

Anastasi (1988, p. 211) recommended that item discrimination is not a useful indicator for a criterion-referenced mastery skill. In other words, since this assessment examined whether and to what extent students are critical and creative, the item difficulty was not examined. Instead of the item discrimination, Anastasi suggested the examination of criterion validity between the piloted assessment and a criterion assessment. However, in the area of critical thinking - as I hope it has already become apparent to the reader - there is not a gold standard assessment to measure the specific two thinking skills for students of this age. Thus, the criterion validation by correlating this assessment with an external criterion was not a feasible option.

9.5.2. Missing Data

Missing data was also examined. If the students left some responses blank, this would indicate that an item was difficult or less interesting. Furthermore, if this item happened to be at the end of the assessment, it could mean that there was insufficient time for the test to be completed. All the students replied to all the questions. The pilot study did not have any missing data and therefore it did not provide indicators for issues such as the aforementioned.

9.5.3. Pattern of correct and wrong answers

One of the things that were revealed though was the correct answers pattern. The correct answers of the test with a multiple choice cannot be always the "a" or the "c" answers. Thus, it has been an effort to balance the pattern of the correct answers. For the critical thinking test which had multiple choices item, I realised better the correct response pattern when I was correcting the forms.

The patterns for form A was: 1-C, 2-A, 3-B, 4-C, 5-C, 6-A

The pattern for form B was:1-C, 2-A, 3-B, 4-C, 5-C, 6-C

Option C appeared as the correct answer for most of the times. However, I did not want to keep a fully balanced pattern with each letter to be correct for 2 times (2 times · 3 letters for 6 problems) because key balancing might lead to predictability of the correct answers, testwiseness and guessing (Bar-Hillel & Attali, 2002). However, I considered the pattern of correct options for the assessments used in the trial.

Concerning the quality of wrong options which were used to distract (distractors) some of the students from providing the correct answer, there was an

175

analysis of the answers provided for all the thinking problems separately. Specifically, charts were created for each of the problems in order to shed light in the possibility of having a misleading distractor, which confuses more students than usually (see appendix 4.b.). Even though the sample was small, the answers of the students in each thinking problem show that the two distractors are equally misleading, but generally students were able to identify the correct answer.

In document Trabajo de Titulación (página 94-106)