TOLUCA Y ZONA INDUSTRIAL, S.A DE C.V.
5.2 Identificación de nodos de ascenso y descenso
Handling of empirical data requires the use of various statistical methods. In the following, the reliability, validity, and significance testing methods used in this work are described.
RELIABILITY
The results of an assessment should be reproducible under different conditions. In many cases, different observers, or even the same observer at a different time, may reach different conclusions. The concept of reliability provides an estimate of how consistently the studied behavior is observed and scored. In addition to this, agreement between observers reflects whether the target behavior or activity is defined well enough (Kazdin 1978).
Intra-observer reliability measures the variation which occurs when one
observer performs multiple judgments at different times. Inter-observer
reliability, on the other hand, measures the variation that occurs when two or
more persons make the judgments independently. Usually, both of these reliability tests should be used when the reliability of a new measurement method is evaluated. However, if it can be assumed that inter-observer reliability contains all the sources of error contributing to intra-observer reliability, plus any differences which may arise between observers, then it may be sufficient to use only inter-observer reliability tests (Streiner & Norman 1995).
Cohen (1960) has presented Kappa (κ) as a coefficient of agreement for nominal scales. The proportion of agreement corrected for chance is the following:
po – pc
κ = (1)
1 – pc
where po is the observed proportion of agreement, and pc is the proportion of agreement expected by chance. It can be seen that when the agreement equals the chance agreement, κ = 0. Greater than chance agreement leads to positive values of κ, less than chance agreement leads to negative values. The upper limit of κ is +1.00, occurring when there is perfect agreement between observers. Originally, Kappa was restricted to the case where the number of observers is
two. Afterwards, Kappa has been generalized to the case where more than two observers rate each subject (Fleiss 1971).
Kappa considers only the total agreement and does not provide partial credit. This is often inappropriate for scaled responses where the responses may differ by only one or two categories. A solution to this is an extension of Kappa, called
weighted Kappa (κw) (Cohen 1968), which assigns some weight also to the
disagreements between observers. The formula for weighted Kappa is: Σ wij x poij
κw = 1.0 – (2)
Σ wij x pcij
where w is the weight assigned to the i,j cell, and poij and pcij are the observed and expected proportions in the i,j cell. In principle, the weights could be assigned arbitrary values between 0 and 1. However, unless there are strong prior reasons, the most commonly used scheme, called quadratic weights should be used (Streiner & Norman 1995).
The relative strength of agreement associated with Kappa has been determined by Landis & Koch (1977) as follows:
Value of κ Strength of agreement < 0 .00 – .20 .21 – .40 .41 – .60 .61 – .80 .81 – 1.0 Poor Slight Fair Moderate Substantial Almost perfect
According to Fleiss (1973), both Kappa and weighted Kappa can be employed as a measure of reliability for quantitative scales. Since Kappa considers only the perfect match between observers it should be used for nominal scales only, while weighted Kappa is preferable for ordered scales.
In this study, weighted Kappa with quadratic weights has been used. Reliability considerations were arranged using inter-observer reliability tests as described in Sections 5.1.1 and 5.1.2.
The Standard deviation (SD) describes how uniform the assessments between observers are. Thus, also SD can be considered as a measure of reliability of a method. In this work, the deviations among students’ assessments in Cases VII-IX were studied using the SD computations.
VALIDITY
An observation method should measure what we think it measures. This leads to the concept of the method’s or scale’s validity. The validity is linked to the reliability – the higher the reliability, the higher the maximum possible validity. There are many names used to describe the different kinds of validity, especially in the educational and psychological literature. For the need of simplicity, the concept of validity is often reduced to three general groups: content validity,
criterion validity, and construct validity. (Downie & Heath 1970, Streiner &
Norman 1995)
The higher the content validity of a measure, the broader are the inferences that can validly be drawn about the observed phenomena (Streiner & Norman 1995). According to Downie & Heath (1970), content validity is a non-statistical type of validity that is usually associated with achievement tests. An adequate sampling of items by the test constructor is usually enough to assure that a test has content validity.
The criterion validity can be defined as the correlation of a scale with some other measure of the phenomena under study. Ideally, this other measure is a standard that has been widely accepted in the field of study. (Streiner & Norman 1995)
Construct validity differs from the two other types of validity in many ways.
Content and criterion validity can often be established in one or two studies while construct validation is an on-going process of learning more about the phenomenon, making new predictions, and then testing them. Thus, construct validity usually arises from larger theories and observations carried out during a long period of time. Furthermore, with construct validity, both the theory and the measure are assessed at the same time. Both a wrong theory, and a measurement method which cannot discriminate the studied object, can result in invalid conclusions. (Streiner & Norman 1995)
In this study, no specific validity studies were carried out. This was mainly due to the complexity of the studied phenomena, making it difficult to find parameters that would describe an audit method’s validity reliably enough.
Criterion validity could have been studied by comparing the findings of the
audit to the accident types in each of the case study company. This was, however, not feasible since accidents correlate poorly with safety activities and the overall safety level (e.g. Groeneweg 1992). Criterion validity could have been assessed also by using some other work analysis method as a reference. However, this would have required that this other method has been validated, and that it covers the scope of the audit method well enough. Also this kind of validation proved to be difficult to carry out.
Emphasis was put on improving the construct validity of the MISHA method by studying in detail a wide range of criteria for a healthy and safe workplace. These criteria were then incorporated into the MISHA method. An attempt to increase the construct validity was also made by studying several organizational assessment methods, and auditing procedures. All these concepts were also discussed with several experts from the Tampere University of Technology, the Finnish Institute of Occupational Health, the University of Louisville, and the VTT Technical Research Centre.
SIGNIFICANCE
The difference between two statistics can be a real difference or it can be only a chance variation. The Case studies I-IX included the analysis of the differences in safety performance between companies in the USA and Finland. In these analyses, the assumption of normal distribution for the sum scores was not reasonable for the collected data, and therefore the significance of the differences was evaluated using Mann-Whitney’s U-test (also known as the Wilcoxon rank sum test). The calculation of Wilcoxon based confidence
intervals would have needed the minimum of four units per group (Downie &
Heath 1970). In this study, the confidence intervals were not determined since the required number of units were not available.