Instrumentation or measurement validity is the critical first step in quantitative, positivist research (Straub, Gefen & Boudreau 2005). If the measuring instruments employed in a study were not acceptable at a minimal level, then the research findings would be meaningless. Three forms of instrumentation validity: content validity, reliability, and construct validity are mandatory. Some pertinent details are discussed next.
Content validity, reliability, and construct validity
Content validity is concerned with the assurance that the measure includes an adequate and representative set of items that tap the concept (Sekaran 2003). It is “a function of how well the dimensions and elements of a concept have been delineated” (p. 206). Reliability is the extent to which individual items used in a construct are consistent in their measurements (Nunnally 1978; Straub, Gefen &
Boudreau 2005). It is concerned with the assurance that the items posited to measure a construct are considered as a set of items that are sufficiently correlated to be reliable (i.e., low on measurement error) (Cronbach 1951).
Construct validity refers to how well the instrument taps the concept as theorized (Sekaran 2003). It is broadly defined as the extent to which an operationalization measures the concept it is supposed to measure (e.g., Bagozzi, Yi
& Phillips 1991; Cook & Campbell 1979). Some authors determined construct validity by assessing the extent to which each measurement item correlates with the total score (e.g., Kerlinger 1986; Yap, Soh & Raman 1992). However, a more stringent assessment of construct validity is through both convergent and discriminant validity (e.g., Campbell & Fiske 1959; Sekaran 2003; Straub, Gefen &
Boudreau 2005; Trochim & Donnelly 2007), which can be established in many ways discussed in the later subsection.
In this study, the content validity of the intention-to-return instrument, as discussed above, was attested by the supervisors of the study according to Kidder &
Judd (1986) and Sekaran (2003) suggesting that content validity can be attested by a panel of judges. All other measuring instruments, adapted from the existing scales with validated psychometric properties, were considered having content validity.
The reliability and construct validity of all measuring instruments were assessed using SPSS graduate pack 16.0. First, factor analysis was conducted to determine the underlying factor structure of each of the ten scales independently.
Second, reliability analysis based on Cronbach’s alpha model was run to examine the internal consistency reliability of each scale (also to determine the inclusion or exclusion of measurement items through the pilot study to produce a reliable scale).
Finally, convergent and discriminant validity analysis were performed to establish the construct validity of each scale. These analytical procedures are outlined in the subsections that follow.
Factor analysis
There are several factor analytical models, with the most common being PAF (principal axis factoring) and PCA (principal components analysis) (Coakes & Steed 2005). Despite the debate in the literature over which model is most appropriate, PAF was considered the preferred model for this study.
To determine the underlying factor structure of each of the ten scales independently, factor analytical procedure was individually done for each scale as follows. First, using Kaiser’s (1960) criterion of extracting factors with eigenvalues greater than 1, a number of factor extraction statistics, including the total variance explained statistics, the scree plot, the communalities and factor loadings, were generated for each scale. This factor extraction criterion is based on the idea that an eigenvalue of 1 represents a substantial amount of variation explained by a factor (Field 2005, p. 633). Next, analysis was undertaken to examine the communalities and factor loadings of the measurement items for each construct to assess if they tapped into the same construct as predicted (Coakes & Steed 2005). The complete
process is illustrated in Appendix 2-15 while some issues concerning communality and factor loading are discussed in the paragraphs that follow.
Communality: Communality for an item measuring a predicted construct is the percent of variance in that item explained by the predicted construct (Field 2005).
It is a measure of substantive importance of a measurement item to the predicted construct. In general, low communalities across a set of measurement items indicate that the measurement items are little related to one another. A construct comprising an item with a low communality raises concern that the construct might not work well for that item. However, an item with a low communality may be meaningful if it contributes to the interpretation of a well-defined construct, though often high communality reflects greater contribution. To determine if a measurement item has substantive importance to the predicted construct, Stevens (1992) recommends a minimum threshold of 0.16 communality or 0.4 loading associated with that measurement item.
Factor loading: Factor loading (FL) for an item measuring a predicted construct can be thought of as the Pearson correlation (r) between the measurement item and the predicted construct (Coakes & Steed 2005; Field 2005). Thus, squared FL (like squared r) would give an estimate of the percent of variance in the measurement item explained by the predicted construct (Field 2005). This means that
‘squared item-loading is communality’; and that loading (as in communality) is a gauge of substantive importance of a measurement item to the predicted construct. In general, the higher the loading, the more meaningful it is, or the greater is the impact of the measurement item on the predicted construct (Pedhazur & Schmelkin 1991).
A finding that measurement items have high loadings on the predicted construct indicates that the measurement items posited to represent the construct really tap into the same construct (Carmines & Zeller 1979; Pedhazur & Schmelkin 1991; Pett, Lackey & Sullivan 2003). In this respect however, there is no single agreement as to how high a loading needs to be. Some researchers used a minimum threshold of 0.3 or 0.35 while some used a minimum loading equal to 5.152/[SQRT(N-2)] when the sample size (N) was 100 or more (Norman & Streiner 1994). Still other researchers used 0.4 for the central construct and 0.25 for other
constructs (Raubenheimer 2004). Typically, researchers have treated a loading greater than 0.3 to be important (Field 2005). For Stevens (1992), a minimum threshold of 0.4 loading explaining around 16% of the variance in the measurement item is recommended.
Reliability analysis
Reliability and factor analysis are complimentary procedures in scale construction and definition (Coakes & Steed 2005, p. 164). Cronbach’s alpha reliability model was considered the preferred model for this study. This procedure examines Cronbach’s alpha internal consistency reliability coefficient of a given scale and determines the inclusion or exclusion of measurement items to produce a reliable scale. Cronbach’s alpha reliability coefficient indicates how well the measurement items in a set are positively correlated to one another (Sekaran 2003). The commonly used threshold value for the Cronbach’s alpha is 0.7 (Hair et al. 1995; Nunnally 1978). Some researchers suggest a reliability alpha of 0.6 as the minimum acceptable level (e.g., Churchill 1991; Sekaran 1992; Slater 1995). In Sekaran’s (2003) terms,
“reliabilities less than 0.60 are considered to be poor, those in the 0.70 range, acceptable, and those over 0.80, good” (p. 311).
The output of a reliability analysis for a given scale comprises three important statistics (Coakes & Steed 2005). First, the ‘corrected item-total correlation’ statistics show the Pearson correlation coefficient (r) between the score on the individual item and the sum of the scores on the remaining items. Here, Field (2005) holds that r for each item in a reliable scale should not be less than 0.3 (depends slightly on sample size, smaller r is acceptable with bigger sample size).
Items with r < 0.3 may have to be dropped for they do not correlate very well with the scale overall. Second, the ‘Cronbach’s alpha if item deleted’ statistics display the alpha coefficient that would result if the item were removed from the scale. Finally, the ‘reliability’ statistics show the Cronbach’s alpha reliability coefficient for the overall scale.
Convergent and discriminant validity analysis: Construct validity
The idea of convergent and discriminant validity through which construct validity can be established was proposed by Campbell & Fiske (1959). Whereas convergent validity refers to two or more valid measures of the same concept should correlate highly, discriminant validity refers to valid measures of different concepts should not correlate too highly (Bagozzi, Yi & Phillips 1991). Another way of saying is that measures that are theoretically supposed to be highly correlated are really so in practice (convergent validity), whereas measures that are theoretically not related to one another in fact are not (discriminant validity) (Trochim & Donnelly 2007).
To establish construct validity, the construct should have not only convergent validity, but also discriminant validity (Churchill 1979). There are several ways by which convergent and discriminant validity can be tested, including factor methods, correlational methods, AVE (average variance extracted) method, SEM (structural equation modelling) methods, and Multitrait-Multimethod. Briefly discussed below is the AVE method employed in this study.
AVE (average variance extracted) method: AVE is a measure of the average variance extracted from the measurement items by each construct, which is computed as the square root of the average communality (Straub, Gefen & Boudreau 2005). According to Fornell & Larcker (1981), convergent and discriminant validity of a given construct can be established as follows. First, a construct is considered to display convergent validity when the average variance explained by that construct’s items (i.e., the construct’s AVE) is at least 0.50. That is, when the variance explained by the construct is greater than the variance due to measurement error. Next, a construct is proved having discriminant validity when the construct’s AVE is greater than the construct’s shared variance (i.e., the squared correlation) with every other construct.
Assumptions underlying factor, reliability, and correlational analysis, and their test procedures
When selecting a data analysis technique that involves parametric statistics, one should ensure that the parametric assumptions related to the technique are satisfied (e.g., normality, linearity, and lack of multicollinearity, etc.) (Straub, Gefen &
Boudreau 2005). However, it is noteworthy that researchers have established moderate violations of parametric assumptions to have little or no effect on substantive conclusions in most instances (e.g., Cohen 1969).
Underlying the application of PAF (principal axis factoring) factor analysis are a number of assumptions related to sample size, normality, linearity, homoscedasticity, absence of outlying cases, absence of extreme multicollinearity and singularity, factorability of the correlation matrix, and absence of outliers among variables (Coakes & Steed 2005). Of which, normality, linearity, homoscedasticity, and absence of outlying cases are also the underlying assumptions of both reliability and correlational analysis. These assumptions and their test procedures are outlined in Appendix 2-14.
In brief, the foregoing factor and reliability analysis, and convergent and discriminant validity analysis were conducted in both the pilot and main studies to assess the instrumentation validity. Some salient points of the pilot study are discussed in the subsection that follows.