5. Introducci´ on a los contrastes de hip´ otesis 79
5.2. Mec´ anica de los contrastes de hip´ otesis
Univariate statistical analysis was utilized for the majority of the PHEV operating and emissions data analysis. Since most of the objectives evaluated different on-road responses to various driving, environmental, and implementation (transit versus civilian driving patterns) scenarios, basic analysis of variance (ANOVA) and the non-parametric equivalent Kruskal-Wallis tests provided the framework for the bulk of the statistical work performed for the PHEV Sprinter study.
Analysis of variance (ANOVA) examines the variance in the data in order to determine if that variance is attributable to specific sources, determining whether the means of several variables or groups are equivalent to a set degree of statistical
significance (α). This test design essentially allows the experimenter to ascertain if the variation in a response variable (for example, carbon dioxide emissions) can be
statistically explained by a predetermined factor variable (such as vehicle specific power bin or roadway type). ANOVA imposes the assumptions of normality, independence, and homoscedasticity.
While a number of the measured variables in the on-road dataset (both emission and operating) cannot be deemed independent due the cause-and-effect nature of the physical and chemical processes being measured during the PHEV’s operation, ANOVA, and its non-parametric equivalent, Kruskal-Wallis, were suitable for investigating the relationship between route and roadway factors on both PHEV operation and emissions.
Additionally, continuous, on-road data possess a degree of intrinsic autocorrelation. In time-series data, such as on-road emissions and vehicle operating data, observations close together in time have some degree of relation to one another. Without any categorization
or additional filtering, the individual measurements as a time-series cannot be considered independent from each other. Deliberate modal categorization of the on-road data, such as by VSP bin, removes the autocorrelation associated with continuous on-road data (Zhai, 2008). Additionally, filtering the second-by-second datasets down to every 5-seconds, or 5-second averages, has been found to remove essentially all autocorrelation associated with continuous on-road data (Zhai, 2008). For all analyses performed, one of the three methods of establishing independence was utilized: the data were subset into unique modes or bins according to modal models, the data were averaged over every 5seconds of data collection, and the data were aggregated into overall averages for a set link or section of road or route creating a sample or run-based value. Analyses based on aggregate sample-route data, yielding total emissions, power output, and fuel use for an entire sample route or roadway link also eliminated the impact of autocorrelation within the dataset.
Continuous on-road data is also inherently non-normal. This is particularly true with regards to the PHEV data, where a disproportionate amount of zero-valued data for a number of the measured variables exists due to its dual-mode operation (electric-only or hybrid operation). Even within individual VSP bins, the PHEV dataset proved non-normal with respect to ICE power output and carbon dioxide emissions during charge-sustaining mode due to the presence of electric-only operation within each bin.
Normality, however, can be achieved by imposing a log-transformation on continuous on-road data. Unfortunately, zero-valued data recorded under electric operation continued to violate the requirements of normality for the larger datasets regardless of transformation method. Removing the electric-only data from each VSP bin resulted in
normal datasets in bins 1 through 5. With the electric-only data removed, the log-based transformation was applied to the residual charge-sustaining dataset creating a simulation of normality. However, since electric-only operation throughout charge-sustaining and charge-depleting operation is a critical feature of PHEV operation, there was no
reasonable justification for removing the zero-emissions data. Since normality could not be rationally developed in the on-road PHEV data and in effort to compensate for
violations of the ANOVA assumptions, all ANOVA analyses were duplicated using the non-parametric equivalent Kruskal-Wallis test. Where ANOVA evaluates population means and variance, the Kruskal-Wallis analysis tests for the equality of population medians among groups, negating the normality requirement of the parametric test.
Pearson’s correlation coefficients were used to gauge the existence of linear relationships between the measured variables. While the Pearson’s correlation coefficient provides a tested measure of linearity, it is not suitable for discerning non-linear relationships between variables. The nature of the on-road PHEV data suggests linear cause and effect, or co-cause and effect, relationships between variables. Pearson’s correlation coefficients were used primarily to distinguish variable interactions with strong linear relationships from those with weak or no linear relationships from one another. Because correlation coefficients assume independence between observations, all data used in correlation analyses were either filtered to an every 5-second-averaged observation, or consisted of compiled summary data (i.e. on a per sample run basis), and, therefore, no longer represented a time-series dataset.
Multivariate analysis techniques were investigated for their suitability in application to the on-road continuous dataset. One of the primary requirements for
multivariate analysis is the assumption of independence between measured variables.
With regards to operating and emissions data, it cannot be assumed that the variables are independent of each other since they have a direct physical or chemical cause-and-effect relationship (i.e. fuel use and carbon dioxide emissions). Additionally, the ultimate product of traditional multivariate techniques such as principal component analysis (PCA) or factor analysis (FA) did not suit the investigations of this study. The clustering and ability to subgroup data presented by PCA and FA was considered redundant given that the experimental design systematically created data subgroups such as roadway type, ambient temperature, and on-road operation scheme for use in objective design.
Multivariate analysis of variance (MANOVA), however, was employed as a justification tool for the univariate ANOVA tests performed. MANOVA is a multivariate generalization of ANOVA that safeguards against the risk of making Type I errors that exists when running a series of different, unique ANOVAs on a multivariate dataset (Johnson, 1998). Using a multivariate technique allows several populations to be compared by utilizing all of the measured variables simultaneously. By determining statistical significance with MANOVA, the researcher felt confident to rely on univariate analysis for the greater part of the data analysis.
Due to the size of the datasets being investigated (filtered or not filtered),
appropriate levels of significance were determined for each test. Statistical significance for MANOVA tests was set at <0.05. With statistical significance found in the
multivariate analysis of variance, an <0.025 was used to determine significance for the ANOVA and Kruskal-Wallis tests performed on the continuous datasets. Since the run-based datasets were essentially summarized compilations of the continuous on-road data,
the criteria demanded by ANOVA were no longer violated for these analyses and an α<0.05 was accepted. The datasets were subgrouped according to investigation for all ANOVA and Kruskal-Wallis tests, so the sample sizes of the populations of interest were smaller than other tests. Correlation analysis, however, was assessed more
conservatively with an <0.025 due to the large size of the continuous datasets.
Chapter 4: Summary and Overview of Data