Both regression methods and multivariate ordination methods were used to investigate the nature of vegetation and edaphic relationships.
6.4.4.4.1 Multivariate analyses
The use of modern ordination techniques makes it possible to analyse multiple response variables (plant species) simultaneously with multiple explanatory variables (edaphic variables).
This alternative approach is particularly useful in this study due to the limited number of species (restricted occurrences over the full range of sites) that could be modelled using the univariate regression approaches described in the next section.
Using multivariate ordination techniques effectively lifts this restriction, enabling the incorporation of a greater number of plant species for analysis. This provides a broader overview of the relationships between species and edaphic factors of the study region (Guisan et al., 1999, Legendre, 2008). Additionally, “Despite the variety of statistical methods available for static modelling of plant distribution, few studies directly compare methods on a common data set” (Guisan et al., 1999)p107.
6.4.4.4.1.1 Indirect gradient analysis (PATN-SSH-NMDS)
Using the Semi-strong Hybrid Multidimensional Scaling (SSH) approach in the PATN™ software package, the 136 sites used for the TWINSPAN classification were subjected to indirect ordination. The multidimensional scaling family of ordination procedures, of which SSH is a member, are deemed to be amongst the most powerful and suitable of ordination techniques for ecological data (McCune and Grace, 2002). SSH provides a more appropriate measure of distance, particularly in cases where many sites share few or no species in common, as is the case with the Cooper Creek floodplain data.
The ordination was derived from a dissimilarity matrix generated using the Bray-Curtis dissimilarity measure (Belbin, 1987). The ordination was then viewed in three dimensions using combinations of the first three ordination axes to determine hypothetical gradients influencing the arrangement of sites based on plant associations. To investigate the influence of soil variables upon the distribution of vegetation, a range of the most physiologically relevant variables were fitted post hoc to the ordination space. Following the ordination procedure above, the selected soil variables were standardized and subjected to the Principal Axis Correlation (PCC) routine in PATN™. The aim of the procedure is to relate the soil variables to the ordination space, to determine whether the species data are responding to the soil variables. PCC, a multiple-linear regression program, provides the best linear fit of the selected soil variables in the supplied two dimensional ordination space. Additionally a Monte Carlo randomization technique (MCAO routine in PATN™) was used to test the statistical significance of each PCC edaphic vector. Edaphic vectors were deemed significant at the p<0.001 level.
6.4.4.4.1.2 Direct gradient analysis (CANOCO-CCA)
The statistical software package CANOCO 4.5 (CANOnical Community Ordination) was used to conduct a range of gradient analyses. The approaches adopted for analyses and statistical tests followed the standard procedures outlined by ter Braak and Smilauer (2002).
Canonical Correspondence Analysis (CCA) represents an integration of multivariate ordination analysis and constrained regression techniques enabling a combined analysis of both edaphic variables and species; in this study a simultaneous consideration of edaphic variables with species abundance data (Guisan et al., 1999). It is assumed that environmental variables, rather than factors such as succession and disturbance, are the main drivers causing variation in vegetation composition (ter Braak, 1986, Jongman et al., 1995). The ordination axes are in effect „constrained‟ by a multiple linear regression with the edaphic variables used for the analysis.
CCA and its associated underlying multiple regression approach are deemed to be robust techniques of analysis (ter Braak, 1987, Leps and Smilauer, 2003). The robustness is attested to by Palmer who states that the process “performs quite well with skewed species distributions, with quantitative noise in species abundance data, with samples taken from unusual sampling designs, with highly intercorrelated environmental variables, and with situations where not all of the factors determining species composition are known” (Palmer, 1993) p2215. In CCA a
range of predictor variable types (continuous, categorical and binary) can be analysed. Categorical factors are re-coded in CANOCO into dummy variables.
Initially species data are subjected to a detrended correspondence analysis (DCA) in order to determine the appropriate response model for analytical processes. The results of DCA provide an indication of the maximum length of the canonical axes. This represents the apparent environmental gradient which is suggested by the diversity or species turnover. A high species turnover indicates that a long environmental gradient has been sampled. Hence an assumed unimodal response model supporting the use of CCA is deemed appropriate for analytical purposes. Should a short gradient be identified, then a redundancy analysis (RDA), a constrained linear ordination, assuming a linear response, is a more appropriate procedure (Borcard et al., 1992).
Ordination diagrams produced from analyses portray the combination of patterns of species distribution and the relationship of these species to the edaphic variables selected for analysis (Hettrich and Rosenzweig, 2003). CANOCO generates only the first four ordination axes in order of variance explained by the liner combinations of edaphic variables. These axes represent the directions of greatest data variability explained by the edaphic variables included in the analyses. CanoDraw (ter Braak, 1986), a program included with CANOCO, is used to present the diagrammatic outputs of results.
Additional outputs generated from CANOCO are tables of eigenvalue outputs for all models produced. Eigenvalues provide a relative measure of importance for an ordination axis (explaining the highest proportion of variance in the species data). The selection of the most significant edaphic variables influencing species is based on their ability to explain the highest proportion of variance in the species data as a whole. The statistical significance of relationships is determined using Monte Carlo permutation tests (999 runs).
6.4.4.4.2 Univariate analyses
As previously discussed, the multi-level or hierarchical sampling design structure was developed with the intent of using generalised linear (GLM) and generalised linear mixed modelling (GLMM) approaches (Hirzel and Guisan, 2002, Latimer et al., 2006, Bolker et al., 2009).
The first stage involved the use of logistic regression to investigate those edaphic parameters that most significanlty influence the presence of a species across the study region (initially excluding then including random structure)
During the second stage, conditional on presence only, that is those sites where a species was found to be present, generalised linear mixed modelling was used to investigate the relationship between the abundance (percentage cover) of a selected species and the edaphic variables most significantly influencing their abundance; again initially excluding and then subsequently including random structure. For both stages the GenStat statistical system (7th edition) was used (Payne et al., 2003).
Generalised linear modelling techniques build upon the foundation of linear regression approaches. They extend the application of linear regression by allowing for non-linearity and non-constant variance structures thus moving beyond the limitations imposed by the usual assumptions underlying linear regression (Austin, 1987, Potvin and Roff, 1995, Guisan et al., 2002). The modelling approach assumes a relationship between a linear combination of explanatory variables (edaphic) and the mean of the response variable (the species) (Keitt et al., 2002). This relationship is via a link function. In dealing with an investigation of presence and absence of a species, the „logit‟ link is used (hence the term „logistic regression‟) and the error structure is assumed to be binomial (Rushton et al., 2004).
Generalised linear mixed modelling (GLMM) represents an amalgam of regression techniques, linear mixed modelling and generalised linear modelling. This facilitates the incorporation of mixed effects (random and fixed). In this study the random effects are associated with the multi-level structure of the sampling design (incorporating effects of spatial dependency and autocorrelation) and the fixed effects with the range of continuous and categorical edaphic variables. GLMM‟s can also deal flexibly with unbalanced sampling and with missing and non-normal data (Bolker et al., 2009). Importantly this mixed model framework does not assume some form of parametric relationship between the nature of the response and explanatory variables; instead the response relationship is directly determined by the data (Austin, 1999).
A step-wise process is used to select parametres during the model selection process. Ultimately the aim of the above statistical procedures is to produce a suitable parsimonious and ecological and physiologically appropriate model, that reveals the most significant edaphic
variables influencing the presence and/or abundance of selected perennial species (Hettrich and Rosenzweig, 2003).
The mixed model technique of Restricted Maximum Likelihood (REML) is able to account for the multiple sources of variation (deviance in REML) associated with multi-level sampling designs. This multilevel structure facilitates a greater insight into the overall sources of random „noise‟ in the data. Partitioning this noise enables a reduction of the „error‟ that remains unexplained by the model, thus increasing the statistical power of the model, detecting real patterns in the response variable (minimising the probability of type II error), and reducing the likelihood of false detection of responses that do not actually exist (type I error) (Legendre et al., 2002). The significance of this approach is that unlike more orthodox regression, REML is able to account for greater than a single source of variation in the model. It is able to provide an estimate of variance associated with all random terms in the model. The variance component is fitted using residual (or restricted) maximum likelihood. The statistical framework of Maximum Likelihood (ML) determines model parametres in a process that maximises the probability of the observed data (the likelihood) (Bolker et al., 2009). ML assumes that the fixed-effects estimates are precisely correct in determining estimates of the standard deviations of random effects whereas REML “averages over some of the uncertainty in the fixed effect parametres”
(ibid). It is assumed that the plant species (response variable) varies in response to a combination of edaphic variables (predictor variables) which comprise both fixed and random elements.
The significance of variables ultimately selected for inclusion in the model is determined by the Wald statistic which approximates a 2 distribution. Variables were included in the model at a significance level of = 0.05.