Fase IV: Validación y control
Etapa 8 del procedimiento general: Control y auditoria del sistema
estimates for the model parameters. Typically, these initial parameters would be estim- ated by plotting the data and identifying key values which relate to the parameters, or if the range of the parameters are known, perform a grid search on a subset of values (Ritz and Streibig, 2008). However, since regression analyses in this study are performed over the large Arabidopsis datasets (Section 1.2), it needs to be automated, and to this end, self-starter functions were used. Self-starter functions are pieces of code that automate the parameter search for starting values. These functions are specic to a particular nonlinear model, which can then be used to calculate starting values for the model for any given dataset. The self-starter may not always result in a successful convergence, but in general it should provide estimates of that parameters that are close enough to allow the estimation algorithms to converge. Several collections of self-starters exist, including several in the base R installation, as well as in third-party packages, such as the drc and HydroMe packages. However, these self-starters are for specialised models in a specic eld, or for a dierent parameterisation of the same shape. Thus for the majority of the selected models, self-starters were developed and used in the tting process. The process of estimating the initial parameters for each shape is described below.
3.3.1. Sigmoid curves - logistic and Gompertz
Both the logistic and Gompertz self-starter functions were adapted from existing sources. The logistic self-starter function was derived from the SSfpl function in the built-in stats package in R (R Development Core Team, 2011), and is a simple re-parameterisation from the originaly =a+1+exp((b−ma−t)/s toy=a+1+exp((bm−t)/s) (i.e. changing the range value to a single parameter).
The Gompertz self-starter function was adapted from the gompertz function in the drc package (Ritz and Streibig, 2005). Similarly to the logistic self-starter, the func- tion was re-parameterised so that the range value is a single parameter - y = a+b·
exp(−exp(s·(t−m))). As the Gompertz model is asymmetric, there are multiple shapes possible, depending on the value of the parameters. Specically, if s>0, this means the curve exhibits accelerated growth towards the right asymptote (slower initial growth). In contrast, if s<0, this represents the form where there is accelerated growth from the left asymptote. The original gompertz function in the drc package only took the Gompertz1 form (growth rate is faster on the left of the midpoint) into account, so an
additional self-starter function was added to detect the Gompertz2 shape (growth rate is faster on the right of the midpoint).
3.3.2. Exponential
A parametrisation of the standard exponential equation was identied by Ratkowsky (1990) that expressed the equation in terms of expected value parameters:
y=y1+ (y2−y1)
1−km−1
1−kn−1 where m−1 = (n−1)(T−T1)
(T2−T1) , n is the number of data points, and k and r are related
in the following manner: r =k(n−1)/(T2−T1). Another parametrisations was also shown
such that
y=y1+ (y2−y1)
(1−[(y2−y3)/(y3−y1)]q) 1−[(y2−y3)/(y3−y1)]2 whereq = 2(T−T1)
(T2−T1), andy1,y2 andy3 correspond to the y-values atT =T1,T =T2 and
T = (T1+T2)/2, respectively. T1 and T2 are the rst and last observed values (time points) in the dataset, respectively. From these two parametrisations, it was assumed that the denominators of both these equations were equal, implying thatkn−1 ≈[(y2−
y3)/(y3−y1)]2. Using this assumption, and the association of k and r described above, it was possibly to estimater ≈[(y2−y3)/(y3−y1)]2/(T2−T1).
Once this approximate solution of r was found, the values of a and b could be easily identied. By using the equation, yi =a+b·exp(−r·ti), whereyi corresponds to the
pointT =Ti, and using the rst and last data pairs ((t1;y1)and (t2;y2), respectively), it is possible to solve for a and b such that
b= y2−y1
exp(−r·t2)−exp(−r·t1) and a=y1−b·exp(−r·t1).
3.3.3. Critical exponential
Since the r parameter is the primary parameter that inuences the shape of the curve, it is the most important to identify rst. To nd an approximate value for the parameters, the data was divided into two parts, separated by the maximum absolute y-value. The absolute maximum was taken to ensure that the curves with a dip instead of a peak were also identied. Thus, the data was divided from (t1;y1) to (tmax;ymax), and
(tmax;ymax) to (tn;yn), wheretmax is the time point of the maximum absolute y-value
in the dataset. The data was further divided at half way between the rst value and the maximum value (called mid1 ), and between the maximum value and the last value (mid2 ) (Figure 3.8). The dierence between the y-values of mid1 and the rst value, and the last value and mid2 were calculated (di1 and di2, respectively) and compared. If di1 was greater than di2, it implied the curve had a faster growth rate on the left
(xmax; ymax)! (x0; y0)! (xn; yn)! (mid1; ymid1)! (mid2; ymid2)! diff1! diff2!
Figure 3.8: Illustration of the self-starter process for the critical exponential function. The data was rst divided by the maximum value, and then further subdivided into two halves (mid1 and mid2 ). The dierence between the rst data point and mid1 and mid2 and the last data point were calculated (di1 and di2, respectively). If the rst segment of the graph has a faster growth rate (di1>di2 ), this meant that the asymptote was on the right side of the graph, and thus r>0. Conversely, if di2 >di1, the graph has the asymptote on the left hand side, and r<0. The other parameters could then be estimated based on the aspects of the curves they inuence.
side and tails towards an asymptote on the right side, meaning r>0, and r was therefore set to 0.2 (Figure 3.4A). This were arbitrarily set as a push in the right direction. The other parameters could then be estimated where a was approximately the last value (the asymptotic value), and b approximately equal to the dierence between the rst y-value and a (sincea+b≈y1). The converse was true if di2 was greater than di1, so r was set to -0.2 (Figure 3.4B, and a and b were approximately equal to the rst value. The c parameter is the dierence between the maximum value and the asymptotic value.
3.3.4. Linear+exponential
To nd approximate starting values for the linear+exponential curve, the fact that a portion of the curve is linear was taken advantage of. To do this, a similar approach to the critical exponential was performed where the data was divided at the maximum ab- solute y-value, i.e. the two datasets were from(x1;y1)to(xmax;ymax)and(xmax;ymax)
to(xn;yn), wherexmaxis the x-value for the maximum absolute y-value in the dataset.
If the function was monotonic, the data was divided in half.
Once again, the primary parameter inuencing the shape of the curve was the r parameter. As shown in Figure 3.5, the side of the exponential portion is determined by the sign of r. To determine an estimate for this parameter, a linear regression was performed on each section of data to determine which was more linear. This comparison was performed using the value of the R2 value from the regression. The data points at
(xmax;ymax) were used in both linear regressions. If the rst section was more linear
(Figure 3.9B), it implied that r<0, and r was approximated to -0.2; and vice versa if the second section was more linear, r was set to 0.2 (Figure 3.9A),. The rest of the parameters could then be estimated with a and c approximately equal to the intercept and slope of the linear regression, respectively. The b parameter aects the concavity of the graph, and is estimated asb≈y1−a.
3.3.5. Gaussian
The Gaussian curve is estimated using various aspects of the curve. The m parameter is the time point where the maximum absolute y-value occurs, and was calculated using the which.max function in R. The a parameter is the average between the rst and last y-values to estimate the base level, the b parameter (range) is the dierence between the base level and maximum absolute y-value, and the s parameter (t-spread around m) is estimated as the dierence between the m estimate and the time point where half the maximum response occurs (y=a+2b). Since there are two time points where this occurs, the rst was selected.
3.3.6. Hyperbola
Like the logistic self-starter, the self-starter for the hyperbola was a re-parameterisation of the SSmicmen function the built-in stats package, y = V mK+·inputinput. An additional parameter, c, was added to allow the function to shift on the time axis. Since c is the
A (xmid; ymid)! (x0; y0)! (xn; yn)! B (xmax; ymax)! (x0; y0)! (xn; yn)!
Figure 3.9: Illustration of the self-starter process for the linear+exponential function. The data was divided by the maximum (B), or in half if the function was monotonic (A). A linear regression was performed on each segment to determine which portion was more linear. If the rst segment was more linear, r<0 (A), and if the second segment was more linear, r>0 (B).
time point where y = 0, this value was estimated as the time point where the y-value is closest to 0.