4. Case study
4.2. Drum boiler optimization problem
...
...
.. .
... x kð MÞx kð Þ x kð MÞx k 1ð Þ x k Mð Þx k 2ð Þ . . . x kð MÞ2 2
66 66 64
3 77 77 75
ð2:18Þ and
D¼ E y k ð ÞXTð Þk
denote the cross-correlation vector of the input and the reference signal
D¼ E y k½ ð Þx kð Þ y kð Þx k 1ð Þ y kð Þx k 2ð Þ . . . y kð Þx k Mð ÞT; ð2:19Þ it follows that the mean square error or statistical mean value of the error signal is
J¼ E y kh ð Þ2i
þ ^HTR ^H 2DTH:^ ð2:20Þ
2.3.2 Minimization of the Criterion of Mean Square Error (Risk)
In a general case, if the number of the coefficients in an adaptive process to be estimated is equal to M, then (2.20) represents a surface in the M-dimensional parametric space. The adaptation process represents the process of searching for a point on that surface which corresponds to the minimal value of the MSE in (2.20), or to the optimal value of the parameter vector ^H.
To determine the global minimum of the criterion function one most often uses some of the algorithms with random search. An important property of the adaptive systems implemented in the form of FIR filters is that their criterion function represents a second-order surface, i.e. a square function of ^H. Here the MSE criterion function has only one, global minimum. In that case one can use much more powerful, deterministic methods of minimization of the criterion function, based on the use of gradients or some estimation of the gradients, instead of the stochastic methods which are used in the case when the criterion function has more local minima, as is the case in the IIR systems [12, 24, 25].
If there is only one parameter, the MSE is described by a parabola (Fig.2.4);
for two parameters, the MSE is a paraboloid (Fig.2.5) and in the general case when there is a larger number of parameters, i.e. when M is larger than 2, the surface is described by a hiper-paraboloid. Since the MSE is by definition a positive value, the criterion function represents a concave surface, i.e. it spreads in the direction of increasing MSE.
The determination of the minimum of the criterion function can be done using the criterion method [12, 24, 25]. Namely, the MSE gradient is a vector that is always directed towards the fastest increment of the criterion function and with a value equal to the slope of the tangent to the criterion function. In the point of the minimum of the criterion function the slope is zero, so it is necessary to determine the gradient of the criterion function and equal it to zero in order to obtain the optimum values of the parameters minimizing the criterion function. The gradient of the criterion function J, denoted asrJ or only r, is obtained by differentiating the expression (2.20) with regard to ^H,
J
parameter b Jmin
Hopt =bopt
M=1 Fig. 2.4 The form of MSE
for the case when M = 1.
The criterion function is represented by a parabola, and the parameter vector contains only a single parameter, b
-10 0
10 -10 -5 0 5 10
-50 0 50 100 150 200 250
b2 b1
MSE
Fig. 2.5 The form of MSE for the case when M = 2.
The contours for constant values of the MSE are represented by the ellipses at the bottom of the graph
r ¼ oJ
o ^H¼ oboJ0 oboJ1 oboJ2 . . . oboJ
M
h iT
¼ 2R ^H 2D: ð2:21Þ If (2.21) is made equal to zero, we obtain the Wiener-Hopf equation, and by solving it we obtain the optimal solution for the parameter vector
r ¼ 2R ^H 2D ¼ 0 ð2:22Þ
Hopt ¼ R1D: ð2:23Þ
Hoptrepresents the vector of optimal values of the FIR filter parameters, i.e. those values of the parameter Jmin. According to (2.20) and (2.23), one obtains
Jmin ¼ E y 2ð Þk
þ HToptRHopt 2DTHopt
¼ E y 2ð Þk
HToptRHopt: ð2:24Þ
Starting from expressions (2.20) and (2.23), the criterion function may be represented as
i.e., if one introduces (2.24), it follows J¼ Jminþ ^H Hopt
T
R ^H Hopt
: ð2:26Þ
It is obvious from (2.26) that there is a quadratic dependence of J on ^Hand that this function reaches its minimum for ^H¼ Hopt.
It should be mentioned that other criterion functions may be utilized besides MSE, for instance the (mean) absolute value of the estimated errors, higher order moments, etc.; however, such a choice, contrary to the MSE, leads to nonlinear optimization problems [9, 10, 16]. Namely, the complexity of their application and analysis is fundamentally increased, but nonlinear criterion functions nevertheless have an important role in some applications [9, 16].
In most practical cases the appearance of the criterion function is not known, and its analytical description is also not known. From (2.23) it follows that to determine Jminit is necessary to know the statistical properties of the input and the reference signal, i.e. the values of the correlation matrix R and the correlation vector D. Most often one knows only the measurement sequences of the mentioned signals, and their statistical properties can be obtained only by estimation, based on experimental data. The values of the points on the surface defining the criterion function may be measured or estimated by averaging the MSE in time, in the sense of the approximation of the mathematical expectation by the appropriate arith-metic means. The problem of the determination of the optimal values for the filter
parameters reduces to defining an adequate numerical procedure or algorithm able to describe the curve or, in a general case, the surface determined by the criterion function, as well as to determine its minimum. The values of the parameters defining the minimum of the criterion function represent the optimal vector Hopt, which is often also denoted as the ‘‘accurate’’ values of the parameters.
The majority of the adaptive algorithms is based on the standard iterative procedures for the solution of the minimization problems in real time. To clarify the properties of the usual adaptive algorithms for the minimization of the criterion function, we will consider two basic numerical methods for iterative minimization of the criterion function: the Newton’s method and the steepest descent method.
The both methods are used for the estimation of the gradient, r, for the deter-mination of the minimum of the criterion function instead of the accurate value of the gradient, which is not even known in the general case [12, 24, 25].
2.3.2.1 Newton’s Method
By multiplying (2.21) with R-1 we obtain 1
2R1r ¼ ^H R1D: ð2:27Þ
By combining (2.23) and (2.27) it follows Hopt¼ ^H1
2R1r: ð2:28Þ
Equation (2.28) represents the Newton’s method for the determination of the root of the vector equation obtained by making the gradient of the criterion function equal to zero (the necessary condition of the minimum of the adopted criterion). Knowing the value of ^Hin any moment of time, together with the R and the corresponding gradientr, one can determine the optimal solution Hoptin just a single step. In practical situations, however, the available information are insuf-ficient to perform a single-step adaptation. The value of the correlation matrix of the input signal, R, changes with time under nonstationary conditions and, in the best case, can be only estimated, similar to the unknown value of the criterion function gradientr which must be estimated in each iteration. In order to reduce the effect of ‘‘noisy’’ or fluctuating values of these estimations, one modifies (2.28) on order to reach the algorithm which updates the parameter vector ^Hin small increments and converges to Hopt after a number of iterations. In this manner, starting from (2.28), one reaches the Newton’s method in an iterative (recursive) form [12, 24, 25]
H k^ð þ 1Þ ¼ ^H kð Þ 1
2R1r kð Þ; fk¼ 0; 1; 2; . . .g ð2:29Þ
where the index k with the gradient of the criterion function denotes that it is estimated in each iteration according to (2.21). The expression (2.29) can be generalized by introducing a constant l, i.e. a dimensionless variable determining the convergence speed of the iterative process
H k^ð Þ ¼ ^H kð 1Þ lR1r k 1ð Þ: ð2:30Þ According to (2.21) it follows that r k 1ð Þ ¼ 2R ^H kð 1Þ 2D, and thus according to (2.30) one obtains
H k^ð Þ ¼ 1 2lð Þ ^H kð 1Þ þ 2lR1D: ð2:31Þ Arranging further (2.31), and taking into account (2.23), one can write
H k^ð Þ ¼ 1 2lð Þ ^H kð 1Þ þ 2lHopt
H k^ð Þ ¼ 1 2lð Þ2H k^ð 2Þ þ 2l 1 þ 1 2l½ ð ÞHopt
...
H k^ð Þ ¼ 1 2lð ÞkH^ð Þ þ 2lH0 optk1P
i¼0
1 2l ð Þi:
ð2:32Þ
The vector ^Hobviously converges to the optimal value of Hoptonly in the case when the condition is fulfilled that the geometric seriesk1P
i¼0
1 2l
ð Þiis convergent, i.e.
1 2l
j j\1; ð2:33Þ
that is
0\l\1; ð2:34Þ
and in that case
H k^ð Þ ¼ 1 2lð ÞkHð0Þ þ H^ opth1 1 2lð Þki
: ð2:35Þ
From (2.35) it follows that the final solution can be reached in one step for l¼ 0:5, but only under the condition that one knows the accurate values of the inverse correlation matrix of the input signal, R1, and the gradient of the criterion function,r, i.e. the cross-correlation vector D. In the case when R1 andr are estimated, one usually utilizes values l 1, typically smaller than 0.01, to overcome the problems appearing because of the error introduced by the estima-tion of the unknown variables R andr.
Newton’s method is fundamentally important from the mathematical point of view, however it is very demanding in practical applications because of the need to estimate R andr in each step. It is the method of gradient search, a consequence of which is that all elements of the vector ^Hchange in each iteration, with the goal
to determine the optimum values of the parameters. These changes are always toward the minimum of the gradient function, but, as (2.30) shows, not necessarily in the direction of the gradient itself.
As mentioned, the main problem with the Newton’s algorithm is its application under the conditions when one does not know the value of the inverse correlation matrix of the input signal and the value of the gradient of the criterion function, i.e.
the cross-correlation of the input and the reference signal. Regretfully, it is a common case in practice [6]. In that case one most often assumes that the non-diagonal elements of the correlation matrix are equal to zero. The methods based on this assumption bear a common name of the steepest descent method and we consider them in the further text.
2.3.2.2 Steepest Descent Method
The steepest descent method is an optimization technique utilizing the gradient of the criterion function to determine its minimum. This method, contrary to Newton’s method, in each iteration updates the values of the vector ^H, only in the direction of negative value of the gradient. Since the gradient represents the direction of the fastest increment of the criterion function, the movement in the direction of negative gradient should ensure the fastest approach to the minimum of the criterion function, which is why this method obtained its name.
According to its definition, the steepest descent method can be described in the following manner [12, 24, 25]
H k^ð þ 1Þ ¼ ^H kð Þ þ b r kð ð ÞÞ: ð2:36Þ The steepest descent method starts from some initial value ^Hð Þ. The estima-0 tion in the next step ^H kð þ 1Þ is equal to the current estimation ^H kð Þ corrected by the value in the direction opposite to that of the direction of the fastest increment of the function, i.e. of the gradient, in the point ^H kð Þ. The last term in Eq. (2.36) represents the estimated gradient of the criterion function in the k-th iteration. The scalar parameter b is the convergence factor determining the size of the correction step and influences the stability and the adaptation speed of the algorithm. The dimension of this factor is equal to the reciprocal value of the dimension of the input signal power.
The graphical presentation of this method for M¼ 1 is given in Fig.2.6. It can be shown that the convergence conditions are satisfied for [6]
0\b\ 1
kmax
; ð2:37Þ
where kmaxis the largest eigenvalue of the correlation matrix of the input signal R, which depends on the input signal power, i.e. on the mean expected value of the squared amplitude of the input signal.
When comparing Eqs. (2.36) and (2.29), one should note that in the case of Newton’s method the information about the gradient is corrected with the value of the inverse correlation matrix of the input signal, R1, and with the scalar parameter l.
This means that in this method the direction of the criterion function search is corrected to keep it always toward the minimum of the criterion function, while in the steepest descent method this direction coincides with the fastest increase (decrease) of the function. The two quoted direction may not coincide in a general case, and the search path of the criterion function in the application of Newton’s method is shorter, which suggests that the optimization process is faster compared to the steepest descent method (Fig.2.7). This advantage stems from the fact that Newton’s method utilizes much more information about the criterion function in comparison to the steepest descent method. Also, compared to the steepest descent method, Newton’s algorithm is much more complex, since it requires the calcu-lation or the estimation of the inverse correcalcu-lation input matrix in each iteration.
However, under real circumstances, in the presence of noise while estimating the gradient and the input data correlation matrix, it may happen that the steepest descent method converges much more slowly toward the minimum of the MSE in comparison to Newton’s method or that, for the sake of speed, converges into a larger value of the MSE criterion.