8. CAPÍTULO VIII. PRINCIPIOS DE ADAM SMITH
8.4. Principio de economía
First, we will discuss several equivalent formulations of (5.12). We will adopt the nomen-clature and terminology used in [63]. In this framework, the optimization problem solved in (5.12) is referred to as basis pursuit de-noising (BPDN), or BPσ. This problem solved in the noise-free setting withσ = 0 is called simply basis pursuit (BP), and its solution is denoted ˆxBP. The theory of Lagrange multipliers indicates that we can solve an uncon-strained problem that will yield the same solution, provided that the Lagrange multipler is selected correctly. We will refer to this unconstrained problem as1 penalized quadratic program and denote it as QPλ. Similarly, we can solve a constrained optimization problem, but with the constraint placed on the1norm of the unknown vector instead of the2norm of the reconstruction error, to obtain yet a third equivalent problem. We will use the name LASSO [64], popular in the statistics community, interchangeably with the notation LSτ for this problem. The three equivalent optimization problems can be written as
(BPσ) ˆxσ = argmin
x x1 subject to Ax − y2≤ σ (5.18) (QPλ) ˆxλ= argmin
x λ x1+ Ax − y22 (5.19)
(LSτ) ˆxτ = argmin
x Ax − y2subject to x1 ≤ τ (5.20) We note that a fourth problem formulation known as the Dantzig selector also appears in the literature [65] and can be expressed as
(DSζ) ˆxζ = argmin
x x1 subject to AH(Ax − y)∞≤ ζ (5.21) but this problem does not yield the same set of solutions as the other three. For a treatment of the relationship between DSζ and the other problems, see [41].
The first three problems are all different ways of arriving at the same set of solutions.
To be explicit, the solution to any one of these problems is characterized by a triplet of values(σ, λ, τ) which renders ˆxσ = ˆxλ = ˆxτ. Unfortunately, it is very difficult to map
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 169
5.3 SR Algorithms 169
the value of one parameter into the values for the other two. However, once a solution to one problem is available, we can calculate (to at least some accuracy) the parameters for the other two solutions.22
First, notice that only a certain range of parameters makes sense. Consider solving BPσ with σ = y2. The solution to this problem is obviously ˆxσ = 0. Any larger value ofσ will yield the same solution. Similarly, imagine solving LSτwithτ = ˆxBP1. (Recall that ˆxBPis the solution to BPσ withσ = 0.) In other words, this is the minimum
1 solution such that A ˆxBP = y. Any larger value of τ will produce the same solution.
Thus, the solution with x = 0 corresponds to a large value of λ, while the solution ˆxBP
corresponds to the limit of the solution to QPλasλ approaches zero. Values outside this range will not alter the resulting solution.
The fact that the BP solution is the limit of the solution to QPλ is important. The algorithms that solve the unconstrained problem cannot be used to precisely compute BP solutions. Algorithms that solve QPλ exhibit a fundamental deficiency in solving BP, as can be seen by their phase transition. See [66] for results on this issue. Notice that this problem does not arise when dealing with noisy data and solving the problem forσ > 0, as the corresponding positiveλ then exists. We will emphasize recovery from noisy data throughout this chapter. In contrast, much of the CS literature centers around solving the noise-free BP problem. From a coding or compression standpoint, this makes a great deal of sense. This distinction,σ > 0 vs. σ = 0, colors our discussion, since algorithms that work beautifully for BPDN may work poorly for BP and vice versa. Indeed, an example would be the approximate message passing (AMP) algorithm [66], whose development was at least partially motivated by the inability of algorithms like Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) to solve the BP problem exactly.
We can create a plot of Aˆx − y2 versus ˆx1 which is parametrized byλ (or by τ or σ ) to obtain what is known as the Pareto frontier for our problem of interest. We will denote the Pareto frontier asφ(τ). This curve represents the minimum 2 error that can be achieved for a given1 bound on the solution norm. Pairs above this curve are sub-optimal, and pairs below the curve are unattainable. It turns out that this curve is convex. Furthermore, for a given point on the curve, the three parameters associated with the corresponding solution ˆx are given byφ(τ) = σ = Aˆx − y2,τ = ˆx1, andλ is related to the slope of the Pareto curve at that point [63]. In particular, the slope of the Pareto curve can be calculated explicitly from the solution ˆx at that point as
φ(τ) = − AHr
r2
∞
where r = y − Aˆxτ [63]. This expression is closely related toλ, which is given by λ = 2AHr∞, as shown in [67].23These results are proven and discussed in detail in [63].
Thus, much like the L-curve [68,69] that may be familiar from Tikhonov regularization, the parameterλ can be viewed as a setting which allows a tradeoff between a family of Pareto optimal solutions. An example Pareto frontier plot is shown in Figure 5-6. In the figure, we have labeled the values of the end points already discussed.
22A good discussion of the numerical issues in moving between the parameters is provided in [61]. In a nutshell, determiningλ from the solution to one of the constrained problems is fairly difficult. The other mappings are somewhat more reliable.
23Note that the factor of 2 stems from the choice to not include a 1/2 in the definition of ˆxλin (5.19).
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 170
170 C H A P T E R 5 Radar Applications of Sparse Reconstruction
0 2 4 6 8 10 12 14
FIGURE 5-6 An example of the Pareto frontier for the linear model. Points above the curve are suboptimal for any choice of the parameters, and those below the curve represent an unattainable combination of the two cost function terms. At a given point on the curve, σ = Aˆx − y2,τ = ˆx1, andλ is the related to the slope of the curve. The example was generated using the SPGL1 software [63].
If A is orthogonal, then we can approximately mapλ = σ√
2 log N [63,70]. Other-wise, it is very difficult to determine the value of one parameter given another without first solving the problem, as discussed at length in [61]. This is significant, because the parameter is often easier to choose based on physical considerations for the constrained problems, particularly BPσ, but the constrained problems are generally harder to solve. As a result, many algorithms solve the unconstrained problem and accept the penalty of more difficult parameter selection. As already mentioned, this issue can be somewhat alleviated by solving the problem for a series of parameter values using a warm-starting or continua-tion approach. As we shall see in Seccontinua-tion 5.4, the unconstrained problem is also beneficial in that we can tack on additional penalty terms to enforce various solution properties and still obtain relatively simple algorithms. Indeed, this addition of multiple penalty terms in a somewhat ad hoc, albeit effective, manner is common in practice [71–73].
Nonetheless, understanding the Pareto frontier and the relationships between the var-ious forms of the optimization problem is highly instructive in interpreting the results. In addition, this explicit mapping between the three problems forms the foundation of the first algorithm discussed in the following section.
5.3.1.2 Solvers
In the last several years, a plethora of solvers has been created for attacking the three
1minimization problems defined in the previous section. We will mention a handful of those approaches here. Our emphasis will be on fast algorithms that do not require explicit access to A and can handle complex-valued signals.
Our first example is SPGL1 [63], the algorithm whose primary reference inspired the discussion of the Pareto frontier in the previous section. This algorithm seeks solutions to BPσ, which is as we have already mentioned more difficult than solving QPλ. The algorithm
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 171
5.3 SR Algorithms 171
takes special advantage of the structure of the Pareto frontier to obtain the desired solution.
In particular, van den Berg and Friedlander develop a fast projected gradient technique for obtaining approximate solutions to LSτ. The goal is to approximately solve a sequence of these LASSO problems so thatτ0, τ1, . . . τk approachesτσ, which is the value for τ which renders the problem equivalent to BPσ. While slower than solving the unconstrained problem, the fast approximate solutions to these intermediate problems allow the algorithm to solve the BPσ in a reasonable amount of time.
Let us consider a step of the algorithm starting withτk. First, we compute the corre-sponding solution ˆxkτ. As discussed already, this provides both the value and an estimate of the slope of the Pareto curve as
φ(τk) =A ˆxkτ− y2 φ(τk) = −
AHrk
rk2
∞
rk = Aˆxkτ− y (5.22)
We will choose the next parameter value asτk+1= τk+τk. To computeτk, the authors of [63] apply Newton’s method. We can linearize the Pareto curve atτkto obtain
φ(τ) ≈ φ(τk) + φ(τk)τk (5.23) We set this expression equal toσ and solve for the desired step to obtain
τk = σ − φ(τk)
φ(τk) (5.24)
The authors of [63] provide an explicit expression for the duality gap, which provides a bound on the current iteration error, and prove several results on guaranteed convergence despite the approximate solution of the sub-problems. Further details can be found in [63], and a MATLAB implementation is readily available online. We should also mention that the SPGL1 algorithm can be used for solving more general problems, including weighted norms, sums of norms, the nuclear norm for matrix-valued unknowns, and other cases [74].
We will now discuss two closely related algorithms that were developed in the radar community for SAR imaging for solving generalizations of QPλ. The algorithms can be used to solve the1 problem specifically, and hence inherit our RIP-based performance guarantees, but they can also solve more general problems of potential interest to radar practitioners. First, we will consider the algorithm developed in [75] which addresses the modified cost function
ˆx= argmin
x λ1xpp+ λ2D|x|pp+ Ax − y22 (5.25) where D is an approximation of the 2-D gradient of the magnitude image whose voxel values are encoded in|x|.
This second term, for p= 1, is the total variation norm of the magnitude image. The TV norm is the1 norm of the gradient. In essence, this norm penalizes rapid variation and tends to produce smooth images. As Cetin and Karl point out, this term can help to eliminate speckle and promote sharp edges in SAR imagery. Indeed, TV minimization has seen broad application in the radar, CS, and image processing communities [76]. Notice
Melvin-5220033 book ISBN : 9781891121531 September 14, 2012 17:41 172
172 C H A P T E R 5 Radar Applications of Sparse Reconstruction
in (5.25) that the TV norm of the magnitude of the image rather than the complex-valued reflectivity, is penalized. This choice is made to allow rapid phase variations.24 Notice also that thep norm is used with 0< p ≤ 2. As we have mentioned, selecting p < 1 can improve performance but yields a nonconvex problem. Cetin and Karl replace thep
terms in the cost function with differentiable approximations, derive an approximation for the Hessian of the cost function, and implement a quasi-Newton method.
Kragh [77] developed a closely related algorithm (for the case withλ2 = 0) along with additional convergence guarantees leveraging ideas from majorization minimization (MM).25For the case with no TV penalty, both algorithms26 end up with an iteration of the form
ˆxk+1=AHA+ h( ˆxk)−1AHy (5.26) where h(·) is a function based on the norm choice p. The matrix inverse can be implemented with preconditioned conjugate gradients to obtain a fast algorithm. Notice that AHA often represents a convolution that can be calculated using fast Fourier transforms (FFTs). A more detailed discussion of these algorithms and references to various extensions to radar problems of interest, including nonisotropic scattering, can be found in [17].
These algorithms do not begin to cover the plethora of existing solvers. Nonetheless, these examples have proven useful in radar applications. The next section will consider thresholding algorithms for SR. As we shall see, these algorithms trade generality for faster computation while still providing solutions to QPλ.