CAPITULO I: MARCO METODOLÓGICO
PORCENTAJE DE POBLACIÓN POR ESTRATO SOCIOECONÓMICO POR
1.5. Técnicas de procesamiento y análisis de datos
The generalisation from the Euclidean distance to Bregman distances is significant to optimi- sation and regularisation theory. In what follows, we briefly consider the Bregman proximal methodand show that the derivative convergence result Theorem 7.33 extends to this case under certain conditions.
Denote by J : Rn→ R a function that is C1-smooth on int dom J, and 1-convex.3 For
F = V + R given by (7.18), the iteration map for the Bregman proximal method is given by Ak(y, ϑ ) := arg min
x∈Rn
fkJ(x, ϑ , y), fkJ(x, y, ϑ ) := 1
τkDJ(x, y) + R(x, ϑ ) + ⟨∇xV(y, ϑ ), x − y⟩.
(7.29)
Algorithm 6 Bregman proximal method
Input: starting point x0∈ Rn, parameter ϑ ∈ Ω, time steps (τk)k∈N⊂ [ε, 1/L] for some
ε > 0
for k = 0, 1, 2, . . . do
xk+1= Ak(xk, ϑ ), Akgiven in (7.29)
end for
As before, we assume that τk→ τ. In comparison to Algorithm 5, this algorithm is more
restrictive, as there is no inertial step, i.e. ak= 0 and τk≤ 1/L. Regarding the restriction on
3We only consider smooth functions, since otherwise the Bregman proximal map would depend on a subgradient choice p ∈ ∂ J(y), further complicating the algorithmic differentiation.
ak, as is pointed out in [212], FISTA does not seem to be directly extendible to the Bregman distance setting, and while other acceleration variants have been proposed [213], we do not consider these here. Depending on the choice of J, time steps τkup to 2/L − ε are possible depending on the Bregman distance generating function J—see [212, Definition 4.1] and surrounding discussion.
Suppose the objective function V + R satisfies Assumption 7.23. Arguing as in Sec- tion 7.5.1 and noting the L-smoothness of V , fkJ satisfies Assumption 7.23, treating (y, ϑ ) as the parameters. Denote by IτJ
kV,τkR(y, ϑ ) the index set (7.13) corresponding to (7.29). The
following result is analogous to Lemma 7.31 and Proposition 7.32 for the Bregman proximal method.
Lemma 7.35. Suppose F = V + R : Rn× Ω → R is given by (7.18) and satisfies Assump- tion 7.23. Then the Bregman proximal mappingAk(y, ϑ ) in (7.29) is piecewise smooth in both arguments, with differential DAk(y, ϑ ) = [∇xAk(y, ϑ ), DϑAk(y, ϑ )]T having a minimal
local representation of " (∇2MiJ(x) + τk∇2MiR(x, ϑ )) †(∇2J(y) − τ k∇2xV(y, ϑ )) −τk(∇2MiJ(x) + τk∇ 2 MiR(x, ϑ )) †(D ϑ∇MiR(x, ϑ ) + Dϑ∇xV(y, ϑ )) # i∈IJ k(y,ϑ ) , (7.30) where x= Ak(y, ϑ ).
Furthermore, if (ND) holds for F(·, ϑ ) at x∗, thenAkis locally continuously differen- tiable near(x∗, ϑ ).
Proof. Piecewise smoothness follows from Theorem 7.29 applied to fkJ(x, y, ϑ ).
For the second part, it is sufficent to show that 0 ∈ ri ∂xfkJ(x∗, ϑ , x∗) and apply [130,
Theorem 5.7]. We have
∂xfkJ(x∗, ϑ , x∗) =
1 τk
(∇J(x∗) − ∇J(x∗)) + ∂xR(x∗, ϑ ) + ∇xV(x∗, ϑ ) = ∂xF(x∗, ϑ ),
and the proof is complete.
Theorem 7.36. Let the function F ≡ V + R : Rn× Ω → R be given by (7.18) and suppose it satisfies Assumption 7.23. Furthermore, suppose for ϑ ∈ Ω that the iterates xk(ϑ ) given by Algorithm 6 converges to a minimiser x∗∈ int dom J of F(·, ϑ ), and that (ND) holds for F(·, ϑ ) at x∗. Then the sequence of (semi)derivatives Dxk(ϑ ) converges linearly to the single-valued limit Dx(ϑ ).
7.5 Algorithmic differentiation 163
Proof.We argue along the same lines as in the proof to Theorem 7.33. Let M ⊂ Rnbe a smooth manifold such that F is partly smooth at (x∗, ϑ ) relative to M × Rn. By Lemma 7.35, there is K ∈ N such that for all k ≥ K, fkis continuously differentiable near gk(xk, ϑ ).
Applying (7.20) to (7.29), we have
Dxk+1(ϑ ) = AkDxk(ϑ ) + bk, (7.31)
where
Ak:= ∇xfkJ(xk, ϑ ), bk:= Dϑ fkJ(xk, ϑ )
Write fJ:= limk→∞fkJ. By Lemma 7.35, there is K ∈ N such that for all k ≥ K, the iterations
Ak(xk, ϑ ) are locally continuously differentiable, and we have
Ak→ (∇2MJ(x∗) + τ∇2MR(x∗, ϑ ))†(∇2J(x∗) − τ∇2xV(x∗, ϑ )) =: A ∈ Rn,n, bk→ −τ(∇2MJ(x∗) + τ∇2MR(x∗, ϑ ))†(Dϑ∇MR(x∗, ϑ ) + Dϑ∇xV(x
∗
, ϑ )) =: b ∈ Rn,m. Write for shorthand
MJ := ∇2MJ(x∗), MR:= ∇M2 R(x∗, ϑ ), MV := ∇2xV(x∗, ϑ ),
so that A = (MJ+ τMR)†(MJ− τMV).
We need to show that ρ(A) < 1. Suppose Ax = λ x for some x ∈ Cn, λ ∈ C \ 0. Note that any eigenvector x of A must lie in the subspace Tx∗M, so the spectrum of A in Rncoincides
with its spectrum restricted to Tx∗M. Furthermore, restricted to this subspace, A satisfies the
conditions for Proposition 2.8, meaning λ ∈ R. Since x ∈ Tx∗M, we can rearrange λ x = Ax to get
(1 − λ )MJx= τ(λ MR− MV)x
Taking the inner product on each side with respect to x, we get
(1 − λ )⟨x, MJx⟩ = τλ ⟨x, MRx⟩ + τ⟨x, MVx⟩. (7.32)
By strong convexity of F and J, there is µ, ν ≥ 0 with µ + ν > 0 such that ⟨x, MJx⟩ ≥ ∥x∥2,
τ ⟨x, MRx⟩ ≥ τν∥x∥2, and τ⟨x, MVx⟩ ∈ [ε µ∥x∥2, ∥x∥2]. One can then verify that for (7.32) to
Therefore, by Lemma 7.35, Dxk(ϑ ) converges linearly to (I − A)−1b. It remains to show that (I − A)Dx(ϑ ) = b. Writing
DV = Dϑ∇MV(x(ϑ ), ϑ ), DR= Dϑ∇MR(x(ϑ ), ϑ ), we have (I − A)Dx(ϑ ) = − I− (MJ+ τMR)†(MJ− τMV) (MV + MR)†(DV+ DR) = −τ(MJ+ τMR)†(DV + DR) = b.
This concludes the proof.
As mentioned earlier, we do not consider nonsmooth Bregman distance generating functions J : Rn→ R, as this would involve differentiation with respect to an additional variable, namely subgradients pk∈ ∂ J(xk). We therefore leave this for future research.
Second, in Theorem 7.36, we assume that x∗∈ int dom J. This ensures that Akconverges to a unique limit. However, this assumption does not hold in general, including for some popular Bregman distances such as the Kullback–Leibler divergence DJ(x, y) = x(log x − log y) −
(x − y) generated by the entropy function J(x) = x log x (in one dimension). Furthermore, as was demonstrated in [166, 167], one can achieve iterative methods that solve nonsmooth variational methods, yet whose iterative map A(x, ϑ ) is continuously differentiable, provided the nonsmoothness can be expressed as convex constraints that coincide with cl dom J. In these settings, one expects x∗∈ dom J./
While we do not prove convergence results for the case where x∗∈ int dom J, we show/ for a simple example with the Kullback–Leibler divergence that the algorithmic iterates Dxk do converge to the implicit derivative Dx even when x∗= 0 /∈ dom J.
Example 7.37. Consider a simple example x(ϑ ) = arg min
x∈Rn
V(x, ϑ ) + δ≥0(x),
and J(x) = ∑ni=1xilog xi. The Bregman distance is the Kullback–Leibler divergence given by
DJ(x, y) =
n
∑
i=1
x(log x − log y) − (x − y).
We assume that x0∈ Rnis such that{x : V (x) ≤ V (x0)} ⊂ [0, 1]n, as J this ensures that J is
7.5 Algorithmic differentiation 165
For τk∈ [ε, 1/L], the iterates of Algorithm 6 yield the updates
xk+1(ϑ ) = xkexp(−τ∇V (xk(ϑ ), ϑ )) → x(ϑ ) =: x∗. We differentiate this with respect to ϑ and obtain
Dxk+1(ϑ ) = Dxk(ϑ ) exp(−τ∇V (xk(ϑ ), ϑ )) − xkDϑ
exp(−τ∇V (xk(ϑ ), ϑ ))
,
whereexp is applied element-wise to the vectors. For each i, if xki → 0, then [Dxk+1(ϑ )]i= Dxk(ϑ ) exp(−τ∇iV(xk(ϑ ), ϑ )) + O(∥xk∥).
In this case, the condition(ND) holds if and only if, for each i such that x∗i = 0, one has [∇V (x∗, ϑ )]i> 0. In this case, we see that [Dxk]i→ 0 linearly. In conclusion, we have
Dxk(ϑ ) → Dx(ϑ ).