CAPÍTULO 2. MARCO TEÓRICO
2.1. Antecedentes
2.2.3. Directiva para la Ejecución Presupuestaria
2.2.3.5. CAPÍTULO V: DISPOSICIONES COMPLEMETARIAS
5.4.1
Computation for Images
For each view of images, we value the similarity of each sample pair by using the neighbors of each point. The construction of Wi is illustrated below via the ℓ1-graph [26], which is
demonstrated to be robust to data noise, automatically sparse and adaptive to the neighbor- hood.
For each Xpi, we find the coefficients βββ ∈ RN−1 such that Xpi = Bβββ , where
B= [X1i, · · · , Xp−1i , Xp+1i , · · · , XNi] ∈ RDi×(N−1).
Considering the noise effect, we can rewrite it as Xpi= B′βββ′, where B′= [B, I] ∈ RDi×(Di+N−1)
and βββ′∈ RDi+N−1. Thus, seeking the sparse representation for Xpi leads to the following op- timization problem: arg min β β β′ ∥Xpi− B′βββ′∥2, s.t. ∥βββ′∥1< ε, (5.24)
where ε is the parameter with a small value. This problem can be solved by the orthogonal matching pursuit [116].
Considering different probabilistic distributions that exist over the data points and the natural locality information of the data, we first employ the Gaussian mixture model (GMM) on the training data for each view. On the one hand, it has been proved that data in the high- dimensional space do not always follow the same distribution, but are naturally clustered into several groups. On the other hand, realistic data distributions basically follow the same form, i.e., Gaussian distribution. In this case, G clusters are obtained by the unsupervised GMM clustering for each view. Thus, we can solve the above problem (5.24) using the data from the same cluster to represent each point rather than the whole data points B, which is also regarded as a solution to alleviate the computational complexity of problem (5.24).
In particular, for βββ′ = (β1, · · · , βDi+N−1), we can first set βq= 0 if X
i
q and Xpi are in
different clusters, ∀q ̸= p, then solve the above problem. Now the similarity matrix Wi∈
RN×N can be defined as: (Wi)pp = 0, ∀p, (Wi)pq = |βq| if q < p, and (Wi)pq = |βq−1| if
q> p. To ensure the symmetry, we update Wi← (WT
i + Wi)/2. Then we set the diagonal
matrix Di∈ RN×N with (Di)pp= ∑q(Wi)pq and the Laplacian matrix Li= Di−Wifor each
Fig. 5.1 Illustration of selected middle frames from actions “Handwaving" and “Diving".
5.4.2
Computation for Videos
Incremental Naive Bayes Keyframe Selection
In a video sequence, however, not all of the poses are informative and discriminative for action recognition. Some poses may carry neither complete nor accurate information and would even contain common patterns shared by various action types. Since these poses in a video sequence cannot represent the action well and would cause confusion during the classification phase, a weakly supervised method, termed Incremental Naive Bayes Filter (INBF), has been carried out to filter the noisy representation and keep the relatively repre- sentative and discriminative poses, i.e., the key poses.
For each action category, ten action sequences are randomly selected. We choose a small set of discriminative poses for a certain action type from each action sequence as the INBF initial positive samples (labeled as y = 1), and the remaining frames are adopted as the negative ones (y = 0). As illustrated in Fig. 5.1, the five frames in the middle of an action sequence are selected as discriminative poses. We repetitively apply the above procedure to each action type. INBF is then regarded as an unsupervised online learning strategy.
For the i-th feature view, the representation of each pose (frame) s can be written as xi(s) = (xi1(s), · · · , xiD(s)) ∈ RD. Since all the features we extracted are based on statistical histograms, we assume all elements in xiare independently distributed and model them with a naive Bayes classifier:
P(xi) = logΠ D m=1Pr(xim|y = 1) Pr(y = 1) ΠDm=1Pr(xim|y = 0) Pr(y = 0) = D
∑
m=1 logPr(x i m|y = 1) Pr(xi m|y = 0) . (5.25)y∈ {0, 1} is a binary variable which represents the negative and positive sample labels, respectively.
Furthermore, in either statistics or physics, real-world data distribution empirically fol- lows the same form, i.e., Gaussian distribution. Thus, the conditional distributions xim|y = 1 and xim|y = 0 in the classifier P(xi) are assumed to be Gaussian distributed with the four-
tuple (µy=1m , µy=0m , σy=1m , σy=0m ), which satisfy
xim|y = 1 ∼ N(µy=1m , σy=1m ) and xim|y = 0 ∼ N(µy=0m , σy=0m ).
Up to now, for a certain feature view, we can initialize a group of naive Bayes models for each action type, and the training sequence is successively employed through all the models. The Gaussian parameters in INBF can be then incrementally updated as follows:
µy=1m ← λ µy=1m + (1 − λ )µy=1,
σy=1m ← q
λ (σy=1m )2+ (1 − λ )(σy=1)2+ λ (1 − λ )(µy=1m − µy=1)2,
(5.26)
where µy=1 = 1S∑s|y(s)=1xim(s), σy=1 =
q
1
S∑s|y(s)=1(xim(s) − µy=1)2, λ > 0 denotes the
learning rate of INBF, and S = |{s|y(s) = 1}|. And µy=0m and σy=0m have similar update rules. The above solutions are easily obtained by maximum likelihood estimation. In this way, we can use INBF to keep the representative frames for the later learning phase and discard irrelevant frames to decrease the influence of noise. The process of INBF is summarized in Algorithm 7.
Algorithm 7 Incremental Naive Bayes Keyframe Selection
Input: 10 randomly selected action sequences from each category; the total number of actions in each category Nc.
Output: The selected keyframes for action sequences.
1: Manually select 5 representative frames from each sequence of the target category as the positive samples and label them as y = 1, otherwise y = 0;
2: for m = 1, · · · , Ncdo
3: Calculate µy=1m , σy=1m , µy=0m and σy=0m ;
4: Update µy=1m+1= λ µy=1m + (1 − λ )µy=1;
5: Update σy=1m+1=qλ (σy=1m )2+ (1 − λ )(σy=1)2+ λ (1 − λ )(µy=1m − µy=1)2;
6: Update µy=0m and σy=0m by using similar rules;
7: end for
Similarty matrix
Gaussian kernelThe procedure of DTW
B fram es of v ide o p A frames of video qFig. 5.2 Illustration of the similarity matrix construction.
RBF Sequential Kernel Construction
For the i-th view, since we extract features from the frames of video sequences, each video sequence can be described by a set of features with a sequential order (along the temporal axis). The similarity between video vpand video vqunder view i: ki(vp, vq) can be measured
via Dynamic Time Warping (DTW) [9]. Therefore, the kernel function can be defined as: ki(vp, vq) = exp(−
DTW(Xpi,Xqi)2
2σ2 ), where DTW (Xpi, Xqi) indicates the sequential distance
computed via DTW and σ is a standard deviation in the RBF kernel. In this way, we can easily obtain the kernel matrices for different views using the above equation.
Similarity Calculation
Based on the above kernel construction, we can obtain kernel matrices K1, · · · , KM ∈ RN×N
with the same size for M views with different dimensions. Furthermore, we use the label of training video sequences to supervise the calculation of the similarity matrix Wi for the i-th
view. Then each component of Wiis computed as follows:
(Wi)pq= ( exp(−DTW(X i p,Xqi)2 2σ2 ), C(p) = C(q) 0, otherwise , (5.27)
where C(p) is the label function which indicates the label of video vpand p, q = 1, · · · , N.
matrix Kias illustrated in Fig. 5.2. Then we have the diagonal matrix Diin which (Di)pp=
∑q(Wi)pq and the Laplacian matrix Li= Di−Wifor each view i.