Requisitos a la planificación - REQUISITOS A RPAS SEGÚN ATM

Capítulo 3 GESTIÓN DEL ESPACIO AÉREO (ATM)

3.3 REQUISITOS A RPAS SEGÚN ATM

3.3.1 Requisitos a la planificación

Sensitive Matrix Model (COSMo), is adapted from the skip-gram learning method in vector space (Mikolov et al., 2013b), explained in Chapter 2.1.3. In skip-gram, word vector embeddings are trained using a feedforward two-layer neural network. COSMo also trains the word matrix embeddings using a feedforward two-layer NN. As explained in Section 5.2.1, a corpus consisting of a large set of sentences is employed as the training dataset to train word matrices. A vocabulary Σ of size V is created by pre-processing the corpus and extracting all words. Recall that each word w is assigned a word matrix Mw and a context matrixCw. Similar to the original idea in the skip-gram method, the task is to train the NN to predict the context words of a given input word based on the available corpus of sentences.

5.2 Learning Methods 109 input :{(b₁, y₁), . . . ,(bi, yi), . . . ,(bn, yn)} bi=w1w2 wj ∈Σ, yi ∈[−1,1] output :{Mw}_w∈_Σ Learning algorithm of PMI-based CMSM

1. model parameters to train

word matrix :Mw∈Rd×d w∈Σ

context matrix :Cw∈Rd×d w∈Σ

mapping vector :β∈_Rd2 2. objective: minimize the objective

function E= 1 n n X i=1 (yi−yˆi)2

Figure 5.2: Supervised learning procedure of PMI-based CMSM. {(b₁, y₁), . . . ,(b,yn)} is a training set of sizen. Σ is the vocabulary extracted from the corpus.

w₁w₂. . . wi−₁wiwi+1. . . wc from which a predefined number n of words are randomly selected, called center words, that serve as the input to the network. A set of context words with a window size ofl are also extracted from the same input sentence for each center word. Note that this learning method considers two versions of training with respect to the position of the context words:

1. Asymmetric training: From a given center word wi, a specified number of context words is selected to the right. Given an input sentences, and a context window size of l, the context words of a center word wi are{wi+1, . . . , wi+l};

2. Symmetric training: From a given center word wi, a specified number of context words is selected to both sides. Therefore, the context words of a center word wi inswith a window size ofl are{wi−l, . . . , wi−₁, wi₊₁, . . . , wi₊l}.

During the training, a matrixQi_pos ∈Rd×d representing the input sequence for each

given center word wi and its context words is computed utilizing matrix multiplication. In the case of symmetric context window size of l,Qi_pos is calculated as follows, where the composition operation is matrix multiplication:

Qi_pos = Cwi−l. . . Cwi−1MwiCwi+1. . . Cwi+l, (5.4)

and, in the case of asymmetric training, Qi_pos is computed as: Qi_pos= MwiCwi+1. . . Cwi+l,

whereMw andCw denote the word and context matrices of a word w, respectively. Note that if the randomly chosen center word happens to be first or last word of the input sentence, which means there is no preceding or proceeding context word, each “missing” position is filled with the global Out-Of-Sentence (OOS) matrix Coos that is also trained alongside the other word matrices.

In line with the original skip-gram with negative sampling (Mikolov et al., 2013b), we choose knegative samples for each center word in the input sentence that we retrieve

from a unigram probability distribution with a smoothing parameter α: q(wj) = f(wj) α ΣV z=1(f(wz)α) ,

wheref(w) represents the frequency of a word w in the given corpus, andV is the size of the vocabulary. Frequencies are raised to the power of α (0< α≤1) to increase the

selection probability of less frequent words and decrease the probability of more frequent words. q(wj) computes the probability that the word wj is selected as the negative sample.

Since the multiplication of the center word matrixMwi with negative sample matrices

Cwj is order-sensitive, we can consider the two following variations of negative sampling

in our method:

1. Asymmetric context of negative samples similar to skip-gram. In this case, the matrix of each center word wi is multiplied withk number of negative samples as follows: Qi_neg= k X j=1 MwiCwj; (5.5)

2. Symmetric context of negative samples. In other words, negative samples are evenly distributed to the left and to the right of the center word in the matrix multiplication process. In this case, Qi_neg is computed as follows:

Qi_neg = k X j=1 MwiCwj+ k X j=1 CwjMwi.

The objective function (also called loss function) to maximize, sums up both positive and negative samples of the center words, computed as follows:

i=1

log σ(RS(Qi_pos) + log σ(−RS(Qi_neg)]

whereQi_posandQi_negrepresent the composition of a center word with the specified correct context words and k negative samples, respectively. Since the composition operation results in matrices, RS represents an operation that sums all elements in the matrices Qi_pos and Qineg; in other words, for any matrix X∈_Rd×d_:

RS(X) =X i′_,j′

X(i′, j′)

for 1 ≤i′, j′ ≤d, where then the log(σ(·)) function is applied on the resulting values,

before adding those two terms. σ is thesigmoidfunction. mis the total number of center words in the training batch at each training iteration. It is computed by multiplying the batch size (number of input sentences to training at a time) with n, the number of randomly drawn center words from each sentence in the batch. Note that the objective

5.2 Learning Methods 111 is to minimize the value of the composition of the center word with negative samples, and therefore, the negative of RS(Qi

neg) is used in the equation. The composition of the center word with context words is to be maximized, and therefore, the positive of RS(Qi_pos) is used in the equation. Similar objective functions can be found in the literature such as the work by Kaji and Kobayashi (2017).

Finally, the negative of the objective function Lis minimized in the training process.

Model parameters for training are the word and context matrices, Mw andCw for all w∈Σ. In terms of the optimizer used, we experimentally determined gradient descent

as the most efficient choice. Word and context matrices are updated at each iteration of training. A batchat each iteration of training is defined as a set of input sentences from the corpus, and an epochis defined as a complete pass over all sentences in the corpus. Therefore, after a T number of epochs over the corpus (i.e., iterating T times over the corpus), we stop the training of matrices. Finally, word matrices Mw for all w∈Σ are extracted and introduced as the matrix embeddings or the word representation model. Fig. 5.3 illustrates the learning procedure.

Input corpus :{s₁, . . . , st, . . . , sN}

st=w₁· · ·wi· · ·wct wi ∈Σ

input sentence : st random center word : wi

context words with l=1 : {wi−₁, wi₊₁}

k negative samples : {wj1, . . . , wjk}

1. Compute objective (loss) function L

2. Update word and context matrices,

Mw and Cw, using gradient descent

output :{Mw}w∈_Σ Repeat until

see all sentences.

Repeat T epochs

over the corpus.

Figure 5.3: Learning procedure for COSMo. Each epoch of training is a pass over all sentences in the corpus. After T epochs {Mw}_w∈_Σ are extracted as the

learned word representations.

We can denote all word and context matrices with a V ×d×d-dimensional tensorsT andT′; that is, each slice of the tensors is ad×d-dimensional matrix corresponding to a word in the vocabulary. Fig. 5.4 shows our feedforward two-layer NN, which, as opposed to skip-gram, consists of weight tensors instead of weight matrices. In this figure, we assume that the input to the training iteration is a center word wi and wi+1 is the context word to predict, both are extracted from an input sentence s. The connections

in the input-to-hidden and hidden-to-output layers show the matrices corresponding to word and context matrices ofwi and wi+1, respectively, which are extracted from the i-th slice of the corresponding tensors Tand T′.

...

· ··

...

w1 wi wV 0 1 0 h1,1 hd,1 h1,d hd,d w1 wi+1 wV 0 1 0 Input

Layer HiddenLayer OutputLayer

TV×d×d T′⊤V×d×d

Figure 5.4: COSMo neural network architecture for an input word wi to predict an output context word wi+1. T and T′ are tensors presenting the word and context matrices. The hidden layer is a matrix with dimensionality of size d×d. V is the size of vocabulary. The input and output layers are one-hot encodings of the corresponding words.

5.2.4 COSMo versus skip-gram

In skip-gram, the composition of center and context words for training is based on the element-wise vector multiplication, which is not sensitive to word order in the sentences. In other words, skip-gram discards the position of the context word against the center word during the training of vector embeddings. In a target space Sof matrices in Rd×d

the composition function in CMSMs requires ordered sequences. It ensures that the composition of two matrices Mi ∈Rd×dand Mj ∈Rd×d in this orderMiMj differs from

its reverse orderMjMi. COSMo introduces order-sensitivity during training by utilizing matrix multiplication when composing the center and context word matrices, as shown in Equation 5.4. The same argumentation holds for composing center words with negative samples. Therefore, symmetric and asymmetric learning methods are introduced in COSMo.

In document Bases y procesos de certificación de sistemas aéreos no tripulados (página 78-86)