CAPÍTULO I: MARCO TEÓRICO
1.2 Bases teóricas
1.3.2 Defensa posesoria judicial
1.3.2.2 Acciones posesorias
Figure 5.1 illustrates the architecture of the feed-forward neural network used in the simula- tions. The input units are shown on the left and activation propagates from left to right to the output. The size of the input layer is the size of the vocabulary (ca. 120000) and the size of the representation layer is the size of the word embeddings (300). We substitute the input→
representation matrixWr iwith the embeddings matrixMobtained in §3.4 and keep it ‘frozen’
during training (i.e., weights are not updated). Using this scheme, we present each word to the network as a one-hot vector (e.g.,⟨1 0 0 0 0 . . . 0⟩⊺∈R∣V∣, where the 1 is the index of the target
word inV) the dot product of which with the input→representation matrix yields the neural
embedding in the representation layer. Since the representation layer has linear activation, no transformation is applied to the embeddings. The entire set of units for the hidden and output layers is shown in the figure. Since all the behavioural experiments reviewed use a four-class system (either determiners or verbs), the size of the output layer remained the same throughout all the simulations. For every simulation, the network is trained to turn on the units in the output layer, which correspond to the categories (either determiners or verbs) in the behavioural experiment.
Given this architecture, the overall objective of the model is to learn to associate word representations as extracted from large linguistic corpora to novel determiner classes as in the behavioural experiments. For example, a sentence in the Williams (2005) dataset which read ‘The fire brigade had to rescueul catfrom the top of the tree.’ would become the input–output
paircat–ulas depicted in the figure. Since the network has access to the entire vocabulary which we activate given an external probe, we can think of this process as activating a long- term memory trace of this word (e.g., Kintsch & Mangalath, 2011) which is subsequently paired with the novel element. Our main goal in the simulations is to see how the network would behave to unseen words once it has learned to associate pairs from the training sets.
box dog cat duck bear snail gi ro ul ne Input Representation Hidden Output
Figure 5.1 Depiction of the connectionist model of classification used in the simulations.
Input units are shown on the left and activation propagates from left to right. For illustration purposes we only show a subset of the units used in the input and the representation layers. The size of the input layer is the size of the vocabulary (ca. 120000) and the size of the representation layer is the size of the word embeddings (300). The entire set of units for the hidden and output layers is shown in the figure. Each unit in the input layer corresponds to a word in the corpus and its activation in the representation layer corresponds to the neural embedding as described above. For every simulation, the network is trained to turn on the units in the output layer, which correspond to the classes (either determiners or verbs) in the behavioural experiment.
The function that the network in Fig. 5.1 ends up computing given a semantic representation xand parametersθis
f(x,θ) =S(Woh⋅σ(WhrWr ix+bh) +bo) (5.7)
whereσis a nonlinear function (for the simulations reported here use the hyperbolic tangent function),Sis the softmax function,Whr,Woh,bh,bo⊂θ⃗are the learnable parameters of the
bias vectors from the hidden and output layers (Wr i is also a subset ofθbut not a trainable
matrix).
Initially, the connections between the units (i.e., the matricesWhr andWoh) have small
random values so that no category is preferreda prioriby the network. The initialisation of those weights is an integral part of the learning procedure as it can be the case that the network is unable to learn the patterns given an improperly initialised weight configuration. We follow the initialisation procedure proposed by Glorot & Bengio (2010), which takes into account the size of each layer in the network (more details in §B.1), and is shown to give better results in multilayer networks. The network learns to perform the task by finding a configuration of weights such that given a semantic representation in the input layer, it activates the node for the correct class in the output layer, inhibiting the activation for the incorrect classes.
Finding the appropriate configuration of weights is not a straightforward process, and many algorithms used in the literature have been criticised in that they are not biologically plausible. Although deriving ‘biologically’ plausible learning algorithms is an active area of research (Scellier & Bengio, 2017) we train the network with the commonly usedbackpropagation
algorithm (Rumelhart, Hinton & Williams, 1986a). As noted in §1.3, backpropagation is an iterative process by which the network makes small adjustments to its weights every time it makes an incorrect prediction. The objective is that the next time the same activation pattern appears in the input, the prediction will be closer to the teaching pattern. Effectively, after a number training cycles (which we call ‘epochs’) where in each cycle the network sees all the items in the training set in random order, the network will reach a state where given an activation pattern in the input layer it will activate the correct nodes in the output layer.
How we quantify the ‘prediction error’ the network is making is a major factor in the discussion as it can not only change the results but also our interpretation of them. Intuitively, we want to quantify the difference between what the network predicted for its output and what the output was supposed to be. From a probabilistic perspective, each teaching pattern can be interpreted as a degenerate discrete probability distribution over classes as the correct alternative always has probability 1 while the rest 0. On the other hand, the normalisation factor in the denominator of the output layer’s softmax function (5.6) ensures that the network’s predictions sum to 1, prompting us to look for a measure of distance between two probability distributions. A commonly used function in information theory which measures this distance is thecross-entropy error. Given a true distribution (the teaching pattern)pand a coding
distribution (the network’s prediction)q, their distance can be quantified as
H(p,q) = − ∑
x
where pis the true distribution,qis the network’s prediction andxis a particular example. In other words, the participant during the experiment learns the probability that a certain determiner precedes a noun. For example, given the word ‘monkey’ and the potential labels ‘gi’ ‘ul’ ‘ro’ ‘ne’, an output layer of ⟨0.3 0.6 0.03 0.07⟩⊺ would mean that the most probable
label for ‘monkey’ would be ‘ul’. However, the activation of ‘gi’ is still higher than those of the determiners that co-occur with inanimate nouns rendering it the preferred choice when forced to choose between ‘gi’ and either ‘ro’ or ‘ne’.
There are three interrelated issues with our training procedure which all stem from the limited amount of training data and the high-dimensionality of the input. Firstly, the number of free parameters is quite large in a neural network. More specifically, the number of parameters in a neural network with one hidden layer is D×H +H ×O +H +O4 where D is the
dimensionality of the input vector,Hthe size of the hidden layer andOthe size of the output layer. As an example, in a neural network where the size of the input vector is 300, the size of the hidden layer is 5, and that of the output layer is 4 the total number of parameters is 1529.5 We counter this issue by applying a penalty to the model parameters during learning. Concretely, during learning, we add a term in our cost function to prefer smaller weights (commonly called the weight decay)λ
/2n∑ww2, whereλcontrols the magnitude of the weights.
This way the solution learnt by the model penalises larger weights, placing, thus, importance on spurious elements of the input. We experimented with various values forλand found that, empirically, a value of 0.1 (i.e., preferring to minimise the cost function instead of small weights) worked best regarding minimising the error on the test set.
Secondly, another counterargument would be that the neural embeddings contain a lot of noise which prohibits the network from discovering interesting regions in the input. To counter this problem, wedropout(Hinton, Srivastava, Krizhevsky, Sutskever & Salakhutdinov, 2012) weights from the input layer.Dropoutis a simple technique by which during training some nodes of the matrix are randomly “turned off” (i.e., set to 0). This technique has been widely used in machine learning to avoid overfitting the dataset by focusing on noisy regions. Moreover,zeroing elements of the feature matrix has been used extensively in computational modelling of memory processes (Hintzman, 1986) as denoting imperfect recall. In other words, an interpretation of this procedure would be that during the training phase, the participants do not retrieve perfectly the distributional representation from their semantic memory.
Thirdly, because the number of datapoints is quite small, the optimisation algorithm is more prone to local minima. In other words, the network selects a solution that does not
4More generally, in any fully-connected feedforward neural network the number of parameters would be
∑L i=2L
i−1Li
+Li, i≥2 whereLis the number of layers in the network andLithe size of thei-th layer. 5Since we do not carry on training the semantic representations, this does not include the first layer of the matrix as in Fig. 5.1.
minimise the error but cannot move away from that because there no other solution in the immediate region that minimises the cost. Although the solutions to the first two problems aid in solving this issue too, we opt for re-running each simulation 30 times, then averaging the results. In this instance, we can consider each run as an independent learner, who might get stuck with a local solution, then by averaging their performance on the test set we effectively perform a by-subjects analysis.