• No se han encontrado resultados

The first step in most NLP models (supervised or unsupervised, classification or tagging, type-based or token-based) is to represent words with numerical features. Distributional representation is one of the pioneering ideas based on Firth (1957). In the models based on this idea, words are represented as vectors in a high dimensional semantic space, where each dimension corre- sponds to a (context) word and the values are based on statistical analysis of the co-occurrences of target words with context words. These models have the following parameters.

Context type: The context of a token can simply be the neighbour- ing words of its token-level occurrences. However, to enrich the contextual information and reduce the effect of polysemy, we can also consider other information from co-occurring words such as part-of-speech tags or depen- dency relations. We can also decide whether to ignore some of context words such as stopwords that are frequent and carry little or no semantic content, such as determiners.

Context window size: The number of neighbouring words around a target can also be tuned. The context scope can be a sentence, a paragraph or even a document. It can also be based on a context window of specific

number of words either on the left side, or on the right side or on both sides of the target word.

Context vector values: The values of a distributional vector repre- sent the degree of association between the target and context words. This association can be measured using raw co-occurrence frequency, binary co- occurrence value, or any of the association measures as defined in Section 2.3.1.1.

2.4.1.1 Vector Space Models

Vector Space Models (VSMs) are promising approaches in distributional se- mantics (Turney and Pantel, 2010). Since the dimension size and therefore the vector size in these models usually end up being very large, other method- ologies are devised for dimensionality reduction while preserving the neces- sary information. These include singular value decomposition (SVD) used by Sch¨utze (1998) or latent semantic analysis (LSA) proposed by Deerwester et al. (1990). The studies that use these methodologies in MWE identifica- tion are discussed both in Section 2.3.1.2 and Section 2.5.

2.4.1.2 Word Embeddings

Word embedding is the name for language modelling based methodologies that still follow the principles of distributional semantics, but aim at learning low-dimension vectors of real numbers. In practice they can be derived by feeding one-hot 6 or randomly initialised vectors into a neural network and

6One-hot vector encoding of a target word is a vector of dimension size equal to the

updating the weights in subsequent iterations. Dense vectors of this kind have only a few hundred dimensions which makes them more practical due to the decrease in the amount of memory required to train them.

These models are originally derived from neural network language mod- elling techniques (Bengio et al., 2003; Collobert et al., 2011). Various meth- ods are proposed to learn these mappings (Pennington et al., 2014; Le- bret and Collobert, 2014), however, introduction of the efficient approach,

word2vec, proposed by Mikolov et al. (2013c), brought this particular vari- ant into widespread use. According to Mikolov et al. (2013a) these models perform significantly better than LSA and are also computationally less ex- pensive. One standard implementation of word2vec is provided by Gensim ( ˇReh˚uˇrek and Sojka, 2010).

word2vec uses two new neural network architectures to accomplish word representation learning, namely, the Skip-gram model and the Continuous Bag Of Words (CBOW) model. In both cases, a feed forward neural net- work is used where the standard non-linear hidden layer in neural network language models is removed and a projection layer is shared for all words. The Skip-gram model receives the target word type as input to predict the context. On the other hand, the CBOW model receives the context as in- put to predict the target word type. Skip-gram model is further improved by computing hierarchical softmax probability, and using negative sampling

for the single entry corresponding to the target word itself which is assigned the value of one.

and subsampling (Mikolov et al., 2013c).

Softmax (a generalisation of the logistic function) is a normalised expo- nential function which is used to model probability distributions. In the case of Skip-gram word embedding, given the words wO and wI, it is defined as

p(wO|wI) =

exp(v0TwovwI) Σexp(v0T

wvwI), where v

0

w and vw are the “input” and “output” vec-

tor representations of w. In negative sampling, the model is given noise words from a noise distribution to distinguish the target word from a noise word. In the case of subsampling, the intuition is that frequent words usually provide less information than rare words and therefore the probability of dropping a training example was proportional to the frequency of the occurrence of the target word.

In word2vec the vector size and the context window size are again pa- rameters that should be defined beforehand. In terms of context type there is a recent work by Levy and Goldberg (2014) that generalises word2vec Skip-gram by replacing the word contexts with arbitrary contexts. In that specific study, they used dependency structures as arbitrary contexts to train the model which is called word2vecf. This approach will be further explained in Section 5.3.1.