• No se han encontrado resultados

Into the Wild Grevelingenveld elaborado por Open Fabric, Dmau

ofV. As a result, word embeddings offer dense data representations of reduced dimensionality even when vocabulary size is very big. Moreover, studies like [26] or [28] that present Skip-Gram and Glove methods, also confirm that word embeddings trained from large text corpora are able to preserve syntactic and semantic word relations. They test this property by means of word analogy tasks and report very good results. It is, however, important to note that word feature quality depends on training data, number of word samples and size of vectors. It takes a lot of computation time to obtain high-quality representations. The following sections describe in details some of the most popular training methods that are available today.

5.2

Popular Word Vector Generation Methods

5.2.1

Continuous Bag of Words

CBOW architecture proposed in [26] is a simplification upon the feed-forward neural model of [5]. The hidden layer that introduces non-linearity is removed and the input window ofQwords is projected into aP-sized projection layer. Qfuture words are used as well and the objective is to correctly predict the middle word. They use a log-linear classifier with a binary tree representation for the vocabulary. This way the number of units in the output layer drops fromV to log2(V). As a result, the total training complexity of CBOW becomes:

C=E×T×(Q×P + P×log2(V)) (5.1)

where E is the number of training epochs and T is the total number of tokens appearing in the training text bundle. Throughout this thesis, the term “token” is used to indicate raw words that do usually repeat themselves within a text document or text bundle. Unique words that form the vocabulary of that text structure are called “vocabulary words” or simply “words”. The architecture of CBOW is schematically presented in Figure 5.1 (a) and shows the use of distributed context words to predict the current word.

62 Distributed Word Representations

Fig. 5.1 CBOW and Skip-Gram neural architectures

5.2.2

Skip-Gram

Skip-Gram architecture shown in Figure 5.1 (b) is similar to that of CBOW. However, it starts from the current word and predicts context words appearing near it [26]. A log-liner classifier with projection layer takes the current word as input and predicts nearby words that appear before and after (inside a window) the current word. The training complexity of this architecture is

C=E×T×(W×(P + P×log2(V))) (5.2) where W is the word window size. Authors report that enlarging the size of the window enhances the quality of generated vectors but also increases computation cost as suggested from Equation 5.2. For benchmarking vector quality of different architectures, they create the analogical reasoning task that consists of analogies likeItaly: Rome == France: ___. These tasks are solved by finding the vector of a wordx(in this casexisParis) such thatvec(x)is closest in cosine distance to

vec(Italy)−vec(Rome) +vec(France). Besides semantic analogies, the task dataset also contains syntactic analogies as well (e.g.,walk:walking == run: running) and is available online.1 Authors generated a collection of word vectors trained on a huge Google News corpus of 100 billion tokens and released them for public use.2

1https://code.google.com/archive/p/word2vec/source/default/source 2https://code.google.com/archive/p/word2vec/

5.2 Popular Word Vector Generation Methods 63

5.2.3

Glove

Glove (GLObal VEctors) described in [28] is a log-bilinear regression model that generates word vectors based on global co-occurrences of words. Authors argue that probability ratios of word co-occurrences can be used to unveil aspects of word meanings. For example, they observe thatP(solid|ice)/P(solid|steam)is about 8.9 whereas P(gas|ice)/P(gas|steam) is only 0.085. This is something we logically expect, since “solid” is semantically more related with “ice” than it is with “steam”. Similarly, “gas” is closer to “steam” than it is to “ice”. Based on these premises, they build a weighted least square regression model that learns word vectors by means of word-word co-occurrence statistics. The calculations on real text corpora show a complexity ofO(|T|0.8)for the model, whereT is again the total number of tokens

appearing in the train texts. To evaluate the quality of word vectors, authors create various big datasets of varying contents.3 They also utilize the word analogy task described above and report that Glove performs slightly better than other baselines such as CBOW or Skip-Gram, especially with larger training corpora. Moreover, its scalability provides substantial improvements from further increase of training corpus size.

5.2.4

Paragraph Vectors

A common limitation of all above word vector generation methods in text mining applications is the fact that they produce fixed-length vectors for words but not for variable-length texts like sentences or paragraphs. Many text datasets contain documents that have different lengths. As a result, they must be preliminarily clipped and/or padded to a fixed length. To overcome this limitation, authors in [105] pro- poseParagraph Vector, a method for learning fixed-length continuously distributed representations of variable-length text excerpts such as sentences, paragraphs or entire documents. Same as in CBOW, paragraph vectors contribute to the prediction of the next word (following few words in a fixed window) using the context words sampled from the paragraph. Paragraph and word vectors inside each paragraph are concatenated to predict the next word. As soon as the word and paragraph vectors are obtained from training texts (training phase), vector prediction for unseen paragraphs is performed (inference phase). Authors report significant improvements

64 Distributed Word Representations

when comparing with BOW representation on supervised sentiment analysis of short texts like sentences. For longer documents like movie reviews, accuracy gains of paragraph vectors are lower.

5.3

Performance of Word Embeddings on Sentiment

Documento similar