III. Índice de Figuras
8. Conclusiones
A popular approach is to model the problem as a supervised learning problem. The idea behind this approach is to train a function capable of determining the sentiment orientation of an unseen document using a corpus of documents labelled by sentiment. For example, the training and testing data can be ob- tained from websites of product reviews where each review is composed of a free text comment and a reviewer-assigned rating. A list of available training
corpora from opinion reviews can be found in (Pang and Lee, 2008). After- wards, the text data and the ratings are transformed into a feature-vector and a target value respectively. For example, if the rating is on a 1-5 star scale, high-starred reviews can be labelled as positive opinions and low-starred re- views as negative in the same way. The problem can also be formulated as an ordinal regression problem using the number of stars as the target variable (Pang and Lee, 2005). In (Pang, Lee and Vaithyanathan, 2002), the authors trained a binary classifier (positive/negative) over movie reviews from the In- ternet Movie Database (IMDb). They used the following features: unigrams, bigrams, and part of speech tags, and the following learning algorithms: Sup- port Vector Machines (SVM), Maximum Entropy Classifier, and Naive Bayes. The best average classification accuracy obtained through a three-fold cross-
validation was 82.9%. This result was achieved using purely unigrams as fea-
tures and an SVM as learning algorithm.
More recent models have adopted the “representation learning” approach of learning the document representation directly from the data using neu- ral networks (Collobert and Weston, 2008). Word embeddings are a popular choice among those approaches. They are low-dimensional continuous dense word vectors trained from unlabelled document corpora capable of capturing rich semantic information. A state-of-the-art word embedding model is the skip-gram model (Mikolov, Sutskever, Chen, Corrado and Dean, 2013) imple-
mented in the Word2vec1 library. In this method, a neural network with one
hidden layer is trained for predicting the words surrounding a centre word,
within a window of size k that is shifted along the input corpus. The centre
and surroundingk words correspond to the input and output layers of the net-
work, respectively, and are represented by 1-hot vectors, which are vectors of
the size of the vocabulary (|V|) with zero values in all entries except for the
corresponding word index that receives a value of 1. Note that the output
layer is formed by the concatenation of thek 1-hot vectors of the surrounding
words. The hidden layer has a dimensionality l, which determines the size of
the embeddings (normally l |V|). The word-embedding for each word can
be obtained in two ways: 1) from the projection matrix connecting the input layer with the hidden one, and 2) from the projection matrix connecting the hidden layer with the output one. The network is efficiently trained using an algorithm called “negative-sampling”, and the rationale of the model is that words occurring in similar contexts will receive similar vectors. There is an-
other word embedding model called continuous bag of words (CBOW), which is analogous to the skip-gram model after swapping its input and output lay- ers. Thus, the learning task of CBOW consist of predicting a centre word given its surrounding words in a window.
A simple approach for using word-embeddings in sentence-level polarity classification is to use the average word vector as the feature representation of a sentence, which is latter used as input for training a sentence-level polarity classifier (Castellucci, Croce and Basili, 2015).
There are syntactic dependencies such as negations and but clauses known as “opinion shifters” (Liu, 2009) that can strongly alter the overall polarity of a sentence, e.g., “I didn’t like the movie”, “I like you but I’m married”. Repre- sentations based on unigrams or averaging word embeddings lack information about the sentence structure. Hence, they are unable to capture long-distance dependencies in the passage. In the following, we discuss two representation learning techniques for modelling semantic compositionality in sentences that were successfully employed for sentiment analysis.
A recursive neural tensor network for learning the sentiment of pieces of texts of different granularities, such as words, phrases, and sentences, was proposed in (Socher, Perelygin, Wu, Chuang, Manning, Ng and Potts, 2013).
The network was trained on a sentiment annotated treebank2 of parsed sen-
tences for learning compositional vectors of words and phrases. Every node in the parse tree receives a vector, and there is a matrix capturing how the meaning of adjacent nodes changes. The main drawback of this model is that it relies on parsing. Thus, it would be difficult to apply it to Twitter data because of the lack of Twitter-specific sentiment treebanks and robust constituency parsers for Twitter (Foster, Cetinoglu, Wagner, Le Roux, Nivre, Hogan and van Genabith, 2011).
A paragraph vector-embedding model that learns vectors for sequences of words of arbitrary length (e.g, sentences, paragraphs, or documents) without relying on parsing was proposed in (Le and Mikolov, 2014). The paragraph vectors are obtained by training a similar network as the one used for training the CBOW embeddings. The words surrounding a centre word in a window are used as input together with a paragraph-level vector for predict the centre word. The paragraph-vector acts as a memory token that is used for all the centre words in the paragraph during the training the phase.
The recursive neural tensor network and the paragraph-vector embedding
were evaluated on the same movie review dataset used in (Pang et al., 2002), obtaining an accuracy of 85.4% and 87.8%, respectively. Both models outper- formed the results obtained by classifiers trained on representations based on bag-of-words features.
Most sentiment analysis datasets are imbalanced in favour of positive ex- amples (Li, Wang, Zhou and Lee, 2011). This is presumably because users are more likely to report positive than negative opinions. The shortcoming of training sentiment classifiers from imbalanced datasets is that many classifica- tion algorithms tend to predict test samples as the majority class (Japkowicz and Stephen, 2002) when trained from this type of data. A semi-supervised model for imbalanced sentiment classification is proposed in (Li et al., 2011). The model exploits both labelled and unlabelled documents by iteratively per- forming under-sampling of the majority class in a co-training framework using random subspaces of features.