This thesis is structured as follows. A review of the literature on sentiment analysis and social media is presented in Chapter 2. The word-sentiment as- sociation method for polarity lexicon induction is described and evaluated in Chapter 3. Chapter 4 describes the tweet centroid model for polarity lexicon induction and for determining word-emotion associations. In Chapter 5, the tweet centroid model is used for transferring sentiment knowledge between words and tweets. The partitioned version of the model for distant supervision is also described in that chapter. The annotate-sample-average distant super- vision method is described and evaluated in Chapter 6. Chapter 7 presents the main findings and contributions of this thesis, as well as a perspective for future work.
Sentiment Analysis and Social Media
In the early stages of the Web, its content was usually published by website owners associated with traditional information sources such as news media
and companies, among other organisations. Additionally, the content was
mainly about “facts” which are objective statements on particular entities or topics. In the 2000s, the rise of Web 2.0 platforms (O’Reilly, 2007), e.g., blogs, online social networks and microblogging services, changed this situation by allowing users to generate and share textual content in a simpler way. This situation caused an explosive growth of subjective information (i.e., personal opinions) available on the Web, which in turn provided new opportunities for information system developers. As the factual information has been tradi- tionally processed using techniques such as information retrieval and topic classification, different types of methods are required in order to process the “subjective" content generated by users. In this chapter, we give a review of those methods, which are commonly referred to in the research literature as opinion mining andsentiment analysis techniques. We discuss works address- ing sentiment classification of documents, sentences, and tweets, as well as methods for polarity lexicon induction. Popular existing opinion lexicons are also reviewed and analysed. Moreover, we discuss work conducting aggre- gated analysis of opinions and applications of sentiment analysis and social media mining, including predictions about stock market prices and election outcomes. Finally, we provide a discussion of existing developments in the field in the context of the research problem addressed in this thesis.
2.1 Primary Definitions
Letd be an opinionated document (e.g., a product review) composed of a list
of sentences s1, . . . , sn. As stated in (Liu, 2009), the basic components of an
• Entity: can be a product, person, event, organisation, or topic on which an opinion is expressed (opinion target). An entity is composed of a hi-
erarchy of components and sub-components where each component can
have a set of attributes. For example, a cell phone is composed of a
screen, a battery among other components, the attributes of which could be the size and the weight. For simplicity, components and attributes are
both referred to asaspects.
• Opinion holder: the person or organisation that holds a specific opinion
on a particular entity. While in reviews or blog posts the holders are
usually the authors of the documents, in news articles the holders are commonly indicated explicitly (Bethard, Yu, Thornton, Hatzivassiloglou and Jurafsky, 2004).
• Opinion: a view, attitude, or appraisal of anobjectfrom anopinion holder.
An opinion can have a positive, negative or neutralorientation, where the
neutral orientation is commonly interpreted as no opinion. The orienta-
tion is also named sentiment orientation, semantic orientation (Turney,
2002), orpolarity.
Considering the components of the opinions presented above, an opinion is defined as a quintuple (ei, aij, ooijkl, hk, tl) (Liu, 2010). Here, ei is an entity, aij
is an aspect of ei and ooijkl is the opinion orientation of aij expressed by the
holder hk during time period tl. Possible values for ooijkl are the categories
positive, negative and neutral or different strength/intensity levels. In cases
when the opinion refers to the whole entity, aij takes a special value named
GENERAL.
It is important to consider that within an opinionated document, several opinions about different entities and also different holders can be found. In this context, a more general opinion mining problem can be addressed consist- ing of discovering all opinion quintuples (ei, aij, ooijkl, hk, tl) from a collection
D of opinionated documents. These approaches are referred to as aspect-
based or feature-based opinion mining methods (Liu, 2009). As we can see, working with opinionated documents involves tasks such as identifying enti- ties, extracting aspects from the entities, the identification of opinion holders (Bethard et al., 2004), and the sentiment evaluation of the opinions (Pang and Lee, 2008).
In addition to the orientation or polarity, there are other affective dimensions
A sentence of a document is defined as subjective when it expresses personal feelings, views or beliefs. It is common to treat neutral sentences as objective and opinionated sentences as subjective.
Emotions are subjective feelings and thoughts. According to (Parrot, 2001) people have six primary emotions, which are: love, joy, surprise, anger, sad- ness, and fear. Another categorisation, proposed by Ekman (1992), is formed by 6 basic emotions: anger, fear, joy, sadness, surprise, and disgust, which was latter extended by Plutchik (2001) to include two additional emotion states: anticipation and trust.
The affective dimension of a document can be represented using different variable types. Nominal variables are used to represent hard associations with the affective dimension e.g., positive or non-positive, while ordinal and numeric variables are used to represent intensity or strength levels, such as weakly positive, strongly positive, or 40% negative.