For this thesis I have chosen to use data from the microblogging site Twitter to represent consumer sentiment. Particular attention is paid to how this data was collected as this was developed specifically for this thesis, and this methodology should be transparent.
The social media platform Twitter is currently the most popular micro-‐blogging site in the world. On Twitter, users have 140 characters to express themselves to their ‘followers’ and the rest of the world. There are as of now, over 284 million active users each month, and over 500 million tweets are sent each day (Twitter Inc. 2014). Tweets are by default public; they are seen by users followers and can found by anyone searching for a term that a user has written about. It is also possible to ‘retweet’ what other users have written, namely sharing a users tweet on your own twitter-‐feed.
Twitter is known to be heavily populated by consumer opinions, and has been used to perform analysis of both customer and consumer sentiments in several studies (Chamlertwat, Bhattarakosol, Rungkasiri, & Haruechaiyasak, 2012; He et al., 2013; Mostafa, 2013; Pak & Paroubek, 2010). Part of the reason for this is that Twitter, as opposed to other social media platforms, has given access to some of their Application
5 It could be argued that this data can either be skewed towards the negative or the positive. If a customer has an unresolved issue, it might motivate the customer to write a very negative message despite having a pleasant interaction with customer service or a positive impression of the company as a whole. Similarly, the data could be skewed towards the positive if a customer has not had a problem solved, but because of a pleasant interaction with a customer service agent believes it will be resolved, the customer might answer positively regardless of the actual outcome of the situation.
Programming Interface (API) to developers. This makes the data more accessible than other social networks such as facebook, instagram and snapchat, which are also all in large part picture and video based. By signing up as a third party developer anyone can therefore access a selection of contemporary tweets, within the confines of what Twitter has found appropriate (Twitter Inc., 2014). This has made Twitter a particularly interesting avenue of research for academia, and is much of the reason why this platform has been used to find consumer data for this thesis. 6
In order to archive results from Twitter, I first had to obtain a developer license to gain access to the Twitter API. Within this API I created a search string containing the key phrase “Telenor”. Further, as my focus is on Telenor Norway, I limited the search to Norwegian tweets by setting the language to “NO” (the ISO 639-‐1 code for Norwegian). This query in the API creates a stream of tweets that is automatically updated every hour. However, this data is still in a data interchange format. Twitter uses an open standard called JSON, which is a format that uses human readable text to send data objects (JSON.org, 2014). Even though JSON is one of the more readable formats in data language processing, it cannot be placed directly into the text analysis software at hand. Therefore I have utilized a script that formats JSON into a standard spreadsheet format (xls/cvs).7 This gives me the textual information of the tweet as well as other metadata in a format that is easy to import into the analytical software.
The collection of tweets began on 25/11/14, and ended on 27/03/15. This gathered all tweets in Norwegian that mentioned the word “Telenor” in this time period. After removing irrelevant tweets8, the remaining dataset analyzed contains 5440 tweets on the subject of Telenor.
There are of course many ways to retrieve, store and analyze textual data from a platform such as Twitter. The method used here is particularly optimized to create compatible data with the Provalis Research Suite, so that the tweets will not only be
6 Academic research on the platform has already been used to find that it could fairly accurately predict the stock market (Bollen et al., 2011), and function as a real-‐time detection of earthquakes (Sakaki, Okazaki, & Matsuo, 2010).
7 For more on the scrip used visit: https://tags.hawksey.info/ -‐ note that I have also modified this script to perform a Norwegian language search.
8 Tweets from Telenor’s own accounts (@telenor_service etc.) were deleted from the dataset, as this thesis is focused on the consumer’s sentiment, and not the company’s. Also, tweets that were automatically generated by Twitterbots or other spamming accounts were also removed, as they cannot be said to contain consumer feedback and therefore irrelevant in this context.
retrieved and stored, but can be analyzed by the same software application as used on the NPS customer feedback data.
These two sources of textual data (NPS and Tweets) are in a sense complementary. Both are textual feedback on a company, they are usually fairly short, colloquial and often contain a positive or negative sentiment. As data they provide much more detailed and vivid information than common surveys, as textual free form data can be on anything, from customer service, the company as a whole, or it services.
However, unstructured text is also difficult to handle. One central aspect is the messiness and ambiguity of written colloquial text. It can often be riddled with spelling errors, jargon and slang, or even meant ironically – which can be difficult to pick up on. This can also make it difficult to find all cases on the same topic, if they are written completely differently. However, a lot of this is less problematic than previously due to advancements in text mining software, which can now easily make dictionaries and word-‐categorizations that include common misspellings or slang. So despite the complexities of textual data, it can still be considered a rich source of information. After discussing textual data as a source I will next discuss the validity and reliability of this thesis.