When we use the term descriptive analytics, what we should think about is
this: What attributes would we use to describe what is contained in this spe- cific sample of data—or rather, how can we summarize the dataset?
To further illustrate the concept of descriptive analytics, we use the results from a system called Simple Social Metrics that we developed at IBM. It’s
nothing more than a system that “follows” a filtered set of Twitter traffic and attempts to provide some kind of quantitative description of the data that was collected (we talk more about this in Chapter 8, where we address real-time data).
In this example, we use a dataset of tweets made by IBMers who are mem- bers of IBM’s Academy of Technology. The IBM Academy of Technology is a society of IBM technical leaders organized to advance the understanding of key technical areas, to improve communications in and development of IBM’s global technical community, and to engage its clients in technical pursuits of common value. These are some of IBM’s top technical minds, so an analysis of their conversation could be quite useful.
One of the first questions we want to answer is this: “Who is contributing the most?” or rather, “Who is tweeting the most or being tweeted about?” One way to do this is to analyze the number of contributions. We simply call this the “top authors,” and for the month of November, a breakdown of the top contributors looked something like that shown in Figure 4.4. Even though this diagram tells us which author had the most number of tweets, we need to go beyond the machine-based analysis and leverage human analy- sis to determine who really “contributed” the most.
ptg16373464 Manuel_Avalos_V iGEASmit JFPuget dr_rick cboulangfr amartin171 jmrod8 andysc Dr_Casimer KMarzantowicz _ _ iGEASmit JFPuget dr_rick cboulangfr amartin171 jmrod8 andysc Dr_Casimer KMarzantowicz
Figure 4.4 Top 10 authors from IBM’s Academy of Technology during November 2014.
While this data is interesting, we need to remember that these types of descriptive metrics represent just a summary over a given point in time. The view could be quite different if we look at the data and take the time frame into consideration. For example, consider the same data, but a view of the whole month versus the last half of the month (see Figure 4.5).
jmrod8 andysc cboulangfr Dr_Casimer iGEASmit jmrod8 iGEASmit kmarzantowicz Dr_Casimer andysc
Figure 4.5 Top authors during the whole month and just at the end.
An important fact that comes across here is that one of the users, kmar- zantowicz, came on strong during the last half of the month with a heavy amount of tweeting to move into the top five of all individuals. Perhaps this person was attending a conference and tweeting about various presentations
ptg16373464
4: Timing Is Everything 55
or speeches; or perhaps this person said something intriguing and there was a flurry of activity around him or her. From an analyst’s perspective, it would be interesting to pull the conversation that was generated by that user for the last 15 days of the month to understand why there was such a large upsurge in traffic.
Sentiment
One of the more popular descriptive metrics that people like to use is
sentiment. Sentiment is usually associated with the emotion (positive or neg-
ative) that an individual is feeling about the topic being discussed. In this chapter, which is focused on time, we discuss sentiment as it changes over time.
Sentiment analytics involves the analysis of comments or words made by individuals to quantify the thoughts or feelings intended to be conveyed by words. Basically, it’s an attempt to understand the positive or negative feel- ings individuals have toward a brand, company, individual, or any other entity. In our experience, most of the sentiment collected around topics tends to be “neutral” (or convey no positive or negative feelings or mean- ings). It’s easiest to think about sentiment analytics when we look at Twitter data (or any other social site where people express a single thought or make a single statement). We can compute the sentiment of a document (such as a wiki post or blog entry) by looking at the overall scoring of sentiment words that it contains. For example, if a document contains 2,000 words that are considered negative versus 300 words that are considered positive in mean- ing, we may choose to classify that document as overall negative in senti- ment. If the numbers are closer together (say 3,000 negative words versus 2,700 positive words—or an almost equal distribution), we may choose to say that document is neutral in sentiment .
Consider this simple message from LinkedIn:
Hot off the press! Check out this week’s enlightening edition of the #companyname Newsletter http://bit.ly/xxxx
A sentiment analysis of this message would indicate that it’s positive in tone. The sentiment analysis being done by software is usually based on a sentiment dictionary for that language. The basic package comes with a pre- defined list of words that are considered as positive. Similarly, there is also a long list of words that can be considered negative. For many projects, the
ptg16373464
standard dictionary can be utilized for determining sentiment. In some spe- cial cases, you may have to modify the dictionary to include domain-specific positive and negative words. For example, the word Disaster can be a nega-
tive sentiment word in a majority of contexts, except when it is used to refer to a category of system such as “Disaster Recovery Systems.”
Understanding the general tone of a dataset can be an interesting metric, if indeed there is some overwhelming skew toward a particular tone in the message.
Consider the descriptive set of metrics of sentiment shown in Figure 4.6, taken from an analysis we did for a customer in the financial industry over a one-month period. This represents the tone of the messages posted in social media about this particular company.
Negative
Neutral Positive
Figure 4.6 Customer sentiment over a one-month period.
On the surface, this looks like a good picture. The amount of positive conversation is clearly greater than the amount of negative, and the neutral sentiment (which is neither bad nor good) overwhelms both. So in sum- mary, this appears to be quite acceptable.
However, if we take the negative sentiment and look at it over time, a dif- ferent picture emerges, as illustrated in Figure 4.7.
ptg16373464 4: Timing Is Everything 57 30 25 20 15 10 5 0
Day of the Month
Number of Messages
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Figure 4.7 Plot of one month’s count of negative sentiment.
While cumulatively the negative sentiment was much smaller than the positive, there was one particular date range (from approximately the 16th to
the 18th of the month) when there was a large spike in negative messaging
centered around our client. While just an isolated spike in traffic, the event could have lingering effects if not addressed.