Summarization
5.2 Multi-Document Opinion-Oriented Summarization Language is itself the collective art of expression,
5.2.1 Some Problem Considerations
There never was in the world two opinions alike, no more than two hairs, or two grains; the most universal quality is diversity.
— Michel de Montaigne, Essays Where an opinion is general, it is usually correct.
— Jane Austen,Mansfield Park
We briefly discuss here some points to keep in mind in regards to multi-document sentiment summarization, although to a certain degree, work in sentiment summarization has not yet reached a level where these problems have come to the fore.
Determining which documents or portions of documents express the same opinion is not always an easy task; but, clearly it is one that needs to be addressed in the summarization setting, since readers of sentiment summaries surely are interested in the overall sentiment in the corpus — which means the system must determine shared sentiments within the document collection at hand.
This issue can still arise even when labels have been predeter- mined, if the items that have been pre-labeled come from different sub-collections. For instance, some documents may have polarity labels, whereas others may contain ratings on a 1-to-5 scale. And even when the ratings are drawn from the same set, calibration issues may arise. Consider the following from Rotten Tomatoes’ frequently-asked- questions page (http://www.rottentomatoes.com/pages/faq#judge):
On the Blade 2 reviews page, you have a negative review from James Berardinelli (2.5/4 stars), and a positive review from Eric Lurio (2.5/5). Why is Berardinelli’s review labeled Rotten and Lurio’s review labeled Fresh?
You’re seeing this discrepancy because star systems are not consistent between critics. For critics like Roger Ebert and James Berardinelli, 2.5 stars or lower out of 4 stars is always negative. For other critics, 2.5 stars can either be positive or negative. Even though Eric Lurio uses a 5 star system, his grading is very relaxed. So, 2 stars can be positive. Also, there’s always the possibility of the webmaster or critic putting the wrong rating on a review.
As another example, in reconciling reviews of conference submissions, program-committee members must often take into account the fact that certain reviewers always tend to assign low scores to papers, while oth- ers have the opposite tendency. Indeed, we believe this calibration issue may be the reason why reviews of cars on Epinions come not only with a “number of stars” annotation, but also a “thumbs up/thumbs down”
indicator, in order to clarify whether, regardless of the rating assigned, the review author actually intends to make a positive recommendation or not.
An additional observation to take note of is the fact that when two reviewers agree on a rating, they may have different reasons for doing so, and it may be important to indicate these reasons in the summary. A related point is that when a reviewer assigns a middling rating, it may be because he or she thinks that most aspects of the item under discussion are so-so, but it may also be because he or she sees
both strong positives and strong negatives. Or, reviewers may have the same opinions about individual item features, but weight these individual factors differently, leading to a different overall sentiment.
Indeed, Rotten Tomatoes summarizes a set of reviews both with the Tomatometer — percentage of reviews judged to be positive — and an average rating on a 1-to-10 scale. The idea, again according to the FAQ (http://www.rottentomatoes.com/pages/faq#avgvstmeter), is as follows:
The Average Rating measures the overall quality of a product based on an average of individual critic scores.
The Tomatometer simply measures the percentage of critics who recommend a certain product.
For example, while “Men in Black” scored 90% on the Tomatometer, the average rating is only 7.5/10. That means that while you’re likely to enjoy MIB, it probably wasn’t a contender for Best Picture at the Oscars.
In contrast, “Toy Story 2” received a perfect 100% on the Tomatometer with an average rating of 9.6/10. That means, not only are you certain to enjoy it, you’ll also be impressed with the direction, story, cinematography, and all the other things that make truly great films great.
The problem of deciding whether two sentences or text pas- sages have the same semantic content is one that is faced not just by opinion-oriented multi-document summarizers, but by topic-based multi-document summarizers as well [247]; this has been one of the motivations behind work on paraphrase recognition [29, 30, 231] and textual entailment [28]. But, as pointed out in Ku et al. [170], while in traditional summarization redundant information is often discarded, in opinion summarization one wants to track and report the degree of “redundancy,” since in the opinion-oriented setting the user is typ- ically interested in the (relative) number of times a given sentiment is expressed in the corpus.
Carenini et al. [52] note that a challenge in sentiment summariza- tion is that the pieces of information to be summarized — people’s
opinions — are often conflicting, which is a bit different from the usual situation in topic-based summarization, where typically one does not assume that there are conflicting sets of facts in the document set (although there are exceptions [301, 302]).