Publicly Available Resources
7.2 Evaluation Campaigns
reviews from several different product types taken from Amazon.com, some with 1-to-5 star labels, some unlabeled.
NTCIR multilingual corpus [registration required]
The corpus for the NTCIR 6 pilot task consists of news articles in Japanese, Chinese, and English and formed the basis of the Opinion Analysis Task at NTCIR6 [267]. The training data contains annotations regarding opinion holders, the opinions held by opinion holder, and sentiment polarity, as well as relevance information for a set of pre- determined topics.
The corpus of the NTCIR Multilingual Opinion-Analysis Task (MOAT) is drawn from Japanese, Chinese, and English blogs.
Review-search results sets
URL: http://www.cs.cornell.edu/home/llee/data/search-subj.html This corpus, used by Pang and Lee [234], consists of the top 20 results returned by the Yahoo! search engine in response to each of a set of 69 queries containing the word “review.” The queries were drawn from the publicly available list of real MSN users’ queries released for the 2005 KDD Cup competition [185]; the KDD data itself is available at http://www.acm.org/sigs/sigkdd/kdd2005/Labeled800Queries.zip.
The search-engine results in the corpus are annotated as to whether they are subjective or not. Note that “sales pitches” were marked objec- tive on the premise that they represent biased reviews that users might wish to avoid seeing.
blog posts expressing an opinion about a specified topic. Fourteen groups participated; Ounis et al. [227] give an overview of the results.
Some findings are as follows. With respect to performance on opinion detection, the participating systems seemed to fall into two groups.
Opinion-detection ability and relevance-determination ability seemed to be strongly correlated. While the best systems were about equally good at detecting negative sentiment as positive sentiment, systems performing at the median seemed to be a bit more effective at locat- ing documents with negative sentiment. Most participants followed a pipelined approach, where first topic relevance was tackled, and then opinion detection was applied upon the results. Perhaps the most sur- prising observation was that the organizers discovered that it was pos- sible to achieve very good relative performance by omitting the second phase of the pipeline; but we take heart in the fact that the field is still relatively young and has room to grow and mature.
TREC 2007 Blog Track. The TREC 2007 Blog track retained the opinion retrieval task and instituted determining the sentiment status (positive, negative, or mixed) of the retrieved opinions as a subtask.
The 2007 and 2006 Blog Track results are analyzed in Ounis et al.
[228]. They found that lexicon-based approaches — either where the discriminativeness of terms was determined on labeled training data or where the terms were manually compiled — constituted the main effective approaches.
TREC 2008 Blog Track. In the TREC 2008 Blog track, the polarity- identification problem was re-posed as one of ranking of positive- polarity retrieved documents by degree of positivity, and, similarly, ranking of negative-polarity retrieved documents by degree of negativ- ity. (“Mixed opinionated documents” were not to be included in these rankings.)
7.2.2 NTCIR Opinion-Related Competitions
The National Institute of Informatics (NII) runs annual meetings code- named NTCIR (NII Test Collection for Information Retrieval Systems).
Opinion analysis was featured at an NTCIR-5 workshop, and served as a pilot task at NTCIR-6 and a full-blown task at NTCIR-7.
NTCIR-6 opinion analysis pilot task. The dataset consists of newswire documents in Chinese, Japanese, and English; the organiz- ers describe this as “what we believe to be the first multilingual opin- ion analysis data set over comparable data” [93]. The four constituent tasks, intentionally designed to be fairly simple so as to encourage par- ticipation from many groups, were as follows:
• Detection of opinionated sentences.
• Detection of opinion holders.
• (optional) Polarity labeling of opinionated sentences as pos- itive, negative, or neutral.
• (optional) Detection of sentences relevant to a given topic.
Due to variation in annotator labelings, two evaluation standards were defined. In the strict evaluation, an answer is considered correct if all three annotators agreed on it. In the lenient evaluation, only a majority (i.e., two) of the annotators were required to agree with an answer for it to be considered correct.
Seki et al. [267] give an overview and the results of this evaluation exercise, noting that differences between languages make direct compar- ison difficult, especially since precision and recall were defined (slightly) differently across languages. A shortened version of this overview also exists [93].
NTCIR-7 Multilingual opinion analysis task (MOAT), 2008. Subse- quent to the NTCIR-6 pilot task, a new dataset was selected, drawn from blogs in Japanese, traditional and simplified Chinese, and English;
according to the organizers, “We plan to select and balance useful top- ics for opinion mining researchers, such as topics concerning product reviews, movie reviews, and so on.” This exercise involves six subtasks:
• Detection of opinionated sentences and opinion fragments within opinionated sentences.
• Polarity labeling of opinion fragments as positive, negative or neutral.
• (optional) Strength labeling of opinion fragments as very weak, average, or very strong.
• (optional) Detection of opinion holders.
• (optional) Detection of opinion targets.
• (optional) Detection of sentences that are relevant to a given topic.
As in the previous competition, both strict and lenient evaluation stan- dards are to be applied.
OpQA Corpus [available by request]
Stoyanov et al. [283] describes the construction of this corpus, which is a collection of opinion questions and answers together with 98 documents selected from the MPQA dataset.