• No se han encontrado resultados

Methods for Automatically Determining Review Quality

Summarization

5.2 Multi-Document Opinion-Oriented Summarization Language is itself the collective art of expression,

5.2.4 Review(er) Quality

5.2.4.1 Methods for Automatically Determining Review Quality

In a way, one could consider the review-quality determination problem as a type of readability assessment and apply essay-scoring techniques [19, 99]. However, while some of the systems described below do try to take into account some readability-related features, they are tailored specifically to product reviews.

Kim et al. [161], Zhang and Varadarajan [328], and Ghose and Ipeirotis [106] attempt to automatically rank certain sets of reviews on the Amazon.com website according to their helpfulness or utility, using a regression formulation of the problem. The domains consid- ered are a bit different: MP3 players and digital cameras in the first case; Canon electronics, engineering books, and PG-13 movies in the

second case; and AV players plus digital cameras in the third case.

Liu et al. [193] convert the problem into one of low-quality review detection (i.e., binary classification), experimenting mostly with man- ually (re-)annotated reviews of digital cameras, although CNet editorial ratings were also considered on the assumption that these can be con- sidered trustworthy. Rubin and Liddy [261] also sketch a proposal to consider whether reviews can be considered credible.

Kim et al. [161] study which of a multitude of length-based, lexical, POS-count, product-aspect-mention count, and metadata features are most effective when utilizing SVM regression. The best feature combi- nation turned out to be review length plus tf-idf scores for lemmatized unigrams in the review plus the number of “stars” the reviewer assigned to the product. Somewhat disappointingly, the best pair of features among these was the length of the review and the number of stars.

(Using “number of stars” as the only feature yielded similar results to using just the deviation of the number of stars given by the particular reviewer from the average number of stars granted by all reviewers for the item.) The effectiveness of using all unigrams appears to subsume that of using a select subset, such as sentiment-bearing words from the General Inquirer lexicon [281].

Zhang and Varadarajan [328] use a different feature set. They employ a finer classification of lexical types, and more sources for sub- jective terms, but do not include any meta-data information. Interest- ingly, they also consider the similarity between the review in question and the product specification, on the premise that a good review should discuss many aspects of the product; and they include the review’s similarity to editorial reviews, on the premise that editorial reviews represent high-quality examples of opinion-oriented text. (David and Pinch [70] observe, however, that editorial reviews for books are paid for and are meant to induce sales of the book.) However, these latter two original features do not appear to enhance performance. The features that appear to contribute the most are the class of shallow syntac- tic features, which, the authors speculate, seem to characterize style;

examples include counts of words, sentences, wh-words, comparatives and superlatives, proper nouns, etc. Review length seems to be very weakly correlated with utility score.

We thus see that Kim et al. [161] find that meta-data and very simple term statistics suffice, whereas Zhang and Varadarajan [328]

observe that more sophisticated cues that appear correlated with lin- guistic aspects appear to be most important. Possibly, the difference is a result of the difference in domain choice: we speculate that book and movie reviews can involve more sophisticated language use than what is exhibited in reviews of electronics.

Declaring themselves influenced by prior work on creating subjectiv- ity extracts [232], Ghose and Ipeirotis [106] take a different approach.

They focus on the relationship between the subjectivity of a review and its helpfulness. The basis for measuring review subjectivity is as follows: using a classifier that outputs the probability of a sen- tence being subjective, one can compute for a given review the aver- age subjectiveness-probability over all its sentences, or the standard deviation of the subjectivity scores of the sentences within the review.

They found that both the standard deviation of the sentence subjectiv- ity scores and a readability score (review length in characters divided by number of sentences) have a strongly statistically significant effect on utility evaluations, and that this is sometimes true of the average subjectiveness-probability as well. They then suggest on the basis of this and other evidence that it is extreme reviews that are considered to be most helpful, and develop a helpfulness predictor based on their analysis.

Liu et al. [193] considered features related to review and sentence length; brand, product and product-aspect mentions, with special con- sideration for appearances in review titles; sentence subjectivity and polarity; and “paragraph structure.” This latter refers to paragraphs as delimited by automatically determined keywords. Interestingly, the technique of taking the 30 most frequent pairs of nouns or noun phrases that appear at the beginning of a paragraph as keywords yields separator pairs such as “pros”/“cons,” “strength”/“weakness,”

and “the upsides”/“downsides.” (Note that this differs from identi- fying pro or con reasons themselves [157], or identifying the polarity of sentences. Note also that other authors have claimed that differ- ent techniques are needed for situations in which pro/con delimiters are mandated by the format imposed by a review aggregation site

but a separate detailed textual description must also be included, as in Epinions, as opposed to settings where such delimiters need not be present or where all text is placed in the context of such delim- iters [191].) Somewhat unconventionally with respect to other text- categorization work, the baseline was taken as SVMlightrun with three sentence-level statistics as features; that is, the performance of a clas- sifier trained using bag-of-word features is not reported. Given this unconventional starting point, the addition of the features that do not reflect subjectivity or sentiment help. Including subjectivity and polar- ity on top of what has already been mentioned does not yield further improvement, and use of title-appearance for mentions did not seem to help.

Review- or opinion-spam detection — the identification of deliber- ately misleading reviews — is a line of work by Jindal and Liu ([141], short version available as Jindal and Liu [140]) in the same vein. One challenge these researchers faced was the difficulty in obtaining ground truth. Therefore, for experimental purposes they first re-framed the problem as one of trying to recognize duplicate reviews, since a priori it is hard to see why posting repeats of reviews is justified. (However, one potential problem with the assumption that repeated reviews con- stitute some sort of manipulation attempt, at least for the Amazon data that was considered, is that Amazon itself cross-posts reviews across different products — where “different” includes different instan- tiations (e.g., e-book vs. hardcover) or subsequent editions of the same item (Gueorgi Kossinets and Cristian Danescu Niculescu-Mizil, per- sonal communication). Specifically, in a sample of over 1 million Ama- zon book reviews, about one-third were duplicates, but these were all due to Amazon’s cross-posting. Human error (e.g., accidentally hitting the “submit” button twice) causes other cases of non-malicious dupli- cates.) A second round of experiments attempted to identify “reviews on brands only,” ads, and “other irrelevant reviews containing no opin- ions” (e.g., questions, answers, and random texts). Some of the features used were similar to those employed in the studies described above;

others included features on the review author and the utility evalua- tions themselves. The overall message was that this kind of spam is relatively easy to detect.