Basic results
Figure 3.6 shows some results in terms of the percentage of the extracted constructions that are actual PRCs contained in our list of gold PRCs (the complete results are in Table A.2). For Lemmas, adding more context by increasing k leads to less actual reversing words that are found with frel and MI+. This is understandable as most re-
versing words occur close to the sentiment word. Adding more words from the context without position information only adds noise. As an example consider the word “with- out”. At k = 1, it occurs more often in reversing contexts, e.g., “without worryingneg”.
Using k = 2 adds a large number of occurrences in non-reversing contexts, e.g., “greatpos
pictures without flash”. The difference in number of occurrences is still large enough that “without” receives a high MI score, but it gets discarded with MI+ as it occurs more often in non-reversing contexts.
Simple Paths profits from the introduction of siblings at k = 2 which is a very frequent construction for polarity reversal in predicative sentences:
60 3. Polarity reversing constructions 2 4 6 0 0.1 0.2 0.3 k P (a) Lemmas 2 4 6 0 0.1 0.2 0.3 k P (b) Simple Paths 2 4 6 0 0.1 0.2 0.3 k P (c) Abstract Paths
Figure 3.6.: Percentage of correctly extracted PRCs at n = 70 for different representa- tions (Lemmas, Simple Paths and Abstracted Paths), path length k between 1 and 6 and scoring with frel, MI and MI+.
Both “not” and “good” are dependents of “is”. With bigger k, occurrences further away cannot be distinguished from close occurrences and add noise, just like with Lemmas.
For Abstract Paths increasing k has a positive effect until about k = 4. This makes sense as Abstract Paths explicitly includes the distance k and uses it to distinguish different constructions. After k = 4 the results do not change much, as only very few longer constructions occur sufficiently often to make it into the top of the list.
The highest number of actual PRCs are found when representing syntactic context with Abstract Paths, but in general, numbers are rather low. Of the top 70 Ab- stract Paths extracted as PRCs at k = 4 with MI, only 13 are correct (19%). For the same setting, but using frel only 14 are correct (20%). Results for MI+ are better,
but still noisy: 20 out of 70 constructions are correct (29%).
Not only the numbers, but also the syntactic constructions extracted with the different scoring methods are very different. Figure 3.7 shows the top ten constructions extracted for Abstract Paths with each of the three scoring methods.
Out of the top 10 constructions extracted with frel, none is an actual PRC. The
top construction for scoring with frel, ADJ<V<O>N>battery_N, comes from sentences like
“battery life could be better” which express a negative sentiment. The related construction ADJ<V<O>life_N is on position 11. Although “could/would/should be ADJ” (ADJ<V<O) is used in combination with other aspects as well (“lcd quality could be a bit better”, “10x zoom would be nice”, . . . ), the phrase “battery life” is the one most mentioned in this construction (39 times compared to 14 times for the second-most frequent word “quality”). The second-ranked construction, N<cards_N, is an example of a construction extracted because of the occurrence of sentiment words in aspects like “flash cards”.
3.4. Experiments on PRC extraction 61 ADJ<V<O>N>battery_N N<cards_N ADV<ADJ<O<small_ADJ PR<V>ADJ>quite_ADV ADV<N<V<that_PR PR>V>and_O PR<V>ADJ>O>use_V PR<V>feature_N ADJ<O<ADJ>bit_N ADV<ADJ<but_O (a) frel ADJ>O>use_V ADJ>to_O ADJ>not_ADV ADJ<V>not_ADV ADJ<V>N>the_DT N>no_DT ADJ<V<could_O ADJ>very_ADV N>low_ADJ ADJ<V>and_O (b) MI ADJ>not_ADV ADJ<V>not_ADV N>no_DT ADJ<V<could_O N>low_ADJ N<N>low_ADJ N<in_PR ADJ<N>not_ADV PR<be_V V<V>not_ADV (c) MI+
Figure 3.7.: Top 10 extracted PRCs for Abstract Paths at k = 4 with different scoring methods. Correct PRCs are marked with , extracted non-PRCs with .
With MI, some actual PRCs are found in the top 10 constructions. The two top con- structions for MI, ADJ>O>use_V and ADJ>to_O, are both indicators for the non-reversed class as it is often used in constructions like “easy to use” in positive contexts. The negated version, “not easy to use”, appears to be less frequent.
In the results for scoring with MI+, the top two constructions of MI are are filtered out, as they are indicators for the non-reversed class. All actual PRCs extracted by MI are kept. The first errors for MI+, N<N>low_ADJ at position 6 and N<in_PR at position 7, are examples for a description of the environment or settings, e.g., “low light shooting” or “in low light” which we have already discussed in context with filters.
Results with filters
After finding the best settings for the basic parameters, we now apply the different filters discussed in Section 3.2.5. We use the best performing system to test the filters, i.e., Abstract Paths with MI+ scoring, which is included in the plots as BL. Figure 3.8 shows some results split into two plots for better readability.
We can see that the Subjstrength filter performs worst of all, introducing errors instead of removing them. When we apply this filter, the number of found sentiment words drops drastically from about 83000 to only about 30000. This is expected, but the expectation was that the remaining training examples are of better quality because non-subjective uses of words where there is no real reverser present are excluded. A manual inspection of the filtered words confirms that this is not always the case. Even if there is overall a slightly better quality of training examples, this cannot compensate
62 3. Polarity reversing constructions
for the smaller amount of training examples overall.
For the Domain filter, we train the Stanford MaxEnt classifier7 (Manning and Klein,
2003) with unigram features and default settings on the ProsCons data to distinguish positive from negative sentences. We use the c features with the highest weights for each class to filter the dictionary. We exclude non-word features, which leaves about 10500 features. For illustration, the top 10 positive features from the classifier are “ca- pable”, “amazing”, “wonderful”, “convenience”, “tons”, “inexpensive”, “telephone”, “pros”, “excellent”, “solid”. The top 10 negative features are “lacks”, “worst”, “heats”, “poor”, “horrible”, “scratches”, “dislike”, “ll” (from “I’ll”), “fragile”, “concern”.
The plot shows the result for 1000 features, but the results are very similar for all cases where we filter for the up to 7000 top features. For the top 1000 features, the resulting filtered Mpqa dictionary contains only 131 positive and 145 negative words. As we can see, performance drops sharply, nearly as much as for Subjstrength. Only after using the top 8000 features, the numbers rise again to the baseline, but never improve upon it. At this point the dictionary contains about 600 words of each polarity and 75000 sentiment words are found – nearly all that are contained in the data. The reason for the low performance is that many bad sentiment words are not filtered out, because even though they do not occur as sentiment words, they still occur in a context that fits their dictionary polarity. For example while the word “brightness” is not a positive sentiment word in our domain, it is often mentioned in positive contexts such as “you can adjust the brightness”, so it still gets a high weight for the class positive from the classifier.
Using the filter Intensifier does not have a significant influence on the results. This filter only affects the extraction of PRCs that have adverbs in their scope. Constructions of length one which end at a sentiment word, like ADV<useless_ADJ or ADV<bad_ADJ, are dropped completely. Most other constructions just change position a few places up or down, because intensifying uses are found with roughly the same percentage in reversing and non-reversing contexts. As a result, the final extraction performance stays roughly the same with the filter as without using the filter.
The difference to the baseline when filtering for Aspects is even smaller than for Intensifier. Aspects are extracted from the ProsCons corpus. There are a total of 17484 bigram phrases, 32 occur 100 times or more, but only 200 occur more than 20 times. The plot contains the results for using the top 100 aspects. Varying the number of aspects in the list to ignore between 0 and 150 does not make a big difference in the result, afterwards results drop slightly. When we look at the two constructions
3.4. Experiments on PRC extraction 63 1 2 3 4 5 6 0 0.1 0.2 0.3 k P BL Subjstrength Aspects POSignore
(a) Subjectivity filters
1 2 3 4 5 6 0 0.1 0.2 0.3 k P BL Domain Intensifier Singletons (b) Other filters
Figure 3.8.: Percentage of correctly extracted PRCs at n = 70 for different path lengths k with filters. BL refers to the best system of the base settings, i.e., Ab- stract Paths scored with MI+.
discussed as examples in Section 3.2.5, N<N>low_ADJ disappears completely from the list and N<in_PR is relegated to position 2106. Unfortunately, they are replaced with equally bad alternatives: PR>be_V, PR<V>N>the_DT (mainly because of “although the X is” with “although” as a sentiment word) and ADJ<at_PR (from “at least” or “at best”).
The filter Singletons is the first to improve upon the baseline, although only slightly. The first change is to (correctly) remove ADJ<O>at_PR at position 29 which only occurs in combination with the sentiment word “least” when used like “at least 130k pixels” where “least” is not a sentiment word at all. Next are ADJ>O>white_ADJ (“black and white”) and some constructions with the sentiment word “although”: PR>V>it_O, PR>V>i_O and PR>V>not_ADV. The example that prompted the filter, ADJ<drive_N is still in the list, because the data contains several modifiers for the word “drive”, some examples are “free [zip] drive”, “powerful [high-speed lens] drive”, or “portable [hard] drive”.
The biggest improvement over the baseline comes from using the filter POSignore, although it still only improves precision from 29% to 31% for k = 4. POSignore allows us to get rid of PR>be_V, PR<V>N>the_DT and many other constructions that are based on occurrences of very questionable sentiment words like “although”, “at least” or “above” which are almost all in non-sentiment contexts. We have tried the combination of POSignore with the other filters, but got no or only marginal improvement. In all cases correctly excluded constructions were replaced with equally bad alternatives, so
64 3. Polarity reversing constructions
that the overall performance change was minimal.
Our final set of PRCs that we use in the following experiments is the set extracted with the following settings: Abstract Paths, MI+ scoring, path length k = 4 and POSignore filter. While even our highest result does not look very promising (only 31% of extracted constructions are really PRCs), as we will see, we can still use noisy PRCs successfully in polarity classification.