• No se han encontrado resultados

7. Ansiedad en las pruebas orales

7.1. Causas de la ansiedad en las pruebas orales

The form of Equation 5.3 suggests one very versatile type of constraint, sometimes called feature labeling [Druck et al., 2008]. This is most easily viewed from the point of view of a log-linear model, such as a maximum entropy model [Jaynes, 1957, Good, 1963] or conditional random field [Lafferty et al., 2001]. Recall that the form of the model for a conditional random field (CRF) is given by:

p(y)∝exp(f(x,y)·θ) (5.4)

where in order for the normalizer of p(y) to be efficiently computable, f(x,y)must de- compose as a sum across the cliques of the model, and the model must have low treewidth. Consequently, any feature function that we would be able to use for a CRF or maximum entropy model would be a decomposable constraint featureφ(x,y).

For example, Druck et al. [2009] train a dependency parser where they ask a user to label some features of the form “given that a sentence has a verb and a noun, what is the probability that the verb dominates the noun?” The user then states a constraint in the form of an approximate probability for that event. In earlier work Druck et al. [2008] give the following example of a feature labeling constraint. Suppose we are interested in classifying documents into the categories “baseball” vs. “hockey.” The word “puck” appearing in the document should give us a fairly strong signal that the correct label is “hockey.” We can encode this knowledge as the constraint that for the collection of articles where the word “puck” appears the probability of a label of “hockey” should be at least α, whereα is a confidence value – e.g. 95%. Specifically to encode this using our notation we would have:

φ(X,Y) = |X| X i=1     

1 if articlexi contains the word “puck” andyi =“hockey” 0 otherwise,

(5.5)

and require that the expectation of this feature to be at least anαfraction of the articles that contain the word “puck”:

Eq[φ(X,Y)]≥ |X| X i=1     

1 if articlexi contains the word “puck” 0 otherwise.

It is worth noting that even though there might be a large number of articles, we have chosen to encode the information as a single constraint. The alternative would have been to include a constraint for each article that contains the word “puck” and require that the expectation of each of these constraints is at leastα. However, this would have required that each article has a label of “hockey” with probability at least α, rather than requiring that an α fraction of the articles are labeled as “hockey” in expectation. In the multiple constraint case, we have to prefer “hockey” over “baseball” for each article, while in the single constraint case, we can prefer “baseball” over “hockey” in a few articles as long as there are a sufficient number where we strongly prefer “hockey” over “baseball.” For this particular piece of prior knowledge, a single corpus-wide constraint is more appropriate.

These kinds of feature-labeling constraints have been used extensively by a number of researchers [Liang et al., 2009, Druck et al., 2008, 2009, Mann and McCallum, 2007, Bellare et al., 2009, Mann and McCallum, 2008]. The difficulty in creating them is that in choosing α for a number of features might be difficult. In the dependency parsing verb- noun example above it might not be immediately clear what the probability that a verb dominates a noun should be. If there is only one noun and one verb, then probably it should be close to one. In a sentence with only one noun and several verbs, it is not clear which of the noun-verb links should be active. One method for gathering the constraint feature expectations is in a semi-supervised setting where in addition to unlabeled data, we are given a labeled corpus. Liang et al. [2009] generate feature expectations in this way and report significant improvements in performance. Quadrianto et al. [2009] use a similar type of constraint, although in a slightly different framework and also report state of the art results.

An alternative source of feature labeling constraints would be a rule-based system for solving the problem. For example, suppose that we have available a rule-based part-of- speech tagger, that we are confident has an accuracy of at least 95%. Additionally, we might have a small labeled corpus. If we want to combine these two resources, we could use the PR framework to train a tagger on the labeled data, but use the constraint that on unlabeled data, our learned tagger should agree with the rule-based tagger at on at least 95% of the part-of-speech tags, in expectation. We could use other sources of noisy la-

bels: Chapter 8 describes our experiments with training both discriminative and generative dependency parsers using parallel text and a foreign language parser as a source of noisy la- bels. Other possibilities include labels obtained from untrained annotators, or game-based data collection methods. It is also possible to get similar feature labels from related tasks. For example, if we want to train a part-of-speech tagger, but have a corpus annotated with named entities, we could add a constraint that the tokens corresponding to named enti- ties should have proper noun, noun or adjective tags 95% of the time. If instead we want to induce both a part-of-speech tagger and a named entity recognizer, we could include agreement constraints as described in the next section.

Documento similar