Our theoretical and empirical analysis highlights the fundamental differences between word-based (semi-Markov tagging) and character-based (Markov tagging) models, which enlighten us to design new models. The above analysis indicates that the theoretical differences cause different error distributions. The two approaches are either based on a particular view of segmentation. The analysis is helpful to design new solutions for segmentation. Our analysis points out several drawbacks of each one. It may be helpful for both models to overcome their shortcomings. We may naturally ask what other methods may prove fruitful. For example, one weakness of word-based model is its word induction ability which is partially caused by its neglect of internal structure of words. A word-based model may be improved by solving this problem. On the other hand, character-based segmenters hope to find a way to utilize dynamic word token information. For example, Zhang et al. [2006] proposed a subword-based model, in which the basic predicting unit is larger than a character yet smaller than a word.
While the two mechanisms overlap in their numerical overall results, they are not redundant. Each segmentation model has strengths and weaknesses for certain design problems. We may construct a single system integrating the strengths of
each segmenter. In this chapter, we try this direction by using an ensemble learn- ing technique. The question “How to combine systems in different architectures” is currently a hot topic in a majority of NLP tasks. System combination strategies can be roughly divided into two categories: (1) learning-based post-inference and (2) learning-free post-inference. For example, in dependency parsing, several methods are proposed to integrate transition-based and graph-based parsers [Surdeanu and Manning, 2010]. Previous work pays much attention to incorporating features that use one system as main problem solver and the main solver use features generated from other systems [Nivre and McDonald, 2008; Torres Martins et al., 2008]. This kind of combination method involves learning in the training phase: A meta-learner is trained to provide combination decisions. The other kind of integration architecture is to directly combine outputs of different systems, such as voting. Note that this kind of combination method may involve complex inference procedure. For example, a re-parsing technique was successfully developed to combine the outputs provided by multiple parsers in [Sagae and Lavie, 2006b]. In their method, dependency struc- tures given by different parsers are first used to create a weighted graph. Finding the optimal dependency structure is formulated as a maximum spanning tree (MST) inference problem over this graph.
The Bagging-based combination method proposed in this chapter is a learning-free inference method. In the next chapter, we will present a learning-based inference, i.e. sub-word tagging, to enhance word segmentation through combining three tagging methods.
Chapter 3
Stacked Sub-word Tagging for
Joint Word Segmentation and POS
Tagging
The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this chapter, we pro- pose a novel stacked sub-word tagging model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, multiple heterogeneous solvers are trained to produce coarse segmentation and POS information. Second, the outputs of the predictors are merged into sequences of largest non-overlapped strings, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is efficient, while in the sub-word tagging step rich contextual features can be approximately derived. We also study the anno- tation ensemble problem and show that sub-word tagging a robust solution, in the sense that the coarse-grained solvers can be trained on heterogeneous annotations. Evaluation on the Penn Chinese Treebank and People’s Daily data shows that our model yields significant improvements over the best system reported in the literature.
3.1
Background
3.1.1
The Problem
Word segmentation and part-of-speech (POS) tagging are fundamental steps for more advanced Chinese language processing tasks, such as parsing and semantic role la- beling. Joint approaches that resolve the two tasks simultaneously have received much attention in recent research. Previous work has shown that joint solutions led to accuracy improvements over pipelined systems by avoiding segmentation error propagation and exploiting POS information to help segmentation. A challenge for joint approaches is the large combined search space, which makes efficient decoding and structured learning of parameters very hard. Moreover, the representation abil- ity of models is limited since using rich contextual word features makes the search intractable. To overcome such efficiency and effectiveness limitations, approximate inference and reranking techniques have been explored in previous work [Jiang et al.,
2008b; Zhang and Clark, 2010].
Given a sequence of characters c = (c1, ..., c#c), the task of word segmenta- tion and POS tagging is to predict a sequence of word and POS tag pairs y = (hw1, p1i, hw#y, p#yi), where wi is a word, pi is its POS tag, and a “#” symbol de- notes the number of elements in each variable. In order to avoid error propagation and make use of POS information for word segmentation, the two tasks should be resolved jointly. Previous research has shown that the integrated methods outper- formed pipelined systems [Jiang et al., 2008a; Ng and Low, 2004; Zhang and Clark,
2008a]. A major challenge for such joint systems is the large search space faced by the decoder. Decoding can be inefficient.
3.1.2
Character-Based and Word-Based Methods
Similar to word segmentation, both word-based (semi-Markov tagging) and character- based (Markov tagging) methods are popular for joint word segmentation and POS tagging. Word segmentation can be viewed as a bracketing problem, while joint segmentation and tagging can be viewed as a labeled bracketing problem.
In the “word-based” approach, the basic predicting units are words themselves. This kind of solver sequentially decides whether the local sequence of characters makes up a word as well as its possible POS tag. In particular, a word-based solver reads the input sentence from left to right, predicts whether the current piece of continuous characters is a word token and which class it belongs to. Solvers may use previously
predicted words and their POS information as clues to find a new word. After one word is found and classified, solvers move on and search for the next possible word. This word-by-word method for segmentation was first proposed in [Zhang and Clark,
2007], and was then further used in POS tagging in [Zhang and Clark, 2008a]. In the “character-based” approach, the basic processing units are characters which compose words, and joint segmentation and tagging is formulated as the classification of characters into POS tags with boundary information. For example, the label B-NN indicates that a character is located at the begging of a noun. Using this method, POS information is allowed to interact with segmentation. This character-by-character method for segmentation was first proposed in [Xue, 2003], and was then further used in POS tagging in [Ng and Low, 2004]. One main disadvantage of this model is the difficulty in incorporating the whole word information. Note that the hybrid approach described in [Kruengkrai et al.,2009;Nakagawa and Uchimoto,2007] is also a character-based approach, since the word information used is word type information.
3.1.3
Stacked Learning
Stacked generalization is a meta-learning algorithm that was first proposed in [Wolpert,
1992] and [Breiman, 1996b]. The idea is to include two “levels” of predictors. The first level includes one or more predictors g1, ...gK : Rd → R; each receives input x ∈ Rdand outputs a prediction g
k(x). The second level consists of a single function h : Rd+K → R that takes as input hx, g1(x), ..., g
K(x)i and outputs a final prediction ˆ
y = h(x, g1(x), ..., gK(x)).
Training is done as follows. The training data S = {(xt, yt) : t ∈ [1, T ]} is split into L equal-sized disjoint subsets S1, ..., SL. Then functions g1, ..., gL (where gl = hgl1, ..., glKi) are separately trained on S − Sl, and are used to construct the augmented data set ˆS = {(hxt, ˆy1t, ..., ˆyKt i, yt) : ˆykt = glk(xt) and xt ∈ Sl}. Finally, each gk is trained on the original data set and the second level predictor h is trained on ˆS. The intent of the cross-validation scheme is that ykt is similar to the prediction produced by a predictor which is learned on a sample that does not include xt.
Stacked learning has been applied as a system ensemble method in several NLP tasks, such as named entity recognition [Wu et al., 2003] and dependency parsing [Nivre and McDonald,2008]. This framework is also explored as a solution for learning non-local features in [Torres Martins et al., 2008]. In the machine learning research, stacked learning has been applied to structured prediction [Cohen and Carvalho,
sub-word tagging.
3.1.4
Annotation Ensemble
A majority of data-driven NLP systems relies on large-scale, manually annotated corpora. These corpora are important to train statistical systems but very expensive to build. Nowadays, for many NLP tasks, multiple heterogeneous annotated corpora have been built and are publicly available. For example, the Penn Treebank is popular to train PCFG-based parsers, while the Redwoods Treebank is well known for HPSG research; the Propbank is favored to build general semantic role labeling systems, while the FrameNet is attractive for predicate-specific labeling. However, the annota- tion schemes in different projects are usually different, since the underlying linguistic theories vary and have different ways to explain the same language phenomena.
The co-existence of heterogeneous annotation data, i.e. labeled data in different representations, presents a new challenge to the consumers of such resources. While many state-of-the-art statistical NLP systems are not bound to specific annotation standards, almost all of them assume homogeneous annotation in the training corpus. Therefore, such heterogeneous resources cannot be simply put together while train- ing systems. In this chapter, we address the question about annotation ensemble— learning from instances that have multiple independent representations—which is a natural, yet non-standard new problem setting. There has been a feature-engineering solution for segmentation and POS tagging [Jiang et al., 2009]. Different from their work, we incorporate heterogeneous taggers into our sub-word tagging model, which more explicitly explores the relation between heterogenous annotations.