VI. PERIODO DE LA DENTICION PERMANENTE
VI.1 Características morfológicas de los dientes permanentes
Typical approaches of text categorization require the availability of a train- ing set of documents, from which necessary knowledge is inferred. A train- ing set generally must have some characteristics in order to be usable and to yield optimal results: it must contain an adequate amount of documents, which must be as exactly as possible representative of the documents which will be classified and must be coherently labeled with the same categories which are going to be assigned to new documents. To be representative, training documents should intuitively contain the same words or, more gen-
3.3. Text Categorization 49
erally, the same features which will be extracted to predict categories of subsequent documents to be classified, which are refered to astarget docu- ments. In other words, can be said that training documents must be within the same domain of the target documents. Intuitively, the domain of the documents is the context within they are produced and/or consumed and dictates the words which are used and the categories under which are or must be filed.
Retrieving a set of documents of the exact same domain of the target documents can often be difficult. As discussed previously, a solution would be to manually label some target documents, but creating a suitable train- ing set may require to label a great amount of documents, thus implying a significant amount of human effort. However, in some cases, a set of labeled documents of a slightly different domain may be available. The difference between the domains may consist in the use of some different terms or in the organization within categories representing slightly different concepts. Considering the general setting of the problem, the traditional techniques seen so far for text categorization might be applied to infer a classifier from available training documents and apply it to target documents. How- ever, due to the differences between the two domains, this would likely not result in an accurate classification, as many of the features known by the classifier would not be found within the target documents. Cases like these would require specific methods to somehow transfer the knowledge extracted from the available training data to the target domain. Through- out the last decade, techniques for transfer learning have been devised to address these cases (Pan and Yang, 2010). Transfer learning generally in- volves solving a problem in a target domain by leveraging knowledge from a source domain whose data is fully available. Cross-domain text catego- rization refers to the specific task of classifying a set of target documents in predefined categories using as support a set of pre-labeled documents of a slightly different domain.
Contrarily to traditional text categorization problems, where target doc- uments are generally assumed to be not known in advance, typical cross- domain methods imply that unlabeled target documents must be given in advance, at least in part, as some knowledge of the target domain is nec- essary. Also, the majority of cross-domain methods consider a single-label setting (i.e. one and only one category label to each document), which is assumed by default for the rest of this section.
input a set DS of source documents constituting the source domain, a set DT of target documents making up thetarget domain, we denote with D= DS ∪ DT their union, a set C of categories and a labeling CS : DS → C associating a single category to each source document. The required output is a predicted labeling CT :DT → C for documents of the target domain.
For what concerns the relationship between the two domains, in the general case of transductive transfer learning where cross-domain text cat- egorization falls in (see below), they must share the same feature space X and the same class labels Y: in the case of text documents, the first condition can be satisfied simply by selecting the same terms for source and target domain. The common assumption on the difference between the two domains is that the labels are equally conditioned by the input data, which though is distributed differently in them. Denoting with XS and YS data and labels for the source domain and with XT and YT those for the target domain, we haveP(YS|XS) =P(YT|XS), but P(XS)6=P(XT): this condition is known as covariate shift (Shimodaira, 2000).
Often, two major approaches to transductive transfer learning are dis- tinguished: (Pan et al., 2010) and other works refer to them as instance- transfer and feature-representation-transfer.
Instance-transfer-based approaches generally work by re-weighting in- stances (data samples) from the source domain to adapt them to the tar- get domain, in order to compensate the discrepancy between P(XS) and
P(XT): this generally involves estimating an importance PP((xxS)
T) for each
source instance xS to reuse it as a training instance xT under the target domain.
Some works mainly address the related problem of sample selection bias, where a classifier must be learned from a training set with a biased data distribution. (Zadrozny, 2004) analyzes the bias impact on various learning methods and proposes a correction method using knowledge of selection probabilities.
Thekernel mean matchingmethod (Huang et al., 2007) learns re-weighting factors by matching the means between the domains data in a reproducing kernel Hilbert space (RKHS); this is done without estimating P(XS) and
P(XT) from a possibly limited quantity of samples. Among other works operating under this restriction there is the Kullback-Liebler importance estimation procedure (Sugiyama et al., 2007), a model to estimate impor- tance based on minimization of the Kullback-Liebler divergence between
3.3. Text Categorization 51
real and expectedP(XT).
Among works specifically considering text classification, (Dai et al., 2007b) trains a Na¨ıve Bayes classifier on the source domain and trans- fers it to the target domain through an iterative Expectation-Maximization algorithm. In (Gao et al., 2008) multiple classifiers are trained on possibly multiple source domains and combined in alocally weighted ensemble based on similarity to a clustering of the target documents to classify them.
On the other side, feature-representation-transfer-based approaches gen- erally work by finding a new feature space to represent instances of both source and target domains, where their differences are reduced and standard learning methods can be applied.
The structural correspondence learning method (Blitzer et al., 2006) works by introducingpivot features and learning linear predictors for them, whose resulting weights are transformed through Singular Value Decom- position and then used to augment training data instances. The paper (Daum´e III, 2007) presents a simple method based on augmenting instances with features differentiating source and target domains, possibly improv- able through nonlinear kernel mapping. In (Ling et al., 2008a) a spectral classification-based framework is introduced, using an objective function which balances the source domain supervision and the target domain struc- ture. With the Maximum Mean Discrepancy (MMD) Embedding method (Pan et al., 2008), source and target instances are brought to a common low- dimensional space where differences between data distributions are reduced; transfer component analysis (Pan et al., 2011) improves this approach in terms of efficiency and generalization to unseen target data.
The following works are focused on text classification. In (Dai et al., 2007a) an approach based on co-clustering of words and documents is used, where labels are transferred across domain using word clusters as a bridge. The topic-bridged PLSA method (Xue et al., 2008a) is instead based on Probabilistic Latent Semantic Analysis, which is extended to accept un- labeled data. In (Zhuang et al., 2011) is proposed a framework for joint non-negative matrix tri-factorization of both domains. Topic correlation analysis (Li et al., 2012b) extracts both shared and domain-specific latent features and groups them, to support higher distribution gaps between do- mains.
Likely to traditional text classification, some methods leverage exter- nal knowledge bases: these can be helpful to link knowledge across do- mains. The method presented in (Wang et al., 2008a) improves the cited
co-clustering-based approach (Dai et al., 2007a) by representing documents with concepts extracted from Wikipedia. The bridging information gap method (Xiang et al., 2010) exploits instead an auxiliary domain acting as a bridge between source and target, using Wikipedia articles as a practical example. These methods usually offer very high performances, but need a suitable knowledge base for the context of the analyzed documents, which might not be easily available for overly specialized domains.
Beyond the presented works where domains differ only in the distribu- tion of terms, methods for cross-language text classification exist, where source and target documents are written in different languages, so that there are few or no common words between the two domains. This scenario generally requires either some labeled documents for the target domain or an external knowledge base to be available: a dictionary for translation of single terms is often used. As examples, in (Ling et al., 2008b) is presented an approach based oninformation bottleneck where Chinese texts are trans- lated into English to be classified, while the method in (Prettenhofer and Stein, 2010) is based on the structural correspondence learning cited above (Blitzer et al., 2006).