In this section we take a closer look at recent methods for taxonomy construction from text, discussing their applicability to Expertise Mining. In particular, we compare our approach for constructing a topical hierarchy with two approaches for taxonomy construction. The first approach is based on doubly anchored patterns [KH10], while the second one is the OntoLearn Reloaded approach [NVF11]. Both approaches rely on the massive amount of data available on the web to collect hyponym-hypernym relations between terms of interest. While the former approach uses doubly-anchored patterns and a root concept as seed to gather relationship pairs in a bootstrapping fashion, the latter approach exploits explicit is-a relations from term definitions.
Figure 4.11 shows the hierarchy generated in [KH10] when reconstructing the Word- Net hierarchy for plants presented in Figure 4.4. Compared to occurrence-based meth- ods for identifying relations, such as the subsumption method proposed in [SC99], pattern-based methods generally achieve high precision but lower recall. Addition- ally, patterns work better for single-word terms such as plants and herbs, that are frequently used on the web. Recall is considerably lower for multi-word terms, such as vascular plants (the second largest node, at the right of the figure, in yellow) and woody plants (the second topmost node, in blue). Although a similar number of edges are provided in the gold standard taxonomy for plants compared to vascular plants and for herbs compared to woody plants, only about half the number of edges were found for
4. CONSTRUCTING TOPICAL HIERARCHIES FOR EXPERTISE MINING
the longer terms compared to single-word terms. This problem is even more pervasive in technical domains, where a large number of terms are multi-word expressions.
Figure 4.11: Automatically constructed is-a taxonomy for the plants domain
A more promising approach for constructing a hierarchy of technical terms is the OntoLearn Reloaded approach described in [NVF11]. This method is specifically de- signed to for constructing taxonomies for technical domains, Artificial Intelligence (AI). Additional information about the size of the graph can be seen in Table 4.4. The ex- tracted taxonomy can be seen in Figure 4.12. Note that this is a different visualisation than the ones presented in [NVF11], but the underlying structure is the one made available online by the authors. The main difference is that in our visualisation the size and the colour of the nodes is used to represent the degree of a node. This was meant to highlight nodes that have a large number of descendants.
The root of the OntoLearn Reloaded taxonomy for Artificial Intelligence is the node abstraction. This node has as subordinate nodes other abstract terms such as event, property, knowledge, communication, data/information, system. These terms are domain relevant, but they are too broad to identify an expertise area. Although this approach has a higher coverage of technical terms than the pattern-based approach,
4.4 Experimental evaluation
Figure 4.12: OntoLearn Reloaded is-a taxonomy for Artificial Intelligence
about half of the nodes are not connected to any other nodes. Furthermore, the six most connected nodes (i.e., algorithm, function, methodology/method, procedure/process, act/ action/activity, model ) are also abstract terms, not expertise topics. The edges between these six highly connected nodes and their direct descendants represent about 30% of the total number of edges in the graph. The taxonomy contains a large number of multiword terms as well, but this is not evident in our visualisation because these terms have a small number of child nodes and they appear closer to the leaves of the graph.
The same can be said about the OntoLearn Reloaded taxonomy constructed for a subfield of Artificial Intelligence, Computational Linguistics, that is shown in Figure 4.13. The same root node was identified, the abstraction node. This node is connected to a similar list of abstract concepts, including: system, data/information, knowledge, communication, quantity/measure, event, property. Again, about 40% of the nodes are
4. CONSTRUCTING TOPICAL HIERARCHIES FOR EXPERTISE MINING
not connected by any edge. In this case, the six nodes with the highest degree are procedure/process, model, rule, function, grammar and system. With the exception of grammar, the other nodes are too general to be considered as acceptable descriptors of expertise. Altogether, these nodes are connected by about a quarter of the edges in the graph.
Figure 4.13: OntoLearn Reloaded is-a taxonomy for Computational Linguistics
To come back to our small example based on seven Computational Linguistics terms, used in Section 4.1.3, we analyse a subset of the OntoLearn Reloaded taxonomy, that is shown in Figure 4.14. The majority of the seven terms used in our example can be found in the OntoLearn Reloaded taxonomy as well, with the exception of alignment model, natural language, and translation system, which appear as part of longer terms
4.4 Experimental evaluation
such as statistical alignment model, machine translation system, and natural language processing, respectively. The small taxonomy presented in Figure 4.14 was constructed by selecting the above mentioned nodes in the OntoLearn Reloaded taxonomy, as well as all ancestors up to their common root, which is the node abstraction. A first obser- vation is that the OntoLearn Reloaded hierarchy is rich at the abstract levels, but more shallow for specialised terms. Most of the nodes that are closely related to the root in our hierarchy presented in Figure 4.8, appear as leaves in the OntoLearn Reloaded taxonomy, including the name of the field, natural language processing. This is a side- effect of iteratively searching for hypernyms in increasingly general definitions. The OntoLearn Reloaded approach enforces strict semantics on pairs of relations between terms, resulting in highly accurate structures, but that have a low coverage when iden- tifying relations between specialised terms, such as expertise topics.
Figure 4.14: Ontolearn Reloaded taxonomy for seven Computational Linguistics terms
The application of automatically constructed is-a taxonomies to Expertise Mining is hindered by the low connectivity of the resulting graph, by the large number of abstract nodes in the top levels of the taxonomy, as well as by the considerable number of edges that are connected to a high level concept. In comparison, our algorithm for
4. CONSTRUCTING TOPICAL HIERARCHIES FOR EXPERTISE MINING
Approach Domain #Nodes #Edges
OntoLearn Reloaded AI 868 415
OntoLearn Reloaded CL 1626 926
Topical Hierarchy CL 1626 1609
Table 4.4: Graph size of OntoLearn Reloaded taxonomies for Artificial Intelligence and Computational Linguistics compared to a topical hierarchy for Computational Linguistics
constructing a topical hierarchy, relies on co-occurrence information that is available for a much larger number of nodes, even when using a moderate-sized corpus. This allows us to avoid using out-of-domain corpora which is an additional source of noise that is likely to introduce out-of-domain nodes and edges.
Take for example the topical hierarchy presented in Figure 4.15, that is constructed using a Computational Linguistics corpus. The same number of nodes was considered as for the Ontolearn Reloaded taxonomy constructed for the same domain. The root of the tree is the node natural language, which is connected to two accepted names of the field, natural language processing and computational linguistics, through the node language processing. Several subfields, such as speech recognition, information retrieval and statistical machine translation can be identified as highly connected nodes. A much larger percentage of nodes can be connected using co-occurrence information than by relying on patterns or existing definitions from the web. It is only 1% of the nodes that are not connected in the topical hierarchy, compared to 40% in the case of the Ontolearn Reloaded taxonomy, although in our case all the terms used to construct a topical hierarchy are multi-word expressions and not generic single-word terms. Topical hierarchies, such as the one presented in Figure 4.15, have a more rich structure and a larger number of edges, than OntoLearn Reloaded taxonomies.
A disadvantage of relying on co-occurrence information is that this requires a suf- ficient amount of domain-specific documents. On the other hand, Ontolearn Reloaded can be applied to construct a taxonomy even for a single document because relations are extracted on the web.
We conclude that acquiring is-a relations is an easier task for single-word nodes and abstract nodes, than for expertise topics, that are longer, specific, technical terms. At a superficial investigation, topical hierarchies constructed based on co-occurrence information are more informative for the task at hand. In the following section, we
4.4 Experimental evaluation
Figure 4.15: Topical hierarchy for Computational Linguistics
discuss an application-based evaluation of concept hierarchies, in the context of Exper- tise Mining. This application domain is well suited for this purpose, because relatively clean datasets can be gathered about expertise, as shown in Section 4.3.1. In this way, we can automatically evaluate longer paths between concepts, not just relation pairs as was previously done through user studies.