TITULO III DE LAS DEDUCCIONES
AJUSTE POR INFLACION
Mehler (2008) hypothesizes that “agents build communities in the form of small worlds in order to generate knowledge networks which themselves are small worlds” and that “word networks tend to evolve as small worlds though very differently” (Mehler, 2008, p. 622). The analysis of the WordNet network structure supports this assumption, although it comes with the constraint that the clustering coefficient is quite low. The network analysis of WordNet has shown that WordNet is a small-world network with respect to the degree distribution, and hence the existence of hubs and a small average geodesic path, but that the clustering coefficient is much lower than in other complex networks such as neural or social networks.
Small-world networks expose a degree distribution that follows a power law and a high clustering coefficient. Social networks show just these features. Some vertices have a great number of connections (i.e., a high degree) while others have close to zero connections. In social networks, vertices are very likely to be connected to the same vertices as their neighbors, hence the high clustering coefficient. The synset structure of WordNet causes this low clustering coefficient (0.0004). From the analysis of the WordNet ontology, one
10-3 10-2 10-1 100 100 101 102 103 104 105 106 Φ (conductance)
k (number of nodes in the cluster)
Graph: /Users/bastianentrup/Dropbox/Promotion/Experimente/wn/wnnounexlno.pairs -> wnnounexlno wnnounexlno. A=0.001, K=10-1.5-100M, Cvr=10, SzFrc=0.001 G(251396, 296729)
ORIGINAL MIN (251396, 296729)
(a) NCC plot of the noun subset.
10-2 10-1 100 100 101 102 103 104 105 Φ (conductance)
k (number of nodes in the cluster)
Graph: /Users/bastianentrup/Dropbox/Promotion/Experimente/wn/wnverbexlwF.pairs -> wnverbexlwF wnverbexlwF. A=0.001, K=10-1.5-100M, Cvr=10, SzFrc=0.001 G(56153, 77798)
ORIGINAL MIN (56153, 77798)
(b) NCC plot of the verb subset.
10-3 10-2 10-1 100 100 101 102 103 104 105 Φ (conductance)
k (number of nodes in the cluster)
Graph: /Users/bastianentrup/Dropbox/Promotion/Experimente/wn/wnadjexlno.pairs -> wnadjexlno wnadjexlno. A=0.001, K=10-1.5-100M, Cvr=10, SzFrc=0.001 G(38626, 41241)
ORIGINAL MIN (38626, 41241)
(c) NCC plot of the adjective subset.
10-2 10-1 100 100 101 102 Φ (conductance)
k (number of nodes in the cluster)
Graph: /Users/bastianentrup/Dropbox/Promotion/Experimente/wn/wnadvexlno.pairs -> wnadvexlno wnadvexlno. A=0.001, K=10-1.5-100M, Cvr=10, SzFrc=0.001 G(32, 31)
ORIGINAL MIN (32, 31)
(d) NCC plot of the adverb subset. Figure 26: NCP plots of the WordNet POS subsets.
can suspect that this very low value is caused by the structure of the synsets and how large synsets are connected to their members; theses synonymous word forms are not connected to each other. Also, co-hyponyms are not necessarily connected. The semantic relations that were found in WordNet order the network hierarchically, while only the lexical relations interconnect vertices in different parts of the hypernym/hyponym tree. All this leads to a relative sparseness of edges in the network.
These assumptions based on the ontology structure are confirmed by the network analysis. Especially looking at the subsets built by the different POS shows how sparsely connected the parts of WordNet can be: Adjectives and adverbs form many unconnected
components. Also, the community structure caused by the components leads to the
low clustering coefficient. The verb subset shows an NCP plot that indicates that the whole component is the most community-like set within the verbs, while especially the unconnected sets of adverbs and adjectives show that the components found in the data are actually very small: between 100 vertices for the adjectives, and only 10 for the adverbs.
Steyvers and Tenenbaum (2005) describe a model that should be able to explain the WordNet structure. After looking at the degree distribution of the WordNet graph and the subsets separately, as well as the NCP plots, their model seems unlikely to explain WordNet’s structure. While the NCP plot of the noun subset in Fig. 26(a) resembles that of a network generated using preferential attachment found by Leskovec et al. (2008a, p. 42), the other subsets as well as the total graph do not.
The assumption that vertices that exist for a longer time in the graph, are, over time, more likely to be well connected, is not generally true. Other factors might be the frequency of usage as well as the POS. The Steyvers and Tenenbaum (2005) assumption that the meaning of a word becomes more differentiated the longer the word is used cannot be proved using WordNet. The analysis undergone here cannot confirm the findings and interpretations of Steyvers and Tenenbaum (2005). The model does not account for the existence and evolution over the large number of components, and it was shown what a strong influence this division has on the network analysis. Actually, to confirm their finding, Steyvers and Tenenbaum (2005) only look at WordNet’s biggest components, ignoring the strong division of the network into components.
WordNet’s structure is hence different from other complex networks. One can conclude that measures or features that take the neighbor of a vertex into account, as seen in the social networks approaches to link prediction, as well as purely geodesic-path-based
approaches, will not be good fits for this network structure. A new set of features has to be found that differs from previous attempts on other networks to meet WordNet’s special properties.
It has been assumed that the so-called cousin relations might be very precise indicators of polysemy as defined in the context of this thesis. Still, looking at the frequency of usage of these relations, the coverage is expected to be very low. From the existence of different components in the subsets, as well as the range and domain restrictions that de facto exist for some relations, one can conclude that the relations that interconnect different POS might be of special interest. Also, it can be assumed that some POS are more likely to take part in the polysemy relation that is to predicted in the following. The geodesic path has been claimed to be very precise, but it is not sufficient to classify an instance as being polysemous of homonymous.
Having in mind these properties and the topology of WordNet and its subsets, appro- priate properties of vertices and the graphs can be taken into account when attributes for the machine learning algorithms are being chosen. WordNet’s structure is very different to that of social networks. Second degree neighbors that a vertex shares many connec- tions with are not good candidates to connected the vertex to. This is due to the synset structure as well as the hierarchical structure formed by hyponymy and hypernymy.
It has been stated before that checking every possible combination of any two vertices of a graph for similarity or probability of connectedness is not very economical (cf. Clauset et al., 2008). Since only forms with the same lemma can be polysemous or homonymous, the number of possible candidates to check (i.e., the number of pairs of word forms in WordNet with a shared lemma) is reduced significantly compared to checking every vertex against any other vertex of the network. While this solution here is caused by the definition of the relation in question, we will further see when looking at the DBpedia later on how the network topology can also be used to solve the problem.