8 1983ko azaroaren 25eko Lurralde Historikoen Legea
8. La Ley de Territorios Históricos de 25 de noviembre de
When the GGS is used as ground truth for the evaluation, the best results are achieved when t = 0.50. Whereas for the comparison with the HGS the clusterings created with the threshold t = 0.25 obtain the highest evaluation scores. How is the parameter t linked to the human behaviour of hunting and gathering? What influence does the parameter t have on the flattening of a hierarchical cluster tree and how is it connected to the similarity with the clusterings created by humans? As described in section 6.2.1, the gatherers created more clusters and included more sentences in their clusters. It also seems that the sentences in a cluster created by a gatherer are not as similar as in a cluster created by a hunter. Figure 7.1 shows two dendrogramss for the EgyptAir dataset. In both dendrogramss the cluster trees are identical. As described in section 4.1, the height of the links between the objects (clusters or single sentences) represent the distance between them, known as cophenetic distance. By using the threshold t as a cut-off criterion the tree is cut at a horizontal line where distance = t and all links above this line , i.e.,
186 166 167 168 169 170 171 172 173 174 177 178 135 165 139 138 140 143 153 184 185 175 179 182 187 188 189 190 180 176 181 162 158 161 53 55 95 60 121 155 93 94 122 103 104 126 97 99 119 120 127 136 105 29 90 88 133 23 42 43 64 25 39 72 73 31 32 76 125 183 109 101 113 46 117 51 52 16 108 115 132 92 130 124 75 17 160 123 21 22 98 118 147 111 68 112 96 146 154 106 129 188141 65 27 89 30 110 148 151 149 150 137 19 128 14 164 131 87 163 58 91 74 80 36 41 40 156 84 116 142 35 114 33 54 50 560 85 24 45 37 102 100 61 83 67 77 69 78 134 57 66 86 145 59 63 70 28 34 48 6 4 5 38 44 81 79 82 47 49 159 11 152 10 12 3 2 7 1 9 20 107 15 144 26 71 62 13 157 Sentences 0.0 0.2 0.4 0.6 0.8 1.0 Distance
EgyptAir data set: split into cluster when t<0.5
186 166 167 168 169 170 171 172 173 174 177 178 135 165 139 138 140 143 153 184 185 175 179 182 187 188 189 190 180 176 181 162 158 161 53 55 95 60 121 155 93 94 122 103 104 126 97 99 119 120 127 136 105 29 90 88 133 23 42 43 64 25 39 72 73 31 32 76 125 183 109 101 113 46 117 51 52 16 108 115 132 92 130 124 75 17 160 123 21 22 98 118 147 111 68 112 96 146 154 106 129 188141 65 27 89 30 110 148 151 149 150 137 19 128 14 164 131 87 163 58 91 74 80 36 41 40 156 84 116 142 35 114 33 54 50 560 85 24 45 37 102 100 61 83 67 77 69 78 134 57 66 86 145 59 63 70 28 34 48 6 4 5 38 44 81 79 82 47 49 159 11 152 10 12 3 2 7 1 9 20 107 15 144 26 71 62 13 157 Sentences 0.0 0.2 0.4 0.6 0.8 1.0 Distance
EgyptAir data set: split into cluster when t<0.25
Figure 7.1: Dendrograms for clusterings of EgyptAir dataset with t = 0.5 and t = 0.25
with a height > t, are ignored. The upper dendrogram shows the cluster tree of the EgyptAir dataset which is cut at t = 0.5 resulting in 46 clusters and 53 singletons. The lower dendrogram shows the cluster tree cut at t = 0.25. Here the separation results in only 16 clusters and a lot more singletons. Table 7.4 shows the number of clusters11 and singletons for the EgyptAir 11These numbers include all clusters created by the algorithm. Later these clusters are filtered. Only clusters
108 7.1. FINE-TUNING THE CLUSTERING ALGORITHM sentence set for different t. A lower t here results in fewer clusters and more singletons than a higher value for t. This is equivalent to the behaviour of humans considered to be hunters and gatherers. Hunters create fewer clusters and use fewer sentences than gatherers.
t clusters singletons
0.10 7 176
0.25 16 151
0.50 46 53
0.75 48 7
Table 7.4: Number of clusters and singletons in relation to t for the EgyptAir sentence set
with k=75
Analysis of human behaviour in sentence clustering in section 6.2.1 suggested that gatherers tend to build clusters from sentences that are not equally similar to each other. By comparing clusters created by a gatherer with cluster created by a hunter it was apparent that the two groups agreed on the general topic of a cluster, but that the gatherer included additional sentences, whose connection to the topic of the cluster was not immediately visible. The assumption was that this might result in lower intra cluster similarity. The problem was that within the human generated cluster it was not possible to calculate intra- and inter-cluster similarity since the hu- man did not determine or rate the degree of membership of a sentence to a cluster. However it is possible to calculate these internal evaluation measures for the automatic generated clusterings. Table 7.5 shows the intra- and inter-cluster similarity for the two clusterings for the EgyptAir
t intra inter 0.1 0.93 0.02 0.25 0.85 0.04 0.5 0.65 0.05 0.75 0.49 0.06
Table 7.5: Intra- and inter-cluster similarity for clusterings of the EgyptAir sentence set
with t = 0.5 and t = 0.25 and k=75
sentence set created with different t. The intra-cluster similarity decreases with increasing t whereas the inter-cluster similarity, i.e., the similarity between clusters grows when the value of t increases. These result are consistent with the assumptions made above.
In conclusion it can be said that specific human behaviour with regard to sentence clustering for MDS can be emulated to a certain degree by fine-tuning the cluster algorithm. The cut off threshold t can be used to adjust the clustering algorithm to produce clusterings that exhibit typical features of a hunter or a gatherer.
Having said that, it is striking that the comparison with the HGS almost always receives higher evaluation values than the comparison with the GGS. As can be seen in tables 7.1 and 7.2 only for t = 0.75 is the Vbeta for the HGS smaller than for the GGS. In section 6.2.1 I made the assumption that it might be harder for gatherers to agree on clusters. In addition to sentence clusters which represent key topics of a sentence set the gatherers create clusters for less important topics. Humans can reach a consensus about the main topic of a document collection reasonably well (Barzilay and Elhadad, 1997; Marcu, 1997) but with lower level of importance the agreement seems to diminish. Hence the probability that different clusterings are created increases with every additional cluster. This fact could lead to continuously lower evaluation values. To confirm this hypothesis I calculated the normalized Vbetaas described in section 6.3. The normalized Vbetaputs the result into perspective with regard to the upper bound and lower bound of the evaluation scale. In principle the Vbeta can range between 0 and 1 but the interjudge agreement (J ) acts like an upper bound for the performance of the system (Radev et al., 2000). I would assume that with the normalized Vbetathe score for the two gold standard subset are similar. Unfortunately the results did not confirm this hypothesis. In section 6.2.3 both groups of human annotators receive a similar average inter annotator agreement (hunter: 0.7275, gatherer 0.725) and therefore, even with the N Vbeta, the comparison with the HGS receives higher values than the CGS. This might be due to the fact that this kind of clustering algorithm favours the creation of clusterings that are more similar to the clusterings of hunters. On the other hand it might be due to the selection of sentence sets. The sentence sets were chosen so that one generic summary can be created. The requirement was that a set describes a single person or event. This selection might already favour hunter-like clusterings. The gatherer subset of the gold standard might be more useful to summarization system which generate topic focused summaries. But these are only the results for the Iran EgyptAir subset. If this finding holds true for the whole data set remains to be seen.
Nonetheless in consequence it can be said that the clustering algorithm can be tuned to act more like a gatherer or more like a hunter by changing the value of t. Following this experiment the threshold value t will be set to 0.5 to create clusterings that are compared to the GGS and to 0.25 to create clusterings that are compared to the HGS.