In this section it will be experimentally tested what influence the choice of the global corpus as basis for token and phrase frequencies has on the obtained rankings. Two sources for frequencies have been prepared to be used during term generation. For the Web 1T 5-grams Version 1 (Brants and Franz, 2006) with its 4.4 million 1,2,3,4, and 5-grams based on text of 1, 024, 908, 267, 229 tokens has been encapsulated as web service (Google-N-Grams). It is contains n-grams from web sites indexed by Google in 2005. The sentence-based occurrence and co-occurrences counts of words in PubMed have been calculated and encapsulated as web service too (PubMed-
Cooc).
Hypothesis
Reviewing the results presented in Section 3.4 (Evaluation of the quality of generated terms) it can be hypothesised that the influence of the background knowledge on the predicted candidate terms as well as on the ranking is evident. With a total difference in precision of 2.5 to 5% for the range from rank 10 to 50 the difference cannot be regarded as significant.
Fig. 3.11. Mean precision for the retrieval of terminology from lipoprotein metabolism domain, see Figure 3.5.
Experiment
For the 28 PubMed queries listed in Table 3.9 term rankings have been generated following the pipeline presented in Figure 3.1. The experiment has been repeated
with two configurations using the different corpus statistics Google− N−Grams
and PubMed−Cooc for scoring.
Beside the pure ranking special interest in this experiment has been devoted to the number of terms extracted with one but not the other configuration. Different Part-Of-Speech tags lead to different noun phrases and hence different candidate terms.
Results
Single examples Figure 3.14 shows per example for the three domains Blood Pres- sure, Obesity, and Insulin Resistance how the ranking is influenced when using instead of corpus statistics obtained from PubMed the statistics obtained from Google. The experiment was repeated with 50, 100, 500, and 2000 PubMed abstracts containing the words “Blood Pressure”, “Obesity”, or “Insulin Resistance”.
Summary over 28 experiments The figures Figure 3.12 and Figure 3.13 show a summary plot over all 28 experiments for documents sets of the size 50, 100, 500, 1000, and 2000 PubMed abstracts. For each ranked term which has been scored using occurrences and co-occurrences from PubMed (x-axis) the plot illustrates the change in rank when exchanging the PubMed occurrences and co-occurrences with the Google n-grams statistical information (y-axis) and vice versa.
PubMed vs. Google Google vs. PubMed 50 d o cuments 100 do cuments 500 do cuments
Fig. 3.12. Summary: PubMed vs. Google based corpus statistics (part 1)
Summary plot of term generation results on the basis of 50, 100, and 500 documents. The plot illustrates for each ranked term (x-axis) the change in rank when using instead of corpus statistics obtained from PubMed the statistics obtained from Google (y-axis) and vice verca. Results are shown for the top 25 ranked terms. The differences in rank have been accumulated over the 28 experiments and are visualized using gray shadings in a hexagon plot. The darker a hexagon is displayed the more aggreement was observed between the 28 experiments. The results are shown for the top100 ranked terms.
PubMed vs. Google Google vs. PubMed 1000 do cuments 2000 do cuments
Fig. 3.13. Summary: PubMed vs. Google based corpus statistics (part 2)
Summary plot of term generation results on the basis of 1000 and 2000 documents. The plot illustrates for each ranked term (x-axis) the change in rank when using instead of corpus statistics obtained from PubMed the statistics obtained from Google (y-axis) and vice verca. Results are shown for the top 100 ranked terms. The differences in rank have been accumulated over the 28 experiments and are visualized using gray shadings in a hexagon plot. The darker a hexagon is displayed the more aggreement was observed between the 28 experiments. The results are shown for the top100 ranked terms.
Missing terms As the difference in scoring does not affect the extraction of a candidate term, the extracted set of terms is identical for both configurations. This means no terms will be missed due to the change of the corpus statistics. From a applications point of view all terms can be found by searching and filtering.
Difference in rank The difference in rank observed in the plots is a contradic- tion to the hypothesis derived from the experiment in Section 3.4 (Evaluation of the quality of generated terms). The summary plots in Figure 3.12 and Figure 3.13 show change in rank greater than expected for a majority of candidate terms. The ob- servation is independent from the direction of the comparison. The transition from
Blood Obesity Insuline Resistance 50 do cuments 100 do cuments 500 do cuments 2000 do cuments
Fig. 3.14. Selected experiments: PubMed vs. Google based corpus statistics
Examples for rankings based on 50, 100, 500, and 2000 PubMed abstracts The plot illustrates for each ranked term (x-axis) the change in rank when exchanging the available background knowledge obtained from PubMed with this obtained from Google(y-axis). Results are shown for the top 100 ranked terms. Terms missing in the ranking are plotted with negative distance in red. Terms which show a difference in ranks below 2± (rank∗5%)are plotted in black color, others in blue color. The threshhold is illustrated by the gently inclined blue line.
(a) Top rank with PubMed-Cooc→lower rank with Google-N-Grams PubMed Google
8 134 [risk, Risk, risks, Risks] 19 203 [Waist, waist]
13 386 [Levels, level, levels] 12 1042 [Heart, hearts, heart] 15 9781 [index, Index] 15 9307 [Men, men]
17 9005 [WOMEN, Women, women]
15 11389 [Production, productions, production] 5 11503 [oils, Oils, oil]
16 12163 [fat, fats] 17 16450 [AI, A-I]
28 17065 [Alzheimer’s disease, Alzheimer’s Disease] 14 35395 [omega-3, Omega-3, omega3]
(b) Top rank with Google-N-Grams→lower ranked with PubMed-Cooc Google PubMed
19 53 [obese patients, obese patient] 19 59 [Serum, serum]
10 83 [Rats, Rat, rat, rats, RATS]
46 92 [protein, Proteins, proteins, Protein] 18 97 [lesions, lesion]
54 250 [concentration, concentrations, Concentrations] 9 427 [patients, patient, Patients]
29 10865 [adipokines, adipokine, Adipokines]
47 21094 [ApoA1, Apo-A1, apoA1, apoA-1, apo A-1, apo A1] 38 21159 [ApoCI, ApoC-I, apoC-I, apoCI, Apo C-I]
26 21327 [ApoA5, APOA5, apoa5]
Table 3.15. Examples for changes of the term ranking in dependence of the corpus statistics.Listing
of concepts which are ranked significantly different when exchanging Google with PubMed corpus statistics and vice versa; (a) Google-N-Grams→PubMed-Cooc and (b) PubMed-Cooc→Google-N- Grams.
PubMed-Cooc to Google-N-Grams yields as many changes in rank as the transition from Google-N-Grams to PubMed-Cooc. The summary plots show that absolute dif- ference in rank within the top 25 ranked terms is often below 6, but equally often above 25. Especially gene names or chemical compounds are prone to big differ- ence in rank depending how many (replicated) web sites mentioning the gene are indexed in the web search engine. Table 3.15 lists a selection of those concepts where a the change of rank is especially high. The examples reach from
• commonly used terms like man, woman, fat, risk to
• terms not used in common language, e.g. ApoA1, adipokines, ApoCI, and
• terms with different meaning in non-biomedical text like heart, level, production, AI, and omega-3.