2.3 Spontaneous EW symmetry breaking: non-linear realization
2.3.1 A preliminary example from QCD: non-linear σ-model
Table 3.6 summarizes the results of the recognition experiments performed on German corpora using sub-word based LMs that use unsupervised morphemes generated using the Morfessor tool (see Section 3.3.2) along with the baseline experiment. We follow the 2 passes recognition setup of the German testing system described in Appendix A, where a 4 or 6-gram LM is used to construct the search space without a subsequent lattice or N-best rescoring. In the baseline experiment, a traditional LM is used based on full-words without any morphemes. For this initial set of experiments, the total vocabulary size is fixed to 100k. The number of the most frequent decomposable full-words retained without decomposition is optimized over the development corpus. Therefore, the number of full-words is increased gradually starting from zero. The objective of these initial experiments is to discover the best number of full-words, and the optimum order of the sub-word based LM.
Table 3.6. Recognition experiments on German corpora using morpheme-based LMs with 100k vocabularies.
gr-dev09 gr-eval09
OOV WER OOV WER
LM full-words morphemes [%] [%] [%] [%]
4-gram 100k - 5.0 33.9 4.8 29.7
- 100k 1.0 32.2 -
-2k 98k 1.2 31.8 -
-5k 95k 1.5 31.7 1.4 28.5
7k 93k 1.6 31.7 -
-10k 90k 1.8 31.8 -
-20k 80k 1.9 31.8 -
-30k 70k 2.1 31.9 -
-6-gram 5k 95k 1.5 31.6 1.4 28.5
Table 3.6 shows that the best number of full-words to retain in the sub-word based vocabulary is 5k.
Chapter 3 Sub-Word Based Language Models
The minimum observed WERs are achieved using 5k full-words + 95k morphemes. Thereby, the WERs are reduced by [gr-dev09: 2.2% absolute (6.5% relative); gr-eval09: 1.2% absolute (4.0% relative)] compared to the full-word baseline. In addition, significant reductions can be observed in the sub-word OOV rates compared to the full-word OOV rates. On the other hand, it is noted that using a 6-gram rather than 4-gram LM does not help as almost the same WERs are observed for both corpora.
Table 3.7 compares the previously known fragment-based graphones to the newly proposed morpheme-based graphones. The generation of graphones is morpheme-based on G2P models as discussed previously in Section 3.4.2. To train these G2P models, we use a base-lexicon containing pronunciations for about 118k words divided into a training set of 112k words, and a test set of 6k words. Multiple G2P models are trained using different model parameters. For each model, the phoneme error rate (PER) is measured on the test set. The morpheme-based graphones are obtained by modifying a set of graphones based on a G2P model trained with a graphone size parameter L = 4 since it gives the least PER (review Section 3.4.2). The size of the baseline full-word vocabulary is set to 100k words on top of which different types of graphones are added. It is worth noting that the used number of fragment-based graphones represents all the graphones found in the training data other than the original 100k full-words. This interprets the very low OOV rates observed in these cases. Nevertheless, we could not set the graphone size parameter L to a value more than 4 as this increases the graphone inventory leading to impractically very large resource requirements during the G2P model training.
Table 3.7. Recognition experiments on German corpora using 100k full-words as a baseline vocabulary and adding different fragment-based and morpheme-based graphones.
gr-dev09 gr-eval09
voc. OOV WER OOV WER
experiment size graphones [%] [%] [%] [%]
full-words 100k - 5.0 33.9 4.8 29.7
fragment-based graphones
graphone size (L) = 2 102k 2k 0.1 34.2 -
-3 110k 10k 0.1 32.8 -
-4 124k 24k 0.1 32.4 0.1 29.4
morpheme-based graphones 177k 77k 2.8 32.5 2.6 29.5
300k 200k 1.0 32.1 1.1 29.3
It can be seen from Table 3.7 that the morpheme-based graphones outperform the fragment-based graphones. Therefore, fragment-based graphones are going to be utilized in further experiments.
Similar to the experiments on Arabic, to find the best operating point for both the full-word and sub-word based LMs, Table 3.8 introduces a set of experiments in which both full-word and sub-word based vocabulary sizes are increased gradually up to one million. In addition, an extended hybrid LM is experimented which includes words, morphemes, as well as morpheme-based graphones. Therein, full-words are the highest frequent units in the vocabulary; and morphemes are less frequent units; whereas morpheme-based graphones are the least frequent part of the vocabulary. In the recognition lexicon, multiple pronunciation variants are provided for every unit except for graphones, where a single pronunci-ation is provided that corresponds to the phonemic part of every graphone. The probability distribution over graphones can be seen as a combination of pronunciation probability and LM probability in one join distribution.
Table 3.8 shows that the best operating point for the full-word LM occurs at a vocabulary size of 750k full-words. Whereas, the best operating point for the morpheme-based LM occurs at a vocabulary size of 500k (5k full-words + 495k morphemes). Using this morpheme-based LM, WER reductions of [gr-dev09: 0.3% absolute (1.0% relative); gr-eval09: 0.2% absolute (0.7% relative)] are achieved compared to the best full-word LM. At the same time, significant reductions in OOV rates are observed for the best morpheme-based LM compared to the best full-word LM.
Using an extended hybrid LM containing 5k full-words + 295k morphemes + 200k morpheme-based graphones, WER reductions of [gr-dev09: 0.3% absolute (1.0% relative); gr-eval09: 0.4% absolute (1.5%
relative)] are achieved compared to the best full-word LM. Table 3.9 shows the word- and character-level perplexities (PPLs) for the most important experiments listed in Table 3.8.
3.5 Experimental Results
Table 3.8. Recognition experiments on German corpora using full-words, morphemes, and morphemic graphones for LMs with very large vocabularies.
morph- gr-dev09 gr-eval09
emic
voc. full- morph- graph- OOV WER(ins/del) CER(ins/del) OOV WER(ins/del) CER(ins/del)
size words emes ones [%] [%] [%] [%] [%] [%]
100k 100k - - 5.0 33.9(5.3/6.8) 15.1(2.7/6.9) 4.8 29.7(3.4/7.1) 13.8(2.5/6.9) 200k 200k - - 3.8 32.7(4.7/7.0) 14.7(2.7/6.8) 3.5 28.8(3.0/7.3) 13.5(2.4/6.8) 300k 300k - - 3.3 32.2(4.4/7.0) 14.6(3.0/6.9) 3.0 28.4(2.9/7.3) 13.2(2.3/6.7) 500k 500k - - 2.7 32.0(4.0/7.3) 14.7(2.9/7.1) 2.4 28.6(2.7/7.8) 13.4(2.2/7.0) 750k 750k - - 2.3 31.3(4.6/6.0) 14.3(3.5/5.9) 2.1 27.4(3.2/6.5) 12.8(2.7/5.9) 1M 1M - - 2.2 31.4(4.6/6.0) 14.3(3.5/6.0) 1.9 27.5(3.1/6.5) 12.7(2.7/5.8)
2.5M 2.5M - - 1.7 1.4
100k 5k 95k - 1.5 31.7(3.8/7.3) 14.6(2.6/6.7) 1.4 28.5(2.8/7.5) 13.3(2.3/6.8) 500k 5k 495k - 0.9 31.0(4.4/5.8) 14.2(3.5/5.8) 0.7 27.2(3.1/6.1) 12.5(2.7/5.6) 750k 5k 745k - 0.8 31.0(4.3/5.9) 14.2(3.5/5.9) 0.7 27.2(3.1/6.2) 12.5(2.7/5.6) 1M 5k 995k - 0.8 31.2(4.3/6.1) 14.3(3.5/6.0) 0.7 27.2(3.1/6.1) 12.5(2.7/5.6)
2.1M 5k 2095k - 0.7 0.5
300k 100k - 200k 1.0 32.1(4.4/7.0) 14.5(3.0/6.8) 1.1 29.3(3.2/7.1) 13.5(2.4/6.8) 500k 5k 295k 200k 0.9 31.0(4.7/5.8) 14.1(3.5/5.8) 0.8 27.0(3.3/5.9) 12.3(2.7/5.6)
Table 3.9. word- and character-level perplexities for full-word and sub-word based LMs on German corpora (inv:
perplexity for in-vocabulary text excluding the unk symbol; all: perplexity for the whole text including the unk symbol).
voc. full- morphemic word-level PPL char-level PPL
corpus size words morphemes graphones inv(#units) all(#units) inv(#chars) all(#chars) gr-dev09 750k 750k - - 509.0(69548) 490.4(71133) 2.818(418447) 2.725(439560)
500k 5k 495k - 403.9(72391) 393.2(73906) 2.799(422086) 2.713(442333) 500k 5k 295k 200k 398.5(74650) 397.1(76633) 2.889(421293) 2.802(445060) gr-eval09 750k 750k - - 520.0(35591) 503.3(36319) 2.793(216684) 2.713(226395) 500k 5k 495k - 403.0(37151) 393.8(37845) 2.772(218582) 2.697(227921) 500k 5k 295k 200k 382.7(38433) 382.5(39288) 2.838(219126) 2.769(229364)
Chapter 3 Sub-Word Based Language Models