Semanto-Phonetic Writing Systems
Semanto-phonetic writing systems are also called logophonetic, morphophone-mic, logographic and logosyllabic. In such writing systems the symbols may represent both sound and meaning [GN14]. The following types of symbols are included: (1) Pictograms or pictographs, (2) logograms, (3) ideograms or ideographs, and (4) compound characters. Pictograms or pic-tographs resemble the what they represent. Logograms represent parts of words or whole words. Ideograms or ideographs represent graphically abstract ideas. Compound characters consist of a phonetic element and a semantic element [omn14].
The Japanese Kanji and Chinese are two semanto-phonetic writing sys-tems. The Chinese script comprises ideograms and mostly compound char-acters [omn14]. Each character represents one syllable. However, multi-ple characters correspond to one syllable, each one with different meaning.
Thus, semanto-phonetic scripts are a challenge for the pronunciation gener-ation [SW01]. Existing tools which convert the semanto-phonetic characters can be used for Chinese and Japanese [RSW99, SW01, SCT04]. Then the pronunciation can be derived easier from the converted strings. In the case of Chinese the characters are often converted to the Pinyin transcription, a widely used transcription with the Latin alphabet. There the grapheme-to-phoneme relationship is much more straightforward. However, the transcrip-tion is a complex task since about 13% of the Chinese characters have more than one pronunciation [RSW99].
1.4.3 Manual Pronunciation Generation
Generating pronunciations by hand, without any automatic support, can be very time-consuming. For example, [DB04c] report that producing a pronun-ciation for an Afrikaans word takes between 19.2 and 30 seconds on average.
The average times were computed on a set of 1,000 words. 19.2 seconds was the fastest average time observed in their lab using a proficient pho-netic transcriber, and represents an optimistic time estimate. Consequently, with only one annotator it would take between 8.9 and 13.9 days to com-pile a whole Afrikaans dictionary containing 40,000 words plus periods for pauses. Note that Afrikaans is a Germanic language with a fairly regular grapheme-to-phoneme relationship.
The annotation process can be done in parallel. However, multiple annotators may produce inconsistent and flawed pronunciations due to different
subjec-tive judgments, small typographical errors, and ’convention drift’ [SDV+12].
Therefore, the manual work is usually supported with automatic methods.
1.4.4 Automatic Pronunciation Generation
Rule-based Grapheme-to-Phoneme Conversion
In case of a close relationship between graphemes and phonemes, defining rules by hand can be efficient. The best case is a 1:1 relationship where the number of grapheme-to-phoneme conversion rules to be defined would corre-spond to the number of letters. If no wide context information is necessary for a good grapheme-to-phoneme conversion, it can be implemented by search and replace rules. Knowledge-based approaches with rule-based conversion systems can typically be expressed as finite-state automata [KK94, BLP98].
However, for languages with loose grapheme-to-phoneme relationships, these methods often require specific linguistic skills and exception rules formulated by human experts.
The advantage of a rule-based grapheme-to-phoneme conversion in contrast to data-driven approaches is that no sample word-pronunciation pairs are required to train the converters. However, if defining the rules for lang-uages with a more complex grapheme-to-phoneme relationship becomes an elaborate task, it is better to generate training data in terms of sample word-pronunciation pairs and use data-driven approaches. For example, for the creation of the Ukrainian GlobalPhone dictionary, we elaborated and ap-plied 882 search-and-replace rules based on [BMR08] to produce phoneme sequences for our Ukrainian words, as described in [SVYS13].
Data-driven Grapheme-to-Phoneme Conversion
In contrast to knowledge-based approaches, data-driven approaches are based on the idea that, given enough examples, it should be possible to predict the pronunciation of unseen words purely by analogy. The benefit of the data-driven approach is that it trades the time- and cost-consuming task of designing rules, which requires linguistic knowledge, for the much simpler one of providing example pronunciations.
[Bes94] propose data-driven approaches with heuristic and statistical meth-ods. In [Kne00], the alignment between graphemes and phonemes is gen-erated using a variant of the Baum-Welch expectation maximization
al-1.4 Pronunciation Generation 25 gorithm. [Che03, VALR03, JKS07, BN08] use a joint-sequence model to the grapheme-to-phoneme conversion task. [Nov11] and [NMH12] utilize weighted finite-state transducers for decoding as a representation of the joint-sequence model. [GF09], [LDM09], and [KL10], apply statistical ma-chine translation-based methods for the grapheme-to-phoneme conversion. A good overview of state-of the art grapheme-to-phoneme conversion methods is given in [HVB12]. Data-driven methods are applied commonly and cross-checks of the generated pronunciations are often not performed [LCD+11]
because a manual check is time-consuming and expensive, especially if native speakers or linguists need to be hired for this task. Following the literature, we denote a statistical model for the grapheme-to-phoneme conversion as grapheme-to-phoneme model.
As Sequitur G2P, which uses joint-sequence models for the grapheme-to-phoneme conversion task, usually gives the best performance [BN08, HVB12, SQS14], we used it for training and applying grapheme-to-phoneme models for most of our experiments. Therefore, we introduce Sequitur G2P in the following paragraph.
Sequitur G2P
Sequitur G2P is a data-driven grapheme-to-phoneme converter developed at RWTH Aachen University [BN08]. It is open source software and in exper-iments it has shown favorable results when compared to other grapheme-to-phoneme tools. Sequitur G2P employs a probabilistic framework based on joint-sequence models. Such models consist of units called multigrams or graphones, which carry both input and output symbols (graphemes and phonemes). The effective context range covered by a model is controlled by the maximum length L of graphones and the n-gram order M of the sequence model. A graphone of length L carries 0 to L graphemes and 0 to L phonemes, with the non-productive case of both 0 graphemes and 0 phonemes being excluded. The simplest case of L=1 is called a singular gra-phone and corresponds to the conventional definition of a finite state trans-ducer. Several previous experiments have shown that singular graphones in combination with long-range M-grams yield the best performance [BN08].
Typically, M should be in the order of the average word length for maxi-mum accuracy as described in [BN08]. This value naturally differs between our project languages, so we settled on the lowest common average value for the sake of comparability. Therefore, in most experiments we chose to train models using the parameters L=1 and M=6.
Figure 1.10 – Sequitur G2P model performance over n-gram order M .
Figure 1.10 also confirms that M=6 is a reasonable choice. We illustrate the performance of grapheme-to-phoneme models of four languages with differ-ing grade of grapheme-to-phoneme relationship and varydiffer-ing model sizes M trained with 10k word-pronunciation pairs. The performance is expressed by the edit distance between the hypotheses and the canonical reference pro-nunciations at the phoneme level denoted as phoneme error rate (PER). We regard pronunciations which are closer to our high-quality reference pronun-ciations as better since they minimize the human editing effort in a semi-automatic pronunciation dictionary development, as we show in Section 5.3.
For all four languages, we observe that M larger than 3 tend to asymptote and give no much performance improvement. Consequently, M=6 is a choice which is not too conservative.
During the iterative training process, Sequitur G2P uses an expectation max-imization algorithm to estimate the model. To avoid the problem of over-fitting, evidence trimming and discounting is performed. Since the discount values need to be optimized on a data set that is separate from the data that is used to compute the evidence values, we let Sequitur G2P select a random 5% of our training data as held-out development set. The removal of this data from the training set has a noticeable negative effect on model performance. Because of this, Sequitur G2P does a ‘fold-back’ training: The held-out data is added back to the training set after the training process has converged. The training then continues with the previously calculated discount parameters until further convergence.