There are a number of applications in telecommunications where synthesis-by- analysis represents an attractive option. For example, services such as database en quiry systems require the storage of good quality speech for message output. Effi cient storage of such messages is of considerable importance and so present day sys tems which require the storage of large amounts of speech use compression techniques to reduce the storage overhead. Table 2.1 shows a number of speech cod ing strategies commonly used for storing messages.
Data rate (kbit/s)
Method of Analysis
- 6 4 Pulse Code Modulation (PCM) of the speech waveform, 8 bits per sample.
- 3 2 Adaptive Differential Coding (ADPCM)
- 1 0 Multi-Pulse Linear Predictive Coding (MPLPC) - 6 Parallel Formant Synthesis - Fixed Frame Rate - 4 Linear Predictive Coding (LPC)
- 2 - 4 Parallel Formant Synthesis - Variable Frame Rate - 0.1 Text-to Speech System
Table 2.1 Commonly used methods of speech analysis
It is clear from table 2.1 that as the data rate drops more sophisticated models of pro duction are required. Stored speech generated by synthesis-by-analysis represents a
Source Speech
Feature extraction Synthesiser
Parameters
Fig. 2.1: Formant analysis by directed feature extraction
Speech Output Error measure Compare Synthesiser parameters Synthesiser
Re-circulating store for source speech
compromise between efficiency of storage and speech quality. Text-to-speech sys tems while being far more efficient in storage, cannot presently match the natural ness of speech produced by copy synthesis techniques. A further advantage in using speech generated by formant synthesis is the comparative ease with which message segments can be concatenated.
Apart from the above applications, experimenting with synthesis-by-analysis in a formal manner aids researchers in understanding how to drive formant synthesisers. Such information can be used to improve the naturalness of speech generated by synthesis-by-rule systems which employ formant synthesis. For example improve ments in the naturalness of speech generated from text may be observed when formant synthesisers are driven by a sophisticated model of the voiced excitation (Karlsson, 1989).
2.5 Overview of the Thesis
The majority of the thesis describes a computationally efficient method of synthesis- by-analysis which is capable of producing copy synthetic speech of comparable quality to that produced by the reference method of synthesis-by-analysis described above.
The problem of obtaining suitable formant control signals can be viewed simplisti- cally in two ways (Hughes, 1988), either as a feature estimation and extraction prob lem or as a parametric optimisation problem, see fig. 2.1 and fig. 2.2. In the first approach formant synthesiser control signals are estimated using techniques which assume an underlying vocal tract model during analysis. Inevitably the performance of such techniques are limited by the underlying assumptions upon which they are based. When these assumptions are false or difficult to implement the synthesis pro duced is poor. The difficulty encountered by researchers in reliably obtaining good formant tracks is an example of this. An alternative approach is to imagine control signal estimation to be a parametric optimisation procedure, where the control signal values are elements of a multi-dimensional vector. In this approach fewer assump
tions about the nature of the speech signal are made. As a consequence the task of finding an appropriate parameter vector is greatly complicated. For example in the JSRU parallel formant synthesiser (Holmes, 1973; Rye and Holmes, 1982) there are 2^ different choices for each control signal in the synthesis vector which consists of
ZTA
a minimum of ten control signals, giving a total of 2 possible combinations. While in theory there is a vector which measured against some criteria will minimise the difference between the synthesised frame and the original speech, traversing such a large dimensional space to find such a point is a difficult task. Fortunately given reasonable initial estimates iterative procedures such as the one described earlier can find local minima in an acceptable length of time. The disadvantages of such meth ods are as follows: the performance of parametric optimisation methods are depend ent on the initial estimates. Iterative procedures are time consuming and finally, while such procedures are optimal in one sense they are also limited in others, as they give little insight into the nature of speech production or the theories used to model it.
The synthesis method described in this thesis is a hybrid, falling mainly into the first category, as it attempts to produce good quality synthetic speech by considering each synthesiser control signal as a feature to be extracted from the speech signal. The initial estimation stage attempts to determine good control signal values using the most robust1 and appropriate analysis methods. However, once initial values have been extracted, a non-parametric optimisation strategy is adopted to improve these estimates. The method is implemented as an automatic procedure to ensure informa tion is incorporated in a formal manner. The resulting synthesis is compared against the reference parametric method of synthesis to produce, where possible, statistically meaningful results in an attempt to avoid idiosyncratic quality assessment. The synthesis-by-analysis method is shown schematically in fig. 2.3 and follows closely
lrThe term ‘robust’ is used here to suggest that the estimation method considered is reliable i.e. not prone to failure.
Lx Described in appendix A < - - - chapter 3 Described in chapter 4 Described in chapter 5 Described in appendix G Described in chapter six. Synthesis Perform amplitude mapping OQ modifications to fixed excitation Inclusion of dy namic excitation Determine voicing fromTx Determine Fn from Tx Perform amplitude transformation using MLPs Raw formant estimation and tracking Determine excitation points (Tx) in voiced speech Convert to synthesiser control signals Sp
Fig. 2.3: Schematic diagram of the synthesis-by-analysis method.
the contents of the thesis chapters. This diagram is now described through an out line of the contents of the following chapters.
The following chapters can be roughly divided into three portions, an introduction, in which the topic of the chapter (and a brief description of the sectioning within the chapter) are presented, the body of the chapter, where the topic is discussed in gen eral terms and a final section where the subject of the chapter is related to the design of the JSRU synthesiser. This section will in a number of chapters also include re sults of experiments conducted using the JSRU synthesiser.
Chapter three considers the problems of determining accurate estimates for voicing, fundamental frequency and formant frequency tracks. These subjects are considered in some detail as they are fundamental to the production of good synthesis using the new analysis procedure.
Chapter four introduces a method of obtaining good formant control signal ampli tude estimates called ‘automatic amplitude mapping’ (AAM), which takes care to ac commodate specific synthesiser design requirements. Subjective testing is used to compare the quality of synthesis produced using this method with the ‘reference’ method of synthesis-by-analysis. This chapter uses extensively the results of the analyses described in chapter three.
Chapter five introduces a non-parametric method of amplitude transformation which attempts to transform the formant control signal amplitude estimates, obtained using AAM (described in chapter four) into optimised control signal values. MLPs are used to accomplish this. As in chapter four subjective tests are used to assess the quality of the resulting synthesis.
Chapter six investigates various aspects of the voiced excitation source and exam ines how specific voiced source effects can be incorporated into the JSRU synthe siser. Excitation specific effects are considered in two ways. Firstly, through modifi cations to two control signals, the mark/space ratio (which controls the duration of the open phase) and the degree of voicing. These experiments used the internally
generated fixed voiced excitation waveform. Secondly, by the inclusion of a dynami cally varying voiced excitation source, derived from analyses of the speech and laryngographic signals.
The final chapter, pulls together the results described in previous chapters and at tempts to draw conclusions. The chapter ends with a brief discussion on future work.
The thesis also contains a number of comprehensive appendices. Appendix A, con siders in detail laryngographic techniques. Appendix B introduces in some detail the theory of linear predictive analysis. Appendix C gives a brief history of the develop ment of neural networks up to and including present day trends and in addition, pre sents a detailed discussion of many of the most important considerations when de signing MLP networks.
There are also four other substantial appendices in addition to the ‘background’ ap pendices mentioned above. Appendices D, E and F contain, respectively, details of the data used in the training and assessment of the new synthesis-by-analysis method. A detailed account of the statistical analyses carried out in the assessment of the synthetic speech produced by the AAM analysis step and a detailed account of the statistical analyses carried out in the assessment of the transform synthesis analy sis step. Finally, appendix G consists of a colloquium paper which contains experi mental results of subjective tests conducted to assess the improvement in synthetic speech, produced through modifications to the JSRU synthesiser’s excitation control signal (considered in chapter 6). These important topics were included as appendi ces to allow the reader to gain quickly an overview of the major points of the thesis. A reader interested in the details of a particular procedure can then re-read the rele vant chapter, while referring to the appropriate appendices as required.