Recording a voicebank of samples from the UTAUloid's voice donor is a process that both involves a certain amount of linguistic knowledge about the target UTAUloid language, and will feel familiar to linguists, as it closely resembles collecting wordlist data for
phonetics experiments.
The first step is to assemble the list of sounds to be recorded, into what the UTAUloid community calls a 'reclist'. The reclist should consist of all possible combinations of
phonemes needed to reproduce any syllable (or mora) in the target language. This requires a linguistic understanding of the phonetic inventory, phonotactics, and syllabic (or moraic) structure of the language, as well as an understanding of how UTAUloid works in order to know exactly what phoneme combinations need to be recorded.
The phonetic inventory of Cherokee (Pulte and Feeling 1975, Scancarelli 2005; Montgomery-Anderson 2008; Uchihara 2016) is presented in Tables 4.2 and 4.3 below, based on Montgomery-Anderson (2008) and Uchihara (2016):
Table 4.2: Cherokee consonant inventory
BILABIAL ALVEOLAR PALATAL VELAR LABIO- VELAR GLOTTAL PLOSIVES t k kw ʔ AFFRICATES CENTRAL ʧ LATERAL tɬ FRICATIVES s h NASALS m n APPROXIMANTS CENTRAL j w LATERAL l <https://dantedesigns.net/nib.html>.
Table 4.3: Cherokee vowel inventory FRONT CENTRAL BACK
HIGH i iː u uː
MID e eː ə̃ ə̃ː o oː
LOW a aː
Using this inventory, the basic syllable structure (before various morphophonological processes) is as follows (adapted from Uchihara 2016:122):
Table 4.4: Cherokee syllable structure
ONSET NUCLEUS CODA
(s) (t ʧ tɬ l k kw)
(h)
V(V) (ʔ h)
(n j w) (s m h ʔ)
Even with Cherokee's relatively small phonetic inventory, this maximal syllable structure would make recording each possible potential syllable for sampling a considerable task. Thankfully, however, the architecture of UTAUloid allows for combining samples on a single note, so that a smaller number of samples can approximate the entire syllabic
inventory of a language. CVC syllables, for instance, can be produced by a combination of two appropriate CV and VC segments, and in languages with phonemic vowel length (such as Cherokee), it is not necessary to record separate samples for V and V: (or CV and CV:) syllables; vowels can be lengthened either by note length or with the addition of a following sample of the same vowel quality. Likewise, complex onsets and codas can be 'cropped' via volume manipulation in the score, so that a C1C2V segment could also be rendered as C2V.
Of course, because of the effects of co-articulation and gestural timing, the more an UTAUloid voicebank relies on 'shortcuts' such as syllable cropping, the less 'natural' it will sound; in general, recording more unique segments will result in a more natural sound. It is
worth noting, however, that 'natural' is not necessarily the goal of UTAUloid, and many UTAUloids and Vocaloids are not made with voicebanks that take every phonetic process of a language into account. Hatsune Miku, for instance, does not produce the vowel devoicing between voiceless consonants characteristic of much spoken Japanese (Okada 1999).
Taking the above factors into account, the reclist for the Cherokee UTAUloid consists of the following segments:
Table 4.5: Segments for Cherokee reclist, by syllable type
Category: Examples: Number:
1. CV: all initial consonants + all vowels /tɬo/ /se/ /ʔa/ 78 2. CCV: /s/ + {/k kw t ʧ tɬ l/} + all vowels /skwa/ /stə̃/ /stɬi/ 36 3. CCV: {/k kw t ʧ tɬ l n j w/} + /h/ + all
vowels
/khi/ /tho/ /nhe/ 54
4. V: all vowels /a/ /e/ /i/ 6
5. VC: all vowels + final /h/ /eh/ /oh/ /uh/ 6
6. VC: all vowels + final /ʔ/ /aʔ / /iʔ / /ə̃ʔ/ 6
Total: 186
Once the reclist is complete, it can then be recorded. The choice of voice donor is an important one, and will have the most significant impact on the sound of the finished
UTAUloid. In language revitalization contexts – particularly for speech communities in advanced stages of language shift – older speakers and/or skilled language users (as recognized by the community) may have the phonetic pronunciations considered most representative or most desirable by the community. On the other hand, the act of recording for an UTAUloid voicebank can be a powerful act of language reclamation for heritage speakers and learners – an equally valid consideration. Ultimately, any speaker or learner of a language can voice an UTAUloid6, with the 'best' choice differing by community; recording
6 One important exception to this is that UTAU usage guidelines forbid the creation of a voicebank from a donor without their express consent. This includes (specifically) deceased persons, which in language
multiple UTAUloids in a given language is also, of course, both feasible and desirable. For the purposes of this Cherokee UTAUloid example, I acted as the voice donor to be recorded, as part of my own language reclamation process as a heritage learner of Cherokee.
For recording, UTAU requires samples to be in WAV format, but these can come from any source. They can be recorded directly onto a computer using software like Praat
(Boersma & Weenink 2018) or Audacity (Audacity Team 2018), or on a dedicated recording device and copied to the computer for use in UTAU after the fact. For this Cherokee
UTAUloid, the reclist was recorded as a single WAV file in Praat, at 44.1 kHz in 16bit, using an Audio-Technica ATR2100-USB dynamic cardioid microphone, and then split into
individual samples using Audacity.
Aside from a quiet environment and the use of a windscreen – to cut down on high frequencies which can make it difficult for the UTAU synthesis engine to blend sibilants smoothly – the main considerations when recording specifically for UTAUloid is that the recordings should be at as close to a constant pitch as possible, and that nucleus of the syllable or mora should be held long enough to ensure that the synthesis engine has enough steady state of the nucleus to adjust as needed. Figure 4.9 illustrates the raw samples for the sequences /ta kʷa ʧa/ of the Cherokee UTAUloid:
revitalization contexts, means that archival recordings of speakers who have passed away should not serve as voicebank sources (UTAU Software Usage Policy).
Figure 4.5: Raw samples for /ta kʷa ʧa/
After all raw audio files are recorded, they should be placed in a single directory, and can then be 'tuned' for synthesis.