Main publications
• Pierre Godard, Gilles Adda, Martine Adda-Decker, Alexandre Allauzen, Laurent Be-sacier, Hélène Bonneau-Maynard, Guy-Noël Kouarata, Kevin Löser, Annie Rialland, and François Yvon. Preliminary Experiments on Unsupervised Word Discovery in Mboshi.
In Proceedings of Interspeech, San-Francisco, California, USA, 2016
• Pierre Godard, Gilles Adda, Martine Adda-Decker, Juan Benjumea, Laurent Besacier, Jamison Cooper-Leavitt, Guy-Noël Kouarata, Lori Lamel, Hélène Maynard, Markus Müller, Annie Rialland, Sebastian Stüker, François Yvon, and Marcely Zanon Boito.
A Very Low Resource Language Speech Corpus for Computational Language Documen-tation Experiments. In Proceedings of LREC, Miyazaki, Japan, 2018a
• Pierre Godard, Laurent Besacier, François Yvon, Martine Adda-Decker, Gilles Adda, Hélène Maynard, and Annie Rialland. Adaptor Grammars for the Linguist: Word Seg-mentation Experiments for Very Low-Resource Languages. In Proceedings of the 15th Meeting of the ACL Special Interest Group on Computational Morphology and Phonol-ogy (SIGMORPHON), Brussels, Belgium, 2018b
• Pierre Godard, Kevin Loser, Alexandre Allauzen, Laurent Besacier, and Francois Yvon.
Unsupervised Learning of Word Segmentation: Does Tone Matter? In Proceedings of the 19th International Conference on Computational Linguistics and Intelligent Text Pro-cessing (CICLING), Hanoi, Vietnam, 2018c
• Pierre Godard, Marcely Zanon Boito, Lucas Ondel, Alexandre Berard, François Yvon, Aline Villavicencio, and Laurent Besacier. Unsupervised Word Segmentation from Speech with Attention. In Proceedings of Interspeech, Hyderabad, India, 2018d
Collaborations
• Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, and Sanjeev Khu-danpur. Bayesian Models for Unit Discovery on a Very Low Resource Language. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 2018
• Odette Scharenborg, Laurent Besacier, Alan W. Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stüker, Pierre Godard, Markus Müller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, and Emmanuel Dupoux. Linguistic Unit Discovery from Multi-Modal Inputs in Unwritten Languages: Summary of the “Speaking Rosetta”
JSALT 2017 Workshop. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada, 2018a
• Gilles Adda, Sebastian Stüker, Martine Adda-Decker, Odette Ambouroue, Laurent Be-sacier, David Blachon, Hélène Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitri Idiatov, Guy-Noël Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Annie Rialland, Mark Van de Velde, François Yvon, and Sabine Zerbian. Breaking the Un-written Language Barrier: The Bulb Project. In Proceedings of SLTU (Spoken Language Technologies for Under-Resourced Languages), Yogyakarta, Indonesia, 2016
• Annie Rialland, Martine Adda-Decker, Guy-Noël Kouarata, Gilles Adda, Laurent Be-sacier, Lori Lamel, Élodie Gauthier, Pierre Godard, and Jamison Cooper-Leavitt. Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville). In Proceedings of LREC, Miyazaki, Japan, 2018
• Sebastian Stüker, Gilles Adda, Martine Adda-Decker, Odette Ambouroue, Laurent Be-sacier, David Blachon, Hélène Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitri Idiatov, Guy-Noël Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Annie Rialland, Mark Van de Velde, François Yvon, and Sabine Zerbian. Innovative Technolo-gies for Under-Resourced Language Documentation: The Bulb Project. In Proceedings of CCURL (Collaboration and Computing for Under-Resourced Languages : Toward an Alliance for Digital Language Diversity), Portorõz, Slovenia, 2016
• Graham Neubig, Matthias Sperber, Xinyi Wang, Matthieu Felix, Austin Matthews, Sar-guna Padmanabhan, Ye Qi, Devendra Sachan, Philip Arthur, Pierre Godard, John He-witt, Rachid Riad, and Liming Wang. XNMT: The eXtensible Neural Machine Trans-lation Toolkit. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (AMTA), Boston, Massachusetts, USA, 2018
Chapter 2
Background
Contents
2.1 Word segmentation and alignment . . . . 16 2.1.1 Two sides of the same problem . . . . 16 2.1.2 Evaluation. . . . 20 2.1.3 Remarks . . . . 21 2.2 Early models for unsupervised string segmentation. . . . 23 2.2.1 Pioneer work . . . . 24 2.2.2 Multigrams . . . . 24 2.2.3 Minimum description length principle . . . . 25 2.3 Learning paradigms . . . . 27 2.3.1 Signatures . . . . 27 2.3.2 Signatures as finite state automata . . . . 28 2.3.3 Paradigms . . . . 28 2.4 Nonparametric Bayesian models . . . . 29 2.4.1 Stochastic processes . . . . 29 2.4.2 Sampling . . . . 32 2.4.3 Goldwater’s language models . . . . 33 2.4.4 Nested language models . . . . 35 2.4.5 Adaptor Grammars. . . . 36 2.5 Automatic word alignment . . . . 40 2.5.1 Probabilistic formulation. . . . 40 2.5.2 A series of increasingly complex parameterizations . . . . . 40 2.5.3 Parameters estimation . . . . 43 2.5.4 Alignments extraction . . . . 44 2.6 Joint models for segmentation and alignment . . . . 44 2.6.1 Segment, then align . . . . 45 2.6.2 Jointly segment and align . . . . 47 2.7 Conclusion and open questions . . . . 52
15
In Chapter 1, we introduced computational language documentation (CLD) and gave a broad picture of the challenges facing field linguists and computer scientists with respect to the preservation and documentation of endangered languages. The present chapter narrows down the scope of CLD, and introduces the particular problem examined in this thesis: the unsupervised segmentation of a stream of symbols into words, as well as its connection with the automatic word alignment task. In the BULB project’s methodology, this corresponds to a subpart of the automatic processing step after speech data collection and translation (Section 1.2.2). We review the literature and tools necessary to understand our work, and introduce notations and metrics. Parts of this chapter have appeared in (Godard and Yvon, 2016), and Section 2.5 borrows from (Godard,2014).
2.1 Word segmentation and alignment
As discussed in Chapter1, collecting annotated data is costly, and non practical to meet the challenges of documenting a large number of endangered languages. Consequently, the work presented in this thesis is concerned with unsupervised, or minimally super-vised, automatic processing of the “raw” bilingual data at our disposal after collection (see Section 1.2.2).
Such data, in the BULB project’s methodology, consist in pairs of mutually trans-lated sentences in the unwritten language1 (henceforth UL) and in the well-resourced language2(WL). A sentence π in the UL is a sequence ofL units, π = π1, . . . , πl, . . . , πL, and a sentence w in the WL is a sequence ofI units, w = w1, . . . , wi, . . . , wI. In practice, units πl in the UL can correspond to transcribed characters, phones, pseudo-phones,3 phonemes, or even speech frames. Units wi in the WL, on the other hand, correspond to transcribed words.