Speech to sign language translation system for Spanish

(1)

Speech to sign language translation system for Spanish

R. San-Segundo , R. Barra , R. Cordoba , L.F. D'Haro , F. Fernandez , J. Ferreiros ,

J.M. Lucas , J. Macias-Guarasa , J.M. Montero , J.M. Pardo

Grupo de Tecnologia del Habla, Departamento de Ingenieria Electronica, ETSI Telecomunicacion, Universidad Politecnica de Madrid, Ciudad Universitaria sin, 28040 Madrid, Spain

Department of Electronics, University of Alcald, Spain

Abstract

This paper describes the development of and the first experiments in a Spanish to sign language translation system in a real domain. The developed system focuses on the sentences spoken by an official when assisting people applying for, or renewing their Identity Card. The system translates official explanations into Spanish Sign Language (LSE: Lengua de Signos Espanola) for Deaf people. The trans-lation system is made up of a speech recognizer (for decoding the spoken utterance into a word sequence), a natural language translator (for converting a word sequence into a sequence of signs belonging to the sign language), and a 3D avatar animation module (for playing back the hand movements). Two proposals for natural language translation have been evaluated: a rule-based translation module (that computes sign confidence measures from the word confidence measures obtained in the speech recognition module) and a statistical translation module (in this case, parallel corpora were used for training the statistical model). The best configuration reported 31.6% SER (Sign Error Rate) and 0.5780 BLEU (BiLingual Evaluation Understudy). The paper also describes the eSIGN 3D avatar animation module (considering the sign confidence), and the limitations found when implementing a strategy for reducing the delay between the spoken utterance and the sign sequence animation.

1. Introduction

During the last two decades, there have been important advances in the three technological areas that support the implementation of an automatic speech to sign language translation system: sign language studies, spoken language translation and 3D avatar animation.

Sign language presents a great variability depending on the country, even between different areas in the same coun-try. Because of this, from 1960 sign language studies have appeared not only in USA (Stokoe, 1960; Christopoulos and Bonvillian, 1985; Pyers, in press), but also in Europe (Engberg-Pedersen, 2003; Atherton, 1999; Meurant, 2004),

Africa (Nyst, 2004) and Asia (Abdel-Fattah, 2005; Masa-taka et al., 2006). In Spain, there have been several propos-als for normalizing Spanish Sign Language (LSE: Lengua de Signos Espanola), but none of them has been accepted by the Deaf community. From their point of view, these pro-posals tend to constrain the sign language, limiting its flexi-bility. In 1991, (Rodriguez, 1991) carried out a detailed analysis of Spanish Sign Language showing its main charac-teristics. She showed the differences between the sign lan-guage used by Deaf people and the standard proposals. This work has been expanded with new studies (Gallardo and Montserrat, 2002; Herrero-Blanco and Salazar-Garcia, 2005; Reyes, 2005).

(2)

tasks within rather limited domains (like traveling and tourism) and medium sized vocabularies. The best per-forming translation systems are based on various types of statistical approaches (Och and Ney, 2002), including example-based methods (Sumita et al., 2003), finite-state transducers (Casacuberta and Vidal, 2004) and other data driven approaches. The progress achieved over the last 10 years result from several factors such as automatic error measures (Papineni et al., 2002), efficient algorithms for training (Och and Ney, 2003), context dependent models (Zens et al., 2002), efficient algorithms for generation (Koehn et al., 2003), more powerful computers and bigger parallel corpora.

The third technology is the 3D avatar animation. An important community of scientists worldwide is developing and evaluating virtual agents embedded in spoken lan-guage systems. These systems provide a great variety of ser-vices in very different domains. Some researchers have embedded animated agents in information kiosks in public places (Cassell et al., 2002). At KTH in Stockholm, Gustaf-son (2002), Granstrom et al. (2002) and their colleagues have developed several multimodal dialogue systems where animated agents were incorporated to improve the inter-face. These include Waxholm (Bertenstam et al., 1995) (a travel planning system for ferryboats in the Stockholm archipelago), August (Lundeberg and Beskow, 1999), an information system at the Culture Center in Stockholm, and AdApt (Gustafson and Bell, 2003), a mixed-initiative spoken dialogue system, in which users interact with a vir-tual agent to locate apartments in Stockholm. Another application for combining language and animated agent technologies in the past has been interactive books for learning. The CSLU Toolkit integrates an animated agent named Baldi. This toolkit has been developed at CSLU (Oregon Graduate Institute OGI) (Sutton and Cole,

1998; Cole et al., 1999) and now it is being expanded at CSLR (University of Colorado) (Cole et al., 2003). This toolkit permits interactive books to be developed quickly with multimedia resources and natural interaction.

The eSIGN European Project (Essential Sign Language Information on Government Networks) (eSIGN project) constitutes one of the most important research efforts in developing tools for the automatic generation of sign lan-guage contents. In this project, the main result has been a 3D avatar (VGuido) with enough flexibility to represent signs from the sign language, and a visual environment for creating sign animations in a rapid and easy way. The tools developed in this project were mainly oriented to translating web content into sign language: sign language is the first language of many Deaf people, and their ability to understand written language may be poor in some cases. The project is currently working on local government websites in Germany, the Netherlands and the United Kingdom.

When developing systems for translating speech tran-scriptions into sign language, it is necessary to have a par-allel corpus to be able to train the language and translation

models, and to evaluate the systems. Unfortunately, most of the currently available corpora are too small or too gen-eral for the aforementioned task. From among the avail-able sign language corpora mentioned in the literature, it is possible to highlight the following. The European Cul-tural Heritage Online organization (ECHO) presents a multilingual corpus in Swedish, British and The Nether-lands sign languages (ECHO corpus). It is made up of five fables and several poems, a small lexicon and interviews with the sign language performers. Another interesting cor-pus (ASL corcor-pus) is made up of a set of videos in American Sign Language created by The American Sign Language Linguistic Research group at Boston University. In (Bun-geroth et al., 2006), a corpus called Phoenix for German and German Sign Language (DGS) in a restricted domain related to weather reports was presented. It comes with a rich annotation of video data, a bilingual text-based sen-tence corpus and a monolingual German corpus. In Span-ish, it is difficult to find this kind of corpus. The most important one is currently available at the Instituto Cer-vantes; it consists of compound of several videos with poetry, literature for kids and small texts from classic Spanish books. However, this corpus does not provide any text or speech transcriptions and it cannot not be used for our application, a citizen care application where Deaf people can obtain general information about administra-tive procedures. For this reason, it has been necessary to create a new corpus.

In this paper, spoken language translation and sign lan-guage generation technologies are combined to develop a fully automatic speech to sign language translator. This paper includes the first experiments in translating spoken language into sign language in a real domain. This system is the first one developed specifically for Spanish Sign Lan-guage (LSE: Lengua de Signos Espanola). This system completes the research efforts in sign recognition for trans-lating sign language into speech (Sylvie and Surendra, 2005).

This paper is organized as follows: in Section 2, an anal-ysis of the translation problem is described. Section 3 pre-sents an overview of the system including a description of the task domain and the database. Section 4 presents the speech recognizer. In Section 5, the natural language trans-lation module is explained. Section 6 shows the sign play-ing module usplay-ing a 3D avatar, and finally, Section 7 summarizes the main conclusions of the work.

2. Main issues in translating Spanish into Sign Language (LSE)

(3)

Montserrat, 2002). Bearing this aspect in mind, it is possi-ble to identify four main situations when translating Span-ish into LSE. These situations are explained below.

2.1. One semantic concept corresponds to a specific sign

In this case, a semantic concept is directly mapped onto a specific sign. The translation is simple and it consists of assigning one sign to each semantic concept extracted from the text. This sign can be a default translation, independent of the word string, or can differ depending on the word string from which the semantic concept is generated (Fig. 1).

2.2. Several semantic concepts are mapped onto a unique sign

The second situation appears when several concepts generate a unique sign. This situation should be solved by unifying the semantic concepts (resulting in just one concept) to proceed as in the previous situation. This union requires a concept hierarchy and the definition of a more general concept including the original concepts (Fig. 2).

2.3. One semantic concept generates several signs

The third situation occurs when it is necessary to gener-ate several signs from a unique concept. Similar to the pre-vious sections, the sign sequence and its order may depend on the concept and its value, or just the concept. This situ-ation appears in many translsitu-ation situsitu-ations:

VERBS. A verb concept generates a sign related to the action proposed by the verb and auxiliary signs provide information about the action tense (past, present or future), the action subject and the gerund action (Fig. 3). GENERAL and SPECIFIC NOUNS. In sign language, there is a tendency to refer to objects with high precision or concretion. As a result of this, there are a lot of domains where several specific nouns exist, but there is no general noun to refer to them collectively. For exam-ple, this happens with metals: there are different signs to refer to gold, silver, copper, etc., but there is no a general sign to refer to the concept of metal. The same thing happens when considering furniture: there are several signs for table, chair, bed, etc., but there is no general sign to refer to the concept of furniture. This problem is solved in sign language by introducing several specific signs (Fig. 4).

LEXICAL-VISUAL PARAPHRASES. Frequently, new concepts (in Spanish, without a corresponding sign representation) appear which do not correspond to any sign in the sign language. In order to solve this problem, Deaf people use paraphrases to represent a new concept with a sequence of known signs. This solution is the first step in representing a new concept. If this concept appears frequently, the sign sequence is replaced by a new sign for reducing the representation time. Some examples of Lexical-Visual Paraphrases are shown in Fig. 5.

The signs are language representations which are more difficult to memorize and distinguish than words.

[Saludos](buenas[periodo_del_dia](tardes))

[Greetings](good[period_of_day](afternoon))

- > {S_SALUDOS}

{S_GREETINGS}

Independently of the word string ("buenas tardes") which generated the concept [Saludos], the system always produces the same sign: {S_SALUDOS}

[Saludos](buenos [periodo_de_dia](dias))

[Greetings](good[period_of_day](morning))

- > {S_SALUDOS_MANANA}

{S_GREETINGS_MORNING}

[Saludos](buenas [periodo_del_dia](tardes)) • {S_SALUDOS_TARDES}

[Greetings](good[period_of_day](afternoon)) {S_GREETINGS_AFTERNOON}

[Saludos](hola)

[Greetings](hello)

{S_SALUDOS_HOLA}

{S_GREETINGS_HELLO}

In these cases, the generated sign depends on the concept and its value

Fig. 1. Examples of assigning a unique sign to a single semantic concept.

[Pregunta](Dime) [Hora](la hora)

[Asking](tell me) [Time](the time)

- > {S_PREGUNTA-HORA}

{S_ASK-TIME}

Two concepts ([Pregunta] and [Hora]) generate a unique sign. To solve the problem, both concepts are unified in a concept hierarchy and we proceed as in the previous situation.

[PreguntaHora](Dime la hora)

[AskingTime](tell me the time)

{S_PREGUNTA-HORA}

{S_ASK-TIME}

(4)

{S_PASADO} {S_YO} {S_FUTBOL} {S_JUGAR}

{S_PAST} {S_l} {S_FOOTBALL} {S_PLAY}

{S_PASADO} {S_YO} {S_JUGAR} {S_FUTBOL}

ACTION TENSE

Jugaba al futbol

I played football

[Subject](_omitted) [PlayJGugaba) [Footballj(futbol)

In this example, the verb generates two signs for action and tense. The tense sign must be introduced at the beginning of the sign sequence. The sign language distinguishes three verb tenses: past, present and future. When the tense is present, it does not need to be signed. Note the order change necessary between S_FUTBOL and S_JUGAR signs.

SUBJECT

In Spanish (as opposed to English), it is quite common to omit the subject of the verb but in LSE the subject is compulsory. In Spanish, this fact does not cause any ambiguity because the verb conjugation is different according to the action subject. But when translating into LSE, the verb concept must generate three signs: action, tense and subject. In the previous example, it is shown how the verb [Play](jugu6) must generate three signs: {S_PASADO} {S_YO} {S_JUGAR}.

GERUND

For indicating that the action is (or was) in process, the sign associated with the verb action is repeated twice.

Estaba jugando al futbol cuando llegaste I was playing football when you arrived

{S_PASADO} {S_YO} {S_FUTBOL} {S_JUGAR} {S_JUGAR} {S_PASADO} {S_TU} {SJ-LEGAR}

{S_PAST} {S_l} {S_FOOTBALL} {S_PLAY} {S_PLAY} {S_PAST} {S_YOU} (S_ARRIVE)

[Subject](_omitted) [Play](estaba jugando) [Footballj(futbol) [Subject](_omitted) [Arrive](llegaste)

{S_PASADO} {S_YO} {S_JUGAR} {S_JUGAR} {S_FUTBOL}

{S_PASADO} {S_TU} {S_LLEGAR}

Fig. 3. Type of sign sequences generated by verb concepts.

Necesito muebles para mi casa / need furniture for my house

[Subject](_omitted) [Needj(necesito) [Furniture](muebles) [Possessivej(mi) [House](casa)

{S_YO} {S_NECESITAR} {S_MESA} {S_SILLA} {S_MI} {S_CASA}

{S_l} {S_NEED} {SJTABLE} {S_CHAIR} {S_MY} {S_HOUSE}

{S_YO} {SJMECESITAR}

{SJVIESA} {S_SILLA}

{S_MI} {S_CASA}

In this example, the furniture concept has no an associated sign so it must be represented by several signs related to specific nouns (table and chair) included in this general noun.

Fig. 4. Signs for general nouns not presented in the sign language.

[Barro] {S_TIERRA} {S_CON} {S_AGUA}

[Clay] {S_LAND} {S_WITH} {S_WATER}

[Madriguera] {S_AGUJERO} {S_EXACTAMENTE} {S_CONEJO} {S_CASA}

[Burrow] [S_HOLE] {S_EXACTLY} {S_RABBIT} {S_HOME}

Fig. 5. Examples of Lexical-Visual Paraphrases.

Because of this, the sign dictionary is smaller than the Spanish word dictionary. This fact makes it necessary to combine signs in order to represent other concepts. • DATE AND TIME. As shown in Fig. 6, a date

repre-sentation can be made with one or several signs. The

time generally requires several signs for a full representation.

(5)

[Fecha](3 de mayo de 2007)

[Date](May 3?, 2007)

[Hora](4:35 am)

[Time](4:35 am)

{SJTERCERO} {S_MAYO} {S_DOS} {S_MIL} {S_SIETE}

{S_THIRD} {S_MAY} {S_TWO} {S_THOUSAND} {S_SEVEN}

{S_HORA} {S_CUATRO} {S_Y} {S_TREINTA} {S_CINCO} {SJWANANA}

{S_HOUR} {S_FOUR} {S_AND} {S_THIRTY} {S_FIVE} {S_MORNING}

Fig. 6. Dates and times examples.

[Casa](casas)

[House](houses)

[Apple](apples)

[Apple](apples)

[Car](cars)

[Car](cars)

{S C A S A } { S CASA}

{S_HOUSE} {S_HOUSE}

{S MANZANA 2MANOS}

{S_APPLE_2HANDS}

{S VARIOS} {S COCHE}

{S_SEVERAL} {S_CAR}

Fig. 7. Plural noun examples.

the associated sign. For example, in order to emphasize the possessive "my" in the sentence "this is my house", the associated sign is repeated: {S_THIS}{S_MY} {S_MY}{S_HOUSE}.

• PLURAL NOUNS. There are several ways of specifying an object in plural (all of them with the same meaning): repeating the sign, introducing an adverbial sign or rep-resenting the sign with both hands. Several examples are shown in Fig. 7.

• GENDER. A new sign can be introduced into the sequence to indicate the gender of an object. Usually the gender can be deduced by context and it is not nec-essary to specify it. This sign appears when the gender is necessary for the meaning or the user wants it to be highlighted.

2.4. Several semantic concepts generate several signs

Finally, the most complicated situation appears when it is necessary to generate several signs from several concepts with dependencies between them. These cases are less fre-quent than those presented in Sections 2.1, 2.2 and 2.3. Some examples are as follows:

• Verb/Action sign depending on the subject of the action. For example, the verb "fly" is represented with different signs depending on the subject of the action: bird, plane, etc. • A similar situation arises when the sign associated to an adjective changes depending on the qualified object. For example, the sign for the adjective "good" is different when referring to a person or a material object.

Fig. 8. Spoken Language to Sign Language translation system.

estimation where every recognized word is tagged with a confidence value between 0.0 (lowest confidence) and 1.0 (highest confidence).

• The natural language translation module converts a word sequence into a sign sequence. For this module, the paper presents two proposals. The first one consists of a rule-based translation strategy, where a set of trans-lation rules (defined by an expert) guides the transtrans-lation process. The second alternative is based on a statistical translation approach where parallel corpora are used for training language and translation models.

• The sign animation is carried out by VGuido: the eSIGN 3D avatar developed in the eSIGN project (eSIGN project). It has been incorporated as an ActiveX control. The sign descriptions are generated previously through the eSIGN Editor environment.

Fig. 9 presents the user interface. In this interface, it is possible to see the virtual agent (Vguido) and other con-trols and windows for user interaction: a read-only text window for presenting the sequence of recognized words, a text input window for introducing a Spanish sentence, a set of slots for presenting the translated signs and their confidence, etc.

3.1. Domain and database

3. Translation system overview

Fig. 8 shows the module diagram of the system devel-oped for translating spoken language into Spanish Sign language (LSE). The main modules are as follows:

• The first module, the speech recognizer, converts natural speech into a sequence of words (text). One important characteristic of this module is the confidence measure

The experimental framework is restricted to a limited domain that consists of sentences spoken by an official when assisting people who are applying for their Identity Card or related information. In this context, a speech to sign language translation system is very useful since most of the officials do not know sign language and have difficul-ties when interacting with Deaf people.

(6)

[ r l , i . . , l , . L i . . l l . . l i . . . l , l .l. i . , . . l , .

I Spanish To Sign Language Translation D*mo * /rf f * / Y

Fig. 9. User Interface for the Spanish to Sign Language Translation System.

up to a total of 416 sentences that contain more than 650 different words. In order to represent the Spanish Sign Lan-guage (LSE) in terms of text symbols, each sign has been represented by a word written in capital letters. For exam-ple, the sentence "you have to pay 20 euros as document fee" is translated into "FUTURE YOU TWENTY EURO DOC FEE PAY COMPULSORY".

Once the LSE encoding was established, a professional translator translated the original set into LSE making use of more than 320 different signs. Then, the 416 pairs were randomly divided into two disjoint sets: 266 for training and 150 for testing purposes. The main features of the cor-pus are summarized in Table 1. For both text-to-sign and speech-to-sign translation purposes the same test set has been used. The test sentences were recorded by two speak-ers (1 male and 1 female).

As shown in Table 1, the size of the vocabulary com-pared to the overall amount of running words in the train-ing set is very high (every word appears 6 times on average). In addition, the perplexity of the test set is high considering the small vocabulary. The aforementioned

Table 1

Statistics of the bilingual corpus in Spanish and Spanish Sign Language (LSE)

Spanish LSE

Training Sentence pairs Different sentences Running words Vocabulary

Test

Sentence pairs Running words Unknown words, OOV Perplexity (3-grams)

259 3153 532

1776 93 15.4

266

150

253 2952 290

1688 30 10.7

ratio together with the high perplexity show the high data dispersion of this database.

In these circumstances, another important aspect is the large amount of unknown words (OOV words) in the test set. In this task, there are 93 OOV words out of 532 (source language) and 30 OOV signs (target language).

4. Speech recognition module

The speech recognizer used is a state-of-the-art speech recognition system developed at GTH-UPM (GTH). It is a HMM (Hidden Markov Model)-based system with the following main characteristics:

• It is a continuous speech recognition system: it recog-nizes utterances made up of several continuously spoken words. In this application, the vocabulary size is 532 Spanish words.

• Speaker independency: the recognizer has been trained with a lot of speakers (4000 people), making it robust against a great range of potential speakers without the need for further training by actual users.

• The system uses a front-end with PLP coefficients derived from a Mel-scale filter bank (MF-PLP), with 13 coefficients including cO and their first and second-order differentials, giving a total of 39 parameters for each 10 ms. frame. This front-end applies CMN and CVN techniques.

• For Spanish, the speech recognizer uses a set of 45 units: it differentiates between stressed/unstressed/nasalized vowels, it includes different variants for the vibrant 'r' in Spanish, different units for the diphthongs, the frica-tive version of 'b', 'd', 'g', and the affricates version of 'y' (like 'ayer' and 'conyuge'). The system also has 16 silence and noise models for detecting acoustic sounds (non-speech events like background noise, speaker arti-facts, filled pauses, etc.) that appear in spontaneous speech. The system uses context-dependent continuous Hidden Markov Models (HMMs) built using decision-tree state clustering: 1807 states and seven mixture com-ponents per state. These models have been trained with more that 40 h of speech from the SpeechDat database. Although SpeechDat is a telephone speech database, the acoustic models can be used in a microphone applica-tion because CMN and CVN techniques have been used to compensate the channel differences. The influence of this aspect in the speech recognition results is small. • Regarding the language model, the recognition module

uses statistical language modeling: 2-gram, as the data-base is not large enough to estimate reliable 3-grams. • The recognition system can generate one optimal word

(7)

• The recognizer provides one confidence measure for each word recognized in the word sequence. The confi-dence measure is a value between 0.0 (lowest conficonfi-dence) and 1.0 (highest confidence) (Ferreiros et al., 2005). This measure is important because the speech recognizer per-formance varies depending on several aspects: level of noise in the environment, non-native speakers, more or less spontaneous speech, or the acoustic similarity between different words contained in the vocabulary.

For the speech recognition experiments, the following three situations have been considered:

Exp 1: In the first situation, the language model and the vocabulary were generated from the training set. This is the real situation. As we can see in Table 2 the WER is quite high. The reasons for this WER are the high number of OOV words (93 OOV words out of 532) in the test set, and the very small number of sentences to train the language model. In order to demonstrate the validity of these reasons, the following two situations were considered.

Exp 2: In this case, the language model was generated from the training set (as in Exp 1) but the vocabu-lary included all words (training and testing sets). The WER difference between Exp 1 and Exp 2 reveals the influence of the OOV words (as the speech recognizer is not able to handle OOV words).

Exp 3: Finally, the third experiment tried to estimate a top limit in the speech recognizer performance considering all the available phrases for training the language model and generating the vocabulary. In this case, the WER difference between Exp 2 and Exp 3 shows the influence of the small amount of data used for training the Language Model. In this case, the WER is 4.04: a very good value for a 500 word task.

The speech recognition results for this task are presented in Table 2. These results show the outstanding influence of data sparseness (due to the small amount of data) over the decoding process: comparing Exp 1 to Exp 2, it is shown that OOV words are responsible for increasing the WER from 15.04 to 23.50. Comparing Exp 2 to Exp 3, the poor language model makes the WER to increase from 4.04 to

15.04.

Table 2

Final speech recognition results:

WER

Exp 1 23.50 Exp 2 15.04 Exp 3 4.04

Word Error Rate (WER)

Ins (%)

2.60

1.19 0.66

Del (%)

6.45

5.43 1.64

Sub (%)

14.45

8.42 1.74

5. Natural language translation

The natural language translation module converts the word sequence, obtained from the speech recognizer, into a sign sequence that will be animated by the 3D avatar. For this module, two approaches have been implemented and evaluated: rule-based translation and statistical translation.

5.1. Rule-based translation

In this approach, the natural language translation mod-ule has been implemented using a rmod-ule-based technique considering a bottom-up strategy. In this case, the relation-ship between signs and words are defined by an expert hand. In a bottom-up strategy, the translation analysis is carried out by starting from each word individually and extending the analysis to neighborhood context words or already-formed signs (generally named blocks). This exten-sion is made to find specific combinations of words and/or signs (blocks) that generate another sign. Not all the blocks contribute or need to be present to generate the final trans-lation. The rules implemented by the expert define these relations. Depending on the scope of the block relations defined by the rules, it is possible to achieve different com-promises between reliability of the translated sign (higher with higher lengths) and the robustness against recognition errors: when the block relations involve a large number of concepts, one recognition error can cause the rules not to be executed.

The translation process is carried out in two steps. In the first one, every word is mapped to one or several syntactic-pragmatic tags. After that, the translation module applies different rules that convert the tagged words into signs by means of grouping concepts or signs (blocks) and defining new signs. These rules can define short and large scope rela-tionships between the concepts or signs. At the end of the process, the block sequence is expected to correspond to the sign sequence resulting from the translation process.

The rule-based translation module contains 153 transla-tion rules. The translatransla-tion module has been evaluated with the test set presented in Table 1. For evaluating the perfor-mance of the systems, the following evaluation measures have been considered: SER (Sign Error Rate), PER (Posi-tion Independent SER), BLEU (BiLingual Evalua(Posi-tion Understudy), and NIST. The first two measures are error measures (the higher the value, the worse the quality) whereas the last two are accuracy measures (the higher, the better). The final results reported by this module are presented in Table 3.

(8)

Table 3

Results obtained with the rule-based translation system

SER PER BLEU NIST

Exp 1 Exp 2 Exp 3 REF

31.60

24.94 18.23

16.75

27.02

20.21 14.87

13.17

0.5780

0.6143 0.7072

0.7217

7.0945

7.8345 8.4961

8.5992

speech recognizer introduces recognition mistakes that pro-duce more translation errors: the percentage of wrong signs increases and the BLEU decreases.

Analyzing the results in detail, it is possible to report that the most frequent errors committed by the translation module have the following causes:

• In Spanish, it is very common to omit the subject of a sentence, but in Sign Language it is compulsory to use it. In order to deal with this characteristic, several rules have been implemented in order to verify whether every verb has a subject and to include a subject if there is any verb without it. When applying these rules some errors are inserted: typically a wrong subject is associated to a verb.

• Several possible translations. One sentence can be trans-lated into different sign sequences. When one of the pos-sibilities is not considered in the evaluation, some errors are reported by mistake. This situation appears, for example, when the passive form is omitted in several examples.

• In Sign Language, a verb complement is introduced by a specific sign: for example a time complement is intro-duced with the sign WHEN, or a mode complement is introduced with the sign HOW. There are several rules for detecting the type of complement, but sometimes it is very difficult to detect the difference between a place complement and a time complement. Moreover, when the verb complement is very short (made up of one word: "today", "now", "here", . . . ) , this introductory sign is omitted for simplicity (deaf people do not sign the introductory sign to reduce the signing time). When the system estimates the complement length wrongly an error occurs.

• In the test set, there is a large number of unknown words that generate a significant number of errors.

5.1.1. Sign confidence measure

The translation module generates one confidence value for every sign: a value between 0.0 (lowest confidence) and 1.0 (highest confidence). This sign confidence is com-puted from the word confidence obtained from the speech recognizer. This confidence computation is carried out by an internal procedure that is coded inside the proprietary language interpreter that executes the rules of the transla-tion module.

In this internal engine, there are "primitive functions", responsible for the execution of the translation rules

writ-ten by the experts. Each primitive has its own way of gen-erating the confidence for the elements it produces. One common case is for the primitives that check for the exis-tence of a sequence of words/concepts (source language) to generate some signs (target language), where the primi-tive usually assigns the average confidence of the blocks which it has relied on to the newly created elements.

In other more complex cases, the confidence for the gen-erated signs may be dependent on a weighted combination of confidences from a mixture of words and/or internal or final signs. This combination can consider different weights for the words or concepts considered in the rule. These weights are denned by the expert as the same time the rule is coded. For example, in Fig. 10, the confidence measures for the signs "DNI" and "SER" (0.7 in both cases) have been computed at the average value of the confidence of "denei" (0.6) and "es" (0.8). The confidence values of the words tagged as GARBAGE are not used to compute sign confidence values. In Ferreiros' work (Ferreiros et al., 2005), it is possible to find a detailed description of confi-dence measure computation.

This system is one of the few natural language transla-tion modules that generates a confidence measure for signs (target language). Section 6.1 describes the use of sign con-fidence measures when representing the sign.

5.2. Statistical translation

The Phrase-based translation system is based on the software released to support the shared task at the 2006 NAACL Workshop on Statistical Machine Translation (http://www.statmt.org/wmt06/).

The phrase model has been trained following these steps:

• Word alignment computation. At this step, the G I Z A + + software (Och and Ney, 2000) has been used to calculate the alignments between words and signs. The parameter "alignment" was fixed to "grow-diag-final" as the best option.

• Phrase extraction (Koehn et al., 2003). All phrase pairs that are consistent with the word alignment are col-lected. The maximum size of a phrase has been fixed to 7.

• Phrase scoring. In this step, the translation probabilities are computed for all phrase pairs. Both translation probabilities are calculated: forward and backward.

The Pharaoh decoder is used for the translation process. This program is a beam search decoder for phrase-based

el (0.5) denei (0.6) es (0.8)

GARBAGE DNI VER_SER

DNI (0.7) SER (0.7)

(9)

Table 4

Results obtained with the statistical system

Exp 1 Exp 2 Exp 3 REF

SER

38.72

36.08 34.22

33.74

PER

34.25

32.00 30.04

29.14

BLEU

0.4941

0.4998 0.5046

0.5152

NIST

6.4123

6.4865 6.5596

6.6505

statistical machine translation models (Koehn, 2004). In order to obtain a 3-gram language model needed by Pha-raoh, the SRI language modeling toolkit has been used. The Carmel software was used for the K-best list generation.

The speech-input translation results are shown in Table 4 for the three experiments described previously. As a base-line (denoted in the Table as Ref), the text-to-text transla-tion results (considering the utterance transcriptransla-tion directly) are included.

The rule-based strategy has provided better results on this task because it is a restricted domain and it has been possible to develop a complete set of rules with a reason-able effort. Another important aspect is that the amount of data for training is very little and the statistical models cannot be trained properly. In these circumstances, the rules denned by an expert introduce knowledge not seen in the data, making the system more robust with new sentences.

6. Sign animation with the eSIGN avatar: VGuido

The signs are represented by means of VGuido (the eSIGN 3D avatar) animations. An avatar animation con-sists of a temporal sequence of frames, each of which defines a static posture of the avatar at the appropriate moment. Each of these postures can be denned by specify-ing the configuration of the avatar's skeleton, together with some characteristics which define additional distortions to be applied to the avatar.

In order to make an avatar sign, it is necessary to send to the avatar pre-specified animation sequences. A signed animation is generated automatically from an input script

in the Signing Sign Markup Language (SiGML) notation. SiGML is an XML application which supports the defini-tion of sign sequences. The signing system constructs human-like motion from scripted descriptions of signing motions. These signing motions belong to "Gestural-SiG-ML", a subset of the full SiGML notation, which is based on the HamNoSys notation for Sign Language transcrip-tion (Prillwitz et al., 1989).

The concept of synthetic animation used in eSIGN is to create scripted descriptions for individual signs and store them in a database. Populating this database may take some time but considering a minimum amount of one hun-dred signs, it is possible to obtain signed phrases for a restricted domain. This process is carried out by selecting the required signs from the database and assembling them in the correct order.

The major advantage of this approach is its flexibility: The lexicon-building task does not require special equip-ment, just a database. The morphological richness of sign languages can be modeled using a sign language editing environment (the eSIGN editor) without the need of man-ually describing each inflected form.

HamNoSys and other components of SiGML mix prim-itives for static gestures (such as parts of the initial posture of a sign) with dynamics (such as movement directions) by intention. This allows the transcriber to focus on essential characteristics of the signs when describing a sign. This information, together with knowledge regarding common aspects of human motion as used in signing such as speed, size of movement, etc., is also used by the movement gen-eration process to compute the avatar's movements from the scripted instructions. Fig. 11 shows the process for specifying a sign from the HamNoSys description.

6.1. Incorporating confidence measures in sign animation

As described above, the result of the natural language translation process is a sign sequence. Every sign in the sequence can be tagged with a confidence measure ranging from 0.0 (lowest confidence) to 1.0 (highest confidence). Depending on its confidence value, each sign is represented in a different way. There are three confidence levels denned:

« B h . t * * » * •

Hsh | Ori Lw | M0r1 | Wo-31 3hd I

^@H0S EHOIE]

• • • H H Q H H H

• • • • • • m a m a

•••QQSEQQDD

<sigml>

<hns_signgloss="GLOSS:OBLIGATORIO" <hamnosys_nonmanual>

</hamnosys_nonmanual> <hamnosys_manual>

</hns_sign> </sigml>

(10)

• High confidence. Confidence value higher than 0.5 defines a level where all signs are represented in their standard way.

• Medium confidence. Confidence value between 0.25 and 0.5. In this case, the sign is represented but an additional low confidence signal is presented: an interrogative mark or a confused avatar face.

• Low confidence. Confidence value of less than 0.25. At this level, the sign is not played. During the time associ-ated to this sign an interrogative mark or a confused avatar expression is presented.

6.2. Reducing the delay between the spoken utterance and the sign animation

One important aspect to be considered in a speech to sign language translation system is the delay between the spoken utterance and the animation of the sign sequence. This delay is around 1-2 s and it slows down the interac-tion. In order to reduce this delay, the speech recognition system has been modified to report partial recognition results every 100 ms. These partial results are translated into partial sign sequences that can be animated without the need to wait until the end of the spoken utterance.

When implementing this solution, an important prob-lem appeared due to the fact that the translation is not a linear alignment process between spoken words and signs. Words that are spoken in the middle of the utterance can report information about the first signs. So until these words are spoken, the first sign is not completely defined and it cannot be represented.

This problem appears in two situations:

• Verb tense signs. These signs are FUTURE and PAST (there is not a PRESENT sign because it is the default) and they must appear at the beginning of the sign sequence independently of the verb position. In partial translations, this sign appears when the action verb is spoken and this usually happens approximately in the middle of the utterance.

• Verb subject signs. The second aspect is the verb subject. In colloquial Spanish, the subject can frequently be omitted, but in Sign Language, every verb must have a subject. In the translation process, it is necessary to check whether a verb has a subject and the system must include one (at the beginning of the sentence) if there is none.

In order to solve this problem, two restrictions were imposed for representing a partial sign sequence:

• The first sign must be a verb tense or a subject sign: in this case, all sign language sentences start with one of these signs. Considering that these signs are defined when a verb appears in the Spanish sentence, this restric-tion can be reformulated as follows: the partial sequence

should contain a verb sign in order to start the sign representation.

• The first sign should be the same for at least three con-secutive partial sign sequences.

With these restrictions, a 40% delay reduction is achieved without affecting the translation process performance.

7. Conclusions

This paper has presented the implementation and the first experiments on a speech to sign language translation system for a real domain. The domain consists of sentences spoken by an official when assisting people who apply for, or renew their Identity Card. The translation system imple-mented is made up of three modules. A speech recognizer is used for decoding the spoken utterance into a word sequence. After that, a natural language translation mod-ule converts the word sequence into a sequence of signs belonging to the Spanish Sign Language (LSE). In the last module, a 3D avatar plays the sign sequence.

In these experiments, two proposals for the natural lan-guage translation module have been implemented and eval-uated. The first one consists of a rule-based translation module reaching a 31.60% SER (Sign Error Rate) and a 0.5780 BLEU (BiLingual Evaluation Understudy). In this proposal, confidence measures from the speech recognizer have been used to compute a confidence measure for every sign. This confidence measure is used during the sign ani-mation process to inform the user about the reliability of the translated sign.

The second alternative for natural language translation is based on a statistical translation approach where parallel corpora were used for training. The best configuration has reported a 38.72% SER and a 0.4941 BLEU.

The rule-based strategy has provided better results on this task because it is a restricted domain and it has been possible to develop a complete set of rules with a reason-able effort. Another important aspect is that the amount of data for training is very little and the statistical models cannot be estimated reliably. In these circumstances, the rules defined by an expert introduce knowledge not seen in the data making the system more robust against new sentences. The rule-based translation module has presented a very high percentage of deletions compared to the rest of the errors. This is due to the rule-based strategy: when the speech recognition makes an error, some concept patterns do not appear (they do not fit into the defined rules) and some signs are not generated. On the other hand, the statis-tical translation module has generated greater percentage of insertions and substitutions compared to the rule-based system.

(11)

between the spoken utterance and the sign animation. With the solution proposed, a 40% delay reduction was achieved without affecting the translation process performance.

Acknowledgements

The authors would like to thank the eSIGN (Essential Sign Language Information on Government Networks) consortium for giving us permission to use of the eSIGN Editor and the 3D avatar in this research work. This work has been supported by the following projects ATI-NA (UPM-DGUI-CAM. Ref: CCG06-UPM/COM-516), ROBINT (MEC Ref: DPI2004-07908-C02) and EDEC-AN (MEC Ref: TIN2005-08660-C04). The work pre-sented here was carried out while Javier Macias-Guarasa was a member of the Speech Technology Group (Department of Electronic Engineering, ETSIT de Tele-comunicacion. Universidad Politecnica de Madrid). Authors also want to thank Mark Hallett for the English revision.

References

Abdel-Fattah, M.A., 2005. Arabic sign language: a perspective. Journal of Deaf Studies and Deaf Education 10 (2), 212-221.

ASL corpus: <http://www.bu.edu/asllrp/>.

Atherton, M., 1999. Welsh today BSL tomorrow. In: Deaf Worlds 15 (1), pp. 11-15.

Bertenstam, J. et al., 1995. The Waxholm system-A progress report. In: Proc. Spoken Dialogue Systems, Vigso, Denmark.

Bungeroth, J. et al., 2006. A German Sign Language Corpus of the Domain Weather Report. In: 5th Internat. Conf. on Language Resources and Evaluation, Genoa, Italy.

Casacuberta, F., Vidal, E., 2004. Machine translation with inferred stochastic finite-state transducers. Computational Linguistics 30 (2), 205-225.

Cassell, J., Stocky, T., Bickmore, T., Gao, Y., Nakano, Y., Ryokai, K., Tversky, D., Vaucelle, C , Vilhjalmsson, 2002. MACK: media lab autonomous conversational kiosk. In: Proc. of Imagina: Intelligent Autonomous Agents, Monte Carlo, Monaco.

Christopoulos, C , Bonvillian, J., 1985. Sign language. Journal of Communication Disorders 18, 1-20.

Cole, R. et al., 1999. New tools for interactive speech and language training: using animated conversational agents in the classrooms of profoundly deaf children. In: Proc. ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education, London, pp. 45-52.

Cole, R., Van Vuuren, S., Pellom, B., Hacioglu, K., Ma, J., Movellan, J., Schwartz, S., Wade-Stein, D., Ward, W., Yan, J., 2003. Perceptive animated interfaces: first steps toward a new paradigm for human computer interaction. IEEE Transactions on Multimedia: Special Issue on Human Computer Interaction 91 (9), 1391-1405.

ECHO corpus: <http://www.let.kun.nl/sign-lang/echo/>.

Engberg-Pedersen, E., 2003. From pointing to reference and predica-tion: pointing signs, eyegaze, and head and body orientation in Danish Sign Language. In: Kita, Sotaro (Ed.), Pointing: Where Language, Culture, and Cognition Meet. Erlbaum, Mahwah, NJ, pp. 269-292.

ESIGN project: <http://www.sign-lang.uni-hamburg.de/eSIGN/>. Ferreiros, J., San-Segundo, R., Fernandez, F., D'Haro, L., Sama, V.,

Barra, R., Mellen, P., 2005. New Word-Level and Sentence-Level Confidence Scoring Using Graph Theory Calculus and its Evaluation

on Speech Understanding. Interspeech 2005, Lisboa, Portugal, Sep-tiembre, pp. 3377-3380.

Gallardo, B., Montserrat, V., 2002. Estudios Lingufsticos sobre la lengua de signos espahola. Universidad de Valencia. Ed. AGAPEA ISBN: 8437055261. ISBN-13: 9788437055268.

Granstrom, B., House, D., Beskow, J., 2002. Speech and Signs for Talking Faces in Conversational Dialogue Systems. Multimodality in Language and Speech Systems. Kluwer Academic Publishers, pp. 209-241.

GTH: <http://lorien.die.upm.es>.

Gustafson, J., 2002. Developing multimodal spoken dialogue systems -empirical studies of spoken human-computer interactions. PhD. Dissertation. Dept. Speech, Music and Hearing, Royal Inst, of Technology, Stockholm, Sweden.

Gustafson, J., Bell, L., 2003. Speech technology on trial: experiences from the august system. Journal of Natural Language Engineering: Special Issue on Best Practice in Spoken Dialogue Systems, 273-286.

Herrero-Blanco, A., Salazar-Garcia, V., 2005. Non-verbal predictability and copula support rule in Spanish Sign Language. In: de Groot, Casper, Hengeveld, Kees (Eds.), Morphosyntactic expression in functional grammar. (Functional Grammar Series; 27) Berlin [u.a.]: de Gruyter, pp. 281-315.

Instituto Cervantes: <http://www.cervantesvirtual.com/seccion/signos/>. Koehn, P., Och, F.J., Marcu, D., 2003. Statistical phrase-based

transla-tion. In: Human Language Technology Conference 2003 (HLT-NAACL 2003), Edmonton, Canada, pp. 127-133.

Koehn, P., 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. AMTA.

Lundeberg, M., Beskow, J., 1999. Developing a 3D-agent for the August dialogue system. In: Proc. Audio-Visual Speech Processing, Santa Cruz, CA.

Masataka, N. et al., 2006. Neural correlates for numerical processing in the manual mode. Journal of Deaf Studies and Deaf Education 11 (2), 144-152.

Meurant, L., 2004. Anaphora, role shift and polyphony in Belgian sign language. Poster. In: TISLR 8 Barcelona, September 30-October 2. Programme and Abstracts. (Internat. Conf. on Theoretical Issues in Sign Language Research; 8), pp. 113-115.

Nyst, V., 2004. Verbs of motion in Adamorobe Sign Language. Poster. In: TISLR 8 Barcelona, September 30-October 2. Programme and Abstracts. (Internat. Conf. on Theoretical Issues in Sign Language Research; 8), pp. 127-129.

Och J., Ney, H , 2000. Improved statistical alignment models. In: Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, HongKong, China, pp. 440-447.

Och J., Ney. H , 2002. Discriminative training and maximum entropy models for statistical machine translation. In: Annual Meeting of the Ass. For Computational Linguistics (ACL), Philadelphia, PA, pp. 295-302.

Och, J., Ney, H., 2003. A systematic comparison of various alignment models. Computational Linguistics 29 (1), 19-51.

Papineni K , Roukos, S., Ward, T., Zhu, W.J., 2002. BLEU: a method for automatic evaluation of machine translation. In: 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, pp. 311-318.

Prillwitz, S., Leven, R., Zienert, H , Hanke, T., Henning, J., et al., 1989. Hamburg Notation System for Sign Languages - An introductory Guide. In: International Studies on Sign Language and the Communication of the Deaf, Vol. 5. Institute of German Sign Language and Communication of the Deaf. University of Hamburg.

Pyers, J.E., 2006. Indicating the body: Expression of body part terminology in American Sign Language. Language Sciences 28 (2-3), 280-303.

(12)

Rodriguez, M.A., 1991. Lenguaje de signos Phd Dissertation. Confeder-ation Nacional de Sordos Espanoles (CNSE) and Fundacion ONCE, Madrid, Spain.

Stokoe, W., 1960. Sign Language structure: an outline of the visual communication systems of the American deaf. Studies in Linguistics. Buffalo, Univ. Paper 8.

Sumita, E., Akiba, Y., Doi, T., et al., 2003. A Corpus-Centered Approach to Spoken Language Translation. Conf. of the Europ. Chapter of the Ass. For Computational Linguistics (EACL), Budapest, Hungary, pp. 171-174.

Sutton, S., Cole, R., 1998. Universal speech tools: the CSLU toolkit. In: Proc. of Internat. Conf. on Spoken Language Processing, Sydney, Australia, pp. 3221-3224.

Sylvie, O., Surendra, R., 2005. Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (6).