• No se han encontrado resultados

“El ciclismo es un humanismo”: de cuerpos que se proyectan

At first blush, it seems rather surprising that short phone sequences (phone n-grams) are capable of capturing sufficient meaning in order to support SCR. Generally, semantics in human language is consid- ered to be situated at the word level and higher. However, recall from

Automatic Speech Recognition that the morpheme is the basic unit of

meaning in speech, and that this unit is (often) smaller than a word. Single phones cannot be expected to encode much, if any, semantic information, but phone sequences of length two (phone bigrams) and

especially of length three (phone trigrams) and higher grow close to morphemes in length and for this reason it is also plausible that such units can be used to capture semantic information.

Indexing features consisting of sequences of phones were initially proposed by [241] and tested on a small corpus of German-language radio news. The phone transcripts are generated by a phone-based ASR system that uses a phone-bigram language model. The phone sequences extracted from the phone-based ASR transcripts to be used as indexing features are maximally overlapping phone sequences, 3–6 phones in length. The sequences are chosen by a method that eliminates both very frequent and very infrequent sequences. To perform retrieval, the query is first mapped to phone sequences and the VSM is used to compare the query and the documents in the collection. The system described in [254] was an early prototype that made use of this approach, adopting triphone indexing features.

In [304], phone sequences are created by decomposing word-level transcripts generated by a word-based ASR system into phone-based transcripts with the help of a phonetic dictionary. All phone sequences of 3–6 phones in length are used as indexing features. Queries are also converted into phone strings via the dictionary. It is important to point out that the dictionary used for this conversion process is larger than the lexicon of the ASR system. If it were not, the phone-based method would not help to compensate for OOV. Again, the VSM is used for retrieval. The method was tested on a collection of English-language news stories. The use of phone-strings in [304] is intended to emulate the effect of wordspotting in phoneme lattices, which is computationally a more expensive technique. In [174], the method was shown to perform well for English-language broadcast news retrieval, which used phone 5-grams that overlapped by four phones.

In [195], different methods for extracting overlapping phone- sequence indexing features for SCR are explored in detail. This arti- cle arrives at the general conclusion that phone-based retrieval is not as effective as word-based retrieval, but there are certain situations where it is appropriate. Specifically, phone-based retrieval is effective for addressing the OOV problem. Further, if speech recognition must be performed on a platform with limited capacity (i.e., a hand-held

device), then a small language model, such as a phoneme bigram model, makes the ASR system lightweight and compact. The authors of [195] find that in terms of phone-sequence-based indexing features, a combi- nation of phone 3-grams and 4-grams proved most effective. This result confirms the findings of [304] that phone-based features derived from word-level transcripts are able to help compensate for word-level error. Further, [195] shows that ignoring word boundaries when extracting phone-based features does not affect retrieval performance significantly. Similar results are achieved by [197], which investigates a wide variety of different subword indexing terms derived from speech tran- scripts produced by a recognizer with phoneme-level acoustic models and a phoneme-bigram language model. Retrieval experiments were performed using the VSM and 50 topical queries. Overlapping phone trigrams yielded the best retrieval performance. The authors conclude that the overlap of the strings is important because it provides more opportunities for a partial match to be made between the query and the ASR transcript.

Experiments reported in [197] start with phoneme monograms and gradually increase the length of the phoneme sequences. These reveal that retrieval performance first increases and then falls off. This behav- ior clearly demonstrates the importance of using indexing features that are specific enough to be representative, but not overly specific to the point where they fail to generalize.

We close the discussion on phone sequence indexing units by men- tioning work that makes use of sequences of units that are approx- imately phones. In [87], specialized indexing features are used that are defined as “the maximum sequence of consonants enclosed by two maximum sequences of vowels at both ends.” Note that these sequences do not overlap, but rather the speech signal is cut in the middle of a vowel. A vowel is easily identifiable within the speech signal and also more stable than a phoneme transition, making this choice of segmentation boundary a natural and robust one. The indexing fea- tures are extracted from the speech signal using a keyword spotting system. In subsequent work, [241], the authors point out that using phone-sequence features instead of specialized phone-like sequences is beneficial because it greatly increases the ease of feature extraction.

A similar observation has been established in the area of spoken con- tent classification. In [286], two topic classification systems are tested that use phone-sized units as their indexing features. The first uses codebook class sequences (CCS), which are sequences of phone-sized (80 ms) units that have been created by an automatic vector quan- tization process. The second uses phones generated by a phone-based recognizer. Both feature sets are demonstrated to be suitable for topic classification and in both cases sequences of three units were the top performers.

In [192], experiments are presented that demonstrate the relative benefit of extracting phone-string features from lattices rather than from 1-best phone-level recognizer output. This approach bears affinity to the techniques for detecting spoken terms that will be discussed in subsection 4.6.