11 Diagnóstico y corrección de fallas
12.2 Datos generales
In this part I will describe the construction and the development of the Japanese-Chinese MT system based on the basic principle and the idea of the SFBMT.
Japanese Sentence Disambiguities Analysis Dictionary SF Base Bilingual Dictionary Morphological Translation
Tables ChineseSentence
Input
Morphological
Analysis Super Function Matching Morphological Agreement
Output Unknown word
processing
Figure 4.1: Outline of The System
Based on the special functionalities and properties of business letters which has been introduced above, in our system we only look at nouns as variable parts of a sentence, by using the SF-based approach to do the translation between Japanese and Chinese. The outline of the translation is discussed below (see also Figure. 4.1).
The major parts of the translation system are as follows: 1. Input:
The inputted Japanese sentence is written in a file. 2. Morphological Analysis:
The Japanese sentence is morphologically analyzed by ChaSen (a free Japanese morphologi- cal analyzer). ChaSen analyzes the specified file in the morpheme and outputs the result. 3. Super-Function Matching:
First of all, the nouns are extracted from the file which resulted after the morphological analy- sis. Then the words between nouns are tied together to build a node of a SF. Here, we call the node parts of a SF as SF parts. The nouns are written into a noun file, and the parts of the SF are written to a node file which is matched with the SF base to search for the corresponding Chinese SF part. By using the bilingual dictionary the nouns are then translated into Chinese. 4. Morphological Agreement:
Based on the order of nouns in an ETB a rearrangement of the nouns within the Chinese node parts takes place.
5. Output:
A translation sentence is outputted to a browser.
In 1, the Japanese input must be written in a file to be used by the following morphological analysis.
In 2, the process is started by executing ChaSen from the command line. ChaSen ana- lyzes the specified file in the morpheme and outputs the result to another file. ChaSen version 1.0 was officially released in 1997 by the Computational Linguistics Laboratory Graduate School of Information Science, Nara Institute of Science and Technology. It is a free Japanese Morphological analyzer. For the string of hte input Japanese sentence, ChaSen consults its morpheme dictionaries and records all the possible morphemes that are any sub-strings of the input string. Next, ChaSen calculates following two types of costs. One is Morpheme Cost: a cost that is assigned to each morpheme, and is calculated as the product of the cost of the corresponding part of speech, relative weight of morpheme costs and the surface from cost. Another one is Connectivity Cost: a cost that is assigned to each bi-gram of morphemes, and is calculated as the product of the connectivity cost defined in the connectivity rules file and the relative weight of connectivity costs.
For the string of the input Japanese sentence, every possible segmentation into morpheme sequences and their parts of speech tagging is considered and sum of the above morpheme costs and their connectivity costs are calculated. Then, the results with the minimum cost are returned. Some costs width of beam search is defined in the chasenrc resource file, and at every position in the input string, morphological analysis results are pruned using this cost width of beam search. When ChaSen consults its morpheme dictionaries with some sub-string of the input string and can not find any morphemes, it assumes that the sub-string should be considered as a morpheme and behaves as if the sub-string were contained in its morpheme dictionaries, although the sub-string is assigned an extremely high cost compared with those morphemes existing in its morpheme dictionaries. Details of this facility of coping with unknown words are as follows: For hiragana (Japanese), kanji (Chinese), numbers, and symbols character types, ChaSen assumes each one character as a possible unknown morpheme that is not contained in its morpheme dictionaries. On the other hand, for other character types (katakana (foreign), (English) alphabet, etc.), ChaSen assumes the longest string each character of which is of the same character type as a possible unknown morpheme that is not contained in its morpheme dictionaries. Those morphemes that are not contained in the morpheme dictionaries are considered as having the part of speech for unknown words, which is defined in the chasenrc resource file. Those morphemes that are not contained in the morpheme dictionaries are assigned the cost for unknown words. which is defined in the chasenre resource file.
After morphological analysis was done by ChaSen, we need to do some processing to the unknown words. If an unknown word locates in front of a case particle, then the unknown word is considered as a noun. If an unknown word locates after the case particle ”WO,” then the unknown word is considered as a verb.
In 3, first of all, the nouns are extracted from the file which resulted after the morpholog- ical analysis. Then the words between nouns are tied together to build a node of a SF. The nouns are written into a noun file, and the parts of the SF are written to a node file. Then the node file is matched with the SF base to search for the corresponding Chinese SF part.
By using the bilingual dictionary the nouns are then translated into Chinese. This follows the above mentioned kind as well as the location relationship described in the ETB, i.e., in 4 a rearrangement of the nouns within the Chinese node parts takes place.
Figure 4.2: Translation Interface