DATA TITOL LLOC CNL CATEGORIA 13/11/09 Presentació de les parelles lingüístiques del programa "Voluntariat per la

Ordenades per data i per CNL.

DATA TITOL LLOC CNL CATEGORIA 13/11/09 Presentació de les parelles lingüístiques del programa "Voluntariat per la

How many languages does Clearwell support?

We support processing, language identification, search, rendering, and export for over 50 languages including Asian (Chinese, Japanese, Korean), Eastern European (Russian, Bulgarian), Western European (French, German), etc. See Officially Supported Languages in Appendix A for the detailed list.

Does Clearwell process Unicode natively?

Unicode only exists to the extent that it is encoded into a byte-level representation of Unicode characters. Clearwell can process Unicode in all of its encodings, as well as process other non- Unicode encodings such as JIS, Shift-JIS, Big-5, GB, and ASCII.

Upon processing, Clearwell represents all Unicode data internally in a UTF-8 encoding across most of its components. On export, all documents are exported in their original encoding), but the metadata contained in the XML output is in UTF-8 format.

What encodings does Clearwell use?

For processing, Clearwell uses a mix of UTF-16 and UTF-8 depending on the document source, type, and stage of processing. Note that many documents that we process may NOT be in Unicode form to begin with; they may be in simple ASCII (single fixed byte) form or some other alternate encoding. For example, Notes documents are stored in the LMBCS ("Lotus Multi-byte Character Set") format. MIME encodings will be converted to Unicode by Clearwell and

processed natively. For rendering in the UI and for export, Clearwell uses UTF-8, since this is the best option for displaying content in HTML and representing it in XML.

How does Clearwell handle different encodings?

Clearwell handles encoding conversions through two separate "paths":

• Email processing

For emails, Clearwell uses MAPI and Notes APIs to take emails from whatever their original encoding is and convert it into Unicode. Clearwell does not directly do any sort of encoding conversion but instead relies completely on the email APIs to do this. This applies to .msg and .eml files as well.Loose file processing

Loose files and Email Attachments that have anything other than Windows-1252 encoded text files are passed through Oracle Outside-In. Outside-In has the ability to convert from a set of encodings (shown below) to any other output encoding. We have configured Outside-In to process any of the following encodings on input side, and only Unicode encoded as UTF-8 on output side. Again, Clearwell does not directly do any sort of encoding conversion but relies completely on Outside-In to do the relevant mappings.

Multiple Language Handling: Frequently Asked Questions PAGE: 171

Supported Outside-In Character Encodings

Encoding name Description

iso8859-1 Latin-1 iso8859-2 Latin-2 iso8859-3 Latin-3 iso8859-4 Latin-4 iso8859-5 Cyrillic iso8859-6 Arabic iso8859-7 Greek iso8859-8 Hebrew iso8859-9 Turkish

macroman Mac Roman

macce Mac CE

macgreek Mac Greek

maccyrillic Mac Cyrillic

macturkish Mac Turkish

gb2312 Simplified Chinese

big5 Traditional Chinese

shiftjis Japanese

eucjp Japanese

iso2022-jp Japanese

koi8r Russian

windows1250 Eastern European

windows1251 Cyrillic

windows1252 Western European

windows1253 Greek windows1254 Turkish windows1255 Hebrew windows1256 Arabic windows1257 Baltic thai874 Thai

koreanhangul Korean Hangul

utf8 UTF-8

Multiple Language Handling: Frequently Asked Questions PAGE: 172

Can Clearwell handle Shift-JIS-encoded file names and file paths in container files (such as ZIP/LHZ)?

Normally, file names and file paths are encoded in Unicode. However, in a few cases, Clearwell has identified these to be encoded in Shift-JIS (specifically when data originates from Japan). If you are expecting such data in your case matter, Clearwell recommends enabling the following property in Support Features.

To enable container file encoding detection:

1. On the top navigation bar, in System view, click Support Features.

2. Select the support feature Property Browser.

3. In the Name of property to change field, type the property: esa.container.filesname.conversion

4. In the New value field, type: true [to enable] [false to disable]

5. Select the checkbox Confirm change. Are you sure?

6. Click Submit.

For more information on Japanese language encodings and problems with Shift-JIS to Unicode round-tripping, see:

• http://en.wikipedia.org/wiki/Japanese_language_and_computers

• http://web.archive.org/web/20060527013315/http://www.cs.mcgill.ca/~aelias4/ encodings.html

• http://support.microsoft.com/kb/170559

There are at least a couple of known instances of encountering an encoding mapping problem due to Japanese characters in filenames. This mis-mapping is referred to as Mojibake (http:// en.wikipedia.org/wiki/Mojibake).

Is any client-side software required to use Clearwell on multi-language cases?

No. However, you may need to install fonts (such as Chinese, Japanese, and Korean) if your Windows configuration does not have them installed and enabled by default. If you see characters not being rendered properly in your browser, this should be the first thing that you check because it is the most common reason for the problem.

What CJK characters are actually supported by Clearwell?

As noted above, we only process Unicode in the BMP space. Every CJK Unicode character in the BMP is processed, which is comprised of CJK Unified Ideographs in the range U+4E00 to U+9FFF (20992 characters), and CJK Unified Ideographs Extension A in the range U+3400 to U+4DFF (666 characters), CJK Compatibility Ideographs in the range U+F900 to U+FAFF (512 characters). The following CJK Characters are not supported: Unified Ideographs Extension B U+20000 to U+2A6DF (42720 characters) and CJK Compatibility Ideographs U+2F800 to U+2FA1F (544 characters).

Multiple Language Handling: Frequently Asked Questions PAGE: 173

How about stemming support for non-English languages?

Stemming can be enabled as needed for non-English languages during the case setup. Clearwell supports stemming in the following languages:

Note: See examples in the next section for the types of stemming that can occur for Japanese.

Can my tags, projects, saved searches, etc. now use international characters?

Yes, almost all user text input may contain international characters.

What encoding format to you use in exporting documents?

Documents are exported out in their native encodings. All metadata is exported in a normalized UTF-8 encoding, the most widely-used and efficient Unicode standard.

Is Clearwell using any 3rd-party components for its multi-language support?

Along with Clearwell’s own developed language processing technology, the Clearwell Language Processing Engine uses components from Oracle, Basis Technologies (also used by Google), and ICU (International Components for Unicode).

Stemming and Language Support Language

Dutch

English (Linguistic and suffix-based) French German Japanese Korean Portuguese Russian

Multiple Language Handling: Frequently Asked Questions PAGE: 174

Stemming Examples Japanese

In general, there are two types of stemming that can occur in Japanese: one for verb

conjugation and another for meaning changes on Kanji (Chinese characters within the Japanese language). Following are examples of each type.

Verb Conjugations:

• Example: "To Do":

The verb "to do" in Japanese is one of the most commonly used, as shown in the these various conjugations that can occur in stemming:

• Example: "To Make":

Following are similar verb conjugations that can occur in stemming:

Meaning Changes on Kanji: • Example: "Tokyo University":

This name consists of both the word "Tokyo" and "University", which are two separate Kanji groups that are broken up by stemming:

• Example: Special Summoning (as in court):

Multiple Language Handling: Officially Supported Languages PAGE: 175

In document Memòria 2009 (página 91-128)

DATA TITOL LLOC CNL CATEGORIA 13/11/09 Presentació de les parelles lingüístiques del programa "Voluntariat per la

Ordenades per data i per CNL.

DATA TITOL LLOC CNL CATEGORIA 13/11/09 Presentació de les parelles lingüístiques del programa &#34;Voluntariat per la

DATA TITOL LLOC CNL CATEGORIA 13/11/09 Presentació de les parelles lingüístiques del programa "Voluntariat per la