OBJETO Y ALCANCE - ANTEPROYECTO INFRAESTRUCTURAS DE EVACUACIÓN PARQUE EÓLICO LA LOBERA

This section outlines the AnaMed medical terminology language analyser. In addition to analysing linguistic information, AnaMed also identifies SNOMED CT terms and eponyms in a given text. It has been developed for English and Basque, and the aim is to adapt it also to the Spanish language in the future.

Initially, only the English version of the AnaMed analyser was developed, in response to our need for a tool to search for the information required by the KabiTerm system outlined in this chapter. In other words, the information gathered by AnaMed is information that may prove necessary for automatic machine translation into Basque. However, since AnaMed can eas- ily be adapted to other languages, we decided to develop a Basque-language version, since we believe this may prove useful for the drafting of medical reports in Basque.

AnaMed is based on an automatic analysis system and integrates the identification of both eponyms and SNOMED CT terms. Eponyms are proper nouns that appear in the designation of certain concepts. The architecture of AnaMed is shown in Figure 5.1.

The Stanford CoreNLP tool (Manning et al., 2014) was used as the starting point for the development of the English analyser, along with the Python wrapper for Stanford CoreNLP, developed by Dustin Smith1_{. The Eustagger} (Ezeiza et al., 1998) analyser was used for the Basque version (AnaMed_eu). The morphological tokenisers and taggers from the linguistic analyser were used to identify tokens’ lemmas and parts of speech. In addition to this information, token offsets and, in the case of AnaMed-en, named entity tags, were also integrated into the analyser (Figure 5.1 shows the output of the first module).

By adding a second module (see Figure 5.1, module 2) we gave the analyser eponym identification capability. Eponyms are very common in medical

1_{https://github.com/dasmith/stanford-corenlp-python (accesed May 9, 2017)}

Generating complex terms from nested terms

Figure 5.1 – AnaMed analyser architecture.

terminology, particularly in the names of diseases and syndromes. The terms Down syndrome and Alzheimer’s disease are good examples of this2.

Finally, we also added a SNOMED CT term identifier to the AnaMed

2_{Both terms were extracted from the Euskalterm Public Terminology Database.}

5 - COMPLEX TERMS

analyser (see Figure 5.1, module 3).

No changes were made to the linguistic analyser itself during the course of this thesis project. The subsection below describes the modules generated during this phase.

Eponym recogniser

The most obvious eponyms are found in terms similar to the two examples given above: Down syndrome and Alzheimer’s disease, in which the eponym itself appears explicitly (Down and Alzheimer, in this case). How- ever, there are also a number of terms that are derivatives of eponyms, such as Daltonism. The term Daltonism was established in honour of the British chemist John Dalton, the first person to describe the condition3_{. However,} the eponym recogniser used here does not identify eponym derivatives.

The eponym recogniser was developed with Basque grammar in mind. In other words, the composition of proper nouns (referring to both people and places) differs depending on the declension. No agreement was found regarding the definition of eponyms, and sometimes reference is made to place names also4,5_{. In relation to place names, for example, Stockholm syndrome} was named after an event that occurred in the city of Stockholm6_{, and as such,} the Basque equivalent, “Stockholmgo sindrome”, uses the locative genitive case. When referring to the names of specific people, the declension used is the possessive genitive, as is the case with the Weber test, the equivalent of which in Basque is “Weber-en proba”.

Before starting work on the eponym identifier, we analysed different named entity recognition systems (Nadeau and Sekine, 2007; Tjong Kim Sang and De Meulder, 2003), testing some of the state of the art systems with SNOMED CT descriptions. In this manual analysis, the best results were obtained by the Stanford CoreNLP named entity recognition tool (Finkel et al., 2005). However, since even with the best available tool the majority of eponyms remained undetected, we conducted an Internet search for lists of the most common eponyms and, on the basis of the results, developed our own eponym recogniser. Thus, both the Stanford CoreNLP named entity recognition tool’s persons and the eponyms identified by our system are

3_{https://en.wikipedia.org/wiki/John_Dalton (accesed May 9, 2017)} 4_{https://en.wikipedia.org/wiki/Eponym (accesed May 9, 2017)} 5_{http://www.dictionary.com/browse/eponym (accesed May 9, 2017)}

6_{https://en.wikipedia.org/wiki/Stockholm_syndrome (accesed May 9, 2017)}

Generating complex terms from nested terms

tagged as eponyms.

The eponym recogniser searches the words contained in the complex term for the eponyms on the list. Sometimes, compound eponyms are used in the names of certain diseases, as in the case of Verner-Morrison syndrome, for example. With such cases in mind, when drawing up the list of eponyms we extracted simple eponyms from compound ones. Thus, in the example given above, two eponyms were included on the list. With the aim of broadening the recogniser’s coverage, when recognising compound eponyms, the system is designed to identify the whole compound eponym from just one of its components. When compiling the list of eponyms, we analysed all the terms in SNOMED CT, adding all previously unidentified components of compound eponyms to the list. The final list contains around 3,000 proper names for identifying eponyms.

TermZerSCT: SNOMED CT term recogniser

The principal aim of AnaMed is to identify nested terms within terms. Al- though there are many term extractors currently available, none of them are specifically adapted to the needs of KabiTerm. We are not interested here in identifying general terms, only those included in SNOMED CT, using that system’s own hierarchy.

We therefore adapted the TermZerSCT terminology server to identify SNOMED CT terms. SNOMED CT contains a vast amount of terminology (around 300,000 concepts) which takes time to process. TermZerSCT enables faster terminology content management, and when the server is run- ning we receive information about SNOMED CT almost instantly, with only a minimum waiting period.

As stated earlier, the server prepares the terminological content of SNO- MED CT in order to provide the client (in this case AnaMed) with the information it requires as efficiently as possible. Among other things, it uses the original SNOMED CT files to classify active concepts into hierarchies. Thus, when the system is given a SNOMED CT concept identifier, in addition to providing that concept’s FSN, preferred term and synonyms, it also specifies the hierarchy to which it belongs. This information is added to that provided by AnaMed and the eponym recogniser, as shown in the output section of Figure 5.1.

As we can see in the table below (Table 5.1), we can obtain the SNOMED CT concept identifier for a given term (in this case, diabetes mellitus), and 79

5 - COMPLEX TERMS

once we have that code, all the information about that concept becomes immediately available, including its fully specified name (FSN), its preferred term (PT) and its synonyms.

Explanation Function Result Obtain code desc2sct 73211009 Obtain hierarchies sct2hie DISORDER

Obtain FSN sct2fsn Diabetes mellitus (disorder) Obtain PT sct2term Diabetes mellitus

Obtain synonyms sct2syn DM - Diabetes mellitus

Table 5.1 – The information that can be obtained regarding the term diabetes mellitus using TermZerSCT.

We have developed English, Spanish and Basque versions of the server, since these are the languages in which we have the SNOMED CT terminology, although more work has been carried out on the English and Basque versions, in which, using a lemmatiser (Stanford CoreNLP for the English version and Eustagger for the Basque version), we also offer the option of searching for lemmatised terms. This option will be incorporated also into the Spanish version in the future.

Using the TermZerSCT server, AnaMed identifies the nested terms lo- cated within complex terms, enabling us to analyse the structure of said complex terms. Moreover, it also groups nested terms together using under- scores (“_”). For example, in the complex term unstable diabetes mellitus it identifies two nested terms: the qualifier unstable and the disorder diabetes mellitus. Thanks to this identification, in addition to providing the complete analysis, AnaMed also gives us the structure (qualifier+disorder) and the grouping (unstable diabetes_mellitus), information which is extremely useful for KabiTerm.

In document ANTEPROYECTO INFRAESTRUCTURAS DE EVACUACIÓN PARQUE EÓLICO LA LOBERA (página 5-0)