• No se han encontrado resultados

Cuestionario de dinámica de grupo: escuela

The aim of research presented in this thesis is to automate semantic interpretation on clinical narratives. The intention is that by overcoming obstacles caused by costly manual efforts with automated interpretation processes on clinical narratives, we will be able to reuse clinical narratives to support epidemiologic research.

The use of specific document types (knee MRI reports) and ontology (TRAK) allowed us to test a hypothesis about the feasibility of our approach to rapid development of high performing IE systems. Nonetheless, the approach itself is not restricted to a particular domain and as such represents a general contribution to computer science in addition to project deliverables that include an extended ontology and two systems it enables to perform information extraction (KneeTex) and information retrieval (KneeBase).

Our achievements include: a reusable rapid ontology development framework demonstrated by expanding TRAK; an expanded TRAK ontology; an ontology-based information extraction system KneeTex; and a web-based information retrieval system KneeBase. Section 8.1 reviews achievements in this research, followed by a discussion of limitations in Section 8.2. Possible future work is discussed in Section 8.3.

8.1 Summary of contributions

Rapid ontology development framework

We adopted an alternative approach based on a set of strategies that can be used to systematically expand the coverage of existing ontologies or to develop them from scratch. Three of these strategies are data–driven and as such are more likely to ensure that the ontology effectively supports the intended NLP application. Each data–driven strategy utilises a different approach to extracting the relevant terminology from the data either manually or automatically. The fourth strategy is based on integration of concepts from other relevant knowledge sources. The two main aims of this strategy are: (1) to avoid the overfitting of the ontology to limited data available, and (2) to provide an initial taxonomic structure to incorporate new concepts.

In this study, we illustrated how these strategies were implemented in practice to expand the coverage of the TRAK ontology to make it suitable for a specific NLP application.

Expanded TRAK ontology

We practically demonstrated the approach to update the TRAK ontology in order to allow interpretation of information contained in knee MRI reports. We expanded TRAK into a fine-grained lexicalised ontology. The expanded TRAK contains 1,621 concepts, 2,550 synonyms and 560 relationship instances, compared with 1,292 concepts, 1,720 synonyms and 518 relationships in the original one.

The TRAK ontology-based information extraction system KneeTex exhibited human- level performance, which proves that the TRAK ontology has adequate coverage and been highly attuned to the given sublanguage.

KneeTex

We developed KneeTex as an ontology–driven system for information extraction from narrative reports that describe an MRI scan of the knee. The system exhibited human– level performance on a gold standard, attributed partly to the use of expanded TRAK ontology, which serves as a very fine-grained lexico-semantic knowledge base and plays a pivotal role in guiding and constraining text analysis.

The evaluation results confirm that KneeTex succeeded in making effective use of the ontology to support information extraction from knee MRI reports.

KneeBase

We demonstrated KneeBase as an example of integrated use of ontology and extracted information to build an information retrieval system. The system structure and programming framework can be reused for other similar ontology-based information retrieval tasks. As it has been proved with many previous studies that information retrieval systems have positive impact on clinical practices, we would hope that KneeBase can be used to to support large–scale multi–faceted epidemiologic studies of knee conditions.

8.2 Generalisability and limitation

Most efforts we have put in to this research are generalisable and can be exported to be applied to other similar tasks. Here we provide a step-by-step discussion on generalisability and limitations of our work.

The rapid ontology development framework is a partially automated process. It utilises primarily data-driven strategies. The two strategies, dictionary-based and automatic term

recognition, are automated processes using off-the-shelf software tools, and therefore are highly exportable to other similar tasks and require little human operation. Human effort is inevitable as required by manual data annotation and manual terminology search, though domain expert are not necessaries required in these procedures. However, to achieve satisfactory coverage for such a specific domain, domain-specific knowledge is still essential. As mentioned, we have referenced to our domain expert for many suggestions, helping solving non-agreed annotations and manual curations on suggested terms from these strategies. We used manual annotation for the purpose of identifying non-standard and infrequent terms. The reason underneath is that we have a rather small sized unannotated training set. A large and proper annotated training set would have provided sufficient information in the first two automated strategies.

The KneeTex information extraction system structure can be reused or referenced for similar tasks. However, we have used a large amount of lexical patterns and rules for term extraction and ambiguity resolution. These patterns and rules are propably restricted to the domain of knee injury MRI reports. Although we have not evaluated such patterns and rules on other dataset, it is not recommended to apply KneeTex directly on other domains without prior domain adaption.

8.3 Future work

Although our system achieved satisfactory performance, it also raises some requirements of future work, which could still be done to solve errors and further improve performance.

Domain adapted POS tagger

As mentioned in Section 6.3, POS tagging was not included in our linguistic pre- processes. This is due to declined performances of general POS taggers when applied on clinical domains. Correct assigned POS tags could be helpful to solve errors as discussed in Section 6.9.2. Ferraro et al have seen an increase of 6.2% to 11.4% in accuracy after performed domain adaption processes (Ferraro et al., 2013). Therefore, more effort can be put into this.

Dependency based negation resolution

Negation resolution now assumes that findings that locate to the right of the negation term and are within the same segment with the negation term are negated. Although it worked quite well in our system, it would be better to have dependency relations involved

Flexible co-reference resolution

Our current system uses a simple heuristic semantic-rule-based approach to solve co- reference problems. Exceptions have caused a few unsolved co-references in our system as our rules are not flexible enough. Many hybrid and supervised co-reference resolution systems participated and performed well in the 2011 i2b2 co-reference challenges (Uzuner et al., 2012). To introduce supervised machine learning to create flexible co- reference resolution is worth considering for future work.

Reduce human effort

Human effort still takes a significant part in this project. Manual annotation, terminology search and curation all require some human interferences and domain knowledge. Although at current stage human effort is to some extent required, it is still possible to reduce some human effort with the availability of more data sets, which would improve the performance of the two automated strategies we have. There also exist many rule induction algorithms, which could potentially be applied to replace part of human effort in recognising rules and patterns.

8.4 Summary

In this research, we successfully tested our hypotheses that semantic interpretation of clinical narratives can be automated with ontology-based text mining and it is also feasible to expand an existing ontology effectively in a systematic pipeline.

We achieved satisfactory performance in the strategies we developed to expand the TRAK ontology. Based on this expanded TRAK ontology, we successfully automated the semantic interpretation process with support from ontology-based text mining. Most of outcomes in this research can be reused or referenced in future research, including a rapid ontology development framework, the structure of our information extraction system KneeTex, the expanded TRAK ontology and a reusable information retrieval system framework.

Although there is still room for improvement, our research demonstrated the state-of-the- art performance. We hope that this research could provide useful insights for future text mining support on epidemiologic studies.