• No se han encontrado resultados

The automatic extraction of structured information from unstructured sources has been an active area of research for more than two decades. According to the taxonomy of information extraction proposed by [Sarawagi, 2008], this research area can be categorized along five dimensions: the type of structure extracted, the type of unstructured source, the type of input resources available for extrac- tion, the method used for extraction and the output of extraction. Under these

dimensions, the problem of key information extraction is to extract and output entities (i.e., words and sentences that help to determine the applicability and validity of an article) from unstructured texts (i.e., research articles). For the rest of this section, we first review the relevant methods and input resources for the extraction tasks of similar nature, and then move on to the work specific to key information extraction.

3.2.1 Entity Extraction from Unstructured Texts

The methods for entity extraction from unstructured texts can be broadly clas- sified into two categories: rule-based and statistical.

Rule-based Approaches: As the name suggests, rule-based approaches rely

on a set of rules to perform extraction. Rules usually consist of two parts. The first part is a contextual pattern which describes the properties and context of the entities to be extracted in terms of textual features. As summarized in [Muslea, 1999], early information extraction systems for newspaper articles make use of lexical features (e.g., the words themselves), phrase features (e.g., noun/verb/prepositional groups), voice features (e.g., active/passive) and word type features (e.g., physical object) to construct complex patterns. The second part of the rules is the action to be taken when the patten is matched, which is usually to identify series of words as the entities to be extracted.

These rules can be hand-crafted by experts or learnt from an annotated corpus. Hand-crafted rules are able to encode domain knowledge which is hard to capture otherwise and feature widely in early systems [Hobbs et al., 1997;Cunningham et al., 2002]. To alleviate the cost of domain knowledge, rule-learning algorithms have been developed to induce the best set of rules based on an annotated corpus and rule templates. The learning of rules may start by instantiating very specific rules from the templates to cover instances of the information to be extracted, followed by a generalization process that removes some of the text features or replaces rules with more general ones. This is bottom-up rule learning as is done in [Ciravegna,

2001]. Alternatively, the learning can be done in a top-down manner. In [Soderland, 1999], generic rules are made more specialized by adding more text features or replacing them with more specific ones. Nevertheless, these algorithms may still rely on existing hand-crafted rules as a better starting point and involve experts in instance selection and rule refinement for better results.

Despite the growth of statistical approaches, rule-based approaches remain an active area of research and efforts have been made to improve them in various aspects, such as scalability [Reiss et al., 2008], uncertainty man- agement [Michelakis et al., 2009] and refinement process [Liu et al., 2010].

Statistical Approaches: In statistical approaches, the extraction of entities is

done by classifying whether a word is (part of) an entity to be extracted using statistical models. The words in such approaches are commonly de- scribed by a set of text features consisting of word features (e.g., the words themselves), orthographical features (e.g., capitalization pattern), linguis- tic features (e.g., part-of-speech tags) and dictionary features (e.g., whether the word appears in the entity dictionary). Under this formulation, vari- ous statistical models have been examined by different researchers. Hidden Markov Models (HMMs), which naturally capture the dependency between adjacent words, feature prominently in early research. For example, [Bikel et al., 1997] use an Ergodic HMM with internal states representing named entity classes. They calculate the most likely state for each word using the Viterbi decoding algorithm. Later works employing HMMs in information extraction focus on finding the suitable model structure [Seymore et al., 1999] or employing more sophisticated variants of HMMs such as Hier- archical HMMs [Skounakis et al., 2003]. Besides HMMs, Support Vector Machines (SVMs) and Maximum Entropy modeling (MaxEnt) have also been applied in [Isozaki and Kazawa, 2002] and [Chieu and Ng, 2002] for their capability in handling large amount of features. [McCallum et al., 2000] propose the Maximum Entropy Markov Model which combines the strength of HMM and MaxEnt in capturing sequential dependency while

offering more freedom in the choice of features. This leads to the current state-of-the-art model, Conditional Random Fields (CRFs) [Lafferty et al., 2001], which is able to take into account larger context (instead of just the previous word) for individual input and construct a consistent sequence of labels as the output. As a more recent trend, efforts have been made to solve multiple related information extraction tasks together via joint inference [McCallum, 2006;Poon and Domingos, 2007] so that the results of one classification can be used to inform another and vice versa.

Both categories of approaches rely on the presence of an annotated corpus, which is often expensive to obtain. To alleviate the tedium and cost of building large corpora, semi-supervised learning [Nadeau, 2007;Carlson et al., 2010] and unsupervised learning [Etzioni et al., 2005;Dalvi et al., 2012] methods have also been studied for various entity extraction tasks.

Our approach for key information extraction is statistical, as such approaches require less domain knowledge as compared to rule-based approaches (where experts are involved in crafting and tuning the rules). This domain indepen- dence allows our approach to be applied in different domains without having to source for expensive domain knowledge and makes our findings more applicable to domain-specific IR in general.

3.2.2 Key Information Extraction

In healthcare domain, the identification and utilization of PICO elements and their variants have been studied extensively for various intents. Most of the previous works in this area are based on supervised learning with natural lan- guage processing techniques. For example, [Demner-Fushman and Lin, 2007] perform sentence extraction on abstracts to obtain information for clinical ques- tion answering. They consider the sentences for elements P, I and C to be more recognizable by patterns due to the presence of medical concepts while the ones for element O to have no predictable patterns. Therefore, they extract the for- mer using hand-crafted patterns but employ linear regression of text features for the latter. [Chung and Coiera, 2007] seek for a better understanding of the structure of clinical abstracts by classifying their individual sentences into five

classes – aim, method, participants, result and conclusion. [Kim et al., 2010] explore the use of lexical, semantic, structural and sequential information with CRFs, while [Boudin et al., 2010] test and combine multiple classifiers, such as Decision Trees, SVM and Na¨ıve-Bayes. Both of these later works improve the accuracy of sentence classification.

In comparison, research on more fine-grained extraction of EBP information is less common. Existing works usually start by classifying the sentences in ab- stracts or articles to identify the possible locations of EBP information and then proceed to extract the information from those locations. For example, [Bruijn et al., 2008] make use of an SVM-based sentence classifier with n-gram features and a rule-based pattern extractor to identify the key trial design elements from clinical trial publications. [Chung, 2009] extracts interventions from method sen- tences in RCTs using lexical and syntactical features.

The above works either focus on sentence extraction or use sentence extrac- tion as a basis for keyword extraction. While individually important tasks, we believe that the composition of both tasks together is synergistic and would lessen the effort needed in applicability and validity assessment.

• Sentence extraction is important because not all key information is modeled well by individual words. For example, research results are commonly described in prose. It is difficult to extract only a few words to represent the entire text. Extraction at sentence-level is ideal in this case. Even for information such as patient demographics that can be represented by a few words, sentence extraction still imparts evidence that the specific keywords are being used in an appropriate context.

• Keyword extraction is also important because the recognized keywords represent the exact information users need. With the extracted keywords highlighted based on their classes for the ease of reading and assessment, users may quickly locate the desired information from the sentences with- out having to go through each of them in detail. Furthermore, keyword extraction aims at a smaller unit of text and hence can be represented in a more compact manner (e.g., keyword clouds) than sentences. This is

Table 3.4: Classes for sentences.

Name Definition Example

Patient A sentence containing

information of the pa- tients in a study.

A convenience sample of 24 critically ill, endotracheally intubated children was enrolled before initiation of suctioning and after consent had been obtained.

Result A sentence containing

information about the results of a study.

Large effect sizes were found for reduc- ing PTSD symptom severity (d =−.72), psychological distress (d =−.73) and in- creasing quality of life (d =−.70). Intervention A sentence containing

information about the procedures of interest and the ones as the comparison/control in a study.

Children 6 to 35 months of age received 0.25 ml of intramuscular inactivated vac- cine, and those 36 to 59 months of age received 0.5 ml of intramuscular inacti- vated vaccine. (Note: This is also a pa- tient sentence.)

Study Design A sentence containing information about the design of a study.

A prospective international observa- tional cohort study, with a nested com- parative study performed in 349 inten- sive care units in 23 countries.

Research Goal A sentence contain- ing information about what a study aims to achieve.

The aim of this study was to investi- gate the balance between pro- and anti- inflammatory mediators in SA.

useful in presenting more information within the limited screen estate.

Documento similar