Las reglas de negocios describen, en lenguaje sencillo, las características principales y distintivos de los datos como los ve la compañía
1. Modelo de datos-El enfoque relacional
System Population Intervention Background Outcome Study Design Other
Marco Lui 0.58 0.34 0.80 0.89 0.59 0.85 A MQ 0.51 0.35 0.78 0.86 0.58 0.84 Macquarie Test 0.56 0.34 0.75 0.84 0.52 0.80 Starling 0.32 0.20 0.80 0.87 0.00 0.82 DPMCNA 0.28 0.12 0.70 0.78 0.48 0.73 Mix 0.45 0.19 0.68 0.82 0.40 0.81 System Ict 0.30 0.15 0.68 0.84 0.35 0.83 Dalibor 0.30 0.15 0.68 0.84 0.40 0.83 Naive 0.00 0.00 0.59 0.68 0.00 0.15 CRF official 0.33 0.22 0.55 0.78 0.67 0.81 CRF corrected 0.58 0.18 0.80 0.86 0.68 0.83 Aggregate 0.38 0.21 0.71 0.83 0.42 0.76
Table A.4: F-scores across each individual label class and the aggregate. The best results per column are given in bold.
shared tasks.
A.6
Description of the top systems
The following text is by the team competitors who kindly agreed to send us their system descriptions.
Team Marco (Marco Lui)
A full description of this system is given in [Lui 2012]. We used a stacked logistic regression classifier with a variety of feature sets to attain the highest result. The stacking was carried out using a 10-fold cross-validation on the training data, generating a pseudo-distribution over class labels for each training instance for each feature set. These distribution vectors were concatenated to generate the full feature vector for each instance, which was used to train another logistic regression classi- fier. The test data was projected into the stacked vector space by logistic regression
CHAPTER A: APPENDIX 1
System F-score p-value
Marco Lui 0.82 0.0012 CRF corrected 0.80 0.482 A MQ 0.80 0.03 Starling 0.78 0.3615 Macquarie Test 0.78 0.0001 Mix 0.74 0.1646 System Ict 0.73 0.5028 Dalibor 0.73 0.0041 DPMCNA 0.71 0 Naive 0.55 -
Table A.5: Ranking of systems according to F-score, and pairwise statistical significance test between the target row and the one immediately below. The horizontal lines cluster systems according to statistically significant differences.
classifiers trained on each feature set over the entire training collection. No sequen- tial learning algorithms were used; the sequential information is captured entirely in the features. The feature sets we used are an elaboration of the lexical, semantic, structural and sequential features described by Kim et al [Kim et al. 2011]. The key differences are: (1) we used part-of-speech (POS) features differently. Instead of POS-tagging individual terms, we represented a document as a sequence of POS- tags (as opposed to a sequence of words), and generated features based on POS-tag n-grams, (2) we added features to describe sentence length, both in absolute (number of bytes) and relative (bytes in sentence / bytes in abstract) terms, (3) we expanded the range of dependency features to cover bag-of-words (BOW) of not just preceding but also subsequent sentences, (4) we considered the distribution of preceding and subsequent POS-tag n-grams, (5) we considered the distribution of preceding and subsequent headings. We also did not investigate some of the techniques of Kim et al, including: (1) we did not use any external resources (e.g. MetaMap) to introduce additional semantic information, (2) we did not use rhetorical roles of headings for
SECTION A.6: DESCRIPTION OF THE TOP SYSTEMS
structural information, (3) we did not use any direct dependency features. Team A MQ (Abeed Sarker)
In our approach, we divide the multi-class classification problem to several bi- nary classification problems, and apply SVMs as the machine learning algorithm. Overall, we use six classifiers, one for each of the six PIBOSO categories. Each sentence, therefore, is classified by each of the six classifiers to indicate whether it belongs to a specific category or not. An advantage of using binary classifiers is that we can customise the features to each classification task. This means that if there are features that are particularly useful for identifying a specific class, we can use those features for the classification task involving that class, and leave them out if they are not useful for other classes. We use RBF kernels for each of our SVM classifiers, and optimise the parameters using 10-fold cross validations over the training data for each class. We use the MetaMap tool box to identify medical concepts (CUIs) and semantic types for all the medical terms in each sentence. We use the MedPost/SKR parts of speech tagger to annotate each word, and further pre-process the text by low- ercasing, stemming and removing stopwords. For features, we use n-grams, sentence positions (absolute and relative), sentence lengths, section headings (if available), CUIs and semantic types for each medical concept, and previous sentence n-grams. For the outcome classification task, we use a class-specific feature called ‘cue-word- count’. We use a set of key-words that have been shown to occur frequently with sentences representing outcomes, and, for each sentence, we use the number of oc- currences of those key-words as a feature. Our experiments, on the training data, showed that such a class-specific feature can improve classifier performance for the associated class.
Team Macquarie Test (Diego Molla)
A full description of this system is given in [Molla 2012]. The system is the result of a series of experiments where we tested the impact of using cluster-based
CHAPTER A: APPENDIX 1
features for the task of sentence classification in medical texts. The rationale is that, presumably, different types of medical texts will have specific types of distributions of sentence types. But since we don’t know the document types, we cluster the documents according to their distribution of sentence types and use the resulting clusters as the document types. We first trained a classifier to obtain a first prediction of the sentence types. Then the documents were clustered based on the distribution of sentence types. The resulting cluster information, plus additional features, were used to train the final set of classifiers. Since a sentence may have multiple labels we used binary classifiers, one per sentence type. At the classification stage, the sentences were classified using the first set of classifiers. Then their documents were assigned the closest cluster, and this information was fed to the second set of classifiers. The submission with best results used Maxent classifiers, all classifiers used uni-gram features plus the normalised sentence position, and the second classifiers used, in addition, the cluster information. The number of clusters was 4.
Team DPMCNA (Daniel McNamara)
We got all of the rows in the training set with a 1 in the prediction column and treated each row as series of predictors and a class label corresponding to sentence type (’background’, ’population’, etc.) We performed pre-processing of the training and test sets using stemming, and removing case, punctuation and extra white space. We then calculated the training set mutual information of each 1-gram with respect to the class labels, recording the top 1000 features. For each sentence, We converted it into a feature vector where the entries were the frequencies of the top features, plus an entry for the sentence number. We then trained a Random Forest (using R’s randomForest package with the default settings) using these features and class labels. We used the Random Forest to predict class probabilities for each test response vari- able. Note that We ignored the multi-label nature of the problem considering most sentences only had a single label.
SECTION A.6: DESCRIPTION OF THE TOP SYSTEMS
Team System Ict (Spandana Gella, Duong Thanh Long)
A full description of this system is given in [Gella and Long 2012]. Our top 5 sentence classifiers use Support Vector Machine (SVM) and Conditional Random Fields (CRFs) for learning algorithm. For SVM we have used libsvm 1 package and for CRF we used CRF++ 2 package. We used 10-fold cross validation to tweak and test the best suitable hyper parameters for our methods. We have observed that our systems performed very well when we do cross validation on train data but suffered over fitting. To avoid this we used train plus labelled test data with one of the best performing systems as our new training data. We observed that this has improved our results by approximately 3%. We trained our classifiers with different set of features which include lexical, structural and sequential features. Lexical features include collocational information, lemmatized bag-of-words features, part-of-speech information (we have used MedPost part-of-speech tagger) and dependency relations. Structural features include position of the sentence in the abstract, normalised sen- tence position, reverse sentence position, number of content words in the sentence, abstract section headings with and without modification as mentioned in [Kim et al. 2011]. Sequential features were implemented the same way as in [Kim et al. 2011] with the direct and indirect features. After having the pool of features from the above defined features, we perform feature selection to ensure that we always have the most informative features. We used the information gain algorithm from R system3 to do feature selection.
CHAPTER
B
Appendix 2
This work was carried out in collaboration with two other authors (Diego Molla and David Martinez) and as one of the authors I have contributed in forming the idea and running the experiments. However, unlike the rest of this thesis I have not been the first author in this work. I would like to acknowledge that the writing of this appendix has been a collaboration between the three authors, mainly Diego Molla who has been the first author of this particular work taken from the following paper: Molla et al. [Moll´a et al. 2013].
We propose a document distance-based approach to automatically expand the number of available relevance judgements when those are limited and reduced to only positive judgements. This may happen, for example, when the only available judgements are extracted from a list of references in a published clinical systematic review. We show that evaluations based on these expanded relevance judgements are more reliable than those using only the initially available judgements. We also show the impact of such an evaluation approach as the number of initial judgements
CHAPTER B: APPENDIX 2
decreases.
B.1
Semi Automatic Relevance Judgement
There are applications that benefit from an information retrieval (IR) stage, but which do not have enough sample documents for a full assessment of the retrieval quality. Furthermore, the few sample documents available only represent positive relevant documents. For example, within the area of Evidence Based Medicine (EBM), clin- ical systematic reviews provide the medical doctor with clinical evidence together with a list of relevant documents. We envisage the development of tools that will facilitate the production of such systematic reviews. One of the first stages of such an application consists of an IR step that retrieves all key relevant documents. But the references in a systematic review cover only a small sample of all relevant refer- ences [Dickersin et al. 1994], and only a fraction of the documents of a systematic re- view can be retrieved after performing exhaustive searches, mostly due to the fact that there are complex queries and several document repositories [Martinez et al. 2008]. Furthermore, the list of references only indicate relevant documents but there are no lists of non-relevant documents readily available. It is therefore expected that any evaluation metric that is based solely on the references from the systematic review will show unreliable results.
Previous work has shown that by expanding an initial set of document assess- ments for given queries, one can perform a more accurate automatic evaluation of IR systems. For example, B ¨uttcher et al. [B ¨uttcher et al. 2007] used Machine Learning methods trained over a subset of relevance judgements in order to expand the set of relevance judgements. They showed that evaluation results with the expanded set of relevance judgements had better quality than using the source subset of judgements. Quality of the evaluation was measured by ranking a set of IR systems according to
SECTION B.1: SEMI AUTOMATIC RELEVANCE JUDGEMENT
the new expanded relevance judgements, and comparing it against the system order- ing produced by the original set of judgements. In the clinical domain, Martinez et al. [Martinez et al. 2008] explored the use of re-ranking methods based on reduced judgements, and found that the use of automatic classifiers would allow to consider- ably reduce the time required for clinicians to identify a large portion (95%) of the relevant documents. Both these articles reported limitations of the classifiers when the initial number of documents was small. Furthermore, in the scenario that we contemplate, where we rely on the list of references of a systematic review as the set of relevant documents, we do not have information about negative judgements, and therefore a classifier-based approach to expand the set of relevant documents would have to deal with this issue.
More recent work [Sakai and Lin 2010] has shown that by relying on documents retrieved frequently by a diverse set of systems, it is possible to build relevance as- sessments automatically, and achieve high correlation with manually judged data. However this approach has been tested by building on a set of competing runs from different research groups, which is not always available; and this method does not benefit from existing qrels.
We propose to automatically expand the set of relevant documents by adding documents that are reasonably close to the original, reduced set. We show the re- sult of several experiments that test the impact of such automatic expansion. For our experiments, we rely on the OHSUMED test collection [Hersh et al. 1994]. This is a corpus containing clinical queries and assessments, and we focus on the set of 63 queries that was used in the TREC-9 Filtering Track. The OHSUMED queries were generated to address actual information needs for clinicians, and the assessed docu- ments were retrieved in two iterations, by relying on the MEDLINE search interface1 and the SMART retrieval system respectively. The retrieved documents were judged
1
CHAPTER B: APPENDIX 2
by a separate group of domain experts to the group performing the search. As doc- ument collection we rely on the 1988-91 subset of MEDLINE that was released as test data for the TREC-9 challenge, which contains 293,856 documents. For eval- uation we apply a variety of IR systems implemented in the Terrier open source package [Macdonald et al. 2012].