• No se han encontrado resultados

Information Retrieval is the process of searching for documents or information in documents executed by a human user or automated agent. A system supporting a human user querying is aimed at increasing the ratio between relevant and non-relevant documents upon a query, e.g web search engines. An automatic systems task is the aggregation of filtered information to reduce the number of documents requiring further processing, e.g. customizable RSS feed services.

Improved querying

PubMed expands user queries using MeSH headings and additional vocabularies such as drugs or chemicals. If a query contains such a term the query is expanded with the option to include also articles which were manually annotated with this term. In the PubMed interface this expanded query can be reviewed by the user. Also the E-Utilities can be called to compute this expansion. This query expansion helps retrieving relevant articles which otherwise would be missed.

Some tools aim to improve the querying of PubMed by supporting the user during query formulation. Features reach from language translation over graphical aims to pre-processing full English questions:

askMedline. The text-based website askMEDLINE [83] takes a natural language question as input. The system removes irrelevant words and the remaining words are tested to relate to MeSH headings by querying PubMed. Terms classified as “other eligible entries” are eliminated as well if the remaining search results are few. The result is always a list of citation titles and links to the abstract and full text.

PubMedInteract. PubMedInteract [183] is a web interface to PubMed and presents slider bars to set PubMed search limits and parameters. A “Preview Count” option computes the number of articles to be expected with the current settings.

PICO Linguist. PICO Linguist [84] offers non English medicals the option to build a structured clinical query with medical terms that may be difficult to express in English by using the PICO framework. The user may specify the patient’s problem, the therapy and alternative therapies and the outcome in his/her own language. Primary sources of vocabularies for translation are UMLS, MeSH, WHO EMRO and UMLF.

BabelMeSH. The BabelMeSH [84] website maps search terms to a multilingual MeSH in 12 different languages. Only terms listed in the multilingual vocabulary can be used for the query.

PubFinder. This service [101] aims to automatically extract Pubmed abstracts that deal with a specific scientific subject. The user enters a representative set of PubMed ids. Based on the abstracts, a list of discriminating words is calculated which is used for ranking Pubmed abstracts for their probability of belonging to the user defined topic. The first 100 words exhibiting the highest difference in occurrence between both the global PubMed frequency of a word in a reference dictionary and the frequency of a word in the selected abstracts make the list of discriminating words. A set of abstracts dealing with literature mining contains, for example, these words: abstracts, medline, information, articles, names, precision, database, recall, protein, literature, databases, references, system, automatically, interactions, set, mining, scientific, automated, motivation and others.

CiteXplore. CiteXplore indexes documents from sources like Medline, European Patent Office, Chinese Biological Abstracts and Citeseer using the Lucene full text index. Ad- vanced searches such as wildcard search on selected attributes is offered. Another option is the expansion with synonyms. Information gathered from other applications such as Inter- Pro, SwissProt/Trembl and Alternative Splicing is cross referenced. The external WhatIsIt textmining service is used to highlight proteins, genes and protein-protein interactions. The references can be exported to EndNote, RIS and Bibtex format.

Results processing

Some systems process search results further to facilitate browsing of a large number of doc- uments or link to further related citations based on the content of the search result. Exam- ples are evidence highlighting, document re-ranking and information organization. Evidence highlighting visually emphasizes text passages in source documents. For example, the word

in a sentence stating a relation of two entities is underlined. Readers are supported when scanning through relevant text passages. Documents can be sorted according to selected criteria such as date, type of citation, usage of vocabulary or reputation. Hyperlinks to documents not in the original search result, for example referenced papers or papers with similar content are linked, support researchers in finding all relevant material. Informa- tion organization is the process of organizing information such that it becomes useful. For example tables or network graphs support understanding.

BioIE. BioIE is a rule-based system that extracts informative sentences from MEDLINE document or uploaded texts. Informative sentences refer to structures, functions, diseases and therapeutic compounds, localisations or familial relationships of biological entities, par- ticularly proteins. The selected text base can be visualized in tabular form as word, MeSH term and word phrase frequency tables. Textual templates are used to identify informative sentences of a selected type, e.g. functional descriptions. The sentences can be further filtered for cooccurrence with additional keywords.

ReleMed. ReleMed [246] expands a users query automatically using UMLS and MeSH. Names of proteins and genes are expanded as well. Also lexical variants of words are gen- erated. The user has the option to undo this expansions selectively. Matches in separate sentences are highlighted. ReleMed uses the relational MySql database to implement a full text index over single sentences. MeSH headings associated with the abstracts are concate- nated and treated as an additional sentence. The relevance of an article is defined in eight levels depending on the cooccurrence of all keywords in one or more sentences.

PubMed PubReMiner. PubMed PubReMiner [146] shows the user journals in which his/her keywords are mentioned the most. It displays authors publishing the most articles mentioning the keywords. It shows words that have been used most in the title and ab- stract of the articles. Queries can be refined based on document attributes such as address, substances, MeSH headers, publication year, author and others.

ClusterMed. Vivisimorapplies clustering methods in ClusterMed and BioMetaCluster [263]. In ClusterMed PubMed results are clustered in various ways. Document distances are computed based on strings in (1) title, Abstract, and Medical Subject Headings, (2) title, abstract only, (3) MESH only, (4) authors name only, (5) affiliation only and (6) date of pub- lication only. Vivisimo uses words found in this document’s attributes to label clusters. The clusters are ordered by the number of documents contained in them. The cluster hierarchy is computed using statistical language processing. For the query ”rab5” ClusterMed returned several clusters such as Vacuoles, Phagosomes, Rabaptin-5, Rab5a and others. The labels are computed on the basis of word occurrence statistics in the retrieved article abstracts. The cluster “Rabaptin-5” contains sub-clusters such as “Ubiquitin”, “GAT domain”, “Vesicular transport”, “Nucleotide exchange”, “Dimerization Of Rabaptin-5”, “Endocytic membrane fusion”, “Correlated, Tissue”, “FRET microscopy”, “Cleaved in apoptotic” and other labels. Most of the labels could be categorized in the context of biomedicine as proteins, cellular components, molecular functions, diseases, techniques and others.

ClusterMed gives the option to compute the clusters only on the MeSH headings. The same string based clustering techniques are applied but using only words from MeSH. A clustering result for the query ”rab5” displays clusters labeled with MeSH headings such as “Guanine Nucleotide Exchange Factors”, “Virology” and “Pathology” but also concate- nated labels such as “Analysis, Liver”, “Chromatography, Affinity, Cattle”, “Phagosomes,

Microbiology” which do not correspond to a single MeSH heading or sub-heading but to a combination of them. A cluster does not necessarily comprise sub-clusters reflected by a relation in the UMLS. In the examples the cluster “Guanine Nucleotide Exchange Factors” comprises labels of cellular components, diseases, peptides, proteins and algorithms. The clustering algorithm grouped them on the basis of statistical co-occurrence in the result set. No information about relations between headings is used.

Another feature of ClusterMed is the clustering by authors. Here, the strings of the last name plus the initials are clustered. Sub-clusters contain co-authors. The clusters may contain PubMed citations of different authors with same last name and initials.

BioMetaCluster. is a meta search engine based on the Vivisimo clustering architecture. It queries 22 web resources relevant for the biomedical domain using string based clustering of the search results.

Documento similar