Other techniques that are potentially useful for automated content-based quality assessments include natural language processing and machine learning methods. Few published works in quality assessment have employed these methods. The effectiveness of these techniques for health care information quality rating is explored in this study.
Natural language processing and related techniques are applied in this study to implement semantic parsing and processing. Through semantic parsing, sentences expressed in natural human language can be mapped to a formal representation of semantic concepts and relationships. Thus, computer programs can be developed to rate the health information quality based on the semantics of text in web pages. Studies on developing information extraction techniques have attracted great interest from researchers in multiple disciplines, mainly computer science. Extensive research activities started in nineteen-eighties, including the Message Understanding Conferences initiated in 1987 and financed by DARPA
and TREC (Text Retrieval Conferences) run by National Institute of Standards and
Technology (NIST). Techniques such as name entity recognition have proven effective for dealing with text in diverse domains, including the biomedical domain (e.g. Nadeau & Sekine, 2007).
The technology for semantic processing and analysis is in a growth stage. Although there is as yet no universal tool that can in general solve domain-independent text understanding questions, semantic analysis and processing has been successful in some domain-specific applications. Particularly in the health care information knowledge domain, for example, semantic parsing and processing has been successfully used in biomedical concept annotation (e.g. CONANN) and summarizing biomedical text (Reeve, 2007), in classifying medical patient records (Chen et al., 2010), and in extracting medication information from text clinical records (Deléger et al., 2010). Given the success of these studies, it is worthwhile to attempt automated quality assessment of the web health care information based on shallow semantic analysis of text, so that the content of web documents and the rating criteria can be compared through semantics.
In the health care knowledge domain, many tools are available to facilitate text processing functionalities including morphological processing, syntactic processing, grouping synonyms terms into concepts, etc. There are also controlled vocabulary resources available such as MeSH published by the National Library of Medicine. In addition, the SNOMED CT
the U.S. National Library of Medicine provides a tool called Unified Medical Language System (UMLS). The UMLS integrates more than 60 families of controlled biomedical vocabularies including MeSH and SNOMED CT. It was developed to reduce the barriers to effective retrieval of machine-readable information by including the variety of ways the same concepts are expressed in different sources and by different people, and also to enable the representation and distribution of health care information among systems (Humphreys et, al., 1998). With these two advantages, UMLS can be an excellent infrastructure for the effective transformation of a great variety of text in biomedical domain into normalized semantic annotations. Therefore, the UMLS tool is used in this study to generate the semantic representation of web health care content written in English natural language and rating criteria in order that they can be compared.
Another potentially useful technique for quality assessment is text classification and related machine learning algorithms. In our semantics-based quality assessment approach, quality scores are assigned based on semantically comparing the text with the evidence-based quality rating criteria. This study tried to implement the comparison through text classification, in which text can be classified into predefined classes according to content. A number of statistical and machine learning techniques have been developed for text classification, including rule-based decision system, Naïve Bayes, support vector machines, and maximum entropy models (Sebastiani, 2002). Many early text classifiers were based on keywords extracted from the documents, with the assumption that a keyword is a unique representative of a distinctive concept or semantic unit. Thus, these earlier classifiers do not include
a word may represent multiple different meanings, and people can choose different words to refer to the same meaning. However, in some text classification studies (e.g. Wiener et al., 1995; Liu et al., 2004; Wang et al., 2005; Wang & Liu, 2009) natural language processing techniques (latent semantic analysis or LSA) proved to partially resolve the problems of word choice and redundant semantic relationships in text classification. This technique analyzes the associative semantic relationships between a set of documents and the terms they contain by constituting a latent semantic space related to documents and terms. A hint from the LSA studies is that the utilization of semantics will likely be beneficial to text classification in this study and hence it is worthwhile to attempt text classification in combination with semantic parsing and analysis.
Text classification has been successfully used in various domains to solve different
application problems, such as e-mail spam filtering (e.g. Sahami et al., 1998), categorizing news articles into topics (Schapire & Singer, 2000), and assigning international clinical codes to patient clinical records (Chen et al., 2010). This thesis explores the application of text classification to a new application area – i.e., rating the content quality of health documents on the web.