A database is an organized collection of information on a specific subject. Telephone books, library catalogs, and hospital and pharmacy information systems (see Chapters 6 and 7) are all examples of databases. The subject of a database could be scholarly literature, protein structures used in drug discovery, or a set of patients. Most databases described in this textbook are relational databases, which consist of a set of related tables that look like individual sheets in a spreadsheet. Each row in a table (also called a record or tuple) rep- resents one instance of the entity described by the table. A hypothetical inpatient hospital patient database might contain a table of patient information, a table of drug information, and a table of physician information. These tables would be related because physicians treat patients and prescribe drugs, and patients have physicians and are taking drugs.
Each record in the patient table has information about one patient. The columns of a table are called fields. Each record consists of a collection of fields that include data describ- ing the attributes of that record. Fields in our patient table might include name, sex, age, height, weight, diagnoses, allergies, and other patient-specific information. Each field is populated with data that define that attribute (e.g., Smith, female, 64 years, 62 inches, 125 pounds, etc.). To keep track of the drugs that Ms. Smith is taking, we could include fields for all features of the drugs in the patient table. However, this would be inefficient to main- tain the data because the recommended dosage, route of administration, etc. are the same for many patients, so the table would have a large amount of redundant data. Instead, a separate drug table with fields for drug name, dose, cost, etc. should be maintained.
For efficiency and compactness, there should be no redundant or duplicate data. The database parlance for this situation is to say that the database is “highly normalized.” Additionally, each record must be identified uniquely by one or more attributes because
the database must be able to distinguish each patient individually. If some patients had the same last name or the same diagnosis, there might be a chance of confusing the patient records. Every table should have a field that contains unique data for each patient. This unique identifier could be a social security number because no two people should have the same social security number, or in the interest of privacy, each record might include a field for a patient record number. This unique identifier field is called the primary key.
The relationships between tables allow users to extract information from a variety of tables through a query that involves a “join” operation, which allows the user to pull together information from fields in different tables. This is possible because the unique identifier fields in each table (called the primary and foreign keys) are linked. For example, joining patient name and drug name would generate a list of all patients, the drugs they are taking, and their associated dosage information. In performing a query, the user can search all fields of the database or select specific fields. In a large and complex system such as an electronic health record, the relational database consists of many tables, all of which would be linked through primary–foreign key relationships.
4.4.2 Database Searching
The process of specifying what the user is seeking is done through the use of filters, which can take many forms. The filter can be a word or a group of words, a numerical value or range, or a combination of both (e.g., systolic blood pressure over 130). Filters must con- form to database controlled vocabularies. By carefully selecting an appropriate filter and specifying which field to search, a user can retrieve the needed information. For example, one could search the diagnosis field for “myocardial infarction.” However, someone enter- ing data on a patient might have entered “heart attack” or “coronary artery thrombosis” as alternative terms for the same condition. This then becomes a problem if one queries the database to find out how many patients admitted to the hospital were diagnosed as having a myocardial infarction. The records containing heart attack or coronary artery thrombo- sis would not be retrieved.
4.4.3 Role of Vocabularies
From the preceding example, the importance of controlled vocabularies should be obvious. How can data be retrieved and analyzed if they are not entered consistently? In this hypo- thetical case, the controlled vocabulary represents an agreement to use a single term to identify the disease. If, at some later date, it becomes necessary to compare data from both the inpatient and outpatient databases, the task will be much easier if both databases use the same vocabulary. Ideally, only one database would be used for the whole institution and it would use a consistent vocabulary. This situation is rarely the case, and the reconciliation of data across multiple institutional databases has become a major challenge to the future of efficient and cost-effective healthcare provision (see Chapters 16 and 18 for more detail).
To facilitate the entry of the correct terms into the database, data are often entered using a form (or menu on a computer screen) on which, for example, an enumerated list of con- ditions is provided and the person entering the data picks from the list. Two potential problems exist with this scenario. First, the list of conditions could become very long and
42 ◾ Philip E. Bourne and Susan M. McGuinness
the notion of categories and subcategories becomes helpful, as in the discussion of tax- onomies and ontologies later. Second, one of the entries on the enumerated list may not quite fit the clinical situation, so there should be some flexibility in what can be entered. A review of terms that do not fit can lead to further systemization or an extension of the vocabulary. Controlled vocabularies are not only important for patient databases, but also are extremely important to bibliographic databases.