Record structures work best when there is an exact one-to-one correspondence between entities and their names (representations), i.e., no synonyms or ambiguities. And they work best when all entities of a given type have the same name formats (representation). Under these conditions, it is feasible to have a single format specified for a field in which these entities might occur. And it is easy to detect references to the same entity: just match the contents of the fields.
Real entities don't always behave so simply. The employees of a multinational corporation might not all have Social Security numbers, or employee numbers (or they might be in different formats in different countries). But many employees have both, and some may have several Social Security numbers. Some books don't have International Standard Book Numbers (ISBN), others don't have Library of Congress numbers, and some have neither. But many books have both—and some have several ISBNs. And Library of Congress numbers apply to a larger class of entities than do ISBNs; they are also assigned to films, recordings, and other forms of publication, in addition to books. Oil companies have their own conventions for naming their own oil wells, and the American Petroleum Institute has also assigned “standard” names to some wells—but not all.
For all practical purposes, record systems can't cope with partially applicable names. In order to use records for an application, it is necessary that some naming convention be adopted which applies to all occurrences of the entity type.
Synonyms are not really managed at all, as far as the structure and description of data are concerned. If fields in two different record types contain employee
numbers, then the system can perceive that some of these records might refer to the same person. (This is, in fact, the fundamental mechanism for expressing relationships in the relational model—matching field values imply that two records are related, and can be “joined.”) But if one record type contains Social Security numbers instead, then this knowledge is lost. As far as the system is concerned, there are no potential relationships here. It is only in the minds of users, and in procedural logic buried in programs, that any suspicion lurks that these might in fact refer to the same people.
And in all of this, we haven't bothered to mention simple synonyms. Many skills, jobs, companies, people, colors, etc., etc., have more than one name. We might have to deal with them in multiple languages, as well. We have many ways to represent the same date. Quantifiable things are written in different ways depending on the unit of measure, data type, number base, and so on. Our systems are usually inconsistent in handling these: they will help with such things as conversion algorithms in some cases, but not in others.
It can be very difficult to model, in a record-based system, the knowledge that different representations in different records might refer to a single underlying entity (cf. [Stamper 77], [Hall 76], [Falkenberg 76b], [Kent 77a]).
Perhaps the most blatant illustration of this is our inability to manage mailing lists. I don't know how to explain to my non-technical friends why sophisticated modern computers can't eliminate the duplications in a mailing list. The most trivial variation in the way a person writes, abbreviates, or punctuates his name or address is enough to confuse the system, and prevent it from recognizing references to the same person.
You would think that Kent's point about getting the data right for a mailing list would have been solved over these past forty years—but not so! I have a rather strange hobby of collecting examples of bizarre data situations that indicate an issue with information and how it has been modeled. Below is an actual letter I received in the mail recently. The address is correct, but look who it is addressed to? “Fname Lname”! And it is an “Exclusive Invitation” for me!
STRUCTURED NAMES
Additional confusion arises when the synonyms of an entity exhibit different kinds of structure. A person's name might be structured into three fields for first, middle, and last names; his other synonyms are single fields: employee number, Social Security number. A date (if you will accept that as an entity) has three fields in the traditional representation, but only one in Julian notation. (A Julian date is a single integer combining year and day of year: the last day of 1977 is 77365.) Now, every relationship involving a person or a date will have an uncertainty, not only with respect to the data items the fields might contain, but also with respect to the number of fields occurring in the record. Thus a binary relationship between people and dates (e.g., birthdates) could be represented in two, four, or six fields, depending on the representations chosen. But it is still fundamentally a binary relationship. Thus there is potentially a poor (and unstable) correspondence between the degree of a relationship and the number of fields used to represent it. Note that this differs from an earlier situation where we had different kinds of entities. Here we have the same entities, but different names.
COMPOSITE NAMES AND THE SEMANTICS OF RELATIONSHIPS
Composite (e.g., qualified) names occurring in records tend to confuse the purpose and semantics (and degree) of the relationships being represented. This is especially noticeable when the composite names are themselves based on relationships. Consider, for example, the naming of employee's dependents by the two fields consisting of the employee identification plus the dependent's first name (as in Chapter 3).
The dependents in this illustration might occur in any number of relationships, being related, e.g., to benefits programs for which they are eligible, histories of claims and payments, employees responsible for them as counselors, other employee records because the dependents are themselves employees, etc. From an informational point of view, the employee on whom the person is dependent comprises a distinct, independent relationship. Yet, due to the naming convention, this information is gratuitously carried around in all the other relationships. For all of the other information, there is a single well-defined relationship that must be accessed to get the facts; but for this particular information, any relationship will do. (Of course, that gratuitous information would suddenly disappear if the naming convention for dependents was switched from qualified naming to Social Security numbers.)
A basic information model should be able to represent dependents as individual entities in these relationships, without dragging their related employees into every such context. If it is useful for applications to see dependents so identified in various relationships, then it is appropriate to define such derived “views” for the benefit of these applications. But the underlying information model need not confuse relationships with identification. A given relationship (e.g., between a dependent and a benefit program) exists independent of the means of identifying the dependent. That relationship should not be perturbed by problems or changes which might arise in the identification scheme.