. storage
. database design
REDUNDANCY, VALIDATION, AND CALIBRATION
When we are not conducting a poll, it is very often the case that our measurements involve converting information from one form into another. For instance, we may need information that has been stored in non- computerized records. The most common way to extract the information we need is to have people survey the situation and enter what they find onto
a computer. This process, called data entry, is one of the most error-prone
processes in all of statistics. The only method for ensuring reliability is to have two or more people enter every single piece of data. This procedure is
calleddata validation. Obviously, data validation doubles the data collection
cost, at a very minimum. However, this doubling of cost is far, far less expensive than the disasters that can befall us if we do not validate.
We also need to minimize the number of times the information is transformed and/or transported. Every time data are copied or moved or translated from one form into another, there is a potential for additional error. Once data have been safely and reliably collected, the less that is done to disturb them, the better.
Another issue is whether automatic computerized translations and copying of data are ‘‘better’’ than doing things by hand. When the procedure
is just a matter of copying, then automatic means, whether mechanical or computer-based, are much better than doing things by hand.
On the other hand, using the computer is not always the best answer. It is not so much that computers make fewer errors than people, but that they make different kinds of errors. When the risk of computer-type error is high, we should use people. When the risk of human error is high, we should use the computer.
SURVIVAL STRATEGIES
The problem with people doing data entry is called garbage in, garbage out. On the other hand, the problem with automated data collection is summed up in the adage: To err is human, but to really foul things up takes a computer.
The lesson: We are responsible for ensuring the accurate input and translation of data, whether by computer, by machine, or by people.
The most reliable form of data entry is probably the electromechanical device, where a piece of mechanical equipment, such as a thermometer with a bimetallic strip or an infrared detector, measures temperature or the length of products coming off the assembly line, or some other physical attribute. The measurement is translated into an electrical impulse and recorded on computer-readable media. Even these systems are prone to error, but that is a problem for engineers, not statisticians. Even in these situations, redundancy, either by repeated entry of the same data, plus comparisons, or by entry and checking of known data, is the best method of calibration for the elimination of bias and error.
We can use computer input devices, such as optical scanners, to input survey data in a similar fashion. Or we can design a computer interface, such as a web page, where our population enters the data directly. These methods can be highly reliable, but it is important to realize that the reliability does not happen automatically, or by default. Computer interface design and testing is an engineering discipline and art in its own right, and we should make sure that we work with experts who understand bias and its sources, redundancy, testing, calibration, and error correction.
If people are entering the data, there are several possible systems, including manual recording with later transcription, standardized forms to be scanned, or direct data entry.
Each of these has advantages and disadvantages. It may seem that direct data entry is best. But consider this: suppose the survey workers are under pressure to meet an impossible deadline. They might stay up
late, just punching numbers into the computer—making up data. If they filled out paper forms, there would be physical evidence of this, such as the pattern of handwriting. The computer data entry leaves no physical trace. So, we would need to devise other means, such as a hidden time- stamp on each survey, showing when it was done and how long it took, perhaps tied to the phone system used for the survey, to detect such sources of error.
STORAGE
If we design a convenient, secure data storage system that will allow for easy input, secure storage, storage for all the types of data and file formats we need, an easy, appropriate retrieval that does not compromise security, we will need to copy and transfer the data fewer times, and we will have fewer errors. The data should also be backed up and archived appropriately, with proper security on the archives. Good security is a balance of security plus appropriate access. We should review all of these issues with the appropri- ate data systems manager, because the storage, encryption, and security requirements for a statistical study, especially one with HR or other sensitive data, are different than the requirements for storage of ordinary business