Wherever the data originates, it is important that it is processed and stored in a methodical manner. Data capture needs to be disciplined, with data entered into pre- scribed fields and validated with software. The customer record is broken into fields, with each field holding an item of data such as surname, product number, time of purchase, telephone number, etc.
Any raw data being entered into a database by a call centre operator, a typist, or from some external data file needs to go through a number of steps before it goes live on the database. These are described in the following subsections.
Formatting
Formatting aims to remove the inconsistencies and data ‘noise’ that appears within incoming data. Data should be in the correct sequence and length to fit within the various fields of the customer record. This may mean that punctuation and spaces may need to be removed. Lower-case data may also need to be converted to upper case. Abbreviations may also be substituted for longer or common words (e.g.
Limited may become Ltd). Where the databases are international and therefore cover
a number of countries, the database must take account of the different conventions in terms of addresses. For example, in France, the number of the house may be listed after the street name rather than before it, as is the case in the UK.
Validation
Software will be used to validate the accuracy of the information. Validation involves checking that the data is complete, appropriate and consistent, and may occur as the data is being entered or once the data is in the system. The database software will be able to identify fields where there are missing or invalid values. For example, the computer is likely to check name prefixes against a table of reference data such as:
01 Mr 02 Mrs 03 Master 04 Miss 05 Sir 06 Lord 07 Lady 08 Dr 09 Rev 10 Admiral 11 General 12 Major 13 Viscount 14 Hon. 15 Professor
Similar tables of reference data can be used to check job titles, brands of products, models of cars, etc. Addresses are checked against the national Postcode Address File
The customer database
In order to maintain the integrity of a customer database, the software should either automatically eliminate duplicates or identify potential duplicates that require a manual inspection and a decision to be taken. Where the process is done automatically, the sophisticated software is adjusted to opt for overkill or underkill. Overkill is the technical term to describe the situation where the system removes all entries that may
65
(PAF) for completeness and accuracy. The PAF in each country will hold the full postal address and postcode for each property in the country and also for large organisations that have their own unique postcode. The file can be obtained either free or for a small charge from the national Post Office in each country.
In the UK the Postcode Address File is the official Post Office file of postcodes and addresses. It includes over 26 million addresses and approximately 1.7 mil- lion postcodes. The PAF contains no data about the occupants of these addresses. It is available on CD-ROM with quarterly updates.
The data fields may also be subject to a rejection process that recognises spurious data or names given by pranksters such as Mickey Mouse and Donald Duck, although care must be taken to ensure that no real Mr M. Mouse or Mr D. Duck is wrongly branded a prankster. Software can also be used to identify consistency in the data; for example, if a file suggests that an individual does not have a mobile phone but the contact details give a mobile phone number.
Deduplication
Once the data has been entered into the database and validated, the process of dedu- plication will be undertaken. Deduplication is the process through which data belonging to different transactions or service events are united for a particular cus- tomer. Duplication may occur because address data is incomplete or entered in slightly different styles such as:
Mr David P. Jackson Mr P. Jackson Valley Green Gatehouse The Gatehouse
Abbeyfield Valley Green
Grampian Abbeyfield
AB12 4ZT Grampian
AB12 4ZT Mr & Mrs Jackson Mr D.P. Jakson
V.G. Gatehouse The Gatehouse
Abeyfield Valley Green Road
Near Aberdeen Abbeyfield
potentially be duplicates. Underkill is where the system may fail to detect duplicates as it only removes entries where the likelihood of duplication is very high. Overkill would be chosen by a credit card company to ensure a customer does not open two accounts by pretending to be two people. Underkill would be chosen where an organ- isation wanted to have as many prospects as possible on its database. Deduplication is more difficult in business-to-business databases as a result of organisations using multiple trading names, multiple locations, PO boxes and various abbreviations.
Deduplication needs to be undertaken each time data is added to a database and particularly when data files are merged as a result of corporate takeovers or the purchasing of external databases or mailing lists.