Monitoreo y evaluación - UNIVERSIDAD COLEGIO MAYOR DE CUNDINAMARCA

6.4.1 Setting the cut-off threshold

A trade-off exists between the level of false positive matches and the level of false negative matches. It is important to consider the objectives of the matching exercise when

determining cut-off thresholds. For example, if it is critical to avoid false matches, then set the cut-off threshold higher, mindful that some true matches will be missed.

The (non-negative) cut-off threshold is the composite weight value that demarcates between links which the analyst considers to be matches and those which the analyst doesn’t. All record pairs whose composite weight is greater than or equal to the cut-off are regarded as a link. Deciding on the cut-off value is one of the more difficult tasks the analyst faces in a data integration project, as the boundary is not clear-cut. It is acknowledged that even

experienced analysts could produce significantly different linked outputs.³⁷

In practice, the cut-off is initially set at zero for a given pass and is iteratively changed before proceeding to the next pass. After running the pass, the weights histogram can be examined to aid in deciding the cut-off score for the pass. Ideally, the frequencies of matched records trail off as the weights become lower, while the frequencies of unmatched records trail off as the weights become higher. This ideal situation produces a ‘bimodal’ distribution. The farther apart from each other the modes are, the better the discrimination between the matched and unmatched records. This scenario is represented by the figure below.

36 Winkler, WE (1988), “Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage”. In Proceedings of Survey Research Methods Section, American Statisticial Association, 667–671.

37 Gomatam S, Carter R, Ariet M, Mitchell G (2002). “An empirical comparison of record linkage procedures”, Statistics in Medicine 21, 1485–1496.

-150 -100 -50 0 50 100

Weights

Non-Matches Matches Observed

Distribution of Composite Weights Across All Possible

Comparison Pairs

Number of comparison pairs

In reality, the distribution is far more complex. Multimodal distributions are not uncommon and the trailing of frequencies described above may not be as observable. Also, in some software, as comparisons are not made for records that have no chance of matching, comparisons with negative weights do not appear in the histogram.

As the ideal situation above is not often encountered in practice (although such an ideal distribution has been noted for the Student Loan Data Integration Project in Statistics NZ), it is good to produce a file of linked records for examination. The file can be sorted by weight in descending order. The record pairs with high composite weights represent (relatively) good links. As the weight value lowers, the links become dubious. The sorted record pairs are examined for increasing patterns of field disagreements as the weights decrease, to determine an appropriate cut-off level for the pass. Of course this is easier said than done, but as an analyst gains experience and familiarity with the data undergoing integration, a certain level of confidence is gained in setting the cut-off scores.

A sample actual histogram of weights under non-ideal conditions is shown below. After a visual assessment of the file of linked records, the cut-off score of 21.07 was set for this pass. Note the multiple peaks and the not-so-distinct trailing frequencies near the chosen cut-off.

Weight Distribution Histogram

0 50 100 150 200 250 300

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Weight

Freq

cut-off

A side-effect of adjusting the cut-off threshold is the possibility of creating duplicate pairs.

Record pairs whose composite weights fall below the cut-off become residuals and are eligible for linking in the next pass. However, one record in file A may form a pair with a weight above the chosen threshold for more than one record in file B.

Depending on the nature of the integration exercise, these may be treated as genuine duplicates, possibly for further review, and not available in other passes. If the matching is treated strictly as one to one, then the record pair with the highest weight is taken as the link, and the rest become eligible for linking in the next pass. In the case where record pairs have the same weight, one might be chosen at random as the link. Data integration software may be equipped with options for handling cases where duplicates with the same or different weights exist. Carrying out deduplication simplifies the subsequent linking by generating confidence that no genuine duplicates exist.

6.4.2 False positives, false negatives and match rates

False positives are record pairs that are deemed to be links but which are actually true non-matches. False negatives are true matches which remain unlinked.

Generally, there is no good method for automatic estimation of error rates, so false positive rates have been estimated by manually checking samples of linked records. In large

datasets, analysis of false positives can be time-consuming work and it is often useful to group the linked data prior to selecting a sample.

For example, in the Student Loan Data Integration Project, the passes constitute groups from which samples for false positive analysis were drawn. Alternatively, new groups different from the groups induced by the passes can be constructed for sampling purposes.

In the Injury Statistics Project. For example, the linked Accident Compensation

Corporation (ACC) and New Zealand Health Information Services (NZHIS) records may fall in any one of the sample groups below:

Group 1: linked on injury date, National Health Index (NHI) number, first name, surname all same

Group 2: with the same injury date, and date of birth, plus the same NHI if present on both records

Group 3: injury date, first name, surname, date of birth.

Samples from each of the groups can be selected and analysed for false positives.

The clerical review of these samples is done by visually comparing the records, and while this method is able to draw upon subject-matter knowledge and other information, it still involves the subjective view of the reviewer. If it is understood where errors are most likely to occur in the datasets, it may be necessary to target the sample to these areas with a view to improving the quality of the match. Several iterations of clerical review and adjustment of match criteria may be necessary before a linked dataset is confirmed and final false positive error rates calculated.

If at least one of the files is expected to match completely and the false positive rate is low, then the false negative rate may be calculated simply as one minus the match rate (where the match rate for a given file is the number of matched records over total records). However in other situations, such as when the integrated dataset is the union of two files, expected matches are unknown and the false negative rate is difficult to estimate.

6.4.3 Measurement error in integration

Measurement error affects inference – it can lead to bias in estimation, which can be severe.

Best-practice procedures in data analysis examine the data being used for measurement errors and known measurement error properties are incorporated into the analysis (Chesher and Nesheim, 2004).³⁸

The measurement error processes that arise when there is probabilistic record linkage are complex and non-standard. Chesher and Nesheim list causes of measurement error in data linking, including:

• units incorrectly linked so that data from one unit is incorrectly associated with another unit (aka false positive links)

• in many-to-one linking, statistics computed using only a few sub-units are used to measure characteristics of all sub-units

• in many-to-one linking, characteristics of sub-units inferred from features of major units (and vice versa).

Chesher and Nesheim go on to say that, from a practical perspective, measurement error is inevitable and since the potential effects are so damaging, one should avoid using data-linking procedures which are likely to generate large amounts of measurement error.

The first step in estimating the quality of linked datasets is often the estimation of rates of false positives and false negatives. In record linkage projects carried out by Statistics NZ to date, quality measurement has focused on these two dimensions of quality, with the aim of minimising false positive links.

38 Chesher A and Nesheim L (2004). “Review of the Literature on the Statistical Properties of Linked Datasets”, Report to the Department of Trade and Industry, United Kingdom.

In document UNIVERSIDAD COLEGIO MAYOR DE CUNDINAMARCA (página 35-39)