The potential use of COMPASS as a hypothesis generating tool for identifying drug safety issues is analogous to signal detection theory, and measures of performance that follow from diagnostic and screening testing are well suited for study. The aim is to predict a binary classification of drug-condition status (there is, or is not, a causal relationship between exposure and outcome). The method prediction is a continuous valued score, but could be imagined to be dichotomized at some defined threshold. In this context, the test cases could be categorized into the following 2x2 contingency table (Figure 17), and various measures of performance can be estimated.
Figure 17: Performance measures for 2x2 contingency table
Measures of accuracy can be applied within the experiment that are not constrained to defined dichotomization of the method score. In addition to studying COMPASS
performance at logical thresholds, such as ARDLB>0, the performance of COMPASS was characterized through multiple measures of accuracy, including mean average precision, precision-at-k, and area under receiver operator characteristic (ROC) curve.
‘Mean average precision’ (MAP) can be thought of as the average precision at each threshold value that represents a ‘true positive’ association. MAP is effectively the
equivalent to the area under precision-recall curve. MAP can be formally defined as follows. Let ydc =1 if the dth drug is associated with the cth condition (‘positive control’) and zero otherwise, d=1,…,D, c=1…,C. Let =
∑
c d dc y M ,denote the number of causal combinations and
C D
N= × the total number of combinations. Let zdc denote the predicted value for the dth drug and the cth condition. For a given set of predicted values zρ =(z11,Λ ,zdc), we define
) (K ρ
largest predicted values in zρ . Specifically, let z(1)> Λ >z(N) denote the ordered values of z ρ . Then:
∑
= = K i i K y K z P 1 ) ( ) ( 1 ) (ρ ,wherey is the true status of the combination corresponding to(i) z . “Mean Average (i) Precision” is then defined as:
∑
= = 1 : ) ( ) ( ) ( 1 K y K K z P M S ρUnscored conditions are treated as if they produced a minimum score, such that methods receive the maximum penalty for not classifying ‘positive controls’.
‘Precision-at-k’ (P@k) is commonly used in information retrieval, and reflects the proportion of correctly classified objects at a defined cutoff (k) among an ordered set. So, in drug safety contexts, setting k=100, P@k could be interpreted as: ‘among the top 100
estimates produced by the method, what proportion of the drug-condition pairs reflect positive controls’.
An additional tool for assessing accuracy is the Receiver Operator Characteristic (ROC) curve, which are based on evaluating true positive rate (sensitivity) and false positive rate (1-specificity). The area under the ROC curve (AUC) provides a scalar measure of performance at all potential thresholds.
Finally, we define ‘recall-at-FP’ (R@fp) as the sensitivity obtained at a defined tolerance of false positive rate. So, for example, setting FP=5%, R@fp can be interpreted as:
‘what proportion of true positives can a method identify before 5% of negative controls would also be identified’.
Mean average precision, precision-at-k, area under curve, and recall-at-fp all provide scalar measures of performance, but each reflect a complementary component for
interpretation. None are sufficient, since each have inherent limitations. Precision-at-k and recall-at-fp are inherently threshold-based, insofar as a subjective assessment of k and fp is required. In contrast, MAP and AUC are threshold-independent, but provide a composite score that may reflect boundary conditions of little practical use. For example, AUC
integrates over all levels of specificity, including high false positive rates that would likely be unacceptable in a drug safety context. Similarly, MAP integrates over all levels of recall, though it may be unrealistic to expect that a given method can identify all adverse events with high precision and focus on more modest levels of detection may be more appropriate. A method that produces higher performance scores across all summary measures can be considered to have superior aggregate performance. However, it is feasible for methods to have differential behavior across the summary measures.
Moreover, summary performance measures do not reflect expectations for
performance for any specific adverse event, as each condition can have different attributes (such as background prevalence, time-to-onset, strength of association, and degree of confounding) that could alter a method’s behavior for that relationship. For each drug- condition pair, a method produces a score, but the performance of that pair cannot be
measured without putting the score into context with other scores produced by the method for other drug-condition pairs. As such, for each event, it is possible to measure precision and false positive rate at the score produced by essentially treating the event score as the
threshold for dichotomizing scores, as shown in Figure 17. Event-based performance measures were provided to explore differential method performance across the positive controls.