Preparación para más allá del 2015: Procesos y acciones emergentes

Logistic Regression is utilized in the scenario when the response variable yi is binary. In this context, yi is considered a realization of a random variable Yi with the Bernoulli distribution, and can be written as

Pr(Yi= yi) = π yi

i (1 − πi)1−yi

for yi = {0, 1}. The expected value and variance of Yi are E[Yi] = µi = πi and var[Yi] = σi2 = πi(1 − πi). Since the mean and variance of Yi depend on the probability πi, a linear model is not sufficient as it assumes constant variance. Therefore, the application of GLMs is required.

The Logit transformation In order to systematically establish the logistic regression model, a relationship is required between the probabilities πi and the observed covariates xi. The relationship πi = βxi is not sufficient due to the natural range restriction on πi and the real valued RHS linear predictor. A transformation of the probabilities can be applied to remove the range restriction. In particular,

ηi= logit(πi) = log πi 1 − πi

The logit is a one-to-one transformation that maps probabilities in (0, 1) to R. By GLM terminology, the logit function is precisely the link function. The inverse link function allows to go back to probabilities

πi= eηi

−5.0 −2.5 0.0 2.5 5.0 0.00 0.25 0.50 0.75 1.00 Logit Probability

Figure 5.1: The logit function.

Specification of the model The structure of a logistic regression model is defined by the random variable Yi ∼ B(ni, πi) where i ranges from 1 to k different, distinct observations. Formally, the logistic regression model is

log πi 1 − πi

= β0+ xiβ

Estimation methods are applied to retrieve the coefficients β. The inverse link function yields the probabilities again. Suppose, Yi= 1 (positive) when πi≥ 0.5 and Yi= 0 (negative) when πi< 0.5. As a consequence the logistic regression model is equivalent to a linear classifier [18]. In general, the decision boundary separating positive and negative classes is given by the solution to xβ = 0. Those familiar with linear algebra will notice that the decision boundary is a point if x is one-dimension, a line if x is two-dimensional, a place if x is three-dimensional, and so on.

6 A summary of Receiver Operating Characteristics

6.1 Introduction

In the interest of keeping this thesis self-contained, this chapter hopes to provide a brief exposition on methods for assessing the performance of binary classifiers. The Receiver Operating Characteristic (ROC) curve is one of the best developed statistical tool to evaluate binary classifiers. ROC curves have gained tremendous popularity since its development by World War II engineers for signal detection theory. Its utilization has quickly expanded into many other fields including biosciences, psychology, finance and sociology. In particular, it is widely applied in medicine to evaluate diagnostic tests discriminate diseased from normal cases [47]. For instance, radioactive imaging is common diagnostic test in which the test results are real numbers. The higher (or lower) continuous value of the test indicates the presence (or absence) of a disease. Applying ROC analysis evaluates the discriminatory ability of the radioactive imaging diagnostic test, assuming the true status of the disease is known. In general, ROC analysis returns a measure of the discriminatory ability of any continuous, two-group classifier (true/false, yes/no, positive/negative, diseased/non-diseased) as long as the true status of the cases are known by an independent means of testing. This chapter provides an overview on some inference and estimation methods for constructing ROC curves and its associated summary measures.

The object of interest in ROC analysis is the so called ROC curve which is a graphical representation of the relationship between the false positive rate and the true positive rate of any classifier. The true positive rate, also known as sensitivity, of a classifier is the probability that a TRUE object is correctly classified by the model. Similarly, the specificity Sp is the probability that a FALSE object is correctly rejected by the model and the false positive rate is, therefore, 1 − Sp . In the context of medical diagnostics, Se represents the probability that a truly diseased individual has a positive test result and Sp is the probability that a

truly non-diseased individual has a negative result. The ROC curve characterizes Se as a function of 1 − Sp. In other words, the ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR), for various threshold values. The sensitivity and specificity rates allow us to rigorously analyze the classifier by using conditional probabilities of belonging to a particular class given the true classification. In statisticalterms,thesecurvesdisplaythe trade-ofbfetweenpowerandsize ofthetestwithrejection regionsX ¿ 0 asthethreshold0isvaried

The area under the ROC curve (AUC) has been established as a fundamental summary measure of a classifier’s accuracy. The AUC is interpreted as the probability of correctly classifying between a randomly selected pair of TRUE and FALSE objects. More intuitively, given a randomly selected pair of nondiseased and diseased individuals, the classifier assigns a higher score for the diseased subject. AUC values close to 1 suggest an almost perfect classifier. On the other hand, values close to 0.5 suggest an essentially useless classifier. In other words, an area of 0.5 suggests that the diagnostic test was only able to classify 50% of the cases correctly. This is no better, essentially, flipping a coin.

For the rest of this chapter, we present a few theoretical results of ROC analysis and describe methods for creating ROC curves. The termininolgy used for the rest of the chapter is in the context of a medical test. To this end, the binary classifier is some diagnostic test which returns a continuous result. The populations are continuous random variables grouped into non-diseased (X) and diseased (Y ) with size nN+nD, respectively.

In document Reunión Mundial de Educación para Todos (página 32-38)