• No se han encontrado resultados

Relación de la propuesta reichenbachiana con las propuestas de Prior

3. Hans Reichenbach Propuesta y límites

3.2. Relación de la propuesta reichenbachiana con las propuestas de Prior

We wish to test the association between a given SNP and the secondary phenotype.

Specifically, our null and alternative hypothesis can be represented as:

H

0

: There is no association betweenX

j

andY.

H

a

: There is an association betweenX

j

andY.

The motivation for the proposed method is the following: If a given SNP is associated

with a secondary phenotype, we would expect that the vector of minor allele counts for the

SNP,X

j

, would be highly correlated with the secondary phenotype,Y. IfY

were permuted

to form a new vector,Y

, this association would be eliminated. We would expect that the

(absolute) correlation betweenX

j

andY

would be greater than the correlation betweenX

j

andY

. Thus, we may test the null hypothesis by comparing the correlation betweenX

j

and

Y

to the correlations betweenX

j

and a series ofY

s, where eachY

is a permutation ofY.

The procedure can be summarized as follows:

1. To account for the over-representation of disease cases in the data assign a weight of

w

0

= 1to controls and a weight ofw

1

=

(1−pp∗n)∗n0

1

to the cases.

2. Calculate the weighted correlation betweenX

j

andY. Denoted this correlation asR

j

.

4. For each

Y

b

,

b

= 1, . . . , B, calculate the weighted correlation between each

X

k

,

k

= 1, . . . , N, andY

b

. Denote this correlation asR

jb

5. The p-value for the test of the null hypothesis is then given by

p

j

=

1

N B

N

X

k=1

B

X

b=1

I(|R

kb

| ≥ |R

j

|)

(4.38)

This procedure is valid even if the secondary phenotype Y is dichotomous, since the

squared correlation betweenX

j

andY

is proportional to the Armitage trendχ

2

statistic in

this case (Price et al., 2006). Note that this test assumes that the distribution of theR

kb

’s

does not depend on

k. In practice, this assumption is unlikely to be perfectly satisfied.

However, this results in enormous computational savings. Under this assumption, B=25

provides a sufficient number of permutations to demonstrate that a SNP is associated with

a secondary phenotype at a Bonferroni-corrected threshold for genome-wide significance.

Without this assumption, millions of permutations would be required for each SNP, which is

intractable computationally.

Now suppose one wishes to evaluate the association between an allele countX

j

and

a secondary phenotypeY

after controlling for covariatesZ

=Z

1

, . . . , Z

K

, such as demo-

graphic covariates or eigenvectors corresponding to race or ancestry. The above procedure

can be modified as follows:

1. Perform a weighted regression ofX

j

onZand find the vectorX

0

j

of residuals from

the resulting model.

2. Similarly, perform a weighted regression ofY

onZ

and find the vectorY

0

of residuals

from the resulting model.

3. Apply the permutation test procedure above usingX

j0

andY

0

in place ofX

j

andY,

Note that this procedure requires one to regressX

j

on the covariates for each SNP. Thus,

a na¨ıve application of this procedure would require the computation ofN

regression models,

which would be computationally expensive. The required computing time can be signifi-

cantly reduced, however, by noting that each regression model has exactly the same covari-

ates. The only difference between these regression models is the outcome variableX

j

. Sup-

pose we are performing a weighted regression ofX

j

onZ. LetW

be a diagonal matrix of the

weights. Then the regression coefficients are given by(Z

T

W Z)

−1

Z

T

W X

j

, and estimated

values ofX

j

are therefore given byZ(Z

T

W Z)

−1

Z

T

W X

j

. LetH

=Z(Z

T

W Z)

−1

Z

T

W.

(This

H

is commonly known as the hat matrix.) Note that

H

does not on

X

j

. Thus, by

calculating and storing the hat matrixH, one may calculate the residuals of the regression

model to predict

X

j

based on

Z

by calculating

X

j

-HX

j

, which requires only a single

matrix multiplication rather than recomputing the entire regression model for each SNP. This

approach is likely to substantially reduce the computation time needed for the procedure.

Now suppose that some minor allele counts are missing at random for some of the

individuals in the study. We will show how using the Cholesky decomposition ofZ

T

W Zcan

speed the computation ofH. Note, we can determineL, the lower triangular matrix with real

and positive diagonal entries which solves the expressionZ

T

W Z

=L

T

L. Then we have that

(Z

T

W Z)

−1

=L

−1

(L

T

)

−1

. When we have missingness inX

j

we will need to recompute

H. LetZ

represent the matrix of covariates with the individuals who have missing values

ofX

j

removed,W

represent the matrix of weights with the individuals who have missing

values ofX

j

removed,zrepresent the matrix of covariates for individuals who have missing

values ofX

j

, andwrepresent the matrix of weights for individuals who have missing values

of

X

j

. Then

H

=

Z

(Z

∗T

W

Z

)

−1

Z

∗T

W

=

Z

(Z

T

W Z

−z

T

wz)

−1

Z

∗T

W

. We can

compute the down-dated Cholesky factor,

U

which solvesU

T

U

=

Z

T

W Z

z

T

wz

in a

single step. Then we have thatH

=

Z

U

−1

(U

T

)

−1

Z

∗T

W

. Down-dating the Cholesky

Documento similar