• No se han encontrado resultados

5. IMPLEMENTACIÓN 76

5.1. Entorno de trabajo 76

5.1.1. Lenguajes de programación utilizados: 76

We simulated data sets in order to approach realistic scenarios of cohort studies for association of a biallelic SNP, the predictor, with a survival outcome. We considered biallelic SNPs with additive, dominant or recessive genetic model of the risk allele. Data sets of n = 1, 000 unrelated individuals were simulated as for a medium size cohort study. We generated the SNPs with different minor allele frequencies (MAF ): 10%, 25%, 35%, or 50%. We did not try MAF smaller than 10%to avoid convergence problems during the fitting of the Cox regression model. Even that, we experienced convergence problems very often when the Cox model included a SNP with recessive genetic model, then we simulated the recessive models with a minimum MAF of 15%. We also simulated different effect sizes of the risk allele. In genetic association studies, the effect sizes of the risk alleles do not tend to be high. Then, we simulated risk alleles with moderate hazard ratios (HR) of 1.25, 1.5 or 2.0 with respect to the wildtype allele.

The data sets were simulated with a total of 60% of censoring. This percentage was meant to include the three forms of censoring occurring in a cohort study: the losses to follow-up, the withdrawals from the study, and discontinued follow-up due to the end of the study. It was assumed that censoring occurred completely at random, i.e. censoring was independent of the predictor. The study of M¨uller et al. (2008) showed that the percentage of censoring did not influence much on changes in the estimates of R2, unless they are very high, e.g. >80%. Thus, here we did not

investigate effects of variation in censoring for small to moderate censoring.

To simulate each data set, we considered the minor allele as the risk allele. Then, assuming the set of parameters listed above for MAF, HR, and censoring, we pro- ceeded as follows:

We generated a genotype vector X of size n = 1, 000, where X take on values {0, 1, 2} for the {wildtype, heterozygous, homozygous} genotype, respectively. The values were assigned randomly to n unrelated individuals with genotype probabilities pj, for

j = 0, 1, 2. The probabilities pj were computed as expected under Hardy-Weinberg

p0 = (1−MAF )2

p1 = 2×(1−MAF ) × MAF

p2 = MAF2. (5.1)

Next, the vector of time to event T was generated using the derivation of Bender et al. (2005), which is given in equation (5.2). In the next lines we show the reason- ing for this derivation. Let F (T ) be the cumulative distribution function such that F (T ) = 1 − S(T ). Let U1 be a random variable which is uniformly distributed over

the interval [0,1], U1 ∼ Uniform[0,1]. By statistical theory (Mood et al. 1974) it holds

that F−1(U

1) = T , and U1 = F (T ). In addition, it holds that if U1 ∼ Uniform[0,1],

then U = 1 − U1 has the same distribution. Then, U = 1 − F (T ) = S(T ) ∼ Uni-

form[0,1].

Thus, by taking the conditional survival function under the Cox regression model (equation (3.10)) it follows that,

U = S(T |X) = exp − Λ0(T ) exp (ˆβ ′

X) ∼ Uniform[0, 1] , Deriving the cumulative hazard we have,

Λ0(T ) = − log(U) exp (ˆβ′X). if λ0(t) > 0, ∀ t, then T = Λ−10 − log(U) exp (ˆβ′X)  . (5.2)

According to this equation, the simulation of survival time T requires the knowledge of the cumulative baseline hazard function Λ0. For simplicity we assumed T was

exponentially distributed because it provides a constant baseline hazard function λ0

(Bender et al. 2005), and the cumulative baseline hazard function is Λ0(T ) = λ0 T .

Hence, T = Λ−1

0 (λ0 T ), and from equation (5.2) we have

λ0 T = − log(U) exp (ˆβ′X), and T = − log(U) λ0 exp (ˆβ ′ X), (5.3)

where the denominator λ0 exp (ˆβ ′

on the genotypes, i.e. λ0 exp (ˆβ ′

X) = λ(X), ∀t.

Given that in our simulations we considered only a single gene predictor, i.e. K = 1, we generated T such that

T = − log(U)

λ0 HRXm

, (5.4)

where U was a random draw from the uniform distribution in the interval [0,1], HR was the specific parameter of the effect size assumed for the association of the risk allele with the event, and Xm was a recoded value of X according to the genetic

model we assumed for the risk allele, i.e. the mode of inheritance of the event. We also made λ0 vary according to the assumed genetic model.

Hence, for an additive genetic model: Xm = X, and λ0 = 0.12;

for a dominant genetic model: Xm =



0 if X=0

1 if X=1,2 , and λ0 = 0.30;

for a recessive genetic model: Xm =



0 if X=0,1

1 if X=2 , and λ0 = 0.30.

The baseline hazards of λ0 = 0.12 and λ0 = 0.30 were chosen from the interval

[0,1], with the only condition of not exceeding the limit of 1 for any of the hazards λ(t, X), which should fall in the interval [0,1] too. For instance, considering the set of simulated parameters HR = {1.25, 1.5, 2.0} for the dominant and recessive models, the maximum simulated hazard was of λ(t, X) = 0.30×2.0 = 0.60, whereas for the additive model it was of λ(t, X) = 0.12 × 2.02 = 0.48. However, the values

of this parameter seem not to be influential in the results (data not shown).

Drawing t repeatedly for each Xm of the n individuals provided the time to event

vector T .

Next, the time to censoring vector C was generated as a totally random variable from a uniform distribution on the interval [0, t.censor]. The upper limit t.censor was assigned as to produce 60%of censoring. This is roughly the amount of censoring we observed in our study data. The upper limit t.censor can be viewed as the end time of the cohort study. The time t.censor was chosen through some pilot simulations, in which we assigned to t.censor various values equal to various quantiles of the vector T, and then generated the respective vector C. Then, we derived the indicator of censoring δ = I(T ≤ C). We selected the value of t.censor that produced roughly 60% of censoring, i.e. 60% of 0’s in the vector δ. The quantile for the 66.5th percentile of vector T (

q

0.665 of T ) approximated the required 60%of censoring.

Then, we generated the time to censoring vector C from Uniform [0, t.censor], with t.censor = (

q

0.665 of T ). Finally, the observed survival time vector was obtained by

T∗ = min(T , C), and the censoring status vector by δ = I(T ≤ C). The data sets were generated with 100 replications.

Documento similar