• No se han encontrado resultados

Since we are clipping the reads, we need another parameter that limits the minimum length of the clipped reads. We call this parameter λ. Allowing all small reads to be aligned to the graph, we would more often find matches in multiple locations. Furthermore, many reads from other regions are more likely to be aligned. At the same time it is probably not a good idea to reject all reads that have been clipped. Then we would end up with very few reads. We can very cheaply try many different values when training λ.

We trained using the following minimum sequence lengths: λ = 10, 20, 30, 40, 50, 60, 70, 80, 90. Our method involves aligning a read to the graph only if the length of the read is at least the same as the lowest minimum sequence length, which is in our case 10. Then, when the alignment has finished, we copy the results and add them to the total results for each minimum sequences length which is lower or equal to the cutoff length. For instance, if the length of a read is 40 we use 40 as the cutoff length and add the results to the total results for 40, 30, 20, and 10. However, if a read has the length 9 we reject that read entirely.

The computational time therefore does not scale with the number of parameter values we try. It only depends on the length of the smallest minimum sequence we use.

3.4.3 Mismatches

In section 3.4 we mentioned that we can allow mismatches in the aligner in some scenarios. We chose the following scenario to allow mismatches when all the following conditions are met:

• The current node has one outgoing edge. • The current node has one incoming edge.

• The quality of the read is below a given threshold, ρ.

The parameter we need to train is the quality threshold ρ. The reason for first condition is because if we allow more than one then a single alignment could map to multiple paths on the graph. If we allow separate paths we are likely to face multiple cases where a single read can start at some node and end at multiple other places. When this happens there is

3 Methods

no good way to decide which of these paths are the most correct one. To avoid this issue we simply forbid mismatches on nodes with more than one outgoing edge. The second condition is not mandatory. However, it gives us the nice feature of having a symmetric alignment to the graph.

We chose to train the read clipping quality using these values: ρ = 20, 25, 30, 35, and 37. Again, we only chose those five values since each value requires an alignment to the graph, and thus is computationally expensive.

3.4.4 Zygosity factor

When deciding which alleles will explain the most reads it will always heavily favor het- erozygous over homozygous. Two different alleles will always be able to explain at least as many reads as only one allele would. In previous work, such as [Szolek et al., 2014], this issue was resolved by using a constant factor that increases the scores of homozygous results. We believe this method fails to provide good genotype if both true alleles are very similar. In other words if an individual’s true alleles are very similar, then the score of those two alleles are expected to be almost as high as the score of them in a combination. To counter this issue we used weighted average between how many reads two alleles could explain and the average of how many reads they could explain individually. In a more formal manner we can say that we have selected n alleles in a set L = {a1, a2, ..., an}. We

also have m read pairs from sequencing an individual. Given a read pair rj and an allele

ai where 1 ≤ j ≤ m and 1 ≤ i ≤ n, we define

Crj,ai = (

1, if allele ai can explain the read pair rj

0, otherwise (3.1)

If some single read pair is explained by some allele then we say that a hit counter for that allele and read pair is 1, but otherwise 0. Read pair rj can be explained by any number

of alleles. 0 ≤ n X i=1 Crj,ai ≤ n (3.2)

If the total score of an allele A ∈ L is SA for given read pairs r1, r2, ..., rj, ..., rm is

SA= m

X

j=1

Crj,A (3.3)

Since each individual has two chromosomes we need a method to convert the single allele score, SA, to a combined allele score we call SA,B. If we have two alleles A, B ∈ L we

only require that one allele can explain the read. We calculate their heterozygous score as SA,B = m X j=1  maxCrj,A, Crj,B  (3.4)

3.4 Parameters

Here we note that if the score is homozygous, which is when A = B in equation 3.3 we get SA,A= m X j=1  maxCrj,A, Crj,A  = m X j=1 Crj,A = SA (3.5)

Individuals that have the same allele A on both chromosomes simply get a score SA,A=

SA. This scoring scheme heavily favors heterozygous scores because SA,B ≥ SA for any

two alleles A and B. Instead, we decide the zygosity of the individual by using the number of reads explained by allele one, but not the other.

SA\B = m X j=1  maxCrj,A− Crj,B, 0  (3.6)

We also define SA∩B to be the number of reads explained both by allele A and B.

SA∩B = m X j=1  maxCrj,A+ Crj,B− 1, 0  (3.7)

Figure 3.8 shows the relation among scores S.

If SA\B is much larger than SB\A it is likely that the individual is homozygous with two A

alleles. However, if SA\B and SB\Aare relatively similar then it is likely that the individual

is heterozygous.

SA\B SA∩B SB\A

A

B

Figure 3.8: Alleles A and B are represented as sets. Each set has the reads that allele explains. The number of reads explained by either allele A or B is SA,B = SA\B+ SA∩B+

SB\A.

We can use these results to decide zygosity using a constant β.

|SA\B− SB\A|

SA\B+ SB\A

(

> β , homozygous solution

≤ β , heterozygous solution (3.8)

Where 0 ≤ β ≤ 1. Figure 3.9 further explains our reasoning. A cross represents a read. If allele A can explain the read then the cross is in set A.

3 Methods

SA\B = 8 SA∩B = 8 SB\A = 1

a)

A

B

SA\B = 6 SA∩B = 8 SB\A = 5

b)

A

B

Figure 3.9: Two examples of the distributions of reads (crosses) among two alleles, A and

B. a) It is likely that the individual is homozygous even though SA= 16 and SA,B = 17.

The read inside the B\A region is likely an error. b) Here however, SA\B and SB\A are

relatively similar and thus we would rather expect that the individual is heterozygous.

Now we can scale the heterozygous solutions down using β. First, we assume that A has the highest score, Smax = SA ≥ SB. This also means that SA\B ≥ SB\A because

SA− SB = SA\B − SB\A. Furthermore, we can see from figure 3.8 that SA\B + SB\A =

SA+ SB− SA∩B. Assuming SA\B + SB\A > 0, the condition for a heterozygous solution

is

SA− SB− β(SA+ SB− 2SA∩B) ≤ 0 (3.9)

To compare scores between homozygous and heterozygous solutions we introduce a scaled score SA,B0 . If an individual is heterozygous, we want

SA,B0 ≥ SA (3.10)

One solution to equations 3.9 and 3.10 is:

SA,B0 = βSA,B+ (1 − β)

SA+ SB

2 (3.11)

Equation 3.11 has a nice feature, it makes scaling unnecessary for homozygous solutions because

SA,A0 = βSA,A+ (1 − β)

SA+ SA

2 = SA (3.12) Therefore, we only need to scale heterozygous solutions. If we use β = 1 then there would not be any scaling, SA,B0 = SA,B, and heterozygous solutions would be favored. If β = 0

the scoring scheme would favor homozygous solutions. We suggest β = 0.5 as a middle ground but training this parameter is computationally cheap.

Using individuals in deCODE’s dataset we trained β using 9 different values: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. We expect our scoring scheme, at least to some extent, to solve the problem of having two highly similar alleles. Given all this, we formulate our optimization problem as

Smax0 = max

A,B∈LS 0

A,B = maxA,B∈L

 βSA,B+ (1 − β) SA+ SB 2  (3.13)

3.5 Parameter training

3.5 Parameter training

To train the data we used an in-house program that uses imputation to determine the correct allele of an individual. When imputing we use the relation data among individuals in the dataset to expand the results for the Icelandic population. The program requires a list of alleles, the most likely allele, and the likelihoods of all combinations of alleles on the Phred scale. The data was stored in VCF format.

We define eA,B to be the event that a read is explained by some alleles A, B ∈ L, which is

not the true genotype. The probability of that event occurring is P (eA,B) = . In general,

we estimate that the number of such events are

d = Smax0 − SA,B0 (3.14) We assume all such events are mutually independent of each other so the binomial prob- ability of it occuring d-times are in general:

P (eA,B) = d (3.15)

The score number is not always an integer but it does roughly correspond to the number of mismatched read pairs, so we believe it is a reasonable metric to use. The probability of each genotype is then estimated to be

PA,B = P (eA,B)Pmax= dPmax (3.16)

Our imputation tool uses a reversed Phred score, meaning that the allele with a score of 0 is the most likely one. So instead of using the probability of error P (e), we use instead

P (eA,B) in equation 2.1. We also put a limit on the Phred score so it is never higher than

255, which means an allele is never less likely to be true than 10−25.5. The Phred score is then calculated using:

PredA,B = min (−10 · log10(P (eA,B)), 255) (3.17)

We arbitrarily chose  = 1%, so we can simplify equation 3.17 to

PredA,B = min (20d, 255) (3.18)

In section 3.3.2 we defined the 4 parameters trained. In total our input space is large, its total size is 2,025 for each gene for a single individual. The most computationally expen- sive part is the alignment to the partial order graph, and only two parameters required separate alignments. These two parameters are the clipping and mismatch thresholds which have a combined input space of 25. They are the dominant computational time factors.

In our training set we used a dataset of 3,894 Icelanders who have been sequenced at deCODE. These individuals allhave passed the in-house quality control tests with high

3 Methods

scores. The quality control score is based on sequencing depth coverage, contamination, and more. If we assume that the time to genotype an individual for one gene is 40 CPU seconds, then the time required to genotype 3,894 for six genes using 25 different parameters is about 270 CPU days. The parameter training was therefore carried on multiple nodes in deCODE’s computer cluster. The results contain about 47.3 million data points, which where then imputed for more than 150,000 chip genotyped Icelanders using an in-house imputation tool.

The tool evaluates an INFO score which indicates how well the genotypes fit into with the relational data of the individuals. For example, let’s say that Gyper predicts a father to have the two HLA-A*01:01 alleles and the mother to have HLA-A*02:01 and HLA-

A*03:01 alleles. If Gyper predicted that their child has two HLA-A*68:01 alleles, the

INFO score will be lowered because this is essentially impossible. Moreover, even if Gyper predicted the child to have HLA-A*01:01 and HLA-A*02:01, that might also be false if it has already been determined that the mother passed her chromosome 6 with the

HLA*03:01 allele to the child. Therefore, in this particular example case the only possible

genotype of the child, given that the parents were genotyped correctly, is HLA-A*01:01 and HLA-A*03:01.

The in-house imputation tool outputs for each genotype its minor allele frequency (MAF) and an INFO score estimating how well the genotyper predicted this allele correctly. We weight each INFO score by its MAF and find the average over all genotypes. Our assumption is that whichever combination of parameters that provide the highest weighted average INFO score is the best combination.

3.6 Implementation

Gyper is implemented in C++ and depends on both SeqAn 2.1.0 [Döring et al., 2008] and Boost 1.58.0. The project is open-source and maintained on Github at

https://github.com/hannespetur/gyper

The program is licenced under the simplified BSD license. SeqAn 2.1.0 has not been released yet but its development is ongoing. Before it has been released, it is possible to use SeqAn’s development branch on Github instead.

4 Results

4.1 Preprocessing the data