• No se han encontrado resultados

Several other HLA genotypers are publicly available as discussed in section 2.5. One of the best current HLA genotyper is OptiType and our focus is to compare Gyper to it, both in terms of accuracy and time.

4.4.1 Accuracy

OptiType measured its accuracy with both by genotype and zygosity calling samples from the 1000 Genomes project. In both datasets Gyper showed the same or slightly better calling accuracy compared to OptiType.

1000 Genomes exome dataset

Using OptiType’s calling results from their article [Szolek et al., 2014] their 4 digit ac- curacy on the exome dataset was 97.8% (Table 4.12) while Gyper’s accuracy was only barely higher at 97.9%. The exome dataset was had previously been typed by Major et al. [2013] with an accuracy of 93.9%.

OptiType typed 1056 alleles correctly out of 1080 alleles total while Gyper had only a single correct allele more. Compared to Gyper their accuracy on the HLA-B gene is higher but lower for the other two genes.

Table 4.12: OptiType’s 4 digit call accuracy on 1000 Genomes’ exon dataset compared to Erlich et al. [2011].

Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper’s accuracy

HLA-A 169 11 0 349 of 360 96.9% 97.5%

HLA-B 175 5 0 355 of 360 98.6% 96.7%

HLA-C 172 8 0 352 of 360 97.7% 99.4%

All genes 516 24 0 1056 of 1080 97.8% 97.9%

Gyper’s zygosity calling accuracy was 100.0% for these samples, compared to OptiType’s 98.5%.

4.4 Comparison with other DNA sequencing data genotypers

1000 Genomes WGS dataset

Additionally we also compared Gyper to OptiType on WGS data with low coverage. This dataset had been genotyped before with HLAminer with 80.2% accuracy on 4 digit resolution [Warren et al., 2012]. Meanwhile, both OptiType and Gyper managed to achieve 95% genotype calling accuracy (Table 4.13).

Table 4.13: OptiType’s 4 digit WGS genotype calling accuracy compared to Erlich et al. [2011].

Gene 0 errors 1 error 2 errors Correct alleles Accuracy Gyper’s accuracy

HLA-A 18 2 0 38 of 40 95.0% 95.0%

HLA-B 18 1 1 37 of 40 92.5% 92.5%

HLA-C 19 1 0 39 of 40 97.5% 97.5%

All genes 55 4 0 114 of 120 95.0% 95.0%

Furthermore, both Gyper and OptiType called zygosity correctly in 59 of 60 cases (98.3%).

4.4.2 Time

The main feature of Gyper is its efficiency for the case where the user stores their WGS reads in an alignment file using SAM/BAM format. Storing reads aligned is now widely used and has become the industry standard. Raw reads in FASTQ files are hard to work with because they provide no context for the user. OptiType only supports FASTQ files, which means users who are interested in say, the HLA-A genotype of an individual, and store their reads only alignment will need to:

1. Sort the SAM/BAM using read name.

2. Convert SAM/BAM to two separate FASTQ files. 3. Preprocess both FASTQ files using a read mapper. 4. Run OptiType on the preprocessed files.

Testing it on an in-house 90 GiB indexed BAM file this process took a couple of days. Meanwhile, using Gyper on the same computer we genotyped the individual in less than 40 seconds. Even for this massive time difference, Gyper has shown to be comparatively or more accurate in calling the correct HLA genotypes.

5 Conclusions

5.1 Summary

Gyper is a fast HLA genotyper, it dominated all other publicly available HLA genotypers in terms of speed. Its high speed is due to the fact that Gyper only uses a very small subset of reads which are believed to be relevant to the genotyping. Gyper requires sorted and indexed alignment files to fetch these reads quickly. Even though we only genotype with such a small portion of reads, we can still report that Gyper is one of the most accurate HLA genotypers publicly available. When comparing with OptiType, which has been reported to be an accurate HLA genotyper, Gyper’s genotype and zygosity call accuracy was higher than OptiType’s.

We also measured Gyper’s accuracy using the coefficient of determination, r2, with 4 digit resolution genotypings. Gyper achieved r2 > 0.8 for six HLA genes and r2 > 0.95 for the three main HLA class I genes using WGS and exome samples, respectively. Previous methods, we compared ourselves with, did not measure their accuracy this way. We believe it is a better quality score than genotype call accuracy.

Gyper’s high accuracy is achieved by smartly creating partial order graphs for all the different alleles, and then aligning read pairs to them. Aligning all read pairs indepen- dently to every reference allele available, which can sometimes go up to 4000 different references, is extremely time consuming. Instead we create a single graph and align the read pairs to that, resulting in a much faster typing. Partial order graphs have proven to be a good way to represent variation, but we had not seen them used for our purpose before. They allow us to add a wide variety of constraints to the genotyper to extract as much information as we possibly can. By having such a quick genotyper we are able to optimize the parameters for these constraints by training at a large scale.

Gyper is very extensible, it is created as a generic genotyper that is not restricted to genotyping for HLA types. Furthermore, it can easily be extended to genotype for any genomic structural variation (e.g. SNP, insertions, and deletions). We are certain that it can be used in a wide variety of applications.

5 Conclusions