• No se han encontrado resultados

CAPÍTULO 3: PROCESOS DE PLANEACIÓN

8. Procesos de Planeación de la Gestión de riesgos

For all of the retrotransposon analysis in this project, I used the annotation of repetitive regions created by RepeatMasker [8] for the UCSC Genome Browser [200].

RepeatMasker is a set of software tools designed to screen DNA for interspersed repeats and low complexity regions. The output is an annotation of the repetitive

regions in the query sequence, and, if desired, a copy of the query sequence with the repeats “masked”: either converted to Ns or to lower-case characters. For some commonly-used genomes, such as human and mouse, ready-made repeat annota-tions are available. These annotaannota-tions include the location and type of repeat, and summary statistics about its identification. In addition, recent updates from Re-peatMasker also include information about how fragments of repeats, particularly retrotransposons, may be related. For example, if an ERV has had a SINE inserted in the middle of it, the now separated pieces of the ERV would be represented as a single element in two pieces.

RepeatMasker was originally developed by Smit et al. in the 1990s [201]. The underlying method for identifying repeats has not changed radically since then.

RepeatMasker takes a set of reference repeat sequences, and searches for matching sequences in the query. For the ready-made annotations available for mouse and human, the reference sequences are based on two libraries of consensus repeat sequences:

• Repbase [202]: a library of manually curated submissions from researches, maintained by the Genetic Information Research Institute (GIRI)

• Dfam [203]: a more recent library that uses hidden Markov models to identify consensus sequences

Given a consensus library, RepeatMasker can use one of several tools to perform the search step, depending on the relative importance of factors such as speed and precision.

There are several advantages to using RepeatMasker. It is actively maintained and updated, and represents many years of expertise in the field. It is also popular

amongst groups working in the field, so comparisons with other studies are more straightforward, especially given the established nomenclature and categorisation used by RepeatMasker. The integration with the UCSC Genome Browser means that visualisation and comparison with other annotations (e.g., reference tran-scriptomes) is easy. In general, the availability of a ready-made and high quality annotation saves the significant time and resources required to produce one. Rep-base, the repeat reference library used by RepeatMasker, is manually curated and represents the most comprehensive library of repeats available, and many years of experience in the field.

However, there are also problems associated with RepeatMasker. While it is maintained, it is not always up to date with the latest reference sequences.

Similarly, while the UCSC Genome Browser advertises itself as having the most recent RepeatMasker releases, this is not alway the case. For example, the most recent Repbase update was in 2015, while the most recent mm10 repeat annotation available on RepeatMasker was created in 2013. Similarly, the currently available RepeatMasker annotation on UCSC Genome Browser was created in 2012, but the most recent data on the RepeatMasker website is labelled as 2013-04-22.

The use of Repbase may affect RepeatMasker results. While manual curation can be advantageous, as it leverages human intelligence and expertise, it can also lead to biases, as acknowledged by the maintainers of Repbase [204]. (Although the Repbase maintainers have taken steps to reduce unintended bias in the submissions to Repbase.) In addition, Repbase data and methods are not openly available, and so it is difficult to assess their methods in comparison with others.

The sensitivity of RepeatMasker is difficult to assess, as there is not yet a stan-dard set of benchmarks for repeat annotation [205,206]. There are now many tools

designed to identify and annotate repeats, many of which are specific to particular species or types of repeat [206], further increasing the complexity of meaningful comparisons. While these may be able to identify repeats with high specificity and sensitivity, they cannot be categorised without the use of a reference library such as Repbase. Aside from Repbase, few reference libraries exist. Dfam, for example, can be used, and it has been incorporated into RepeatMasker. However, the methodology used for Dfam suggests that in fact they use Repbase and Re-peatMasker to produce their library, and so their results may be influenced by the same biases. In addition, de novo repeat annotation is a computationally intensive and time-consuming process.

I decided to use the RepeatMasker annotation available on the UCSC Genome Browser, and the work in this thesis was carried out using the version available there as of December 2015. This decision was motivated by the ease of use of RepeatMasker and the expertise behind it. In addition, being able to quickly vi-sualise the repetitive regions alongside my own datasets proved extremely useful.

The recent inclusion of fragment-joining information was also extremely useful in accurately quantifying retrotransposon transcription. At the time of writing, more recent versions of the RepeatMasker mouse annotation have become available, and it would be of interest to repeat the analysis described here with the new anno-tation. It would also be advisable to experiment with other repeat identification software and compare the results. More accurate retrotransposon identification should reduce noise and clarify the existing results.