CAPÍTULO II: MARCO TEORICO
2.5 BASES EPISTEMICOS
2.5.1 Fundamento de la Regla de Exclusión
The neighborhood of cas genes (comprising of more than 20 genes) was
initially identified and characterized by Makarova et al. in 2002 by genomic context
analysis, but it was wrongly predicted to be a novel DNA repair system specific for thermophiles, as no connection with CRISPR was detected at the time. Almost
simultaneously, Jansen et al. identified by in silico analysis four genes located in the
vicinity of CRISPR loci that were designated CRISPR-associated (cas1-4; Jansen et al.
2002). The first protein found to bind to CRISPR loci was a genus-specific
uncharacterized protein in Sulfolobus species corresponding to sso454 (Peng et al.
2003), recognizing double and single repeat DNA sequences and producing an
opening on the opposite side. Haft et al. in 2005 identified a guild of 45 Cas protein
families by Hidden Markov models, a categorization refined by Makarova et al. in 2006
taking into account genomic context information, resulting in 25 Cas protein families
(Makarova et al. 2006). These families are proposed to be involved in the generation,
expansion, maintenance, transfer between genomes and function of the CRISPR elements.
With the rapid growth of experimental characterisation and identification of novel CRISPR systems in more prokaryotic genomes, it became apparent that existing CRISPR/Cas classification systems grew increasingly inadequate and did not reflect the emerging phylogenetic relationships between the system components. Moreover, with the elucidation of many Cas protein structures from different families and analysis of an increasing number of gene sequences, previously undetected homologous relationships emerged which enabled the unification of certain Cas families and the
identification of novel ones (Makarova et al. 2011b). As a result, recently Makarova and colleagues (2011a) proposed an updated, polythetic classification of CRISPR/Cas systems based on gene composition, operon organisation and the phylogenetic and functional relationships between Cas genes. According to the novel classification, CRISPR/Cas systems are organised into three phylogenetically distinct types (I-III),
and each major type can be further divided into individual subtypes (Makarova et al.
2011a and b). This classification is summarised in figure 1.7 and the subtypes distribution in table 1.1.
Figure 1.7: Outline of the main types and subtypes of the CRISPR/Cas systems and their phylogenetic relations
The most common composition and arrangement of cas genes is shown for each subtype, but gene order may vary in each organism. Gene families are color-coded and the family name can be seen under each gene. Signature genes for each main type are highlighted in green, and for each subtype in red. The star in gene cas10d indicates a putative inactivated polymerase - HD domain. The letters above certain genes stand for: RE: processing endonuclease for crRNA maturation; L: large subunits of effector complexes mediating interference; S: small subunits of effector complexes; R: subunits of effector complexes that belong to the RAMP superfamily (Repeat Associated Mysterious Proteins; described in chapter 3). Dashed genes in type III systems may not be part of the same operon. Adapted from Makarova et al. 2011a.
The three main CRISPR/Cas types share a common core of two genes, cas1
and cas2, which are highly conserved and are found in almost all CRISPR-containing
species. Cas1 a highly conserved, basic protein that belongs to COG1518 (all COG
groups mentioned in this text refer to the analysis performed by Makarova et al. 2002).
Comparative sequence analysis and certain conserved residue patterns indicate that it
might be a putative novel nuclease and/or integrase (Makarova et al. 2002). Metal-
dependent nuclease activity on ss/ds DNA (non-sequence specific) was confirmed by
Wiedenheft et al. (2009) along with the elucidation of the Cas1 structure from P.
aeruginosa which revealed a unique fold (figure 1.8). Additionally, Cas1 from S. solfataricus exhibited a high binding affinity for ss/ds DNA, ss/ds RNA and DNA-RNA
hybrids, as well as strand annealing activity (Han et al. 2009).
Figure 1.8: Crystal structure of Cas1
Cartoon representation of the P. a e r u g i n o s a C a s 1 homodimer (adapted from Wiedenheft et al. 2009). The N-terminal domain of chain A is colored in yellow, and the C-terminal α-helical domain which contains the active site in gray. Chain B is colored in light blue. Conserved residues making up the active site are in red. Three of the residues (E190, H254 and D268) coordinate a manganese ion (green sphere).
The cas2 gene encodes a small (80-120aa) protein member of COG1343. Distant
similarities were found between members of this COG and a class of sequence- dependent, single-strand RNA nucleases called PIN-domain nucleases (after their identification in the N-terminus of the pilin biogenesis PilT protein), leading to the
speculation that Cas2 might also possess ribonuclease activity (Makarova et al. 2006).
The structure of Cas2 from S. solfataricus was solved by Beloglazova et al. (2008)
revealing an RRM-like domain (RNA recognition motif; structural motif consisting of
four β-strands and two helices arranged in a α/β sandwich) (figure 1.9), while the
protein exhibited metal-dependent ssRNAse activity. The universal distribution of this gene pair along with experimental evidence discussed in subsequent paragraphs, has led to the assumption that Cas1 and Cas2 mediate the integration of novel spacer
sequences into the CRISPR loci (reviewed in Sorek et al. 2008; van der Oost et al.
Sontheimer, 2010; Deveau et al. 2010; Al-Attar et al. 2011). The role of these core proteins in the current scheme of the CRISPR mode of action will be discussed later.
Figure 1.9: Crystal structure of Cas2
Structure of Cas2 from S. solfataricus, solved by the SSPF (PDB code: 2IVY). The active conformation is a homodimer, with the interface formed by the tandem β- sheets in each monomer that make up the RRM motif. Conserved residues are located on the loops at the edge of the central cleft, at the bottom of the structure.
Type I systems are characterised by the presence of cas3 (COG1203), a gene
encoding for a protein with conserved superfamily II helicase motifs and an additional
HD-nuclease domain, encoded separately in certain subtypes (Makarova et al. 2002).
Type I systems also contain multiple representatives of the RAMP superfamily (Repeat associated mysterious proteins), which are suggested to form large heteromeric
complexes and take part in invader silencing (Brouns et al. 2008). The RAMP
superfamily encompasses a large variety of protein families with ferredoxin-like folds,
predicted to have RNA-binding activity (Makarova et al. 2002, 2006; Haft et al. 2005)
and will be discussed in more detail in chapter 3. Characteristic RAMP families associated with type I subtypes include Cas5, Cas6 and Cas7 (COG1857) protein
families (Makarova et al. 2011a). Cas6 has been shown to possess metal-
independent, sequence specific RNAse activity, and is the processing endonuclease that generates the mature interfering RNA units (referred to as crRNAs from now on) from the primary CRISPR transcript, in every type/subtype it is associated with. An additional protein found in four out of six type I subtypes and a type II subtype is Cas4
(COG1468), a member of the RecB exonuclease family (Jansen et al. 2002, Makarova
et al. 2002). A number of studies have concluded that the targets of type I systems are
DNA viruses and plasmids (among others Brouns et al. 2008; Marraffini et al. 2008,
Garneau et al. 2010).
Type II systems have been found only in bacteria and contain only the
signature gene cas9 (COG3513), the core cas1/cas2 genes and either cas4 or csn2, a
modular gene (Makarova et al. 2011a). Cas9 family members are predicted to be large
(about 1000 residues), multidomain proteins including an N-terminal RuvC-like domain
(RuvC is a Holliday junction resolvase that belongs to the RNase H fold; Aravind et al.
2000) and an HNH nuclease domain, common in restriction endonucleases (Makarova et al. 2002). Targeting of plasmid and phage DNA was demonstrated in vivo for this
system, and Cas9 is implicated in the interference stage although no biochemical
characterisation has been presented (Barrangou et al. 2007; Garneau et al. 2010).
Type III systems are characterised by the presence of cas10 (COG1353).
Among the identified domains of this large multidomain protein (~1000 residues) is a permuted HD-superfamily hydrolase near the N-terminus, a globular uncharacterised
α+β domain, a Zinc-ribbon (well-known nucleic acid interacting domain) and the core
palm domain of DNA/RNA polymerases and nucleotide cyclases near the C-terminus
(Makarova et al. 2002, 2006). The function of this protein is yet to be elucidated, but it
has been shown to form multimeric complexes with the additional RAMP Cas proteins
in type III-B operons which can effectively target RNA in vitro (Hale et al. 2009).
Targeting of DNA has also been demonstrated in vivo for type III-A systems (Marraffini
and Sontheimer, 2008). cas6 is also part of type III systems. The core cas1 and cas2
genes are occasionally missing from type III operons, but in these cases they are
found to co-exist with other CRISPR/Cas systems (type I or type II) encoding cas1 and
cas2 in the same genome. This supports the theory that cas1 and cas2 are involved in
a different stage of CRISPR functioning, and co-regulation is not necessary (Makarova et al. 2011a). Mechanistic details of each stage in every CRISPR/Cas type will be discussed in detail in subsequent sections.