3. Referentes teóricos
3.5 Concepciones de Familia
A total o f about 90% o f residues in proteins are found in either a-helices (38%),
p- strands (20%), or reverse turns (32%). The practice o f predicting secondary structure
from amino acid sequence on the way to predicting total protein structure is very
w idespread (Creighton, 1993), although this m ethod was not used in this thesis. If the fold structure has not been determined for the superfamily or if the sequences are very
divergent, prediction o f secondary structure from the amino acid sequence presents a
m ajor means for identifying the correct fold. The methods o f secondary structure
prediction techniques discussed vary from statistically based approaches to neural
networks, all o f which have advantages and disadvantages. By com bining the results o f
these different techniques, the reliability o f the overall prediction is increased, and it can
then be seen w hich a-helices and p-sheets are consistently predicted by all these
methods. The accuracy o f secondary structure prediction methods is vastly improved
The Chou-Fasm an method is a statistical prediction m ethod that was derived
from the propensities for each amino acid to occur either in an a-helix or in a P-sheet in
15 protein structures (Chou & Fasman, 1978). According to these propensities, each
amino acid is allocated to one o f six classes depending on its likelihood o f forming an
a-helix, w hich vary from strong a-helix former to strong a-helix breaker, and to one o f
six classes depending on its likelihood o f forming a P-sheet, w hich range from strong P-sheet former to strong p-sheet breaker. A series o f rules are used to assign secondary
structure elem ents to clusters o f probable a-helix and p-sheet residues in an amino acid
sequence. A lthough this method is fairly uncom plicated in its concept, it has been
criticised because o f its simple statistical approach, its arbitrary prediction rules, and
because it does not consider the chemical and physical properties o f the amino acids
(King et a l , 1996).
The GOR method (Gamier, Osguthorpe, and Robson) m ethod is a more complex
statistical m ethod (GOR-I; Gamier et a l , 1978). The method was developed using a
database o f 26 protein stmctures, which was later updated for a database containing 75
stm ctures (GOR-III; Gibrat e ta l,\9 % l) . Each residue is unam biguously assigned to one
o f 4 possible conformations, a-helix, p-sheet, p-tum (2-residue tum ) or random coil.
The basis o f this m ethod is that the amino acid sequence and the secondary stmcture are two distinct messages that are related by a translation process that can be examined
using information theory. Although in theory the conform ation o f any particular residue
is dependent on every other amino acid in the protein, the m ost significant influence on
the conform ation o f a residue is exerted by the eight residues either side o f it (Robson
& Pain, 1971; Robson & Suzuki, 1976). Stm cture prediction uses the information a
residue carries about its ovm secondary stm cture, the inform ation a residue carries on
the secondary structure o f a second residue within eight residues along the sequence
irrespective o f the second residue’s type, and the inform ation a residue carries about the
secondary stm cture o f a second residue respective o f on the second residue’s type. This
method is theoretically elegant and it allows the separation o f the different types o f
inform ation involved in the folding o f a protein. However, it also neglects the physical
and chemical properties o f the amino acids and it deliberately neglects protein folding
PHD (Profile netw ork system from H eiD elberg) is a secondary structure
prediction algorithm that is based on a neural net learning system and a multiple
sequence alignm ent PHD had an accuracy greater than 70% when cross-validated on
more than 100 unique structures (Rost & Sander, 1993 ; R ost et al., 1994). Homologous
proteins have the same three-dimensional fold and approxim ately equivalent secondary
structure profiles at around a level o f 25-30% identical residues and hence a multiple
sequence alignm ent o f the protein family can contain more structural inform ation than
a single sequence (Rost & Sander, 1993a). First, a profile o f the frequencies o f amino
acids occurring at each sequence position is calculated from a m ultiple sequence
alignm ent and this is then processed by a three-layered network. The first layer is a
neural netw ork that has been trained to classify residues according to three states o f
secondary structure, a-helix, p-strand and loop. In the second layer, stretches o f
predicted residues are analysed and contiguous regions o f residues that are predicted to
have the same structure are assigned as secondary structure elements, and unlikely
stretches o f secondary structure elem ents are discarded, e.g. an unlikely prediction from
the first level such as HHHEEHH (H, helix; E, strand) will be altered to HHHHHHH. At this stage, the agreement o f predicted segment lengths with those observed in protein
structures is noticeably improved but the overall prediction accuracy is not significantly
improved. Only the length o f the predicted segments is m ore consistent w ith observed
protein structures than the output from level one. The third level or ju ry decision is
effectively a noise reduction step that comes about by the arithmetic averaging o f
12
different netw ork predictions. These networks have been trained differently on
'balanced' and 'unbalanced' datasets. In 'balanced' training the num ber o f examples o f
a-helix, p-strand and loop presented in the training set are equal, as opposed to 31% a -
helix, 22% p-strand and 47% loop found in the database or 'unbalanced' training set
(Rost & Sander, 19936).
The SAPIENS (Secondary structure and A ccessibility class Prediction Including
EN vironm ent-dependent Substitution tables) prediction m ethod is based on the
amino acids in aligned sequences (W ake & Blundell, 1994a & b). Initially, sequences
are considered individually and the preferred secondary structure state (a-helix, p-sheet,
buried coil or exposed coil) is assigned to each residue using propensity and substitution
tables. These assignments are m odified for neighbouring residue cooperativity and
according to the positions o f residues that are typically found at the N - and C-terminal
caps o f secondary structure elem ents. Next, the secondary structure assignments are
altered using predicted solvent accessibility patterns, w hich are com pared to those
observed for secondary structure elem ents in known protein structures. Finally, the
conformational state at each residue position is averaged across the m ultiple sequence
alignment and the m ost dom inant state is used.