• No se han encontrado resultados

3. Referentes teóricos

3.5 Concepciones de Familia

A total o f about 90% o f residues in proteins are found in either a-helices (38%),

p- strands (20%), or reverse turns (32%). The practice o f predicting secondary structure

from amino acid sequence on the way to predicting total protein structure is very

w idespread (Creighton, 1993), although this m ethod was not used in this thesis. If the fold structure has not been determined for the superfamily or if the sequences are very

divergent, prediction o f secondary structure from the amino acid sequence presents a

m ajor means for identifying the correct fold. The methods o f secondary structure

prediction techniques discussed vary from statistically based approaches to neural

networks, all o f which have advantages and disadvantages. By com bining the results o f

these different techniques, the reliability o f the overall prediction is increased, and it can

then be seen w hich a-helices and p-sheets are consistently predicted by all these

methods. The accuracy o f secondary structure prediction methods is vastly improved

The Chou-Fasm an method is a statistical prediction m ethod that was derived

from the propensities for each amino acid to occur either in an a-helix or in a P-sheet in

15 protein structures (Chou & Fasman, 1978). According to these propensities, each

amino acid is allocated to one o f six classes depending on its likelihood o f forming an

a-helix, w hich vary from strong a-helix former to strong a-helix breaker, and to one o f

six classes depending on its likelihood o f forming a P-sheet, w hich range from strong P-sheet former to strong p-sheet breaker. A series o f rules are used to assign secondary

structure elem ents to clusters o f probable a-helix and p-sheet residues in an amino acid

sequence. A lthough this method is fairly uncom plicated in its concept, it has been

criticised because o f its simple statistical approach, its arbitrary prediction rules, and

because it does not consider the chemical and physical properties o f the amino acids

(King et a l , 1996).

The GOR method (Gamier, Osguthorpe, and Robson) m ethod is a more complex

statistical m ethod (GOR-I; Gamier et a l , 1978). The method was developed using a

database o f 26 protein stmctures, which was later updated for a database containing 75

stm ctures (GOR-III; Gibrat e ta l,\9 % l) . Each residue is unam biguously assigned to one

o f 4 possible conformations, a-helix, p-sheet, p-tum (2-residue tum ) or random coil.

The basis o f this m ethod is that the amino acid sequence and the secondary stmcture are two distinct messages that are related by a translation process that can be examined

using information theory. Although in theory the conform ation o f any particular residue

is dependent on every other amino acid in the protein, the m ost significant influence on

the conform ation o f a residue is exerted by the eight residues either side o f it (Robson

& Pain, 1971; Robson & Suzuki, 1976). Stm cture prediction uses the information a

residue carries about its ovm secondary stm cture, the inform ation a residue carries on

the secondary structure o f a second residue within eight residues along the sequence

irrespective o f the second residue’s type, and the inform ation a residue carries about the

secondary stm cture o f a second residue respective o f on the second residue’s type. This

method is theoretically elegant and it allows the separation o f the different types o f

inform ation involved in the folding o f a protein. However, it also neglects the physical

and chemical properties o f the amino acids and it deliberately neglects protein folding

PHD (Profile netw ork system from H eiD elberg) is a secondary structure

prediction algorithm that is based on a neural net learning system and a multiple

sequence alignm ent PHD had an accuracy greater than 70% when cross-validated on

more than 100 unique structures (Rost & Sander, 1993 ; R ost et al., 1994). Homologous

proteins have the same three-dimensional fold and approxim ately equivalent secondary

structure profiles at around a level o f 25-30% identical residues and hence a multiple

sequence alignm ent o f the protein family can contain more structural inform ation than

a single sequence (Rost & Sander, 1993a). First, a profile o f the frequencies o f amino

acids occurring at each sequence position is calculated from a m ultiple sequence

alignm ent and this is then processed by a three-layered network. The first layer is a

neural netw ork that has been trained to classify residues according to three states o f

secondary structure, a-helix, p-strand and loop. In the second layer, stretches o f

predicted residues are analysed and contiguous regions o f residues that are predicted to

have the same structure are assigned as secondary structure elements, and unlikely

stretches o f secondary structure elem ents are discarded, e.g. an unlikely prediction from

the first level such as HHHEEHH (H, helix; E, strand) will be altered to HHHHHHH. At this stage, the agreement o f predicted segment lengths with those observed in protein

structures is noticeably improved but the overall prediction accuracy is not significantly

improved. Only the length o f the predicted segments is m ore consistent w ith observed

protein structures than the output from level one. The third level or ju ry decision is

effectively a noise reduction step that comes about by the arithmetic averaging o f

12

different netw ork predictions. These networks have been trained differently on

'balanced' and 'unbalanced' datasets. In 'balanced' training the num ber o f examples o f

a-helix, p-strand and loop presented in the training set are equal, as opposed to 31% a -

helix, 22% p-strand and 47% loop found in the database or 'unbalanced' training set

(Rost & Sander, 19936).

The SAPIENS (Secondary structure and A ccessibility class Prediction Including

EN vironm ent-dependent Substitution tables) prediction m ethod is based on the

amino acids in aligned sequences (W ake & Blundell, 1994a & b). Initially, sequences

are considered individually and the preferred secondary structure state (a-helix, p-sheet,

buried coil or exposed coil) is assigned to each residue using propensity and substitution

tables. These assignments are m odified for neighbouring residue cooperativity and

according to the positions o f residues that are typically found at the N - and C-terminal

caps o f secondary structure elem ents. Next, the secondary structure assignments are

altered using predicted solvent accessibility patterns, w hich are com pared to those

observed for secondary structure elem ents in known protein structures. Finally, the

conformational state at each residue position is averaged across the m ultiple sequence

alignment and the m ost dom inant state is used.