Capítulo III.Métodos de la investigación
3.6 Sistema de variables e indicadores
3.6.2. Variable dependiente
3.6.2.2 Indicadores de la variable dependiente
Some of the inherent redundancies in the PDB were introduced in Chapter 2. Any analysis of interface properties not taking such biases into account is likely to be skewed by over-represented systems. Therefore before interface properties are analysed the data sets were filtered and clustered to provide a reliable non- redundant set.
The procedure of applying PISA-derived rotation-translation matrices to gen- erate biological assemblies removes artefactual non-specific crystal packing in- terfaces. Despite this a small number of insignificant interfaces remain in the PISA-derived assemblies. These typically comprise only a handful of residues, and manual examination reveals they are almost exclusively due to peripheral contacts of non-neighbouring chains in high order multiprotein systems.
No chain length filtering criteria were applied prior to generation of PIC- COLO. This was a deliberate choice; interactions of proteins with small peptides are of interest when considering the effects of mutations on protein function. However for the purposes of systematically deriving properties of protein inter- faces, it is the interaction surfaces of globular proteins that are of most interest. Interactions of small peptidic polypeptide chains of less than 15 valid amino acid residues were therefore removed.
3.2 Methods
Figure 3.1 shows the number of residues contributed by each side of the in-
terface (Ri and Rj). For clarity the pairs of interfaces have been ordered by size,
with the chain contributing the most residues shown on the x-axis. The inset in- dicates a close-up of the smallest interfaces. Although a threshold of a minimum
of 5 contact residues per protein (Ri ≥ 5 and Rj ≥ 5) was initially considered (red
dashed line) this would exclude a small number of genuine interfaces. Instead the criterion that the product of the number of residues from each interface is greater
than or equal to 25 was used (Ri × Rj ≥ 25) (solid red line). Collectively these
filters remove 28,152 interfaces (21.6% of the original 130,336).
It would be anticipated that each side of a protein-protein interface would contribute approximately the same number of residues and this is borne out
by Figure 3.1. The largest single interface, in terms of the number of residues
involved, is that of homodimeric pyruvate-ferredoxin oxidoreductase (PDB entry 1kek) with more than 300 residues contributed from each partner in an extended,
interdigitated surface as shown in Figure 3.2.
Typical procedures to deal with redundant data involve performing cluster analysis whereby the objects are partitioned into subsets such that the data in each agglomerated subset are co-proximal, as defined by a particular dis- tance measure. Selection of one representative from each subset provides a non- redundant set. An example of such a procedure for clustering homologous protein sequences is described in Chapter 4. However, identifying a non-redundant set from a pairwise set of proteins, such as that in PICCOLO is not so straightfor- ward. Any upstream sequence-based clustering of PDB polypeptides cannot be performed, as two protein structures with identical sequences may exist in dif- ferent states: one may be complexed and the other bound; and even if both are bound, they may be bound to different partners; and even if both bind the same partner there is no guarantee the interaction surface or mode of interaction will be maintained.
For this reason the following clustering procedure was devised. All pairwise interfaces were first grouped by the unique ordered combination of UniProt iden- tifiers of both component proteins. Then within these UniProt pair clusters, each cluster member pair was compared to all other cluster member pairs and the overlap of unique UniProt residue numberings (pre-calculated and stored in the
3.2 Methods
Figure 3.1: Scatter plot of the number of residues contributed by the larger
side of each PICCOLO interface (Ri on y-axis) against the number of residues
contributed by the smaller side (Rj on x-axis). Colour indicates the total number
of interfaces at each point, reflecting the fact that many interfaces share the same number of contributing residues. The red dashed line indicates a threshold of a
minimum of 5 contact residues per interface (Ri ≥ 5 and Rj ≥ 5) that was
initially considered. The solid red line indicates a threshold where the product of the number of residues from each interface is greater than or equal to 25
(Ri× Ri ≥ 25) that was used. The inset shows a close-up of the lower left corner
3.2 Methods
Figure 3.2: Cartoon representation of the homodimeric interface of pyruvate- ferredoxin oxidoreductase (PDB entry 1kek) with more than 300 residues con- tributed by each surface is the single largest interface in PICCOLO. Chain A is shown in blue and chain B in red. Figure generated using PyMOL (Delano (2002)).
3.2 Methods
ResMap table) for both constituents was assessed reciprocally. If both sides of the interface share more than 75% of unique residue positions in common with another pairwise interaction then the interfaces were co-clustered. 75% was cho- sen as a sparsely populated region that gave good separation of some manually selected test cases.
In order to choose representatives to form the non-redundant set, rather than simply choose an arbitrary member of each cluster, the representative complex for each cluster was chosen as that complex whose mean QScore of the two con- stituent chains was highest (QScore is a property of each polypeptide chain as it depends partly on the number of missing residues). Note that this process results in a non-redundant set of interfaces, not oligomeric assemblies.