• No se han encontrado resultados

The method below applies to two separate analyses, one performed on HSEu data and the other on side-chain ASA data. The procedure for both are the same except for the determination of the Expected in the ASA analysis, which needed to allow for the variation of side chain size between the different amino acids, as described later.

Preparing the data

The representative structures for Pfam domains were selected using the database described in Chapter 3. The selection criteria were to have two sets of domains, one exclusively eukaryotic and the other exclusively prokaryotic. Further, in each set, the domains were selected to be exclusively cytoplasmic, non-membrane, non-DNA and non-RNA binding. i.e. domains exclu- sively from cytosolic globular proteins. The set of quaternary structures that were selected from the PiQSi database contained at least one of the chosen Pfam domains. Initial results included some unusual points inconsistent with the trend indicated by the rest of the data. Investiga- tion into this revealed that there were errors in some structure files taken from PiQSi. Thus a “cleaned” set of PiQSi structures was assembled, with the problem structures removed, leaving a total of 12,234 unique structures from which to select representatives for our Pfam families.

A script calling the program Whatif [74] was used to check the quality of the selected struc- tures. Three operations were performed on the structures: i) Adding missing side-chain atoms into the structure. This was especially important for the ASA analysis as missing atoms would reduce the measure of ASA and introduce an error into the results. ii) Checking and correcting bond-lengths and bond angles. iii) The numbering of amino acids in the structure was checked and residues were renumbered to correct for duplications. This was done to resolve problems

that had arisen with the BioPython PDBParser module that could not handle inconsistencies in residue numbering.

For each of the 12,234 structures the solvent exposure of every residue in the structure was measured. Solvent exposure in this case refers to both HSEu and the side chain ASA in Å2. The hsexpo.py script included in the BioPython module [75] for Python, was used to develop a script that worked within the work-flow to calculate HSEu. The Naccess [4] program was used to determine the ASA for the side chains. A copy of each structure file was made with the B-factor column used to store the solvent exposure measure. Thus, the solvent exposure was measured for each residue, in it’s crystallised biological unit, as defined by PiQSi. Quaternary structures were used for the calculation of solvent accessibility values so that our analysis had the greatest biological and physical relevance possible.

The selected Pfam domains were located and extracted from the the crystal structures pro- vided by PiQSi. Locating the domain in the structure required cross referencing the chosen Pfam families with the UniProtID associated with each protein structure. Although Pfam does provide start and end points for domains in protein structures, these are not recorded in a con- sistent fashion and therefore could not be used. A purpose built Python script was used to find the start and end points of the domains in the structure, by using the regular expression matcher Tre – mentioned in Chapter 3. The Python script located the sequence provided by Pfam in the structure sequence from PiQSi, within a 10% margin of error. The PDBParser module in BioPython provided a method for extracting the structure of the domain from the structure file from PiQSi in PDB format. Representative structures for each of the Pfam domains were each individually stored in a labelled ASCII file in PDB format. The set of representative structures stored with HSEu and ASA values were consistent with each other.

Differences in Calculating the Expected for ASA Compared with HSEu Data

Differences in the size of amino acid side chains required a different method of determining the Expected E for the ASA analysis compared with HSE. HSEu has the property that it is completely independent of residue size, it considers the amino acid population in the half sphere of a specified radius in the direction of the Cα- Cβ vector for a given residue. To determine the

Expected for HSEu it was sufficient to use the method described in Chapter 2 Section 2.4.2, detailing the application of OE to solvent exposure analysis.

While ASA is explicitly size dependant and restricted to some specific maximum value that varies between residue types. The nature of the Expected in the analysis is to offer some unbi- ased or ‘unconditioned’ distribution with which to compare the Observed distribution. However residues such as alanine and lysine, which are of very different sizes, could never be ‘expected’ to share the same Expected value of absolute ASA. To estimate the Expected for each residue type in the ASA analysis 100 randomised data sets were created yielding 100 bootstrapped Ob- served values, based on the data from the selected Pfam domain structures. The average of these bootstrapped Observed values was taken as an approximation of the Expected ASA given no distributional bias of residues. This was used as the Expected value for each residue-type, for the analysis using normal, unrandomised, data to calculate the OE for each residue type. It was also used as the Expected for all the bootstrap analyses to calculate the OE for each residue type in the analysis of randomised bootstrap data. This was used for determining the chance variation possible in the data analysis, and thus the likely statistical significance of the true data, as described later in Section 4.2.2

The Analysis

For each Pfam family the frequency of each residue type having a value within a given range (representing a range-bin) of solvent exposure, was determined. For HSEu the range-bin size was set to 4 HSEu counts and the highest range-bin was 56-60 counts, while for ASA it was set at 10 Å2 with the highest range bin set at 240-250 Å2. The maximum values for each analysis

were chosen by searching the data set for the highest single value of solvent exposure for each residue type. The true range of each bin was from the lower value to strictly less than the upper value i.e. the range (0 to 10) is in fact set as: i ∈ [0, < 10], this means that 10 goes into the next bin (10 to 20), such that: j ∈ [>= 10, < 20]. The frequency of each residue in a range-bin and the frequency of each residue in the entire structure were then used to determine the OE for each residue-type in each range-bin. For the HSEu analysis theOE was calculated using the equations given in Section 2.4.2:

Observed:O1 = P(a|r) = P ar P Ar (4.1) Expected: E1 = P(a) = P aR P AR (4.2) O1 E1 = P(a|r) P(a) (4.3)

For the ASA analysis, the Observed was determined as per equation 4.1 above, the Ex- pected was calculated as described in Section 4.2.1 had already been calculated earlier. N.B. The Expected for the ASA analysis is only calculated once, and is used for both the analysis of the normal data and the randomised bootstrap data. This was done to accommodate time considerations, it would be prohibitively expensive in terms of computational resources (both CPU cycles and data storage) to do this once for Observed value, i.e. for the normal data and once for each bootstrap analysis.

Our objective was to calculate a value of OE which was generic and unbiased by differences in the number of representative structures available for each Pfam domain. Thus the OE for each residue type in each range-bin for a given structure example, needed to be weighted to adjust for the similarity of the sequence to others for that family. To achieve this the Henikoff & Henikoff weighting [64] described in Section 2.5.2 was calculated for the sequences of all the representative structures for each Pfam family being considered. This weighting was then applied to each OE for each residue type in each range-bin for each structural example of the Pfam family. The sum of all weighted OE for a given range-bin for all structures was calculated, giving a weighted OE for the Pfam family. This provided a representative OE for each residue type in each range-bin for each Pfam family. Finally an average value of allOE from all Pfam families was calculated to give the values reported in the results section. This final average of averages was calculated by generating the sum of all OE in a given range bin and dividing by the number of families that had contributed to the value in that range-bin for that residue type.