• No se han encontrado resultados

CAPÍTULO 7: CONSIDERACIONES FINALES

7.2 PROYECCIONES PARA LA PROFESIÓN/DISCIPLINA/SALUD

Previously, the analysis of UniProtKB has investigated annotation quality in bulk, without analysing how individual records are maturing. If quality is a function of maturity or age of a record, then it would be expected that individual entries should be improving over time, even if, due to the rapid increase in size of UniProtKB the data as a whole is not.

Each entry within UniProtKB contains three date stamps indicating: when the entry was first introduced into the database; the last modification date of the entry and

the last modification date of the sequence. By extracting the creation date from

each UniProtKB entry, the average record age can be calculated, as is illustrated in Figure 4.11a. Using this information it can be seen that the average age of a record has increased only slowly over the life span of UniProtKB as a whole. From this graph, it can be calculated that, although Swiss-Prot is currently around 25 years old, the average record age is actually around eight years old. This difference between the average age and release date for all versions of UniProtKB is illustrated in Figure 4.11b. For example, Figure 4.11b shows that Swiss-Prot Version 9 has an age difference of 1 year and 4 months, which is calculated based on the difference between the release date (November 1988) and the average entry release date (July 1987).

Figure 4.11 shows similar patterns for both Swiss-Prot and TrEMBL, accounting for the fact that Swiss-Prot is ten years older than TrEMBL. However, in Figure 4.11a, it is noticeable that Swiss-Prot, and to a lesser extent TrEMBL, maintain the same average age for a number of recent releases. This constant average age coincides with the introduction of more regular releases of UniProtKB, which has also seen a reduction in the number of Swiss-Prot entries being added, as shown in Figure 2.9a.

These figures emphasise the increasing size of UniProtKB and the corresponding effect on the average age of entries. Therefore, in order to assess whether individual records appear to be maturing, it is necessary to abstract away from the increasing size of the database. Such an analysis, however, is not straightforward; essentially, a set of records which relate to a defined set of proteins is needed.

●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1990 1995 2000 2005 2010 1990 1995 2000 2005 2010 Release Date A v er

age Creation Date ● ● ●●● ●●

●● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●● ● ● Swiss−Prot TrEMBL

(a) Average entry age

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 1990 1995 2000 2005 2010 0 2 4 6 8 Release Date Diff erence (y ears) ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● Swiss−Prot TrEMBL (b) Age difference

Figure 4.11: The average entry age and the difference between release date and average age for each version of UniProtKB.

To achieve such an analysis, the annotations from entries common between Swiss-Prot Version 9 and other versions of Swiss-Prot were extracted. This extraction of anno- tation from entries common across all databases versions allows an equal comparison between a set of records over the history of the database. The resulting α values from this analysis are shown in Figure 4.12, in addition to the α value for the entries remaining in the database (i.e. those entries that are not in Swiss-Prot Version 9). This result shows that the α value for the mature set of entries has decreased over time, correlating with the Swiss-Prot database as a whole. However, the overall decrease in α value is reasonably small compared to the α values for the remaining entries. Although the difference in α value between the subset of common entries and the remaining entries initially increased significantly, it has started to slowly reduce, with later versions showing only a minimal change in α value.

Given that the α value for mature entries has generally decreased over time, it is of interest to investigate the α values of entries that are new to each version of Swiss-Prot. To perform this analysis, the annotations from entries that appeared for the first time in a given database version were extracted. The results from this analysis are shown in Figure 4.13. It again would appear that the α value is decreasing over time, similar to that of other Swiss-Prot graphs, with later versions of Swiss-Prot starting to show

● ● ● ● ● ● ● ● Database Version α ● ● ● ● ● ● ● ●

SP9 SP25 SP40 UP5 UP15 UP2010_05 UP2011_05 UP2012_05

1.6 1.7 1.8 1.9 2.0 ● ●

Entries common with Swiss−Prot Verison 9 Remaining Entries

Figure 4.12: Figure showing the α value for all entries contained within Swiss-Prot Version 9 that are also in various other Swiss-Prot versions. In addition, the α value of the remaining entries for each Swiss-Prot version are shown (i.e. the annotation from all entries that weren’t in Swiss-Prot Version 9).

improvement.

Since the new release cycle, the α values for Swiss-Prot annotations have steadily in- creased, with the age difference in UniProtKB/Swiss-Prot Version 2012 05 being at a high of eight years. It appears that changes to the release cycle and annotation proce- dure have started to slowly improve the quality of both new and existing annotations. From these analyses, we conclude that there are differences between bulk annotation and individual sets of proteins, either as they mature over time, or as they first enter the database. However, the broad direction of change in the annotation is similar for these subsets as it is for the database as a whole. Therefore, we also conclude that the change in α value that we see in bulk is unlikely to result only from the increase in size of the database.

However, age is not the only factor that can have an impact on annotation quality. UniProtKB categorises proteins in relation to species and taxonomy; analysing these categories allows additional subsets of annotations to be analysed. Specifically, given that some species are model organisms, it would be expected that the quality and wealth of knowledge attached to these proteins would be of higher quality than those

● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● Database Version α

SP11 SP20 SP30 SP40 UP6 UP2010_01 UP2010_11 UP2011_09UP2012_05

1.6

1.7

1.8

1.9

2.0

Figure 4.13: α value of annotations from entries new to each version of Swiss-Prot.

Documento similar