4. Conclusiones y Recomendaciones
4.2 Desarrollos y reformas normativas
users in entering complex laboratory parameters and resulting data. When appro- priate, the software should resemble the workflow users are accustomed with while it should at the same time provide help on which steps to take next.
In order to completely capture the wide variety of possible protocols and their variations, the underlying data model of the software is likely to be itself highly complex. It should not be exposed to the user, but worksteps of analysis should be rather decomposed into more easy to comprehend steps.
Explorative analysis of acquired and transformed data may benefit from a high level of interactivity of the system. This means that for every relevant detail like genes and data fields it should be possible to increase the level of background infor- mation displayed by simple mouse interactions. This additional level of information should always be available with the same representation, regardless of where in the interface a hybridization, biological sample, specific gene or data entry occurs.
Furthermore, the user interface has to be readily available without complex soft- ware installation procedures, to reduce software maintenance costs. The users work with highly heterogeneous software and hardware and require a high level of com- patibility with the different infrastructure.
Regarding different modes of interaction, such as graphical user interfaces and command-line interfaces, the optimal combination of flexibility and ease of use would be an interface, as flexible as a command-line solution (for example R) and as easy to use as a graphical user interface. It is almost certain there has to be a trade-off between flexibility and ease of use in a real-world application.
5.3 Technical Requirements and Preconditions
5.3.1 Data Handling
As a high-throughput method microarrays create large amounts of data which have to be handled in an efficient and reliable way. Any step of data analysis usually involves the creation of more derived transformed datasets. The amount of data is further multiplied by addition of an increasing number of new experiments carried out over time. This creates the necessity to use scalable technologies which can keep data manageable while the underlying database grows and allow the database to be searched.
Redundancies can occur during use of the system by adding identical datasets many times and by repeated application of analysis methods with the same param- eters. That way, the amount of stored data is multiplied without any added infor- mation. Making existing data reusable for other users is a way of addressing this problem. The user needs to be able to use already uploaded microarray datasets for different types of analyses and re-organize them into virtual experiments. If there is already an appropriate normalized dataset for a specific microarray, there should be no need to recompute it. Adding more normalized datasets to check ef- fects of different methods or parameters should remain possible and datasets based
on previous normalization steps should be preserved.
Security by authentification and authorization for each individual dataset is a high-priority requirement, especially in large distributed projects where data pri- vacy is often a serious concern. Another rationale is observed during data collec- tion and analysis; it is often unclear if data quality justifies further analysis efforts. Therefore, the experimenters tend to make only data of good quality visible to other project members.
It is possible to set up individual repositories for each project, but this has a se- vere impact on database maintenance. With the number of individual repositories, the administrative effort increases as well. Data scattered over several repositories is an obstacle for data mining as the data mining process has to be repeated for each repository. It is therefore required to be able to reduce the number of repos- itories by forming common data repositories. In addition, a way of interchanging data between repositories is required to create federated repository from several independent ones.
5.3.2 Data Analysis Capabilities
The available methods for pre-processing and analysis should include most algo- rithms presented in Chapter 3 for which their applicability to microarray data has been shown in the corresponding publications. Every existing system, presented in Sections 4.4 and 4.7, tends to implement at least a subset of this functionality.
To be able to carry out self-contained analyses within the system, methods for normalization and preprocessing, data filtering, statistical tests, and machine learn- ing should be integrated. Additional visualization methods are required to enable researchers to interactively locate interesting genes, experiments and groups thereof. For the purposes of publishing the experimental results, these visualizations should also be exported in printable formats.
The analysis methods examined so far can be applied in sequential steps. While not every possible arrangement of such an analysis sequence makes sense, there are still many possible combinations of methods. Each method accepts data of specific type and produces data of another type. It is important to be able to store data of consecutive analysis steps for reference, but if this was mandatory even for all preliminary steps, storage requirements would grow excessively. Accordingly, it should be possible to combine methods without keeping intermediate results.
Apart from the possibility of having multiple options of arranging methods in many different ways, almost every analysis method has a set of parameters control- ling its behavior. Hierarchical cluster analysis, for example, requires to specify the method for computing distances between genes (e.g. Euclidean, manhattan, corre- lation coefficient) and the method for computing the inter-cluster distances (e.g. average linkage, complete linkage, Ward’s method). Pre-filtering steps are neces- sary to compute a gene-expression matrix, where repeated measurements must be consolidated by computing an empirical location statistic, namely mean, median, or a trimmed mean. It can also be reasonable to scale the resulting gene-expression
5.3. Technical Requirements and Preconditions 83
matrix to have equal variance and means of columns, and for time-series to scale the mean of each row.
Exploring different parameters might result in an inefficient try-and-error ap- proach. It is therefore important to provide guidelines on how to set up a consec- utive series of analysis methods, and to provide unexperienced users with a set of standardized parameters to start with. All parameters used in an applications of methods need to be recorded. The choice of some methods depend on the microar- ray platform: normalization of two-color cDNA arrays differs from normalization of single channel arrays (see Section 3.3.2). The system should support the user in this decision process based on the type of microarray. As many new methods are constantly being developed, the software needs a mechanism to incorporate new methods to extend its analysis capabilities without further programming effort.
5.3.3 Data Integration
Some analysis methods can benefit from or even require the presence of prior knowl- edge. Supervised learning methods, as an example, require class labels for the pur- pose of training and performance evaluation. Cluster analysis methods could also benefit from the presence of external labels: Functional classification of genes from genome annotation systems can be used to assess the quality of clusters. Another possible use-case is the projection of measured data on the chromosomal location of genes. This approach can be used for tiling microarrays to detect transcribed regions of the chromosome. The exact location of representative sequences on the chromosome can be retrieved from specialized genome annotation systems such as GenDB (Meyer et al., 2003).
Allowing the retrieval of microarray data by remote software is also an important feature, for instance in order to map the expression levels of genes to metabolic pathways. That way, the specialized pathway software can display the metabolic network including displays of the actual expression level for each involved gene. Gene expression information may also serve as additional annotations in genome annotations, stating that a specific gene was differentially expressed under a specific experimental condition. This requirement raises a new problem, both motivated by the complexity of microarray data.
The technical aspect to solve is about appropriate interfacing. Neither can the remote application (client) be required to support MAGE-ML, nor should the com- plex structure of MAGE-ML be exposed to the client; this would be an unwarranted overhead for simple queries. On the other hand, there can be more complex queries which combine many different aspects of the experimental annotations with exter- nal data-sources. For this purpose it seems justified to provide full interoperability for the software programmers of remote applications.
There can exist a very large number of measured and derived single-datum points for a gene within the system. It is a non-trivial task to decide which single-point measurement or which expression vector should be exported to external applica- tions. Hence, the external user has to be able to further restrict queries for ex-
pression data not only to genes, but to a defined set of experimental conditions and data-types. Although, applications often require singular measurements for a gene, one has to keep in mind that this is a very reductionist view on gene- expression data. This problem needs to be solved in the design of an appropriate query interface.
In conclusion, it is necessary to provide at least two views on data- interoperability:
• A simplified access model by which simple queries for single datum-point can be processed and
• a complex access model, by which complex queries can be implemented, which are not specified at this time and may use all information in the repository.