Anexos cutáneos o tegumentarios
Recuadro 2.2 Tipos de plumas
All the aims of the project rely on having all data, either captured or computed, accessible at all time. The data underlying the entire metabolomics workflow needs, therefore, to be centralised in a common structure. Sharing data between users (aim number 5) is highly dependent on the data structure and is addressed in this section. The modularity of the tool being another key objective of the system (aim number 3), the data structure needs to follow a modular design to support scalability and rapid development. The data structure proposed below is implemented as a relational database developed using MySQL.
3.4. Untargeted metabolomics pipeline 42 The structure is organised in modules to separate the different types of data. Captured data and computed data have been identified as being the two main data types. Those two data types are then organised in sub-modules to form a coherent data structure supporting the capture of all the information and parameters required for an untargeted metabolomics ex- periment. Figure 3.3 shows the general organisation of the data structure implemented in PiMP. The following modules structure the data capture: projects, fileupload, groups and experiments; the computed data which form the results of the analysis pipeline is stored in the data and compound modules.
The first module called “projects” showed in Figure 3.4 allows the recording of a project’s metadata such as its name, creation and edition dates as well as its owner. The module also captures the users that are granted access to the project through the UserProject table, recording also the level of permission a user may have to an individual project.
The second module represented in Figure 3.5 allows the storage and organisation of raw files generated by the instrument. The user may upload two types of samples, hence the separation of the module in two similar structures. The first type of sample supported and simply called “sample” corresponds to the biological samples of the experiment. Each sample when run on the instrument can contain either one or both positive and negative polarities, each polarity being contained in a different file. These files describing the same sample are stored using the “file” table and are organised using the “SampleFileGroup” joining table. The other sample type stored by the table “CalibrationSample” is used for quality control purposes; it is used to store and organise pooled, blank and external standard samples. The main difference with the biological samples is that the standard files can be stored in csv file containing both polarities. This difference is reflected by the ”data” field in the “StandardFileGroup” table. This table and its equivalent for the biological samples, the “SampleFileGroup”, also allow defining the format of the file that is stored. Finally, the “Curve” table is used to store the total ion chromatogram of each sample. This table was created for optimisation purposes as the TIC is also accessible from the file itself but requires more time and processing power than a simple query.
The “groups” module (Figure 3.6) captures the experimental information of a particular study. This adds an extra layer to the organisation of the samples. Two primary information is stored in this module which is the levels and factors respectively stored in the Attribute and Group tables. The factor represents a category of a biological sample and the level its condition within the category. For instance, if a factor is “gender”, the level could be “male”, “female” or “undefined”. To be flexible, sample entries are attached to the attribute table through a joining table. This structure allows the storage of one level per factor for each sample. For example, sample A could be annotated with the level “male” under the factor “gender”, and “time 0” under the factor “time”. One sample can only be attached to one level under a specific factor; however, the number of factors is not limited to allow the definition
3.4. Untargeted metabolomics pipeline 43
Figure 3.3: Database structure showing the general organisation of the data storage in mod- ules. Four data modules (projects, fileupload, groups and experiment) are used to store the data captured from the user, the other two modules (data and compound) are used to store the processed and biological data generated by the the data processing pipeline.
3.4. Untargeted metabolomics pipeline 44
Figure 3.4: Detailed structure of the Projects module showing the organisation of meta data and user permission capabilities.
Figure 3.5: Detailed structure of the Fileupload module showing the organisation of the different files required for LCMS data analysis
Figure 3.6: Detailed structure of the Groups module showing the organisation of the biolog- ical samples between factors (groups) and levels (attributes)
The next module represented in Figure 3.7 and named “experiments” captures two types of information, the analysis parameters and the different levels to compare. The “params” table store all the parameters and information required for the back-end pipeline to run, the ”experiment” table store the information about the comparisons to perform. The analysis table brings together those two sets of information along with extra fields such as the status of the analysis (i.e. “Running” or “Finished”) and time stamps.
The “data” module in Figure 3.8 corresponds to the extracted and computed data resulting from running the sample files through the back-end pipeline with the selected set of parame- ters and comparisons. The main information stored in this module is the peaks (in the “peak” table). The other tables are all joining tables that store extra information in relation to other
3.4. Untargeted metabolomics pipeline 45
Figure 3.7: Detailed structure of the Experiments module showing the organisation of the levels to compare and the analysis parameters.
data entries. The dataset table represents the set of peaks that has been extracted from the sample files for a particular analysis. The presence or absence of a peak (and its intensity if present) in a specific sample is stored in the joining table called “peakDTsample”. More information about the peak is also stored directly within the peak table such as the mass, the retention time or the polarity. The “peakQCsample” table stores the same information as the “peakDTsample” table but for the calibration samples (pooled and blank samples). The last table of this module (“PeakComparison”) stores precomputed data resulting from the analysis pipeline such as the p-value and log fold change of two peaks in two different conditions. This table is a joining table between the peak and comparison table.
Figure 3.8: Detailed structure of the Data module showing the organisation of extracted and processed raw data into features (peaks) with attached values.
The last module is only attached to the rest of the data structure through the peak table. The “compound” module which structure is shown in Figure 3.9 stores data from external re- sources about biological compounds and pathways. Entries in the compound table must be unique, and the information about the external database can be found in the “RepositoryCom- pound” table. One compound can have many repository entries to be flexible and extendable with any external database. The four other tables in this module allow the storage and organ-
3.4. Untargeted metabolomics pipeline 46 isation of pathway information. The “pathway”, “superPathway” and “DataSource” tables respectively store the name of pathways, super-pathways which are a set of pathways re- lated with one another, and the external data source from which the information has been extracted. One big joining table brings all the information together by joining a pathway to a super pathway, a data source and many compounds.
Figure 3.9: Detailed structure of the Compound module showing the organisation of the biological data used to enriched the processed data explained in the previous Data module The modular design of the data structure has been implemented for flexibility and scalability purposes (Figure 3.3). Although the current structure supports any type of metabolomics data, the field evolves rapidly, and it is important that modules can be extended or replaced with ease to support long-term changes. For example, the file module supports the current files generated by mass spectrometers. However, it is possible that the file structure generated by these instruments changes in the future, it is, therefore, important to be able to adapt the module with minimum repercussion on the rest of the database.
3.4. Untargeted metabolomics pipeline 47