División del Sector Privado - Cuestiones administrativas

C. Cuestiones administrativas

6. División del Sector Privado

Both, public and user’s data, are transparently accessible over the cloud VLAN, but not from outside. It limits the adoption of cloud computing federations, because even if PMES enables remote executions across multiple clouds, data dependencies would hamper the full implementation. Hence, available multi-institution storage solutions are explored with the aim to integrate the most convenient one into the current MuG DMP – preserving as much as possible the current model.

oneData system is chosen as the most suitable solution. It permits a totally transparent use

of the remote data resources offering a virtual POSIX file system mounted by FUSE, son no major disruptions are expected in the DMP already in place. The data redundancy rule applied is “copy on read”, convenient for unbalanced data clouds like MuG’s, very diverse in terms of proximity to public repositories’, capacity and accessibility. In this way, data will be transferred from one cloud to the other only under demand during job executions. Furthermore, the activation of user’s virtual storage is a POSIX-based and non-interactive process, essential to be latter integrated into the dynamic and user-specific contextualization of the storage via OCCI at boot time. Additionally, oneData supports OpenID connect protocol, becoming trivial to encompass it with MuG authentication system. And finally, the system supports mechanisms for sharing datasets, a feature that could well complement MuGVRE.

Although oneData is not yet part of MuG platform, a pilot installation is built and configured at BSC premises in order to design a detailed integration plan. The installation is composed

of two VMs (oneProviderEBI and oneProviderBSC) emulating two geographically separated clouds each contributing with storage capacity. A third VM federates the providers under a unique view (oneZone), and a last VM represents a client (OneClient) accessing the storage via oneZone, for instance, MuGVRE or any of the VM tools. They all are emulating a scenario where a oneProvider server is on top of each MuG cloud, with its corresponding local NAS, while a Zone has access to them via TCP/IP ports. The system already interconnects with MuG users via MuG authentication server, and specific openID group’s membership and scopes are set up, as oneData uses them to manage data access across clouds. According to these, MuG users have access to certain oneData Spaces, virtual views of data volumes supported by the distinct cloud NASs. MuG users are configured to have accessible by default two Spaces: a private Space, analog to current MuGVRE user’s workspace, and a so called “public” Space, planned for caching public repositories’ data. Upon user’s login from the client (MuGVRE server), their spaces are transparently mounted by FUSE, making user’s files accessible via POSIX regardless their physical cloud datastores. Thus, MuGVRE visualizers, running on the web server, could gain access to remote files, which are copied locally only when accessed. Similarly, if the system’s login were set up during the contextualization of PMES-enabled VMs being deployed into the Embassy Cloud, the VM would access to user’s uploaded input file, initially sitting on the cloud hosting MuGVRE and transparently transferred to Embassy’s data block. A second execution there, would already find the file there. Scripts for user’s registration and Space’s configuration are prepared, all via REST oneData API. Next step is integrating them into MuGVRE backend.

Metadata

Metadata creation and handling has sharpened important MuGVRE features in terms of user experience, automatic application’s administration, and reproducibility. Backend and frontend heavily rely on the metadata stored in Mongodb documents, whose collection entries are modeled after the PHP objects used in MuGVRE. The data models more relevant are the following:

Data model Definition MuGVRE DB Annex

File

Descriptive and operational metadata defining files and directories.

MuG partners prepared REST APIs also based on this model. Its specification is annexed.

“Files” collection: operational metadata “Files Metadata”: rest of metadata 8.4.1 Data Model: “File” 8.4.2 Data Model : “Tool” Tool

Metadata describing the applications to be executed on the cloud

Defined using JSON schemas. One for tool developers willing to register his tool. A second, for internal use with some extra fields (e.g. identifiers)

“Tools” collection

Such data models include descriptive and technical metadata that help MuGVRE to accomplish some basic functionalities like job accounting, data provenance, new tool and visualizer integration, tools interoperability, etc. “File” and “Tool” models structure the necessary data.

“File” data model

“File” defines a file or directory resource stored in the MuG infrastructure and is represented as exemplified in Snippet 4.7. “File” is the junction point between MuGVRE and the metadata management APIs implemented by MuG partners. The model corresponds to two synchronized collections in MuGVRE database, (i) “Files” collections, strictly storing operational metadata that cover the minimal functionalities of MuGVRE as a file server application, (ii), the rest of metadata required to build the VRE - semantics and descriptive metadata. { "file_id": "MuGUSER59e5ead574743_5cf8c3d1b43156.06448028", "path_type": "file", "file_path": "MuGUSER59e5ead574743/__PROJ5c6c417267b522/uploads/G1.bam", "cloud": "mug-irb", "user_id": "MuGUSER59e5ead574743", "project": "__PROJ5c6c417267b522.33832470", "size": 725988911, "parent_dir": "MuGUSER59e5ead574743_5c6c417280d635.07060077",

"expiration_time": {"sec": 1559806931,"usec": 0},

"creation_time": {"sec": 1559806929,"usec": 0},

"source_id": ["MuGUSER59e5ead574743_566041g28dd456.675598”],

"tool_id": "BAMindex",

"arguments": {"sort": true },

"data_type": "data_mnase_seq", "file_type": "BAM", "compressed": false, "metadata": { "refGenome": "R64-1-1", "taxon_id": 4932, "description": "MNase-seq for S. cerevisiae cells synchronized in G1", "paired": "paired", "sorted": "sorted", "associated_files": ["MuGUSER59e5ead574743_5cf8c3bf2151a3.62251628"], "validated": true, "visible": true . } }

Snippet 4.7 : Example for “File” data model in MuGVRE

Operational metadata for user’s input and output files is collected, which provide the

platform with a flexible data hierarchy, and a dynamic allocation system based on a unique “file_id” and “file_path” addresses relative to a cloud storage. Full data access is resolved at the application level, either by MuGVRE or MuG data APIs. Together with other metadata objects like “JobProcess” or “User”, MuGVRE provides job accounting and data provenance

system by storing: file lineage at “source_id” that records job transformation operations;

input files and arguments values for each run; file and job timestamps with a full registry; Operational metadata File provenance Minimal description Descriptive metadata

tool’s control versioning; operation logging; etc. Such metadata, together with the use of sandboxed executions on virtual environments are essentials for achieving reproducibility on the system.

Descriptive metadata accompanying files, as well as applications and visualizers, conforms the basis of tool’s interoperability in the platform. “file_type” and “data_type” fields conform the minimal descriptive set of metadata required for MuGVRE to operate. They semantically define the content (e.g. “DNA sequence”) and format (e.g. “FASTA”) of a “File” record. In turn, they constrain suitable input files and specify expected output files when applied to “Tool” or “Visualizer” definitions. The metadata matching between both, “Tools” and” Files”, permits to interoperate input and output file tools, as well as guide user experience, for instance, dynamically building toolkits of “Available Tools” responsive in front of user’s workspace file selections.

“Tool” data model

“Tool” entity is the MuGVRE PaaS building block and it defines “what” and “how” a Tool Developer’s application is to be executed. The following snippet represents a simplified example:

{

"_id" : "naflex",

"name" : "NAFlex analyses", "title" : "Nucleic Acids Flexibility Analysis", "short_description" : "Set of analyses to extract [...]”, "long_description" : "NAFlex provides a [...]”,

"url" : "http://mmb.irbbarcelona.org/NAFlex/", . "owner" : {

"author" : " Adam Hospital", [...]

"status" : 1,

"keywords" : ["dna", "rna", "dynamics"],

"keywords_tool" : ["nucleic acid NA", "flexibility", "curves"], "infrastructure" : { "memory" : 16, "cpus" : 1, "executable" : "/home/MuG/NAFlex/NAFlex_Wrapper.py", "clouds" : { "mug-irb" : { "launcher" : "PMES", "minimumVMs" : 1, "initialVMs" : 1, "imageName" : "uuid_mugMD_99" } } }, "input_files" : [ { "name" : "pdb",

"description" : "Input Structure, pdb format", "help" : "Input representative structure [...]", "file_type" : ["PDB" ], Tool description Deployment details Input file requirements .

"data_type" : ["na_structure"], "required" : true, "allow_multiple" : false }, { "name" : "top",

"description" : "Input Topology, Amber Parmtop v7 format", [...]

}, {

"name" : "crd",

"description" : "Input Trajectory, Amber mdcrd format", [...]

"input_files_combinations" : [ {

"description" : "Analyses from trajectory", "input_files" : ["pdb", "top", "crd"]

}, {

"description" : "Analyses from structure", "input_files" : ["pdb"] } ], "arguments" : [ { "name" : "operations",

"description" : "Flexibility Analysis to be computed", "type" : "enum_multiple",

"enum_items" : {

"name" : ["Curves", "Nmr_NOEs", [...]], "description" : ["Curves", “NMR NOEs", [...]] }, "required" : true, "allow_multiple" : false, "default" : ["Curves"] } ], "output_files" : [ { "name" : "NAFlex_report", "required" : true, "allow_multiple" : false, "file" : { "file_type" : "TAR", "data_type" : "tool_report”, "meta_data" : {

"description" : "NAFlex analyses [...]", "compressed" : "gzip", "visible" : true } } }, { "name" : "CURVES_torsions", [...] ] }

Snippet 4.8: Example for “Tool” data model in MuGVRE

The use of descriptive fields like “Title”, “Description”, or “keywords”, allows MuGVRE to automatically create help and usage applets, tool discovery and browsing functionalities,

Arguments Expected output files Input file requirements .

etc. The “infrastructure” nested object is focused on defining how to remotely invoke the application from MuGVRE. It includes the location of the application main “executable” in the VM, how to reach it (either via PMES or via SGE depending on the particular VM configuration), the computational resources that tool developer estimated necessary, and the elasticity bounding parameters.

On top of that, the enumeration and description of “input_files” and “arguments” per each tool permits MuGVRE to build automatic launching tool forms (Figure 4.35) based on PHP web templates - some JavaScript may be manually added, for instance, to control argument fields dependencies.

Figure 4.35: Tool web form in MuGVRE.

It is automatically build based on registered database “Tool” record. The example corresponds with the “Tool” instance in Snippet 4.8 “input_files” objects’ array “arguments” objects’ array Common form section

Such metadata-driven automatism is the first stage of the complete tool management

lifecycle (Figure 4.36), which is fully controlled by “Tool” and “File” metadata. Such data (i)

defines MuGVRE submission into PaaS components, (ii) circulates to the deployed VMs, and (iii) validates application results. To achieve so,

I. “input_files”, “arguments” and “input_files_public” as defined in the “Tool” object is used to build the web form, which the researcher fills in with the values for a particular run.

II. these are processed and written down in two auxiliary files. The two JSON files (i.e. “in_metadata.json” and “config.json”) that are stored in the newly created “Run folder”, which will become the tool working directory. Annex 8.4.3 Job Auxiliary Files contain examples of such files.

III. the “infrastructure” fields on the Tool object (e.g. “memory”, “cores”, “comps” enabling, SGE “queue”, VM “image”, application “executable”, etc) are used to compose the job petition for the selected job processor (i.e. PMES or SGE), who triggers the execution on the underlying cloud.

IV. on the triggered command the two JSON files are passed as arguments to the virtual instance. MG-TOOL wrapper parses these files and composes to actual application command with all the necessary information regarding input files and arguments.

Figure 4.36: Metadata flow among MuG elements during a tool life cycle.

Illustration on how user’s selection of input files and arguments are passed from the web to the virtual machine via two auxiliary files (config.json and in_metadata.json). “Tools” registry is the primary source.

V. once the application finishes, MG-TOOL wrapper writes down into “results.json” some relevant metadata regarding the output files just generated. Primarily, it includes the metadata that change on each run: “file_path” locations, file “source_id”, or custom metadata attributes that tool developers want to inject into MuGVRE metadata (“e.g. docking_grid_size”).

VI. after job completion, MuGVRE builds a File object for each output file by aggregating (a) the information stored in the “Tool” object under the “output_files” section (e.g. “file_type”), (b) dynamic fields read from “results.json” (i.e. “path”), and (c) operational data (e.g. “owner”, “creation_time”, etc). Once output files are registered in the “Files” collection, they are eligible from user’s workspace.

Use case: Nucleosome Dynamics

The present section describes the deployment and usage of one of the analysis tools offered on the MuGVRE platform, the Nucleosome Dynamics suite. The use case illustrates how a scientific method development can benefit from the integration into community-driven computational platforms, like MuGVRE or the well-known Galaxy. The example covers the integration of the tool at the MuG infrastructure as well as other implementation models. The corresponding publication is annexed at 8.7 Publications.

Nucleosome Dynamics suite

Nucleosome Dynamics is a suite of programs to define nucleosome architecture and dynamics from noisy experimental data. Different studies demonstrated that nucleosome positioning is coupled to gene function [230] and that transcriptional activity and nucleosome architecture are tightly coupled. The package allows both the definition of nucleosome architectures and the detection of changes in nucleosome organization due to changes in cellular conditions from MNase-seq and ATAC-seq experimental data. Results are annotated sequence files (GFFs and BED) that can be displayed in the genomic context thanks to sequence browsers, allowing the user a holistic, multidimensional view of the genome/transcriptome. The package shows good performance for both locating equilibrium nucleosome architecture and nucleosome dynamics.

Two specific programs, nucleR and NucDyn, have been specifically developed to perform such studies:

- nucleR performs Fourier transform filtering and peak calling, in order to efficiently and accurately define and classify the location of nucleosomes.

- NucDyn is a method to detect changes in nucleosome architectures based on MNase-seq experiments. It identifies nucleosomes’ insertions, evictions and shifts between two experiments at the read level.

- Location of nucleosome-free regions (NFRs)

- Classification of transcription start sites based on the surrounding nucleosomes - Study of nucleosome periodicity at the gene level

- Stiffness of the nucleosomes derived from fitting a Gaussian function to nucleosome profiles

Implementation Models

Nucleosome Dynamics is implemented as a set for R packages and libraries provided under several distribution models to fulfill the needs of different users. Moreover, it is also offered as a service in two different research platforms. All available distributions are explained at Nucleosome Dynamics landing page15_{, and summarized in Table 4.5:}

Landing page http://mmb.irbbarcelona.org/NucleosomeDynamics/ Code distribution Standalone installation Nucleosome Dynamics CLI https://github.com/nucleosome-dynamics/nucleosome_dynamics nucleR R package https://github.com/nucleosome-dynamics/nucleR Bioconductor: https://www.bioconductor.org/packages/release/bioc/html/nucleR.html NucDyn R package https://github.com/nucleosome-dynamics/NucDyn

Bioconductor: (in review) Containerized installation Docker https://github.com/nucleosome-dynamics/docker Docker-hub: mmbirb/nucleosome-dynamics Singularity https://github.com/nucleosome-dynamics/nucleosome_dynamics_singularity Singularity-hub: https://singularity-hub.org/collections/2579 Platforms in use

MuG Virtual Research Environment

https://vre.multiscalegenomics.eu/workspace/?from=nucldynwf

Galaxy Platform https://dev.usegalaxy.es (in development) Galaxy Tool-Shed:

https://toolshed.g2.bx.psu.edu/repository?repository_id=822e9c879cf92fd0

Table 4.5: Implementation models for Nucleosome Dynamics

In document Informe financiero y estados financieros comprobados. Informe de la Junta de Auditores (página 50-54)