4. RESULTADOS
4.1 VARIABLES NECESARIAS PARA LA REPRESENTACIÓN DE UNA
4.1.2 En cuanto a las variables geográficas vinculados al territorio en estudio
Docker is a cross-platform compatible virtualization system that ensures research is computationally reproducible, regardless of where the software is run or what environment and hardware the institution may possess [100]. Additionally, Docker images are easily distributed and image hashes can be used to validate that the proper image is always being used giving researchers confidence the results do not contain computational batch effects that stem from differences in software versioning, dependencies, and environment. The RNA-seq workflow that Treehouse uses, toil-rnaseq, encapsulates individual tools within immutable Docker images so that each tools unique set of dependencies are isolated and do not conflict with any other tool or dependency (Figure B.3) [101]. Treehouse employs gene expression data obtained from one path in the workflow: sequence data is aligned with STAR and then gene expression is generated from RSEM [82, 57]. This data is then publicly hosted on the Xena Browser, a
platform for analyzing and visualizing genomics data sets [62]. Using this distributable compute system, Treehouse reached out to collaborators who possess controlled access data that they are unable to share. For example, collaborators in Canada are unable to move pediatric sequencing data out of the hospital due to Canadian law. An ambassador for Treehouse contacted several pediatric hospitals and explained that they could contribute valuable information to pediatric research without exposing the underlying sequence data, violating patient privacy concerns, or exposing the data stewards to risk. Treehouse then worked with amenable hospitals to get the software running on each hospitals own compute infrastructure, securing more than 100 additional pediatric samples with more than 150 additional samples on the way (Table 3.3). Treehouse also has taken advantage of the St. Jude Cloud, which allows Dockerized analysis software to run on hundreds of pediatric RNA-seq samples and illustrates the collaborative opportunities that emerge when data is moved to cloud-based repositories [34].
Institution # of Patient Samples Deposited in Compendium / Available Canada’s Michael Smith
Genome Sciences Centre 39 Yes
The Hospital for
Sick Children 130 Awaiting data
Nationwide Childrens Hospital 65 Yes Childrens Hospital
of Los Angeles 30 Awaiting data
St. Judes 905
106 in compendium and 796 out of 799 remaining samples processed
Table 3.1: Number of patient samples obtained from different pediatric hospitals using the methodology we describe of sending the compute directly to the data. This allowed expeditious acquisition of pediatric samples that may have never been obtainable, or taken months to procure due to the stringent data agreements required to transfer the patients sequence data. Underly- ing sequence data for patients was never exposed to non-credentialed individuals and all gene expression information was generated on-site by the hospital, patient labels were anonymized, and then transferred to Treehouse.
3.4
Discussion
The FAIR principles of scientific data management: Findable, Accessible, Interoper- able, and Reproducible, are critical for scientific progress by easing the sharing of both data and analytical tools [28]. By designing deterministic, portable, and cloud-compatible workflows, we enable interoperability and reproducibility between large compendium datasets and small silo datasets. This leaves a need for findable and accessible datasets ideally centralized in a cloud environment with broad accessibility to research groups who can easily download the data. For more controlled datasets, there needs to exist a standard system for sending analytical tools that can run on the protected data in order to allow researchers to gather necessary in- formation. A common repository for data notes, which describe datasets and encourage reuse, could be coupled with analysis notes that allow identification of downstream compendiums and how they were processed, enabling researchers to coordinate compute efforts more efficiently. By leveraging portable and reproducible compute and participating in persistent outreach ef- forts, Treehouse has expanded its access to critical pediatric data that otherwise would have been almost impossible to access in an expeditious manner. We believe this is a useful model for enabling data sharing. As the genomics community continues to fund large-scale projects that will generate petabyte-scale datasets, the future of bioinformatic collaboration will likely revolve around cloud-based storage where Dockerized analysis workflows can be sent to run on the data removing the costs and legal concerns of storing and transferring the data to a new location while removing the lengthy back-and-forth exchanges that often take place between research groups and data stewards before critical analysis can be done (Figure 3.1). Despite
these advantages, the genomics community will have to remain vigilant to protect patient pri- vacy and stay aware of potential risks that will be exposed during this paradigm shift, such as maliciously-designed software that scrapes personal data and attempts to export it alongside ap- proved output. We at Treehouse hope our experience inspires other research groups to persist in acquiring critical data by circumventing these common data barriers that impede collaborative research.
Figure 3.1: Deploying a single workflow to both large-scale cloud environments as well as individual repositories containing controlled access data. Non-controlled secondary output from these workflows can then be consolidated into large public data repositories that any researcher can access. This method of shipping a portable and reproducible workflow directly to the data is an effective way to target large genomic cloud repositories as well as individual data centers. Costs surrounding data transfer and storage are circumvented as well as any requirements to obtain access to the underlying data the workflow is running on, assuming the output of the workflow does not require controlled access.
4
A Bayesian Framework for Detecting Gene
Expression Outliers in Individual Samples
“My greatest concern was what to call it. I thought of calling it ‘infor- mation,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty func- tion has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’”
– Claude Shannon, Scientific American (1971)
In this chapter, I describe my work on a robust statistical framework for detecting gene expression outliers by appropriately choosing subsets of a large background comparison cohort. This unsupervised process weights background datasets and then produces a posterior
distribution for each gene of interest that can be used to calculate posterior predictive p-values. Our method has several advantages over existing methods in the same space: continuous p- values over binary output, automatic selection of the appropriate background dataset(s), as well as being robust to false positives, imperfect or sparse comparison sets, and samples of unknown origin or mixed lineage. This manuscript was submitted as a preprint in June 2019 under the title “A Bayesian Framework for Detecting Gene Expression Outliers in Individual Samples,” with the following co-authors: Jordan Eizenga, Holly C. Beale, Olena Morozova Vaske, and Benedict Paten.