Many tools have emerged to capture the prospective and retrospective provenance of the scripts. There are represented and stored in different ways in different sys- tems. Most of them store the provenance data in a traditional database in the provenance capturing systems of scripts. In our work, we aim to semantically rep- resent the provenance information of script execution. This will help to describe the whole experiment semantically including the computational provenance as well. The provenance information of the scripts will be combined with other experimen- tal metadata thus providing the context of the results. Thus we aim to make the experiments understandable along with their context. In this section, we focus on how we convert the computational notebooks into Resource Description Framework (RDF). While, in Section 5.3, we discuss how we convert the scripts in general into RDF.
ProvBook provides the user the ability to convert the notebooks into RDF along with the provenance traces and execution environment attributes. The REPRODUCE- ME ontology is used to describe the computational tasks of the notebook. The ontology is extended from PROV-O and P-Plan to describe the provenance infor- mation of the notebook.
We define the competency questions required to answer the questions related to the computational provenance.
CQ11 What is the complete path taken by a user for a computational notebook experiment?
CQ12 What is the sequence of steps in the execution of a computational notebook? CQ13 How many trials were performed for a particular cell in a computational note-
book?
CQ14 How long it took for a particular trial of a computational notebook? CQ15 What was the source for a particular trial of a computational notebook? CQ16 What was the output for a particular trial of a computational notebook? CQ17 Who are the agents responsible for a computational notebook?
CQ18 When was a particular trial of a computational notebook last executed? CQ19 What are the environmental attributes of a notebook execution?
The aim of this module is to semantically describe the prospective and retrospec- tive provenance of a computational notebook. The module contains the concepts needed to represent the different elements of a computational notebook and the
Notebook CellExecution Output Source p-plan:hasInputVar p-plan:hasOutputVar Setting Kernel ProgrammingLanguage Version p-plan:correspondsToStep p-plan:Plan p-plan:Variable rdf:type rdf:type rdf:type
p-plan:Step rdf:type Cell p-plan:isStepOfPlan
rdf:type rdf:type hasProgrammingLanguage hasKernel hasVersion p-plan:Activity xsd:dateTime xsd:dateTime xsd:string prov:used prov:generated prov:Entity executionTime prov:endedAtTime prov:startedAtTime rdf:type prov:Entity prov:Agent prov:wasAttributedTo REPRODUCE-ME P-Plan PROV-O
Figure 5.5: The semantic representation of a computational notebook [Samuel and K¨onig-Ries, 2018b]
properties to relate the several trials of the notebook. We use RDF to represent the computational notebooks as we have discussed the benefits of using semantic web technologies to represent provenance information in Chapter 4. Figure 5.5 shows the semantic representation of a computational notebook. We define how the notebook is semantically described.
• Notebook
The computational notebook is represented as a Notebook which is a sub- class of p-plan:Plan. The Settings describes the execution environment of the Notebook. The Settings are Kernel, ProgrammingLanguage, Version.
• Cell
The cell of a notebook is represented as Cell which is a p-plan:Step. The Cell is a step of Notebook and the relationship is described using p-plan:isStepOfPlan. • Source
The input of each cell is described as Source which is related to Cell using the object property p-plan:hasInputVar. The Source is a p-plan:Variable. The value of the Source variable is represented using rdf:value.
• Output
The output of each cell is described as Output which is related to Cell using the object property p-plan:hasOutputVar. The Output is a p-plan:Variable. The value of the Output variable is represented using rdf:value.
Figure 5.6: A Notebook which can be downloaded in RDF • CellExecution
Each execution of a cell is described as CellExecution which is a p-plan:Activity. The input of each Execution is an prov:Entity which is related using the prop- erty prov:used. The output of each Execution is an prov:Entity which is related using the property prov:generated. The data properties prov:startedAtTime, prov:endedAtTime and repr:executionTime are used to represent the starting time, ending time and the total time taken for the cell execution respectively. Figure 5.6 shows a notebook which can be downloaded in RDF using ProvBook. The RDF can be downloaded as a turtle file either from the user interface of the notebook or using the command line. Figure 5.7 shows a part of Jupyter Notebook in RDF represented using REPRODUCE-ME ontology. It allows the user to share a notebook along with its provenance in RDF and also convert it back to a notebook. ProvBook also provides a reproducibility service where the provenance graph is con- verted back to a computational notebook along with its provenance. The provenance graph of the notebook can be converted back to a notebook using the command line. We answer the competency questions (CQ11-CQ19) using SPARQL in Chapter 7.