4.1. Análisis de resultados
4.1.4.2. Optimización de la concentración de la fuente de carbono
11.3
Reproducible analysis and output
The key idea of reproducible analysis is that data analysis code, results, and interpretation should all be located together. This stems from the concept of “literate programming” (in the sense of Knuth [86]) and facilitates transparent and repeatable analysis [95, 52]. Repro- ducible analysis systems, which are becoming more widely adopted [13], help to provide a clear audit trail and automate report creation. Ultimately, the goal is to avoid post-analysis cut-and-paste processing, which has a high probability of introducing errors.
There are various implementations of reproducible analysis in R [95, 199], several of which made the production of this book possible. Each of these systems functions by allowing the analyst to combine code and text into a single file. This file is processed to extract the code, run it through the statistical systems in batch mode, collect the results, then integrate the text, code, output, and graphical displays into the final document. The systems available in R are extensive and are an active area of development.
The most powerful and flexible system is the knitr package (due to Yihui Xie [199]). The package can be used by writing a file in the LATEX document markup language, but
another useful option is to write it in the far simpler Markdown format. Markdown files can be converted to a variety of common display and editing formats, such as PDF and Microsoft Word, using Pandoc (http://johnmcfarlane.net/pandoc, a “Swiss Army knife” of file conversion).
The knitr package is well-integrated with RStudio, and both LATEX/PDF and Mark-
down/Pandoc conversions to several formats are provided via single-click mechanisms. More details can be found in [199] and [47] as well as the CRAN reproducible analysis task view (see also http://yihui.name/knitr).
As an example of how these systems work, we demonstrate a document written in the Markdown format using data from the built-in cars data frame. Within RStudio, a new template R Markdown file can be generated by selecting R Markdown from the New File option on the File menu. This generates the dialog box displayed in Figure 11.1. The default output format is HTML, but other options are available.
Figure 11.2 displays this default Markdown input file. The file is given a title (Sample R Markdown example) with output format set by default to HTML. Simple markup (such as bolding) is added through use of the ** characters before and after the word Help. Blocks of code are begun using the ‘‘‘{r} command and closed with a ‘‘‘ command (three back quotes). In this example, the correlation between two variables is calculated and a scatterplot is generated.
The formatted output can be generated and displayed by clicking the Knit HTML button in RStudio, or by using the commands in the following code block, which can also be used when running R without the benefit of RStudio.
> library(markdown); library(knitr)
> knit("filename.Rmd") # creates filename.md > markdownToHTML("filename.md", "filename.html") > browseURL("filename.html")
The knit() function extracts the R commands from a specially formatted R Markdown input file (filename.Rmd), evaluates them, and integrates the resulting output, including text and graphics, into an intermediate file (filename.md). This file is then processed (using markdownToHTML()) to create a final display file in HTML format. A screenshot of the results of performing these steps on the .Rmd file displayed in Figure 11.2 is displayed in Figure 11.3.
Figure 11.1: Generating a new R Markdown file in RStudio
The knit() function operates, by default, on the convention that input files ending with .Rmd generate a .md (Markdown) file, and files ending with .Rnw generate a .tex (LATEX)
file.
Alternatively, a PDF or Microsoft Word file can be generated in RStudio by selecting New from the R Markdown menu, then clicking on the PDF or Word options. RStudio also supports the creation of R Presentations using a variant of the R Markdown language. Instructions and an example can be found by opening a new R presentations document in RStudio.
A LATEX file can be generated using the following commands, where filename.Rnw is a
LATEX file with specific codes indicating the presence of R statements.
> library(knitr) > knit("filename.Rnw")
The resulting filename.tex file could then be compiled with pdflatex in the operating system, resulting in a PDF file. This is done automatically using the Compile to PDF button in RStudio.
It’s often useful to evaluate the code separately. The Stangle() function creates a file containing the code chunks and omitting the text. The resulting file could be run as a script using source(), and would generate just the results seen in the woven document.
11.4. ADVANCED STATISTICAL METHODS 173 ---
title: "Sample R Markdown example" author: "Nick Horton"
date: "October 4, 2014" output: html_document ---
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
‘‘‘{r}
summary(cars) ‘‘‘
You can also embed plots, for example: ‘‘‘{r, echo=FALSE}
plot(cars) ‘‘‘
Note that the ‘echo = FALSE‘ parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Figure 11.2: Sample Markdown input file
The spin() function in the knitr package takes a formatted R script and produces an R Markdown document. This can be helpful for those moving from the use of scripts to more structured Markdown files.
11.4
Advanced statistical methods
In this section, we discuss implementations of modern statistical methods and techniques, including Bayesian methods, propensity score analysis, missing data methods, and estima- tion of finite mixture models.
11.4.1
Bayesian methods
Bayesian methods are increasingly commonly utilized, and implementations of many models are available in R.
We focus here on Markov Chain Monte Carlo (MCMC) methods for model fitting, which are quite general and much more flexible than closed form solutions. Diagnosis of convergence is a critical part of any MCMC model fitting (see Gelman et. al., [50] for an accessible introduction). Support for model assessment is provided, for example, in the