• No se han encontrado resultados

C. Estudios del control metabólico relacionados con el contaje de CH

IX. CONCEPTOS BÁSICOS PARA EL MANEJO DE TERAPIA FUNCIONAL

This book does not attempt to be a comprehensive introduction to using R. Some basic familiarity with R will be gained through our travels in data mining using the Rattle interface and some excursions into R. In this respect, most of what we need to know aboutRis contained within the book. But there is much more to learn about R and its associated packages. We do list and comment on here a number of books that provide an entr´ee toR.

A good starting point for handling data in R is Data Manipulation with R(Spector, 2008). The book covers the basic data structures, read- ing and writing data, subscripting, manipulating, aggregating, and re- shaping data.

Introductory Statistics with R(Dalgaard, 2008), as mentioned earlier, is a good introduction to statistics using R. Modern Applied Statistics with S (Venables and Ripley, 2002) is quite an extensive introduction to statistics using R. Moving more towards areas related to data mining,

Data Analysis and Graphics Using R (Maindonald and Braun, 2007)

provides excellent practical coverage of many aspects of exploring and modelling data using R. The Elements of Statistical Learning (Hastie et al., 2009) is a more mathematical treatise, covering all of the machine learning techniques discussed in this book in quite some mathematical depth. If you are coming to Rfrom a SAS or SPSS background, then R

for SAS and SPSS Users (Muenchen, 2008) is an excellent choice. Even

if you are not a SAS or SPSS user, the book provides a straightforward introduction to using R.

Quite a few specialist books usingRare now available, includingLat- tice: Multivariate Data Visualization withR (Sarkar, 2008), which covers the extensive capabilities of one of the graphics/plotting packages avail- able for R. A newer graphics framework is detailed in ggplot2: Elegant Graphics for Data Analysis (Wickham, 2009). Bivand et al. (2008) cover applied spatial data analysis, Kleiber and Zeileis (2008) cover applied econometrics, and Cowpertwait and Metcalfe (2009) cover time series, to

1.12 Resources 19

name just a few books in theRlibrary.

Moving on from Ritself and into data mining, there are very many general introductions available. One that is commonly used for teaching in computer science is Han and Kamber (2006). It provides a compre- hensive generic introduction to most of the algorithms used by a data miner. It is presented at a level suitable for information technology and database graduates.

Chapter 2

Getting Started

New ideas are often most effectively understood and appreciated by ac- tually doing something with them. So it is with data mining. Fun- damentally, data mining is about practical application—application of the algorithms developed by researchers in artificial intelligence, machine learning, computer science, and statistics. This chapter is about getting started with data mining.

Our aim throughout this book is to provide hands-on practise in data mining, and to do so we need some computer software. There is a choice of software packages available for data mining. These include commercial closed source software (which is also often quite expensive) as well as free open source software. Open source software (whether freely available or commercially available) is always the best option, as it offers us the freedom to do whatever we like with it, as discussed in Chapter 1. This includes extending it, verifying it, tuning it to suit our needs, and even selling it. Such software is often of higher quality than commercial closed source software because of its open nature.

For our purposes, we need some good tools that are freely available to everyone and can be freely modified and extended by anyone. There- fore we use the open source and free data mining tool Rattle, which is built on the open source and free statistical software environmentR. See Appendix A for instructions on obtaining the software. Now is a good time to install R. Much of what follows for the rest of the book, and specifically this chapter, relies on interacting with Rand Rattle.

We can, quite quickly, begin our first data mining project, withRat- tle’s support. The aim is to build a model that captures the essence of the knowledge discovered from our data. Be careful though—there is a

,

DOI 10.1007/978-1-4419-98 - _2, © Springer Science+Business Media, LLC 2011 G. Williams, Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery

22 2 Getting Started

lot of effort required in getting our data into shape. Once we have qual- ity data, Rattle can build a model with just four mouse clicks, but the effort is in preparing the data and understanding and then fine-tuning the models.

In this chapter, we useRattleto build our first data mining model—a simple decision tree model, which is one of the most common models in data mining. We cover starting up (and quitting from) R, an overview of how we interact withRattle, and then how to load a dataset and build a model. Once the enthusiasm for building a model is satisfied, we then review the larger tasks of understanding the data and evaluating the model. Each element of Rattle’s user interface is then reviewed before we finish by introducing some basic concepts related to interacting directly with and writing instructions for R.

2.1

Starting

R

R is a command line tool that is initiated either by typing the letter R (capital R—R is case-sensitive) into a command line window (e.g., a terminal in GNU/Linux) or by opening Rfrom the desktop icon (e.g., in Microsoft Windows and Mac/OSX). This assumes that we have already installed R, as detailed in Appendix A.

One way or another, we should see a window (Figure 2.1) displaying the Rprompt (> ), indicating that Ris waiting for our commands. We will generally refer to this as theR Console.

The Microsoft Windows R Consoleprovides additional menus specif- ically for working with R. These include options for working with script files, managing packages, and obtaining help.

We startRattleby loadingrattleinto theRlibrary usinglibrary(). We supply the name of the package to load as the argument to the com- mand. Therattle()command is then entered with an empty argument list, as shown below. We will then see the Rattle GUI displayed, as in Figure 2.2.

> library(rattle) > rattle()

TheRattleuser interface is a simple tab-based interface, with the idea being to work from the leftmost tab to the rightmost tab, mimicking the typical data mining process.

2.1 StartingR 23

Figure 2.1: The R Console for GNU/Linux and Microsoft Windows. The prompt indicates that R is awaiting user commands.

24 2 Getting Started

Figure 2.2: The initial Rattle window displays a welcome message and a little introduction to Rattle and R.

Tip: The key to using Rattle, as hinted at in the status bar on starting up Rattle, is to supply the appropriate information for a particular tab

and to then click the Execute button to perform the action. Always

make sure you have clicked the Execute button before proceeding to the next step.

Documento similar