• No se han encontrado resultados

6. PROPUESTA DE MEJORA DE LOS EXPERIMENTOS VISUALIZADOS EN EL

6.6. LA EVALUACIÓN

We present a method of predictive habitat modelling where relationships between the various automatically extracted image features mentioned above (quantifiable, environmental variables) and human scored habitat classes are investigated using

the random forests classifier (Breiman, 2001). Statistical modelling techniques

that have been used to predict habitat distribution comprise Classification And Regression Trees (CART, Holmes et al. (2008)) and Quick, Unbiased and Efficient Statistical Trees (QUEST, Rattray et al. (2009)). A collection of recursive rules based on predictors shape the decision tree, i.e., the position of branches and leaves. Random forests is a classification and regression method that derives a classifier by ‘growing’ an ensemble of decision trees and letting them vote for the most popular class. We used the random forests method for two reasons: (1) the random forests

Table 3.2: Habitat class code, habitat class, brief habitat de- scription and frequency of occurrence of each habitat class within AUV transect on the continental shelf study area off south-eastern Tasmania.

Code Habitat class Habitat description Occurrence

RSE reef-sand eco-

tone

interface between hard (reef) and soft (sand) substrate cf patch reef (PR)

6.13%

LRR low relief reef hard substrate but low relief (<20 cm

excluding benthos)

14.22%

CS coarse sand usually shell gravel mixed with sand,

however, not screw shells

6.00%

PR patch reef patchy hard substrate (reef) covers

<50% within soft substrate (sand)

6.27%

S sand fine sand with/without sand ripples or

waves

2.89%

SSR screw shell rub-

ble

substrate with >50% covered by screw

shells (Maoricolpus roseus) cf SSRS

21.25%

SSRS screw shell rub-

ble/sand

substrate dominated by sand, screw

shell cover<50% cf SSR

2.52%

HRR high relief reef hard substrate (reef) with high relief

>20 cm

33.72%

ECK Ecklonia radiata hard substrate covered by kelp (Ecklo-

nia radiata)

Figure 3.3: Example images for each habitat class identified in the continental shelf study area off Tasmania, south-eastern Australia: (a) reef-sand ecotone, (b) low relief reef, (c) coarse sand, (d) patch reef, (e) sand, (f) screw shell rubble, (g) screw shell rubble/sand, (h) high

approach achieved the highest prediction accuracy, albeit by a small margin, based on comparing prediction accuracy between CART, QUEST and random forests approaches and (2) the random forests approach gives useful internal estimates of classification error, predictor strength, case correlation and variable importance (Breiman, 2001). A subset of 500 randomly sampled images was used to create an ensemble classifier using the randomForest package (Liaw and Wiener, 2002) for R (R-Development-Core-Team, 2009). In addition, the importance of predictors was assessed by extracting variable importance measures produced by random forests and a proximity measure among rows was calculated to identify similarities between habitats as predicted by random forests. Different subsets of predictors were used to investigate the impact of fewer predictor variables on classification error rate, i.e., only the patch-gap summaries predictor set, then adding the HSV predictor set, then adding the local binary pattern predictors set and finally rugosity. The model was run with combinations of the three predictor sets and rugosity culminating in a final model including all 26. Each model run produced an error rate estimate based on bootstrapping. The different random forests models derived from the training data set were then applied to predict habitat classes for the remaining 3086 images.

Fleiss’ exact and habitat class-wiseκwere computed to evaluate prediction accuracy

compared to observed habitat classes (Fleiss, 1971). In addition, confusion matrices, a common visualisation tool in the machine learning realm, were used to further clarify model strengths and weaknesses. Each matrix column represents predicted instances and each row represents the actual observed class (habitat). This way it is easy to assess which classes were misclassified, expressed as being on either side

of the diagonal line of numbers (Table 3.3 numbers in bold). The random forests algorithm estimates the importance of a variable (predictor) based on prediction error increase when out-of-bag (randomForest intrinsic prediction error estimation) data for that variable is permuted while the remaining variables are left unchanged. Calculations are carried out tree by tree as the random forest is constructed. The more the estimated error rate increases the more important is a predictor, i.e., leaving an important descriptor out decreases prediction accuracy. We performed a

χ2 goodness-of-fit test to assess differences between observed and predicted habitat

classes. Bootstrapping and the calculation of 95% confidence intervals helped to visualise which habitat classes the random forests model was able to predict within confidence boundaries. The entire vector containing all observed habitat classes was re-shuffled with replacement 25 times. These 25 permuted habitat distributions were used to calculate 95% confidence intervals to visually assess prediction performance for each habitat class. Random forests provides intrinsic proximity values for each

case (image), in our case culminating in a square (500×500) proximity matrix with

value 1 on the diagonal and values between 0 and 1 in the off-diagonal positions. Multi-dimensional scaling (MDS) was used to plot the scaling coordinates contained in the proximity matrix to visualise case (image) similarities. The rationale behind MDS is the representation of samples (images) as points in two-, sometimes three- dimensional space so that distance (proximities) between points corresponds to similarities in the intrinsic random forests proximity matrix. Applying this principle, points in an MDS plot that are close together stand for samples (image classification outcomes using random forests) that are very similar and points far apart stand for

samples that are very different.

Observed and predicted habitat classes were superimposed on bathymetry for visual assessment. All statistical analyses described in this section were performed using the R base package and the MASS package (R-Development-Core-Team, 2009).

Documento similar