Methods .1 Data Set - Ensembles of Ensembles: Combining the Predictions from Multiple Machi

Ensembles of Ensembles: Combining the Predictions from Multiple Machine

5.2 Methods .1 Data Set

Marine traffic data was obtained for the Scotian shelf offshore region of Nova Scotia, originating as hourly automatic identification system (AIS) tracking positions recorded between June 22, 2014 and June 22, 2015. All vessels large than 500 tons are required to transmit AIS signals, although many smaller vessels are also equipped with the technology to support rescue in an emergency response situation.

An automated procedure was employed to construct vessel tracks from point loca- tion data when gaps between successive positions were less than 4 h in duration, otherwise vessel locations were modelled as isolated positions not part of a continuous cruise (Brawn 2016). For the purposes of this study, analysis was limited to vessels of type “fishing”, and for the third quarters (Q3) of the 2014–2015 period.

Using a 20-km grid cell size, which corresponded approximately to the 90th percen- tile of nearest-neighbour distance for points on a cruise track, line (scLineDensity) and point (scPointDensity) densities were calculated and scaled relative to the maxi- mum density over the entire surface (Fig. 5.1). As complete vessel tracks were deemed more informative than solitary point positions, the two vessel density mea- sures were averaged to form a single measure called “scCombined” with a weight- ing of 80% assigned to line information and 20% to point information.

Environmental data consisted of physical and oceanographic variables with the potential to identify areas of high marine productivity and, therefore, areas likely to

be exploited by fishing vessels (Table 5.1). This data was combined within a geodatabase and aggregated to the level of 20-km grid cells to maintain consistency with the vessel traffic response data.

5.2.2 Accuracy Assessment

Model construction was conducted 500 times for each combination of tuning parameter (see Tables 5.2, 5.3). For each of the 500 iterations, we randomly selected 50% of the data for model training purposes, and passed this on to the BRT and RF algorithms. Mean squared error (MSE) was employed to cross-validate predictive accuracy using the withheld testing data. It should be noted, however, that the RF method has its own internal procedure for randomly selecting a portion of the data for model training, and reserving a portion for model testing (the so-called “OOB”, or out-of-bag portion of the data; see Modeling Algorithms). The data and R-code developed for this study is online accessible at https://doi.org/10.5281/

zenodo.1318352 (Lieske et al. 2018) and serves as a documentation of methods and results.

Fig. 5.1 Relative intensity of fishing-vessel traffic for the Scotian shelf region, based on automatic identification system (AIS) tracking data for the third quarters of the period 22 June 2014 to 22 June 2015

Table 5.1 Summary of the thirteen predictor variables used in the analysis, as well as the data sources

Layer group

Resolution/

scale Source Description

CHLA_xxxx 4 km, monthly

MODIS^a Concentration of the photosynthetic pigment Chlorophyll a (mg m^-2), where xxxx indicates the month and year of summary: July 2014, 2015;

August 2014, 2015; September 2014.

SST_xxxx 4 km, monthly

MODIS^a Sea surface temperature derived from long-wave (11–12 μm) thermal radiation, where xxxx indicates the month and year of summary: July 2014, 2015; August 2014, 2015; September 2014.

WIND_yyyy 1⁰ × 1⁰ QuikSCAT^b Surface wind (m s^-1), August 1999–October 2009, where yyyy indicates the month averaged over the entire time series: July, August, and September.

DISTCOAST 4 km GML^c Distance to coast (km).

DISTPORT 4 km GML^c Distance to closest marina and/or port (km).

DEPTH 4 km GML^c Seadepth (m), derived from ETOPO2v2 2006 product^d

RUGGED 3 nearest neighbours

GML^c Seafloor ruggedness, derived using Benthic Terrain Modeler Extension^e

aNASA OceanColor Web (http://oceancolor.gsfc.nasa.gov/cms/), downloaded 1 September 2016.

bQuickSCAT (https://podaac.jpl.nasa.gov/QuikSCAT), downloaded 1 September 2016.

cMount Allison University Geospatial Modelling Lab (GML, http://arcgis.mta.ca).

dNational Geophysical Data Center. 2006. 2-minute gridded global relief data (ETOPO2) v2.

National Geophysical Data Center, NOAA.

eWright, D.J., M. Pendleton, J. Boulware, S. Walbridge, B. Gerlt, D. Eslinger, D. Sampson, and E. Huntley. 2012. ArcGIS Benthic Terrain Modeler (BTM), v. 3.0, Environmental Systems Research Institute, NOAA Coastal Services Center, Massachussetts Office of Coastal Zone Management.

https://www.arcgis.com/home/item.html?id=b0d0be66fd33440d97e8c83d220e7926, downloaded 8 April 2016.

Table 5.2 Key tuning parameters for the random forests (RF) method, as implemented in the R package of Liaw and Wiener (2002)

Random forests^a Parameter

Software-specific

implementation Typical Values Comment Number of iterations,

T ntree 5,000^b

Size of predictor

subset, m mtry Classification: p^{1 2}^/ ; regression: p

m = p (the number of predictors) equivalent to bagging.

Number of observations in terminal node

nodesize Default: 5 for regression, 1 for classification

aPackage randomForest (Liaw and Wiener 2002); Note: p = the number of predictor variables (covariates).

bBreiman (2002)

5.2.3 Modeling Algorithms

Random forests (RF, Breiman 2001) is, in itself, a form of ensemble technique that generates continuous-value predictions by averaging the expectations from multiple regression trees. Stochasticity is introduced into the learning procedure by: (1) bootstrapping 63% of the data (with replacement) for training purposes, reserving the remainder for model testing (referred to as the “out of bag” results, or OOB), and (2) randomly selecting a subset of the predictor variables at each step, thereby

“decorrelating“ the trees and ensuring that the resulting average is less variable and of higher predictive accuracy (Hegel et al. 2010; James et al. 2013). RF was implemented using the randomForest R package of Liaw and Wiener (2002).

Important tuning parameters for this package are described in Table 5.2. It should be noted that other implementations are available for RF, for example, Salford Systems’ (2016) RandomForests package. Our analysis focused on the more readily- available open source version of Liaw and Wiener (2002), but we acknowl- edge that different software implementations may yield different results from those reported here.

The boosted regression tree algorithm (BRT, Friedman 2001) can be described as a “slow learning” technique where predictions are constructed additively, sequen- tially, and incrementally (Elith et al. 2008; James et al. 2013). Data is not bootstrap sampled nor is the response variable directly modelled, rather the attention of the model-fitting procedure is focused on the residual or unexplained variation (James et al. 2013). The weight of each tree in the final prediction is controlled by the learning rate parameter (λ, Table 5.3), which ensures slow improvements in the model in the areas of the response “space” where predictions are poor (James et al. 2013).

BRT was implemented using the gbm R package of Ridgeway (2012), though it should be noted that alternative commercial boosting packages are available (e.g., Salford Systems 2016).

Table 5.3 Key tuning parameters for the boosted regression tree (BRT) method, as implemented in the R Package of Ridgeway (2012)

Boosted regression tree^a Parameter

Software-specific

implementation Typical Values Comment Number of iterations,

T n.trees 3,000 to 10,000

Shrinkage

(learning rate), λ shrinkage 0.01 to 0.001 Elith et al. (2008) recommend ↓λ with ↑K Number of splits

(depth of each tree), K interaction.depth 1 to number of variables in dataset

Subsampling rate, p bag.fraction 0.5 (recommended) Controls the level of stochasticity in model selection.

apackage gbm (Ridgeway 2012)

Tuning parameter settings (Tables 5.2 and 5.3) were varied systematically for both algorithms. For RF, the size of the predictor subset (m, Table 5.2) was allowed to vary from 1 to the total number of available predictor variables. Simultaneously, the number of iterations varied from 500 to 8000 (T, Table 5.2). In the case of BRT, the learning (or shrinkage) rate (λ) took on a value of either 0.01 or 0.001; interaction depth (or the number of splits, K) was allowed to vary from 1 to the total number of available predictor variables; and the number of iterations was varied from 500 to 8000 (T, Table 5.3).

Ensemble (ENS) predictions were generated, at each iteration of model construction, by calculating a weighted average of the predictions from RF and BRT. The weightings were based on model performance using the cross-validation data, and were defined as the inverse of the MSE (MSE⁻¹).

In document Machine Learning for Ecology and Sustainable Natural Resource Management (página 126-130)