Sampling effort is another significant element when quantifying and qualifying diversity in particular when comparing diversities of different microbiomes. Diversity characterisation whatever the criteria of measurement is (richness or evenness with or without weighting on genetic relatedness) and the specific index applied (e.g. Shannon, Unweighted unifrac), the size of the sample (e.g. number of sequences found per soil sample) will affect the final results (Gotelli and Colwell, 2001; Lozupone, 2007; Magurran, 2004). One common understanding in statistical biodiversity is that the types of organisms observed increases with sampling effort until all types
(e.g. OTUs) are observed (Hughes et al., 2001; Magurran, 2004). In practice, this pattern is illustrated by plotting ‘species accumulation curves’, which on the basis of DNA sequencing data, record the cumulative number of OTUs as a function of the sequencing depth of sampling (e.g. from 1 to more than 10,000 sequences/sample as in the case of bacteria and fungi datasets in this survey). Thus, sampling effort and species accumulation curves are strictly associated.
The species accumulation curve (and others surrogate closely related, e.g. rarefaction curves, individual-based taxon sampling curves) provides useful information on the relationship diversity/sampling effort by showing the rate at which new species are found (Gotelli and Colwell, 2001; Hughes et al., 2001; Magurran, 2004). In general, species accumulation curves are constructed from left to right as the sampling effort increases in the x-axis. In general terms, these curves rise relatively rapidly at first and much more slowly in later samples as increasingly rare species are added when is expected to reach an asymptote (Gotelli and Colwell, 2001). The initial steepness of most accumulation curves reflects discovery of new taxa that correspond to more abundant organisms and therefore increased probability that they will be detected with minimal sampling. As the curve begins to plateau they detect new OTUs from lower abundance or more rare populations. The richer and more uneven the community, the longer it takes for the curve to level off, as new species continue to be found as sampling continues. For example, assuming a sufficient sampling effort, bacteria rarely approach the plateau but archaea communities can reach this level at the tenfold lower level of diversity than do those for bacteria (Reid et al., 2011).
Simultaneously, the curve can be ‘normalized’ by randomizing subsamples when is created, with or without replacements, as new species are added (without replacement is recommended although penalize variance calculation) (Colwell, 2013). This procedure is suggested since the shape of the sampling curve is very sensitive to the order at which the subsamples are taken at each depth of sampling (e.g. 10; 50; 1,000; 10,000 sequences). For instance, a first subsample taken at 10 sequences depth can release independently a higher or lower number of observed species than a second subsample from the same dataset. Subsequently, the subsampling intensity would set up completely different curves leading to completely different interpretations of diversity patterns. For this reason, the accumulation curve is normalized by randomizing this subsampling protocol prior use of diversity estimators and indices – this procedure is completely different to randomization and rarefaction for comparative analyses (‘rarefying’) which is discussed below (Colwell, 2013; Gotelli and Colwell, 2001; Magurran, 2004).
Drawbacks regarding sample size can signify an important concern in environmental microbial studies. There are sensitivity variations in sampling size along the different diversity metrics. For example, estimators based on species richness are highly sensitive to sampling effort, e.g. Chao, Jackknife index (Hughes et al., 2001; Magurran, 2004). Certainly, others indices are found to be more accurate when measuring diversity at a low level of sampling density such as those based on taxonomic differences. In this regard, when the sampling effort is not exhaustive enough the accumulation curve can be ‘extrapolated’.
An extrapolation of this curve allows predictions on the increase in species richness as the sampling effort is intensified rather than an estimate of total richness valuation (Magurran, 2004). But, it is argued that this technique has limited access in microbial studies since these communities are often very abundant/diverse and so the accumulation curve has either no yet begun to reach the asymptote or does not fit the best extrapolation model for predicting its level off (Lozupone, 2007). These issues are quite well controlled by counting with a deep sequencing dataset to increase the sampling effort as much as possible nearby the asymptote such as in the case of this study.
A statistical expectation of the corresponding accumulation curve is estimated by ‘interpolation’ processing most commonly referred as ‘rarefaction v ’ (Colwell et al., 2012). Rarefaction generates the ‘expected’ number of species in a small collection of n individuals from a larger pool of N individuals (the entire collection, i.e. the curve depends upon every individual in the pool at the accumulation curve’s right-hand end). In opposition to accumulation curves, rarefaction curves move right to left, as the full dataset is increasingly rarefied (Gotelli and Colwell, 2001).
The purpose of generating rarefaction curves is to make direct comparisons among communities on the basis of a number of individuals in the smallest samples (Crist and Veech, 2006; Magurran, 2004). The method has been widely applied in microbial ecology, especially, to estimate the effectiveness of sampling effort to highly represent the diversity of the total
microbial community, which is particularly critical when working on the basis of millions of DNA sequences distributed unevenly along all the species/OTUs. For example, it is unreliable analyze diversity in which some of the OTUs end up with millions of DNA sequences whereas others count with only one (singleton) or two (doubleton) sequences. For this reason, diversity is estimated using a rarefied dataset and all the measures are made using the same number of sequences, i.e. at the same depth of sampling.
McMurdie and Holmes, (2014) stated that rarefaction is one of the common procedures for addressing differences in sequencing effort across samples (different library sizes) – another classic one applies the proportional abundance of each species in a library. These authors are formal detractors of ‘rarefaction’ by arguing that it throw away data of the individuals from the larger libraries which is a waste of valuable information. Regardless, they highlighted that rarefaction are adequate when comparing ‘obviously different’ microbiomes such as in this investigation.
‘Interpolation’ (rarefaction) and ‘extrapolation’ has been more clearly explained in Colwell et al., (2012) who pointed out that an interpolation estimates the ‘expected’ number of species in a random sample of a smaller number of individuals or a smaller area sampled, meanwhile, extrapolation estimates the number of species that ‘might be expected’ in a larger number of
Summarizing, either on the basis of raw-observed, normalized, extrapolated or rarefied data the final shape of the accumulation curve estimate diversity as well as the sampling effort effectiveness to represent such diversity. As indicated by Hughes et al., 2001, both richness and relative abundances differences in the sampled communities underlie the differences in the shape of the curves. Thereby, these curves: (i) scope the total diversity of the community that have been sampled, (ii) qualifies how copiously representative can be different depth of sampling to estimate the total diversity of the community, i.e. the effectiveness of the sampling effort (iii) the curve can be extrapolated to estimates the total species richness when the sampling effort is not sufficiently exhaustive and, (iv) a surrogate of this type of curve (rarefaction curves) represent the way how to compare among communities unevenly sampled, i.e. with totally different number of sequences (Crist and Veech, 2006; Hughes et al., 2001; Magurran, 2004).