Física y Química. 4º ESO Matriz de especificaciones

During training, only the best performing models (in terms of validation accuracy) were saved. The models were trained to return P(GFP+_{) – the probability that the input image} is a GFP+_{cell. A perfect model would return P(GFP}+_{)=0.0 for each non-rod and} P(GFP+_{)=1.0 for each rod, but in reality, we obtain a broad distribution of probability} values as shown in the histograms in Figure 3.21. The histograms were obtained by predicting the cells from training set 1 (see Table 5) using MLP1 (top), MLP2 (middle) and CNN1 (bottom), which were trained without (0%, left histograms) and with 20% generated data (20%, right histograms). As expected, GFP+_{cells (green histograms)} return larger P(GFP+_{) values compared to GFP}-_{cells (gray histograms), but there is a} certain overlap of both distributions for all models.

The analysis routine and data is identical as used for Figure 3.11 (area gating), Figure 3.12 (random forest based on online features) and Figure 3.14 (random forest based on online and offline features), allowing to compare the methods. The overlaps of green and gray histograms are smaller for all DNNs (MLPs and CNN) as compared to all other methods (compare Figure 3.21 to Figure 3.11 A, Figure 3.12 A, and Figure 3.14 A), suggesting superior classification performance of these DNNs.

Usually, the threshold P(GFP+₎

thresh, upon which an event is predicted as GFP+, is 0.5. By increasing P(GFP+₎

thresh (indicated in upper left histogram in Figure 3.21 by a red line), the number of false positives within the target region decreases, resulting in an increase of the concentration of rods as shown by the plots in the middle in Figure 3.21. Simultaneously, the number of rods in the target region (“yield”) decreases, as shown in the plots on the right in Figure 3.21. For all three DNNs, the concentration of rods increases until approximately P(GFP+₎

thresh≈0.9. Above 0.9, the yield drops to zero. MLPs trained with 20% generated data show improved performance in terms of rod- concentration and yield. This is not the case for CNN1, but CNN1 performs in general better than the MLPs. A concentration above 90% is obtained by MLP2 for thresholds above 0.85 and by CNN1 already for P(GFP+)thresh≥0.8. For the validation data and the testing data, the distributions might look different and one would maybe like to use a

different threshold, but since P(GFP+₎

thresh modulates the prediction, it belongs to the model parameters and has to be defined using the training data alone.

Figure 3.21 Performance of DNN based cell classification

Histograms show the probability distributions for GFP+_{(green) and GFP}-_{(grey) cells resulting from}

training set 1 using MLP1, MLP2 and CNN1, which were trained without generated data (0%, blue) and with

20% generated data (red), respectively. The red line indicates the threshold P(GFP+₎

thresh, which is the

decision boundary between GFP+_{and GFP}-_{. The concentration and number of rods (yield) within the}

target region is dependent on the threshold as shown in the middle and right plots. The target concentrations and yields were determined for all three training sets individually and the plots show the resulting mean and standard error of the mean. This routine has been applied to models that were trained without generated data (0%, blue) and to models that were trained with 20% generated data (20%, red).

To allow comparison to the area gating method (see section 3.3.2) and random forest based classification (see section 3.3.3), the identical analysis-routine was applied, to obtain the confusion matrices: for each DNN, a certain P(GFP+₎

thresh was determined, which delivers in average a yield of 40% on the training sets. This threshold is denoted

95 as P(GFP+₎

thresh 40. The resulting confusion matrices when applying the models on the validation and test set are shown in Figure 3.22, which indicate that each DNN performs better on the validation set compared to the random forests (see Figure 3.14 B) or the area gating method (see Figure 3.11).

Figure 3.22 Confusion matrices for DNNs

Normalized confusion matrices show the classification performance of each DNN model. The adjusted classification threshold which results in a yield of 40% on the training set is used (P(GFP+₎

thresh 40).

An analysis of the concentration and yield in the target region for each model (MLP1, MLP2 and CNN1) is shown in Figure 3.23. Besides the usual threshold of P(GFP+)thresh 40, also another threshold, which delivers a yield of 20% on the training sets (denoted as P(GFP+₎

thresh 20) is used in order to check, if higher concentrations could be achieved. As one would expect, the achievable concentration is higher for the validation data (dark gray bars) compared to the testing data (light gray bars) since the testing data corresponds to new biological replicates which might show slightly different phenotypes and therefore deviate from training and validation data. Furthermore, the testing data has been recorded several months after the training data and the measurement system was continuously used, adjusted and has altered (a different LED was implemented). Both, validation and testing dataset, contain data of three biological replicates, which were analyzed individually, allowing to compute a mean and standard error of the mean

for concentration and yield, which is shown in Figure 3.23. The left plot in Figure 3.23 A shows the concentration of rods in the target when using P(GFP+₎

thresh 40. For the validation data the concentrations are above 80% for all MLPs and even above 85% for CNN1. The concentrations for the testing set are in average approximately 8% lower than for the validation set. The right plot in Figure 3.23 A shows the same analysis, when using P(GFP+₎

thresh 20. Interestingly, MLP2 reaches a concentration of 89.2% for the validation set, but at the same time, the yield drops by approximately 20% and 10% for the validation data and testing data, respectively (see Figure 3.23 B). So one can conclude that increasing P(GFP+₎

thresh allows to increase the resulting concentration but at cost of a large decrease of the yield.

Figure 3.23 Performance of models on validation and testing set

(A) Barplots show the concentration of rods in the target region when the threshold P(GFP+₎ thresh is

adjusted such that a yield of 40% (P(GFP+₎

thresh 40,left) or 20% (P(GFP+)thresh 20,right) is achieved. Models

were either trained using 0% or 20% generated data. Each model was applied to three validation and three testing datasets individually, resulting in a certain mean concentration and standard error of the mean, which is displayed by error bars.

97 Table 8 summarizes c40 of each algorithm when being applied to the training and validation set:

Approach c40 (training set) c40 (validation set)

Area gating 66.5% ± 3.8% 69.20% ± 4.6%

Random forest (online features) 75.7% ± 4.1% 72.29% ± 4.7% Random forest (online & offline features) 83.5% ± 2.6% 77.91% ± 1.8%

MLP1 (0% gen. data) 83.0% ± 3.5% 82.9% ± 1.7% MLP1 (20% gen. data) 84.2% ± 3.8% 82.5% ± 3.6% MLP2 (0% gen. data) 85.2% ± 3.5% 80.4% ± 3.1% MLP2 (20% gen. data) 86.1% ± 3.6% 84.9% ± 2.1% CNN1 (0% gen. data) 92.0% ± 3.1% 87.7% ± 1.8% CNN1 (20% gen. data) 91.7% ± 3.0% 87.3% ± 4.1%

Table 8 Comparison of c40 for all classification methods

The training datasets were tightly restricted to very low (GFP-_{) and very high (GFP}+₎ fluorescence intensities in order to avoid training on mislabeled cells (see section 3.3.1). In the following, the performance of the models is checked on unfiltered datasets, spanning the full range of fluorescence values. In the worst case, cells with moderate fluorescence levels would show a different phenotype and would be incorrectly classified by the models. The blue vertical histograms in Figure 3.24 show the distributions of GFP fluorescence expressions of two testing sets (without gating to fluorescence ranges as shown in section 3.3.1). Testing set 1 was captured at a higher laser power, which causes relatively higher fluorescence values compared to testing set 3. The blue horizontal histograms show the corresponding area distributions. Apparently, cells of testing set 1 tend to be slightly larger compared to testing set 3, which could be due to biological variation. The squared plots show scatterplot-contours at 95% (solid blue line) and 50% (dashed blue line) of the maximum event-density. Here, the initial data of both testing sets shows a population in high and low fluorescence intensity ranges. MLP1 was used to predict which events are rods and the area and fluorescence distributions of the identified cells are shown by green histograms and contours. While the histogram of the fluorescence of the initial dataset (blue vertical histograms) showed two peaks, the distribution of the cells that are predicted to be rods has only one peak at high fluorescence intensity. Even though, there is a difference in

cell size between training set 1 and 3, MLP1 based selection of rods succeeds to shift the fluorescence distributions towards higher values for both datasets. This indicates that the model is robust for such differences in the phenotype.

Figure 3.24 Rod-identification in full range of fluorescence values using MLP1

The plots show the area (horizontal histograms) and fluorescence (vertical histograms) distributions of two testing sets. Squared plots between the histograms show the 95% (solid line) and 50% (dashed line) scatterplot-contour lines. The distribution of the initial data is shown in blue. MLP1 was used to predict

which events correspond to rods and the resulting populations are shown in green.

In document PROYECTO DE ORDEN MINISTERIAL POR LA QUE SE DETERMINAN LAS (página 39-42)