These methods are the most accurate in the last few years. Neural network methods generate statistical patterns from training with sample data, the obtained patterns are used to classify new samples [Specht, 1990]. They are sophisticated FD models that assure robust FD. In this section, we present 3D CNN, Multi-Scale CNN, Encoder-Decoder CNN, RNN and Attention Neural Network methods.
3.2.1 Hybrid CNN methods
Hybrid methods combines handcrafted or statistical methods and a CNN method to achieve the FD. The model proposed by [Simonyan and Zisserman, 2014a] is composed of two streams of CNN, one stream performs spatial information and the other performs temporal information. The spatial stream receives a single frame and the temporal stream takes a sequence of optical flow frames, finally, their outputs are combined by a fusion method. Their model takes spatial and temporal information to perform the action recog- nition task, nevertheless, this model has a drawback when it is applied on camera motion videos.
DeepCNN [Babaee et al., 2018] mixes two handcrafted methods and a CNN. The generation of background model combines the segmentation mask from SuBSENSE [St- Charles et al., 2014] and the output of Flux Tensor algorithm [Wang et al., 2014a], which is the time variation of the optical flow field. After that, uses a CNN to process the input frame and the background model. Then, DeepCNN computes the average of the video sequence queue with adaptive length. Finally, for the segmentation step, they apply a spatial-median filter to get the median over a neighborhood and avoid mislabeled foreground objects. DeepCNN takes the output of the two mentioned methods as input to the CNN to segment the foreground in order to get better performance; however, this model is dependent to the performance of SubSENSE.
3.2.2 3D CNN methods
3D CNN methods include 3D CNN modules to the pipeline of the FD method, this leads to using tensors with an extra dimension. For the FD problem, they use temporal information or the sequence of frames of the input video sequence. The model proposed by [Gao et al., 2018] takes four frames from the past as input. The architecture of this network is four layers of 3D CNN with max pooling, two fully connected layers and atanh as the activation function. Each pixel in the output segmentation has a value between 0 and 255. To eliminate ambiguity, pixels with value less than 255 are labeled as background and otherwise as foreground. The architecture of this model is a shallow network; thus, the resulting segmentation does not correctly delimit the foreground objects.
The model of [Sakkos et al., 2018] takes ten frames as input and divides them in four groups with a stride of two. The architecture of this network is five layers of 3D CNN with max pooling, each feature map from each layer is up-sampled, then concatenated to feed a convolution layer, finally, it applies a sigmoid activation function. This model employs the same threshold strategy than the above methodology. This model segments effectively the foreground; however, it has a high computational complexity.
3.2.3 Multi-Scale CNN methods
Multi-scale CNN methods employ different scales of the input to extract local and global features from an input frame. According to the proximity of the foreground objects and
the position of the camera, there are foreground objects of different sizes in the video sequence; a multi-scale CNN method can overcome this problem.
PSPnet [Zhao et al., 2017] employs a CNN to process the input frame, then a pyramid parsing module is applied to harvest different sub-region representations, then it is up-sampled and concatenated to generate the feature representation. Finally, the feature representation is fed into a convolutional layer to get the target segmentation. The different scales generated by the pyramid parsing module provides additional contextual information on the scene parsing problem. An overview of the PSPnet architecture is presented in Fig. 3.2. The model of [Zeng et al., 2019] uses PSPnet to perform the BS task. It employs two streams, one to perform BS using SuBSENSE [St-Charles et al., 2014] and the second to perform semantic segmentation using the PSPnet model. Finally, to generate the output, they use the semantic segmentation pixels if they are greater than a threshold, otherwise, they use the BS segmentation pixels.
Pool CNN
Feature Map
CONV CONV
CONV
CONV
Upsample CONV
CONCAT
Input Image Final Prediction
Pyramid Pooling Module
Figure 3.2: Overview of the PSPnet Architecture [Zhao et al., 2017].
The proposed method by [Wang et al., 2017] has an architecture of four CNN layers with max pooling and ReLU activation function, two fully connected layers and a sig- moid activation function. The input frame is scaled by factors of 0.75, 0.5 and 1. Each scaled input is fed to the network, then the three output segmentations are up-sampled to the original dimensions, and then the model averages them. The averaged result is concatenated to the input frame and it is fed into the same network architecture. Finally, a threshold is applied to the output segmentation. This model employs a cascaded CNN model to obtain more spatial information from the coherence of adjacent pixels; however, it only uses isolated frames as input.
3.2.4 Encoder-Decoder CNN methods
Encoder-decoder methods have an architecture of two modules. The first module extracts the feature maps from the input. The second module modifies these features maps to generate the output. Encoder-decoder methods are the most widely used methods to solve the FD problem. The model proposed by [Lim and Keles, 2018] employs a VGG network as an encoder and a Transposed Convolutional Network (TCNN) as a decoder.
The encoder is computed three times in parallel with the input frame reduced by factors
of 1/3, 2/3 and 1. The output feature maps are concatenated and then fed into a TCNN to up-sample the features. Finally, the model employs a sigmoid activation function and a threshold to label the pixels. This model uses only isolated frames to segment the foreground, which is undesirable for the FD task.
One of the best proposals used for medical image segmentation was the U-net ar- chitecture [Ronneberger et al., 2015]. The model of [Kim and Ha, 2021] modifies this architecture to take a 10-channel input. The input is the difference between the last 10 frames and the current frame. This strategy leads to get temporal information from the video sequence. On the other hand, BSUV-Net [Tezcan et al., 2019] takes three frames as input, the first one is an “empty” background frame with no foreground objects, the sec- ond one is the temporal median of the latest 100 frames and the last input is a foreground probability map (FPM). Therefore, each input is composed of four channels (R, G, B and FPM). The residual connections from the encoder to the decoder help the network to combine low-level visual information obtained in the initial layers with high-level visual information obtained in the deeper layers. The residual connections are used to obtain low-level visual information from the initial layers of the network and combine them with high-level features obtained from the deep layers. However, this model does not have an update mechanism to modify the sequential input in case of consecutive frames with foreground objects. The architecture of BSUV-Net is presented in Fig. 3.3.
Empty Reference Frame
Recent Reference Frame
Current Frame
64 128 256 512 512
512 256
128 64 1
3x3 Conv + BN
SD + 2x2 Maxpooling + 3x3 Conv Concatenation
3x3 Up-Conv + BN Sigmoid
Figure 3.3: Overview of the BSUV-Net Architecture [Tezcan et al., 2019].
3.2.5 Recurrent Neural Network methods
The RNN methods have the advantage of collecting information over a time window, which is useful in time series prediction. RNN methods have been used in image segmentation [Ye et al., 2020] and semantic segmentation [Visin et al., 2016]; FD is related to the mentioned tasks. The model proposed by [Akilan et al., 2019] performs FD by capturing short signals with 3D convolution modules and long spatio-temporal signals through LSTM modules.
Their architecture has an encoder block with three micro-autoencoder blocks and at the end has a Conv-LSTM layer. Each micro-autoencoder block has a 3D Conv layer, a 3D Transpose Conv layer, a concatenation with the input of the micro-autoencoder block
and a 3D Conv layer. The decoder has four blocks of a 3D Transpose Conv layer, a concatenation with its respective encoder block and a 3D Conv layer. They add a Conv- LSTM module at the end of the decoder. Finally, they apply a sigmoid activation function to obtain the target segmentation, aiming to obtain higher scores for foreground objects pixels and lower scores for background pixels, and a threshold is applied to the output segmentation. This method is part of the state of the art and we compare against it. The architecture of this method is presented in Fig. 3.4.
Conv3DT
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU Conv3DT
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU
Batch Normalization Conv3DT
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU
Batch Normalization ConvLSTM2D, ReLU
Conv3D, ReLU with spatial subsampling
Conv3D, ReLU
Batch Normalization Conv3DT
Batch Norm., ReLU Conv3D
Dropout
Conv3D, Sigmoid
Input Conv3DT
Batch Norm., ReLU Conv3D
Conv3DT
Batch Norm., ReLU Conv3D
Conv3DT
Batch Norm., ReLU Conv3D
ConvLSTM2D, ReLU
Output
Figure 3.4: A layer-wise schematic of the 3D CNN-LSTM architecture [Akilan et al., 2019].
3.2.6 Attention Neural Network methods
Attention Neural Network methods restrict the processing of the entire input to a subset of the visual field. The model proposed by [Oktay et al., 2018] uses a U-net architecture to capture a sufficiently large receptive fields and, thus, semantic contextual information.
Before each concatenating step, the model employs an attention mechanism, called addi- tive attention gate. It computes the element-wise multiplication of the input feature maps and the attention coefficients. The attention coefficients are obtained by applying a linear transformation using a convolution with kernels of 1×1×1 on the context information and the input feature. The resulting feature maps are concatenated, then it is applied a ReLU activation function followed by a new linear transformation. Finally, the attention coefficients are obtained by applying a sigmoid activation function. The additive attention gate highlights the salient features coming from the skip-connections. This allows model parameters in shallower layers to be updated primarily in relevant regions; however, this additive attention gate only uses linear transformation with kernels of 1×1×1, without any spatial support to generate the attention coefficients.
On the other hand, the model of [Perreault et al., 2020] performs the object detection task. This model employs two CNN backbones, one to perform an object segmentation and the other to produce a foreground segmentation map. The model combines these two backbones with a self-attention mechanism, which takes as input the object segmentation
map and the foreground segmentation map as context. The self-attention mechanism highlights locations containing an object of interest. A similar model proposed by [Liang and Liu, 2021] employs the input frame and its respective optical flow to feed two encoders with eight convolution layers. The decoder has eight convolution with up-sample layers.
The input of each decoder is the output of its respective attention module. The attention module is similar to a self attention mechanism, it takes as input the concatenation of the outputs of the corresponding layer of the two encoders and the output of the previous layer of the decoder. Then, the model applies a sigmoid activation function to obtain the foreground segmentation. These models demonstrated that attention mechanisms are able to highlight relevant features to perform the task FD; hence, we include attention modules in our proposed models.