• No se han encontrado resultados

6. MÉTODOS

6.7. OTRAS HERRAMIENTAS DEL MODELO DE GESTIÓN:

6.7.5. Servicio de admisión

Whilst traditional Re-ID methods have related to describing person images with low-level features such as colour and texture, these feature types can be heavily be influenced by variations in illumination or other visual characteristics. However, characteristics such as age, gender and general colour of shirt, are significantly more invariant to these changes. Thus, several pieces of the literature have researched how best to incorporate such characteristics, typically referred to as attributes, into the Re-ID pipeline.

Figure 2.11: Example attributes and corresponding positive and negative examples. Images are taken from the VIPeR [59] data set with attribute labellings taken from the PETA [39] data set.

Layne et al. [101] define a mid-level attribute as being a physical character- istic that is unambiguous in interpretation. Fifteen attributes are chosen, namely shorts, skirt, sandals, backpack, jeans, logo, v-neck, open-outwear, stripes, sunglasses, headphones, long-hair, short-hair, gender and carryingobject. As some of these attributes will only be present on certain parts of a person’s body, the authors extract a 464-dimensional feature descriptor from six equal-sized stripes, resulting in a 2784-dimensional feature descriptor consisting of colour and texture information. The authors then train a Support Vector Machine (SVM) [161] to detect the presence of the fifteen attributes. Given that some attributes will be more reliable than others, due to their prevalence within the imbalanced data as well as how useful they are for discriminating between di↵erent individuals, the authors learn a weighted `2-norm

authors found that the highest Re-ID matching results could be obtained when the attribute features were combined with low-level SDALF [43] features for matching. Khamis et al. [90] also combine attribute features with traditional hand- crafted features. The authors learn a distance metric which learns a discriminative projection in a joint appearance-attribute subspace. The authors optimise the ranking loss and attribute classification loss and by this, achieve some invariance to illumination and pose, and demonstrating improved matching rates over using just appearance or attribute information. Su et al. [170] utilise the correlation of attributes, such as female and long hair, to allow attributes of the same person between multiple cameras to be embedded into a low rank space. Using a low rank space allows for noisy attributes to be pruned, as well as missing attributes, such as those which are incorrectly labelled by a human annotator, to be rectified. However, such a task is computationally expensive, and therefore the authors incorporate a Multi-Task Learning (MTL) [18] algorithm. By considering Re-ID between multiple cameras as related tasks, the authors are able to use MTL to exploit features and attributes shared across multiple cameras to increase efficiency and learn from multiple cameras simultaneously.

Shi et al. [164] discuss the problem of there not being sufficient training data to train a Re-ID framework using attributes which can produce state-of-the-art results. To counter this issue, the authors propose using two fashion data sets, named Clothing-Attribute [25] and Colourful-Fashion [122]. However, given the large di↵erence in visual characteristics between data sets, training a model using these data sets would typically be useless for use on Re-ID data sets. The authors propose taking a generative model approach based on the Indian Bu↵et Process (IBP) [60], and exploit attribute features at patch-level rather than image-level. The use of patch-based features is used in combination with Bayesian Adaptation to ensure that the learnt model can output a strong patch-level feature capable of being used within a wide range of di↵erent domains, including Re-ID.

More recently, CNNs are being used for attribute detection. In [171], the authors propose a three-stage attribute prediction network. In the first stage, the authors training a Deep CNN to predict 105 attributes using the PETA [39] data set. The second stage involves fine-tuning the model on the MOTChallenge [104] data set, this time training using person ID labels and utilising triplet loss. The final stage uses a combination of all previous training data sets for the final stage of fine-tuning, with the output of this stage being named by the authors as deep attributes. The authors extend this work in [174] by dividing the attributes in to a set of 15 types, such as Age, Gender, CarryObject and HairStyle. Encoding attribute information in this way ensures contradictory information such as short hair and long hair cannot

co-exist, by enforcing only one positive attribute per type. The final attribute feature is therefore shortened from length 105 to a set of K attributes belonging to C types, A = {A1, A2, ..., AC}, where Ac = {ac

1, ac2, ..., acKc} and a 2 {0, 1} represents the

presence of otherwise of a specific attribute.

Ye et al. [209] propose a body parts-based approach which combines colour, texture and attribute features. First, LOMO [115] features are extracted and used both to contribute to the person image’s feature descriptor, and to train a LIBSVM [20] classifier for each attribute. Furthermore, a Sample-Specific SVM (SSSVM) [213] is used to weight each body part according to its contribution to Re-ID Matching. Following calculations of the weights, the weighted distance between corresponding parts of di↵erent images are fused, forming the final distance between two images. Zhao et al. [218] utilises video sequences to improve Re- ID matching rates. Feature descriptors are extracted first from individual video frames, and are then divided into groups of sub-feature descriptors corresponding to specific attributes. These sub-feature descriptors are weighted according to the corresponding confidence of attribute prediction. Following weighting, the feature descriptors extracted from each frame are aggregated across the temporal dimension to produce the final sequence feature descriptor.