• No se han encontrado resultados

In this section, we review some of the popular methods for automatic image annotation. We roughly divide them into four groups: parametric topic models, nonparametric mixture models, discriminative models and local nearest neighborhood based models.

Parametric Topic Models

The first group of models are based on topic models. Monay and Gatica-Perez [112] extend the probabilistic latent semantic analysis model [80], and Barnard et al. [7] extend the latent dirichlet allocation model [18] to the multi-modal data respectively.

Different from the traditional topic models, special treatment is required to model the co- occurrence of image features and tags. It is often assumed that the latent space is shared by the two modalities. In other words, same mixture of topics is shared by both image features and text features. Each annotated image is modeled as a mixture of topics, and each topic has a distinct distribution over the visual features and the tags. Often the case, a multinomial distribution over the tag dictionary and Gaussian distributions over the visual features are employed. where each topic generates the corresponding image features and tags. The mixture of topics, the distribution of text and the visual features given the hidden topics are the parameters that needed to be inferred.

Duygulu et al. [57] introduced a machine translation based method to automatic image annotation. The idea is to treat annotation as a translation process that translates image regions into annotation vocabulary. It can also be understood as a topic model that has one hidden topic for every annotation tag.

Although topic models offers strong explanatory power, the predictive power of these models is limited by the number of topics that could be included in the modeling. The complex parameter estimation process often limits the number of topics to be within hundreds. Also since the number of parameters grows linearly with the number of topics, the models run into overfitting problem easily as more topics are used, and Bayesian parameter estimation or other form of regularization need to be enforced.

Nonparametric Mixture Models

The second group of methods models the joint distribution of the image features and the tags with mixture models. Examples of models in this class include Continuous-space Relevance Model [85], Cross-Media Relevance Model [94], and Multiple Bernoulli Relevance Models [60]. Carneiro et al. [29] model the distribution over the image features of the entire image using gaussian mixture models with a fixed number of mixture components per keyword. Other methods [85, 94, 60] model the distribution over image patches or segmented image regions. The distribution is approximated using kernel density estimation [128]. After the model is trained, the conditional probability of the keyword given the visual features is used to annotate new images.

Discriminative Models

In generative models, the parameters are estimated to maximize the likelihood of generating the training data, which is not necessarily good for predictive performance. The third groups of methods instead train discriminative models, such as SVM [48], ranking SVM [72] and boosting [78] to predict tags from image features.

Local Nearest Neighborhood Based Models

The methods we introduced above have achieved promising annotation results. However, their complex training processes limits the number of descriptors that can be incorporated. Recently proposed models such as the Joint Equal Contribution model of Makadia et al. [104] and the TagProp model of Guillaumin et al. [75] rely on local nearest neighborhoods and work surprisingly well despite their simplicity. JEC assign equal weight to different visual descriptors when computing the distances between data points, while TagProp carefully learns the weights for different visual descriptors to maximize the predictive performance. TagProp is the current state-of-the-art method for image annotation. Its success can be attributed to three elements: 1) the ability to incorporate a large number of different visual descriptors; 2) the model grows in capacity as training data increases, alleviating the effect of sparse training tags; 3) its special treatment of rare tags.

Although Tagprop achieves superior performance on several benchmark datasets, the O(n2) training and O(n) test complexity, where n is the number of examples in the training set, hinder its applicability to large scale datasets. In this work, we introduce a new model that incorporates the three elements for successful annotation much more cheaply. Our model matches the performance of TagProp in term of annotation precision and recall, but is much faster to train and test.

Most existing models assume that a complete list of relevant tags for each image is available at training time. However, in practice, this is either impractical or impossible for a large training set. It is much easier to tag an image with a few of the most prominent visual features than to obtain the complete list from a tag dictionary. To alleviate the need for complete labeling, several existing approaches [61, 136, 145] resort to semi-supervised approaches to

incomplete user tags y visual features predicted relevant tags

W

B

snow, lake, feet mountain, snow, sky, lake, water, feet, legs, boat, trees x visual features predicted relevant tags x training testing kWx Byk2 2 E⇥kyi B˜yik2⇤p(˜yi) sky, clouds, lake, water, feet, legs, boat, trees Wx

W

kWk2 2

Figure 3.13: Schematic illustration of FastTag. During training two linear mappings B and W are learned and co-regularized to predict similar results. At testing time, a simple linear

mapping x→ Wx predicts tags from image features.

leverage unlabeled or weakly labeled data from the web. We adopt the same assumption of sparse training tags and incorporate partial supervision in our work.