Matriz de giro o de cosenos directores 3.1.3.
3.3. Determinación de actitud
Learning good fine-grained visual representations is very important for fine-grained recognition, image generation and semantic segmentation tasks. Based on our pro- posed methods, there are many possible improvements or extensions. Some of them are discussed below.
Although Generative Adversarial Networks (GANs) have shown remarkable suc- cess in various tasks, they still face challenges in generating high quality images. In this thesis, our AttnGAN has significantly improved the performance of GAN models for text-to-image generation because of the proposed attention mechanisms. How- ever, the AttnGAN still struggles on generating photo-realistic images on multi-class datasets (e.g.COCO). By carefully examining the generated samples, we observe that
it is very difficult for the AttnGAN to capture geometric or structural patterns that occur consistently in some objects (e.g., animals). As discussed by Zhanget al. [124], one possible explanation for this is that heavily relying on convolution prevents GAN models learning about long-term dependencies. To tackle this challenge, they in- troduced a self-attention mechanism into convolutional GANs. The self-attention module has shown strong ability to model long range, multi-level dependencies across image regions. Different from our fine-grained attention (i.e., the attention over word embeddings within an input sequence) discussed in Chapter4, the self-attention pro- posed in [124] is the attention over internal model states. In the future, we can explore the integration of our fine-grained attention mechanism and the self-attention mechanism for text-to-image generation on multi-class datasets.
Moreover, as the first GAN-inspired framework adapted specifically for the seg- mentation task, our proposed SegAN has produced superior segmentation accuracy. However, from the comparison results, we also observe that the SegAN model still has some drawbacks when segmenting the core and Gd-enhanced regions. While the SegAN can extract different levels of features, segmentation for relatively small re- gions such as core and Gd-enhanced may need more focus on pixel-level features. Thus, previous methods using pixel-level loss could have better performance than the proposed SegAN for segmenting these small regions under some circumstances. One possible improvement for future work can be using different network architectures for segmenting different types of regions. Another drawback that we observe is that, although our model can be easily extended to semantic segmentation tasks that have many label classes, the computational cost can be quite high when the number of classes is large. For instance, in a task with m different classes, to achieve best per-
formance, we can buildmS1-1C models (i.e., one segmentor and one critic per class) to generate segmentation masks for the m classes. However, a major limitation is that such a model would have high computational cost whenm is very large. In the future work, we can investigate variants of the SegAN architecture in order to reduce computational cost without sacrificing accuracy.
In addition, it is a promising direction to build the bridge between fine-grained image recognition and generation with Generative Adversarial Networks. Large num- ber of categories and the lack of training data are main factors that make fine-grained categorization more challenging. Thus, one intuitive bridge is to apply GANs to pro- duce more training images for fine-grained classification. Another option is to utilize GANs to generate object parts with different poses and viewpoints.
Last but not the least, as general frameworks, our multimodal deep neural net- work in Chapter3and SegAN in Chapter5in Chapter are not limited to the medical domain. We can investigate their potential for general image classification and seg- mentation tasks in the future.
References
[1] R. Appel, T. Fuchs, P. Dollr, and P. Perona. Quickly boosting decision trees pruning underachieving features early. In ICML, volume 28, pages 594–602. 2013. 44
[2] Pablo Andr´es Arbel´aez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marqu´es, and Jitendra Malik. Multiscale combinatorial grouping. InCVPR, 2014. 18,28 [3] Hossein Azizpour and Ivan Laptev. Object detection using strongly-supervised
deformable part models. InECCV, 2012. 26,27,29,30,35
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014. 12
[5] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learn- ing: A review and new perspectives. PAMI, 35(8):1798–1828, 2013. 3
[6] Thomas Berg and Peter N. Belhumeur. POOF: part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In CVPR, 2013. 5,16,26,32,33
[7] Lubomir D. Bourdev and Jitendra Malik. Poselets: Body part detectors trained using 3d human pose annotations. InICCV, 2009. 6
[8] Steve Branson, Grant Van Horn, Pietro Perona, and Serge Belongie. Improved bird species recognition using pose normalized deep convolutional nets. In BMVC, 2014. 6,7,16,23,26
[9] Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017. 4,10
[10] Jo˜ao Carreira and Cristian Sminchisescu. CPMC: automatic object segmenta- tion using constrained parametric min-cuts. TPAMI, 2012. 18
[11] Yuning Chai, Victor S. Lempitsky, and Andrew Zisserman. Symbiotic segmen- tation and part localization for fine-grained categorization. InICCV, 2013. 5, 6,16,26,32,33,34
[12] S. K. Chang, Y. N. Mirabal, and et al. Combined reflectance and fluorescence spectroscopy for in vivo detection of cervical pre-cancer. Journal of Biomedical Optics, 10(2):024031, 2005. 8,49,52,55,56
[13] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximiz- ing generative adversarial nets. In NIPS, 2016. 10,58
[14] Dan C. Ciresan, Alessandro Giusti, Luca Maria Gambardella, and J¨urgen Schmidhuber. Mitosis detection in breast cancer histology images with deep neural networks. InMICCAI, pages 411–418, 2013. 9
[15] Jifeng Dai, Kaiming He, and Jian Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 24
[16] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005. 3
[17] Jia Deng, Jonathan Krause, and Fei-Fei Li. Fine-grained crowdsourcing for fine-grained recognition. InCVPR, 2013. 5,16
[18] Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015. 4,10,11,12,58
[19] T. DeSantis, N. Chakhtoura, and et al. Spectroscopic imaging as a triage test for cervical disease: a prospective multicenter clinical trial. Journal of Lower Genital Tract Disease, 11(1):18–24, 2007. 8,49,52,55,56
[20] Carl Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016. 62
[21] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015. 26
[22] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, June 2010. 13
[23] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C.
Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual con- cepts and back. In CVPR, 2015. 67
[24] Ryan Farrell, Om Oza, Ning Zhang, Vlad I. Morariu, Trevor Darrell, and Larry S. Davis. Birdlets: Subordinate categorization using volumetric prim- itives and pose-normalized appearance. InICCV, 2011. 6
[25] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ra- manan. Object detection with discriminatively trained part-based models. TPAMI, 2010. 6
[26] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, pages 4476–4484, 2017. 4,5
[27] Jon Gauthier. Conditional generative adversarial networks for convolutional face generation. Technical Report, 2015. 59
[28] Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, Arnold W. M. Smeul- ders, and Tinne Tuytelaars. Fine-grained categorization by alignments. In ICCV, 2013. 6,19
[29] Ross B. Girshick. Fast R-CNN. In ICCV, 2015. xii,18,20,22
[30] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. InCVPR, 2014. 6
[31] G. Gkioxari, R. Girshick, and J. Malik. Actions and attributes from wholes and parts. In ICCV, 2015. 6
[32] Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks.CoRR, abs/1701.00160, 2017. 10,58
[33] Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016. 3,4
[34] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative ad- versarial nets. InNIPS, 2014. 9,11,59
[35] Christoph G¨oring, Erik Rodner, Alexander Freytag, and Joachim Denzler. Non- parametric part transfer for fine-grained recognition. InCVPR, 2014. xii,5,6, 16,18,19,25,26,31,32,33,34
[36] Stephen Gould, Richard Fulton, and Daphne Koller. Decomposing a scene into geometric and semantically consistent regions. InICCV, pages 1–8, 2009. 13 [37] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan
Wierstra. DRAW: A recurrent neural network for image generation. InICML, 2015. 11
[38] Karol Gregor and Yann LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. CoRR, abs/1006.0448, 2010. 23 [39] Bharath Hariharan, Pablo Andr´es Arbel´aez, Ross B. Girshick, and Jitendra
Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 6
[40] Mohammad Havaei, Axel Davy, David Warde-Farley, Antoine Biard, Aaron Courville, Yoshua Bengio, Chris Pal, Pierre-Marc Jodoin, and Hugo Larochelle.
Brain tumor segmentation with deep neural networks. Medical Image Analysis, 35:18–31, 2017. 4,12,14,91
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI, 2015. 22
[42] Xiaodong He, Li Deng, and Wu Chou. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, 25(5):14–36, 2008. 67 [43] R. Herrero, M. Schiffman, C. Bratti, and et al. Design and methods of a
population-based natural history study of cervical neoplasia in a rural province of costa rica: the guanacaste project. Rev Panam Salud Publica , 1:362–375, 1997. 39
[44] Devon R. Hjelm, Vince D. Calhoun, Ruslan Salakhutdinov, and et al. Restricted boltzmann machines for neuroimaging: An application in identifying intrinsic networks. NeuroImage, 96:245–260, 2014. 9
[45] Gary B. Huang, Honglak Lee, and Erik G. Learned-Miller. Learning hierarchical representations for face verification with convolutional deep belief networks. In CVPR, 2012. 23
[46] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using click- through data. In CIKM, 2013. 67
[47] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, pages 448–456, 2015. 50
[48] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 4, 10, 11, 12,14
[49] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 5,32,33
[50] Q. Ji, J. Engel, and E. Craine. Classifying cervix tissue patterns with texture analysis. Pattern Recognition, 33(9):1561–1574, 2000. 42
[51] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 26
[52] Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee. Minimum classification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, 5(3):257–265, 1997. 67
[53] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical Image Analysis, 36:61–78, 2017. 12,14,91
[54] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset for fine-grained image categorization. InCVPR Workshop, 2011. 5
[55] E. Kim and X. Huang. A data driven approach to cervigram image analysis and classification. InColor Medical Image analysis, Lecture Notes in Computational Vision and Biomechanics, volume 6, pages 1–13, 2013. 41,42
[56] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014. 11,62
[57] Jan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-grained categorization. InICCV Workshop, 2013. 5
[58] Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, and Fei-Fei Li. Learning features and parts for fine-grained recognition. In ICPR, 2014. 7
[59] Jonathan Krause, Hailin Jin, Jianchao Yang, and Fei-Fei Li. Fine-grained recog- nition without part annotations. InCVPR, 2015. 5,6,7,16,26,32,33
[60] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InNIPS, 2012. 4,9,21,26,48 [61] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole
Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016. 62
[62] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic
single image super-resolution using a generative adversarial network. InCVPR, 2017. 4,10,11,12
[63] W. Li, J. Gu, D. Ferris, and A. Poirson. Automated image analysis of uterine cervical images. InSPIE Medical Imaging, 2007. 42
[64] Zhongyu Li, Xiaofan Zhang, Henning M¨uller, and Shaoting Zhang. Large-scale retrieval for medical image analytics: A comprehensivereview. Medical Image Analysis, 43:66–84, 2018. 4
[65] Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In CVPR, 2015. 4,5, 6,7,16,23,25,26,27,29,30,31,32,33,34,35
[66] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian Reid. Efficient piecewise training of deep structured models for semantic segmentation. In CVPR, pages 3194–3203, 2016. 4,12,14
[67] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014. 10,69,70
[68] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. InICCV, 2015. 5,32,33
[69] Yen-Liang Lin, Vlad I. Morariu, Winston Hsu, and Larry S. Davis. Jointly optimizing 3d model fitting and fine-grained classification. In ECCV, 2014. 5
[70] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional net- works for semantic segmentation. In CVPR, pages 3431–3440, 2015. 4,12,13, 14
[71] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. Seman- tic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408, 2016. 10,13,93
[72] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013. 5
[73] Elman Mansimov, Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Generating images from captions with attention. InICLR, 2016. 11,12 [74] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer,
Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slot- boom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats). TMI, 34(10):1993–2024, 2015. 87,89
[75] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014. 59
[76] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In ICML, pages 689–696, 2011. 7, 9,47
[77] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. InCVPR, 2017. xi,11,69,75
[78] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. InICCV, pages 1520–1528, 2015. 4,12,13 [79] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image
synthesis with auxiliary classifier gans. In ICML, 2017. 4,10,58
[80] T. Ojala, M. Pietikinen, and D. Harwood. A comparative study of texture measures with classification based on feature distributions.Pattern Recognition, 29:51–59, 1996. 3
[81] S´ergio Pereira, Adriano Pinto, Victor Alves, and Carlos A Silva. Brain tu- mor segmentation using convolutional neural networks in mri images. TMI, 35(5):1240–1251, 2016. 12,14,91
[82] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 4,10,11,12,76,94
[83] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In CVPR Workshops, 2014. 4,9,48
[84] Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. InNIPS, 2016. xi,10,11,12, 60,61,62,69,75
[85] Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of fine-grained visual descriptions. InCVPR, 2016. 72
[86] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. xi,4,10,11,12,58,60,61,62,69,75
[87] Scott E. Reed, A¨aron van den Oord, Nal Kalchbrenner, Sergio Gomez Col- menarejo, Ziyu Wang, Yutian Chen, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density estimation. InICML, 2017. 11
[88] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: To- wards real-time object detection with region proposal networks. InNIPS, 2015. 24,25
[89] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015. 4,12,14,91
[90] Holger R. Roth, Le Lu, Ari Seff, Kevin M. Cherry, Joanne Hoffman, Shijun Wang, Jiamin Liu, Evrim Turkbey, and Ronald M. Summers. A new 2.5d representation for lymph node detection using random sets of deep convolutional neural network observations. InMICCAI, pages 520–527, 2014. 9
[91] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015. 26,65
[92] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Rad- ford, and Xi Chen. Improved techniques for training gans. InNIPS, 2016. 10, 11,12,69
[93] R. Sankaranarayanan, L. Gaffikin, M. Jacob, and et al. A critical assessment of screening methods for cervical neoplasia. International Journal of Gynecology and Obstetrics, 89:4–12, 2005. 38
[94] Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions Signal Processing, 45(11):2673–2681, 1997. 65
[95] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. InICLR, 2014. 24
[96] Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Fang Weng, Yi-Chang Lu, and Yung-Yu Chuang. Deep co-occurrence feature learning for visual object recognition. InCVPR, pages 7302–7311, 2017. 4,5
[97] Hoo-Chang Shin, Matthew Orton, and et al. Stacked autoencoders for unsu- pervised feature learning and multiple organ detection in a pilot study using 4d patient data. TPAMI, 35(8):1930–1943, 2013. 9
[98] Marcel Simon and Erik Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. InICCV, 2015. 5,7,26,32, 33
[99] D. Song, E. Kim, X. Huang, and et al. Multi-modal entity coreference for cervical dysplasia diagnosis. TMI, 34(1):229–245, 2015. xiii, 8, 9, 38, 40, 42, 48,49,50,52,55,56
[100] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(1):2949–2980, 2014. 7,9,47
[101] Heung-Il Suk, Seong-Whan Lee, and Dinggang Shen. Hierarchical feature rep- resentation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage, 101:569–582, 2014. 9
[102] Heung-Il Suk and Dinggang Shen. Deep learning-based feature representation for AD/MCI classification. InMICCAI, pages 583–590, 2013. 9
[103] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig- niew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 65
[104] Casper Kaae Snderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszar. Amortised map inference for image super-resolution. In ICLR, 2017. 10
[105] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised cross-domain image generation. In ICLR, 2017. 10
[106] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. DeepFace: Closing the gap to human-level performance in face verification. In CVPR, 2014. 23
[107] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. IJCV, 2013. 6,18,20,28,29,30
[108] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016. 10,11,58
[109] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017. 12
[110] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech- UCSD Birds-200-2011 Dataset. Technical report, 2011. 5,6,25,30,69,70 [111] J. M. Walboomers, M. V. Jacobs, M. M. Manos, and et al. Human papillo-