Capítulo VIII Suplemento Europeo al Título Artículo 16. # Incorporación de los créditos obtenidos en el Suplemento Europeo al Título
1. Prácticas externas y el Trabajo de Fin de Grado: en el plan de estudios del grado de
Cha et al. (2007) discuss a bottleneck created by a lack of textual data that results in poor search and poor recommendation engines. This is defined by other authors as
19
the semantic gap. Enser (2008) describes the semantic gap as the distance between information that can be extracted automatically from a visual resource and how humans interpret the resource content. He argues that a description of the content of a visual resource lies in the semantic inferences represented in textual metadata rather than perceptual features. Perceptual features are indexed using content based search and textual data using text based search, neither method indexes the high level semantics required for image or video search. The information that can be retrieved from low level features cannot be transformed to high level features that represent objects in the image or video also, users formulate queries using high level semantics not low level features so what can be retrieved does not match what is being queried (Hare et al., 2006). Enser et al. (2007) found that the majority of terms people use to describe an image were not present in the image indicating a practice of subjective interpretation of the content. Tjondronegoro et al. (2009) describe bridging the gap between low-level visual features and high level semantic text. Bai
et al. (2008) suggest mapping high level semantic concepts to low-level features, e.g.,
celebrity name to person. Enser (2008) states that at time of writing most attempts at bridging the semantic gap had not successfully addressed the problem of distance between object labelling and high level reasoning.
The most active research in this area is focused on trying to train algorithms to extract high-level features from online video in order to improve the precision and recall of video search. Morsillo et al. (2010) emphasise that current methods of automatic indexing are not transferrable to the web and large scale corpora of UGC such as YouTube. The majority of concept detection algorithms are trained on small, professionally annotated corpora, predominantly TRECVID, whereas YouTube is a large corpus and is user annotated. TRECVID (Smeaton et al., 2009) is a collection of
20
professionally annotated videos, mainly from the news genre, but increasingly since 2010 from other media outlets, namely the BBC, so the dataset more closely resembles web video (Over, 2014). An alternative dataset has been proposed which tries to more closely emulate video content that would be found on YouTube. Loui et
al. (2007) created a benchmark dataset ‘Kodak consumer video benchmark dataset’ of
annotated UGC videos. Videos are categorised by semantic concepts. There are two datasets: one containing videos uploaded by participants in a Kodak user study (1358 videos) and one of YouTube videos (4539 videos). Videos are annotated with predefined concepts rather than free natural language tags. The authors create an ontology consisting of seven categories, with 25 concepts for each category. However, little research has been published that uses the Kodak dataset. The TRECVID dataset annotates individual shots, whereas YouTube annotations tend to refer to the whole video.
Morsillo et al.'s (2010) experiments with YouTube, whilst offering some success, still only generated basic level vocabulary and at great computational cost, which is inappropriate for a home user. They acknowledge that video is more difficult to index by concept detection, as many single shots make up one video and content comes from audio as well as visuals. Jaimes et al. (2003) used speech from videos to create keywords to enhance the low-level visual features automatically extracted. Keywords are grouped into perceptual concepts based on the five senses. Min et al. (2003) also discuss a method of turning the audio commentary of a video into searchable keywords. Whilst this method is useful for extracting high-level semantic concepts, the problem lies in how reliable the transcribing software is. Ulges et al. (2008a) propose a system that learns from the low quality data available from YouTube. Although they improved annotations for a selection of videos, the
21
annotations were still at a basic descriptive level. Their approach is to use these methods to enhance textual data for existing text-based search rather than to categorise videos in semantic categories for content-based search. Despite research into content-based video search, the most popular method for users to find video online is using text-based search (Halvey and Jose, 2012), yet all research agrees that Query by Text is currently inadequate because of insufficient textual data and poor descriptions associated with online video. What is not agreed is which method should be used either to replace query by text or to improve it.
With users as content producers as well as content consumers, vast quantities of videos are being published with no editorial control. There is no control over metadata resulting in poor labelling and inadequate descriptions (Morsillo et al., 2010). Bridging the semantic gap by creating improved annotations for videos is a lively research area, with a number of different approaches, both manual and automatic. Automatic methods concentrate on improving concept detection algorithms so they are able to extract high-level visual features and high-level semantic information. Manual methods look at employing or encouraging people to annotate videos.
Manual annotation is expensive and difficult to use for large scale repositories like YouTube. Shih-Fu et al. (2007), Tjondronegoro and Spink (2008), Ulges et al. (2008a) and Morsillo et al. (2010) argue that professionally annotated datasets like TRECVID (Smeaton et al., 2009) are inadequate because the categories professionals use do not correspond to users’ natural language used in search. The authors found that search terms used in YouTube did not correspond to the TRECVID semantic categories.
22
They argue that videos with bad metadata are invisible to users, which explains why the majority of videos on YouTube are difficult to find. Just because PGC is professional content does not mean it is professionally annotated. Although Halvey and Keane (2007) found that promoted videos have more descriptive information, the dominance of PGC in YouTube is a result of promotion rather than improved description or textual data. Higher quantity does not necessarily equal higher quality. There is to date no research that analyses the semantic vocabulary of this textual data to ascertain whether it is of an adequate quality. PGC might not fully meet the users search goal, yet videos that could satisfy their requirements are described poorly and therefore remain unfound.