CAPÍTULO 2. ENFOQUE ENDÓGENO DEL CRECIMIENTO. EL PAPEL DE LA
2.8. Estado del arte de la innovación regional
As mentioned in section 5.1, one of the main challenges for CBNDVC detection is finding semantic similarity between videos. By considering the color histogram or other low level data it is difficult to interpret the obtained knowledge for higher level semantics, but by using machine learning techniques we can use low level features to detect the presence of high-level features and can interpret the mined results at a semantic level. Here, semantic concepts are events, which a user can judge the occurrence or absence of, for example Outdoor and Singing. Semantic concepts are also often referred to as High Level Features, because of their meaning to humans. They are also referred to as visual concepts, because of their predominant occurrence in the visual part of the video.
Semantic concept detection in video has been perceived as a pattern recognition problem. Given pattern −→x (e.g. color moment, color histogram etc.), part of a shot i, the aim is to obtain
a probability measure, which indicates whether semantic concept ωj (e.g. Outdoor) is present in
assumptions. Hence, it cannot form the basis of comparison between different methods. Therefore, probability is utilized as a confidence value, defined as p(ωj | −→x ) [133]. In practice, Support Vector
Machine (SVM) with Platt’s conversion method is used to obtain such confidence value. SVM classifiers thus trained for ωj, result in an estimate p(ωj| −→x ,−→q ), where −→q represents the parameters
of the SVM.
For many applications, the concept detectors for these semantic concepts are assumed to be binary classifiers with threshold value as 0.5, which differentiate between presence (0.5 or greater) and absence (lesser than 0.5) of concept. As explained with video 1 and video 2 semantic distance example in section 5.1 it is desirable to consider the confidence value for more accurate comparison. We advocated this in our earlier work [22], that detected high-level features should be represented with their concepts associated posterior probabilities or confidence values to overcome inaccuracies of binary classification for multimedia datamining purpose. Also, employing confi- dence value for datamining can give the additional information for certain shot containing certain concept with a certain confidence value which can help accurate matching of semantic similarity.
For CBNDVC detection, we need to match the semantic concepts along the time axis. Therefore,a video should be represented by a time-series of semantic concepts in it. A time-series is an ordered sequence of observations. Although the ordering is usually through time, particularly in terms of some equally spaced time intervals, the ordering may also be taken through other di- mensions, such as space [142]. We consider video to be represented as time series of its discovered concept confidence values. Each of the detected semantic concept confidence values is considered as an observation and its corresponding shot length is considered as the time dimension. The CB- NDVC detection problem is not limited to traditional content level one-to-one keyframe matching or matching within small window size, but to the semantically identical videos where one-to-one matching may not be possible as videos can be of varying length and have diverse content leading to the different number of shots. Thus, video representation should be flexible to match concepts between videos that are not alike but are visually similar and semantically related.
We construct the time-series of semantic concept confidence values for a video as shown in Figure 5.3. For a given video, semantic concept detection is performed at the shot level. A video is first partitioned into a set of shots based on editing cuts and transitions between frames, and then a representative keyframe is extracted to represent each shot. Extracting a representative keyframe from the middle of a shot, therefore, is relatively reliable for extracting basically similar keyframes from different near-duplicates. This mapping of video to keyframes reduces the num- ber of frames that need to be analyzed. A video sequence, denoted as V, is first segmented into
Figure 5.3: Video representation as time-series of Semantic concept confidence value
N shots such that V = {s1, s2, . . . , sN}, where si stands for the ith shot of V. Visual feature X
such as color (e.g. 225-dimensional grid color moment [134]) is extracted from each keyframe, thus
X = {X1, X2, . . . , XN}. Let C = {c1, c2, . . . , cM} be a set consisting of M semantic concepts,
with ck denoting the kth semantic concept. Also, let D = {d1, d2, . . . , dM} be the set of classi-
fiers corresponding to the M semantic concepts, where dk denotes the classifier whose output is a
confidence value for concept ck. Given the visual feature X extracted from shot si, the classifier dk
outputs the posterior probability P(ck | X ) . This posterior probability represents the relevance or
Each of the N shots contains detected confidence values of M semantic concepts. Let
V T S = {vts1, vts2, . . . , vtsM} be a set consisting of M time-series of semantic concepts’ con-
fidence values for N shots of the video V, where vtsi = {P (ci|X1), P (ci|X2), . . . , P (ci|XN)}.
Finally, we construct the dataset as shown in Figure 5.3 using the aforementioned video represen- tation as time-series of detected semantic concept confidence values. We assume that detectors are independent of each other and that each detector emits for each shot a single and real valued confidence score.