First, we show that multi-represented similarity search is usually more ef- fective than similarity search using only one representation. In addition, we show in this subsection, that weighting the different representations yield a significant benefit compared to un-weighted multi-represented similarity detection.
In a first experiment, we performed video similarity search. As setup step, we picked 50 query videos from our database and manually selected a set of videos which are similar to the query videos. We compared recall and preci- sion achieved on the best single representation to the query result computed by using theε-neighborhood and entropy weighting functions. Furthermore, we investigated the performance of our weighting strategies on three summa- rization techniques, namely video signatures (ViSig), K-Means and expecta- tion maximization (EM). The results of this comparison is depicted in Figure 6.4. For all evaluated summarization techniques, we observed a significant performance improvement when using multiple representations in compari- son to the best single representation. Furthermore, our weighted approach leads to better results on all considered summarization techniques.
Using the same test setup as described before, we compared different stan- dard combination techniques for multi-represented objects to our weighted combination method is shown in Figure 6.5. We investigated the perfor- mance of commonly used standard combination techniques such as product, sum, minimum and maximum. In most cases, our weighted approach is more effective than the standard combination algorithms. Especially the ε-
6.4 Experimental Evaluation 139 0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p reci si o n
single combi (entropy) combi (epsilon)
(a)ViSig 0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p recisio n
single combi (entropy) combi (epsilon)
(b)K-Means 0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p recisio n
single combi (entropy) combi (epsilon)
(c)EM
Figure 6.4: Precision vs recall for different summarization techniques on best single representation and two best weighting functions.
0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p recisio n
epsilon entropy quality support product sum min max
(a)ViSig 0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p recisio n
epsilon entropy quality support product sum min max
(b)K-Means 0 0.25 0.5 0.75 1 0.33 0.66 1.00 recall p recisio n
epsilon entropy quality support product sum min max
(c)EM
Figure 6.5: Precision vs recall for different summarization techniques on standard combination strategies and proposed weighted combination strate- gies.
neighborhood and entropy weighting methods show good precision and recall values for all considered summarization strategies.
6.4.2
Multi-Represented Similarity Search Applications
In the following, we identify two common applications that may pose different challenges to multimedia similarity search techniques and propose the most appropriate weighting functions for these tasks.
Application 1: Finding Similar Videos. Our first application addresses copyright issues. In order to detect plagiarism, we want to find videos that are
140
6 Effective Similarity Search in Multimedia Databases using Multiple Representations
similar to a given query video. We argue that in this application, similarity should be considered more locally because several representations are usually almost identical. This is the case if e.g. the image or audio part of a video is encoded in different resolutions or sampling rates. To distinguish these videos from the rest of the database, it is necessary to examine a small neighborhood. Otherwise, we would obtain results which are similar, but do not violate the copyright.
Theε-neighborhood weighting function follows this idea and can success- fully be applied for this task as shown in Figure6.4 and Figure6.5.
Application 2: Finding Videos of a Given Artist. In our second application, we address content-based multimedia retrieval in music video databases. Given a query video of a specific artist, we want all videos of this artist in our database. Obviously, in this application, a more global notion of similarity is necessary.
In order to demonstrate this idea, we selected a set of 20 query videos associated with different artists. For each video in our query set, we extracted all videos of the same artist from our database. The results of our artist search are depicted in Figure 6.6. In all experiments, the entropy-based weighting function outperforms the ε-neighborhood approach. This can be explained by the fact that the entropy weighting function takes all distances into account in opposite to the local character of theε-neighborhood function.
6.5
Conclusions
Similarity search in multimedia databases can be improved by using multiple representations of the multimedia objects. When searching for similar videos, one can e.g. use audio features such as rhythm and pitch as well as video features such as color histograms and textures.
6.5 Conclusions 141 0 0.25 0.5 0.75 1 0.2 0.4 0.6 0.8 1 recall p recisio n epsilon entropy (a)ViSig 0 0.25 0.5 0.75 1 0.2 0.4 0.6 0.8 1 recall p recisio n epsilon entropy (b)K-Means 0 0.25 0.5 0.75 1 0.2 0.4 0.6 0.8 1 recall p recisio n epsilon entropy (c)EM
Figure 6.6: Precision vs. recall for different weighting strategies when performing similarity search for videos of the same artist.
In this chapter, we presented a method for effective similarity search in multimedia databases that takes multiple representations of the database ob- jects into account. In particular, we proposed several weighting functions for summarization vectors of different representations of each database object. Our concepts are independent of the underlying summarization method and compute a weight for each summarization vector of each representation for each object separately. Using these weighting factors, we further show how well-known distance measures for non-multi-represented, multi-instance ob- jects can be adopted to multi-represented objects. In our experiments, we evaluated the proposed methods and showed the benefits of our approach.
142
6 Effective Similarity Search in Multimedia Databases using Multiple Representations
Part III
Data Mining Techniques
Chapter 7
Using Uncertainty to Provide
Privacy Preservation
for Distributed Clustering
Privacy preservation is a new area in data mining research that deals with obtaining valid data mining results without learning the underlying data. In this chapter we introduce a novel method for clustering distributed data that achieves an arbitrary level of privacy preservation through the obfus- cation of the original data using aggregation by the mixture of Gaussians. This chapter starts with an introduction into privacy preservation for dis- tributed clustering in Section 7.1. In Section 7.2, we survey related work on distributed and parallel clustering. In Section 7.3, we describe our novel privacy-preserving clustering algorithm that describes original data by un- certain models. Section 7.4provides an extensive experimental evaluation of the performance and the accuracy of the proposed approach. In Section7.5, we summarize this chapter.
146
7 Using Uncertainty to Provide Privacy Preservation for Distributed Clustering
7.1
Introduction
As discussed in Chapter1, advanced application have often to perform data mining task on distributed data under privacy preservation requirements. A good distributed data mining framework performs data mining operations based on the type and the availability of the distributed resources. As sug- gested in [PK03], a distributed data mining solution consists of the following steps. First, a data mining algorithm is locally applied to each of thek sites separately and independently. The results are k local sets of patterns called
local models. Second, the local models are transferred to a central server. The central server combines the local models to generate a global model. Third, the global model may optionally be sent back to local sites.
The data mining technique we address in this chapter is clustering which aims at partitioning the data objects into distinct groups (clusters) while maximizing the intra-cluster similarity and minimizing the inter-cluster sim- ilarity. Many clustering algorithms for the centralized approach have been proposed so far using different clustering notions, e.g. distribution-(or model- )based, center-based, or density-based (cf. [HK06] for an overview). In gen- eral, all those methods are applicable for a distributed solution as far as they produce a local model in Step 1 of the distributed data mining process that is as compact as possible but provides as much information as needed for building a global model in Step 2. Unfortunately, many traditional clus- tering algorithms produce a clustering that cannot be easily described by a simple prototype. For example, density-based clustering [EKSX96] detects clusters of arbitrary shape. However, describing a cluster having a complex shape might become quite expensive possibly causing large transfer rates. Thus, a local model should describe each cluster by a “suitable” prototype. Obviously, such prototyping should also meet privacy constraints. We ar- gue, that the expectation maximization (EM) clustering algorithm provides exactly such prototypes. EM describes the dataset by a set of Gaussian
7.2 Related Work 147
distributions consisting of the cluster center (mean) and covariance matrix. The latter describes the density of points around the center of the cluster. If certain constraints are met, privacy is preserved because the exact values of the data objects cannot be retrieved from the distribution.
We propose a novel distributed clustering algorithm called DMBC (Dis- tributed Model-Based Clustering). The local models are acquired using EM clustering. Since the necessary number of clusters on each site might be strongly varying, DMBC automatically determines a suitable number of lo- cal clusters based on privacy and performance constraints. The constraints control the maximum transfer volume that is allowed from an individual site and assure that each local data object is described as good as possible and prohibit the transfer of clusters that could lead to a violation of privacy as- pects. To combine the local clusters at the central server, the aggregation step of DMBC can employ two variants of parametrization to either derive a global clustering offering k clusters or an arbitrary set of clusters that are considerably different from each other. In both cases, DMBC derives a meaningful global mixture model of Gaussian in efficient time. Our broad experimental evaluation shows that DMBC is a scalable solution for cluster- ing in a distributed environment that achieves comparative results compared to a centralized EM-based approach.
7.2
Related Work
In the following, we will review recent work on parallel and distributed clus- tering. Parallel clustering is related to the problem of distributed clustering because the data objects are also distributed over several clients where a local clustering is performed. The local clusterings are merged to produce the final model. However, parallel clustering methods can control the assignment of data objects to each site. Thus, the merge step is usually less complex and
148
7 Using Uncertainty to Provide Privacy Preservation for Distributed Clustering
implying different problems than the merge step of distributed data mining approaches. However, several recent approaches for distributed data mining are adoptions of parallel clustering algorithms and do not consider privacy preservation issue.
Parallel versions ofk-means,k-Harmonic-Means, EM [DM00,FZ00], and DBSCAN [XJK03] are all not applicable within distributed environments because all methods rely on a centralized view of the data during or before the clustering is computed.
In [SS00] a parallel algorithm is proposed for clustering web documents distributed randomly over several sites. Any clustering algorithm can be used to generate local clusters. The entire local clusters are sent back to the server rather than compact prototypes. Clusters are merged if they share a given number of documents which is determined by deriving maximum- sized itemsets from the documents. Obviously, since all local documents are transferred to the server, this approach does not consider any privacy issues.
In [JKP04] a distributed version of DBSCAN [EKSX96] is presented. The local clusters are represented by special objects that have the best representa- tive power. This representative power is based on two quality measures that take the density-based clustering concepts into account. For each represen- tative, a covering radius and a covering number is aggregated for the global merge step. The performance of the proposed method is heavily dependent on the number of representatives. If it is chosen too small, the accuracy sig- nificantly decreases. Otherwise, the runtime increases due to high transfer cost. In addition, since real data objects are sent to the global server, this approach does also not consider any privacy issues.
In [JK99] a single-link hierarchical clustering algorithm for vertically dis- tributed data is proposed. However, our new approach DMBC is focused on horizontally distributed data.