• No se han encontrado resultados

FreqyWM: Frequency WaterMarking for the New Data Economy

N/A
N/A
Protected

Academic year: 2023

Share "FreqyWM: Frequency WaterMarking for the New Data Economy"

Copied!
15
0
0

Texto completo

(1)

FreqyWM: Frequency WaterMarking for the New Data Economy

Devri¸s ˙I¸sler

IMDEA Networks Institute

& Universidad Carlos III de Madrid [email protected]

Elisa Cabana IMDEA Networks Institute

[email protected]

Nikolaos Laoutaris IMDEA Networks Institute [email protected] Abstract

We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuris- tic algorithms for creating and verifying such watermarks.

We also demonstrate the robustness of our technique against various attacks and derive analytical bounds for the false pos- itive probability of erroneously “detecting” a watermark on a dataset that does not carry it. Our technique is applicable to both single dimensional and multidimensional datasets, is independent of token type, and can be used in a variety of use cases that involve buying and selling data in contemporary data marketplaces.

1 Introduction

Data-driven decision making powered by Machine Learning (ML) algorithms is changing how society and the economy work and is having a profound positive impact on our daily lives. A McKinsey report predicted that data-driven decision- making could reach US$2.5 trillion globally by 2025 [4]. ML is driving up the demand for data in what has been called the fourth industrial revolution. To satisfy this demand, sev- eral data marketplaces (hereinafter, DMs) have appeared in the last few years. DMs are mediation platforms that aim to connect the two primary stakeholders of the data value chain, namely the data providers/sellers and the data buy- ers. DMs take care of tasks such as searching/cataloging, facilitating price negotiation, storing and transferring data, etc. This ecosystem includes open data repositories (e.g., Dataverse, Kaggle), general-purpose DMs (e.g., AWS, Data- Rade, DAWEX, DIH, Advaneo), and specialised or niche DMs for specific industries, such as automotive (e.g., Otonomo, Caruso), financial (e.g., Refinitiv, Battlefin), marketing (e.g., LOTAME, LiveRamp), and logistics (e.g., Veracity) to name a few. See [6] for an extensive survey of close to 200 DMs and other data providers.

The problems: The majority of DMs and data sellers have explicit clauses indicating that purchased datasets cannot be resold or used outside the jurisdiction of the buyer. But as with all digital assets, being able to copy/store/transmit datasets with close to zero cost makes creating illegal copies very easy. Even worse, unlike media content and software, the issue of ownership is less obvious when it comes to datasets.

Any popular movie, song, e-book, or piece of software can usually be attributed to a director/producer/actor, musician, author, or company, respectively, but the same is hard to do for large datasets. Consider an anonymised mobility dataset logging the movement of people in a city. Such a dataset may have been produced by collecting GPS readings from the smartphones of individuals using a map application, or it may be deduced by analysing cell phone traces [26] or Call Description Records (CDRs) maintained by mobile operators.

With weak to nonexistent ownership guarantees, it is diffi- cult to imagine that the data economy will ever flourish and reach its projected potential. Indeed, any sold copy of a dataset can be ‘pirated’ by a buyer-turned-seller that can then resell the same dataset in a DM at a potentially lower price thereby undercutting the rightful owner and rendering his investment useless. Piracy has threatened in the past the software and me- dia industries and may very well do the same to the fledgling data economy by discouraging companies from investing in expensive measurement probes, big data infrastructures, and human labor costs to clean and curate commercial datasets.

A novel watermarking technique for data: In this paper we develop a technique for protecting dataset ownership from piracy. We present a novel Frequency Watermarking tech- nique, henceforth FreqyWM,1for hiding a secret within a dataset in a manner that makes the said secret indistinguish- able from the data that it protects. The main idea behind FreqyWMis to modify slightly the appearance frequency of existing tokens within a dataset in order to create a secret in the form of a complex relationship between the frequencies of different tokens. By making this relationship complex enough,

1Freqy pronounced as freaky.

(2)

we can reduce the probability that it appeared by chance close to zero. Therefore, by revealing knowledge of such secret re- lationship, a party can claim ownership over a dataset because the only practical way of knowing such a secret is to have inserted it in the data in the first place.

A token may be a word, a database record, a URL, or any repeating value within a structured or semi-structured com- mercial dataset. Our secret is created by first selecting a num- ber of token pairs. Then for each pair, we slightly modulate the frequency counts of its tokens in order to make their dif- ference yield zero under modulo N arithmetic. This can be easily done by adding or removing some instances of one, the other, or both tokens. By increasing the number of selected pairs we can make our watermark more resistant to attacks, as well as less likely to have appeared by chance.

FreqyWMcan be used to achieve several things. First and foremost, by revealing knowledge of the secret encoded by the watermark, a data seller can prove rightful ownership of a dataset to a third party, such as a DM. This can be used to distinguish a rightful owner from a pirate that may, for exam- ple, attempt to upload and monetize a pirated dataset in one or several DMs. If the DM, or the rightful owner detects such an event, the latter can demand that its dataset be removed and the pirate be banned. This would mimic what web-sites like YouTube do to protect copyrighted content. Detecting the presence of pirated copies can be achieved using content sim- ilarity [18], locality sensitive hashing [31] and even hashing similarity [7] that go beyond the scope of watermarking.

In addition to proving ownership, our watermarking tech- nique can also be used to reveal who may have leaked (copied/pirated) a dataset in the first place. A dataset seller or a DM may create a different watermark for every buyer and in addition to encoding it into the data, store also a description of it in some immutable index, such as a blockchain. Then, if an unauthorized copy of the dataset is found at a latter point, the culprit can be identified by looking up its watermark against the said index.

Our contributions: Our first and foremost contribution is the idea of using the appearance frequency of tokens to encode invisible watermarks upon datasets traded in modern DMs. We establish a family of such watermarks using fre- quency pairs and modulo arithmetic and prove that creating an optimal FreqyWM reduces to solving a Maximum Weighted Matching (MWM) problem [15,28] combined with a polyno- mial special version of the 0/1 Knapsack problem [11] involv- ing items of equal value but different weights. Our second major contribution is the extension of frequency-based water- marking for making it resilient against a series of attacks. In particular, we protect our data engineering technique against a Guess Attack attempting to identify our watermarked pairs and secrets to impersonate the rightful owner. We make such an attack computationally hard by introducing a high-entropy secret while generating the watermark. We also protect against a Re-watermarking Attack mounted by having a pirate inject

his own watermark upon an already watermarked dataset, and then present the former as a false proof of ownership. We thwart such an attack by describing a simple protocol capable of ordering chronologically multiple watermarks that may be carried by different versions of the same dataset. We also protect against a Destroy Attack mounted by an attacker at- tempting to destroy our watermark by changing the frequency of different tokens in the dataset. By relaxing our modulo arithmetic rule used during the verification process of a par- ticular watermark pair, as well as the percentage of pairs to be detected before the entire watermark is verified (accepted), we oblige the attacker to effectively also destroy the actual data in the process of attempting to destroy the watermark.

Finally, we show that our technique is robust to a Sampling Attackin which the attacker is attempting to pirate only a random sample of the watermarked data.

Our final contribution is an extensive performance evalu- ation study aiming to explain how the main parameters of FreqyWMimpact on major performance metrics under differ- ent attack scenarios using synthetic and real world datasets.

Our findings: Next we summarize the results from our performance evaluation study.

• We show that as long as there exists sufficient variation in the frequencies of different tokens, FreqyWM can encode robust watermarks with minimal distortion on the initial data. Our technique become inapplicable only when token appearance frequencies are almost uniform, because in this case there does not exist sufficient gap between different frequencies for encoding a watermark.

• Regarding the false positive probability, i.e., “detecting”

a watermark on a dataset that does not carry it, our an- alytical bounds show that it quickly goes to zero as we increase the number of pairs in our watermark.

• We demonstrate that a Guess Attack has negligible proba- bility of success, thereby making it impossible for almost all practical cases. On the up side, the rightful owner or any party that is given the secret key for verifying the watermark, can do that very fast in linear time complex- ity.

• In terms of Sampling Attacks, we show that with the exception of very small samples (of correspondingly small value), our verification algorithm is still capable of detecting our watermark. Achieving this requires using the relaxed detection algorithm that trades robustness to attacks with false positives. For example, on a sample of 20% and with thresholds that impose tiny false positive probability, the watermark detection probability exceeds 90%.

• In terms of Destroy Attacks, we show that a watermark that imposes (costs) a tiny 0.0002% distortion on the original data, remains detectable even under attacks that

(3)

add random noise that imposes a 90% modification on the original data.

• Compared to existing solutions from the literature [3,30]

that are applicable only to numerical data and preserve only the mean value of the watermarked data, FreqyWM allows a data owner to control the exact amount of dis- tortion introduced by the watermark in terms of cosine or other similarity metrics which, under [3,30] may be- come unbounded. For example, a FreqyWM watermark that imposes only 0.0002% distortion in terms of cosine similarity, is stronger than a watermark from [30] that imposes a 46.72% distortion under the same metric.

2 Related Work

Database watermarking is the closest type of watermarking to our work. There are of course other types of watermarking and fingerprinting, e.g., for sequential dataset [5], genomic datasets [21], but as they focus on very different data types we do not go into more details about them. Going back to databases watermarking, the first known technique was proposed by Agrawal et al. [3] in 2002. It is applicable to numerical data only and guarantees that the introduced distortion does not dramatically change the median and standard deviation of the data. The watermark information is normally embedded in the Least Significant Bit (LSB) of features of relational databases to minimize distortion.

Other database watermarking solutions introduce distortion by considering the statistics of numeric values [14,17,25,30].

The proposed solutions in [3,30] focus on keeping the change at minimum (i.e., median and standard deviation).

To avoid distortion, distortion-free database watermarking schemes have also been proposed. [12,13] are distortion-free techniques that introduce fake tuples or columns in the original database. The fake tuples or columns are created based on a watermark secret by computing a secret function which makes watermarking visible and easy to remove.

Survey papers such as [22,24,29] compare database watermarking techniques in terms of verifiability, distortion, supported data types, and other aspects. Many of these solutions are applicable only to numerical data and thus cannot be applied in a range of commercial datasets, e.g., to web-browsing click-streams.2Moreover, these solutions do not allow a data owner to limit the amount of distor- tion introduced by the watermark but merely to preserve the basic statistics of the data in terms of its first few moments.

2Note that categorical watermarking techniques would replace tokens in a dataset with another token in the same category, and this will cause an undesired distortion and require certain predefined categories. While a text watermarking would change a token (e.g., replacing a word with another similar word). This (insecure) change may invalidate a token (e.g., causing an invalid URL).

3 Preliminaries

Let λ ∈ N be a security parameter. A probabilistic polynomial time (PPT) algorithm A is a probabilistic algorithm taking 1λ as an input and has running time bounded by a polynomial in λ. || denotes concatenation. In Table1we present our method notations used throughout the paper.

Hash Function.

Definition 1 A hash function H (chosen from a family of such functions) is a deterministic function from an arbitrary size input to a fixed size output, denoted H: {0, 1}→ {0, 1}λ. The hash function is collision resistant if it is hard to find two different inputs m06= m1that hash to the same output H(m0) = H(m1).

Maximum Weight (Perfect) Matching (MWM).

Definition 2 Let G = {V, E} be a graph with a set of vertices V= {v1, v2, . . . , v|V |} and a set of edges E = {e1, e2, . . . , e|E|} where each edge has a weight denoted by w(ei). A set M ⊆ E is a matching if no two edges in M have a common vertex.

A vertex v is matched by M if it contains an edge of M, and unmatched otherwise. In the maximum matching problem it is asked to find a matching M of maximum weight in a given input graph G= (V, E) which is formulated as MAX M =

ei∈Ew(ei)xi s.t. xi∈ {0, 1}.

Equally Valued 0/1 Knapsack Problem (QKP).

Definition 3 Let T = {t1,t2, . . . ,tn} be the set of items. Each item (ti) is worth the same value val and has a weight wti. The problem is to fill a knapsack of capacity W . The assumption is that weights wti and the total weight W are integers. The greedy solution would fill the knapsack by adding the (sorted) items until the capacity W is reached. While the general 0/1 Knapsack problem is known to be NP-Hard [11], this special equally valued 0/1 Knapsack problem would have a polynomial time (greedy) solution.

Table 1: Notation

Do The original data to watermark.

Dw Watermarked (data) version of Do.

tki ith token.

fio Frequency of ith token in Do. fiw Frequency of ith token in Dw.

R A high entropy secret.

Lwm A list of chosen token pairs for watermarking.

Lsc A list of secrets required for watermark detection.

Le A list of eligible token pairs for watermarking.

k Threshold for detecting a watermark.

t Threshold to accept a pair as watermarked.

b A budget threshold for distortion that watermarking can introduce.

(4)

4 Frequency-based Watermarking 4.1 Overview

Our new private-key based watermarking scheme called Fre- qyWMis blind (does not require the original data), primary- key free, robust, and secure against guess, sampling, destroy, and false-claim attacks with a high utility and a good trade-off between the complexity of the transformation and algorithmic efficiency of the solution we present.

Our FreqyWM scheme consists of two main algorithms: the watermark generation algorithm W MGenerate and the water- mark detection algorithm W MDetect. W MGenerate gener- ates watermarked data based on a budget b capturing how much the watermarked data may differ from the original one, e.g., in terms of cosine (or other) similarity metrics of their corresponding token frequency distributions. By calling W MGenerate, the owner creates a watermarked version of her data consisting of tokens which allows her to prove her ownership. W MDetect detects if a suspected dataset holds the watermark of the owner using the owner secrets produced by W MGenerate and two thresholds (k and t). If W MDetect outputs accept/verified, this evidence would prove that the owner can claim ownership of the watermark and thus the data. By nature, W MDetect can be computed as many times as desired in private while it can be computed only once in public, because it would mean that the potential data owner shall reveal the secret leading to such watermark to the public (or whomever must verify it, e.g., a judging third party). As part of our future work we are also looking at public verifia- bility without revealing the private key (see Section8). We introduce the following two constraints that must be satisfied by FreqyWM:

Constraint 1: W MGenerate should not change the rank- ing of tokens. For instance, let us assume that tokens are URLs. While the most frequent token may be YouTube in the original histogram of the dataset, in the watermarked his- togram, YouTube must preserve its rank despite increase or decrease in frequency of given histogram tokens. Preserva- tion of ranking does not of course imply that frequencies of individual URLs need to remain intact – they can be modified as long as their ranking is preserved.

Constraint 2: The (cosine) similarity3between the origi- nal (histogram) data and its transformation to watermarked (histogram) data should not be any less than (100 − b)%

where b is a budget. Input b is determined by the owner.

For instance, the owner may want to keep distortion due to watermark generation within a given budget.

Next we describe the general idea behind FreqyWM, also illustrated in Figure1. To make the description easier we assume that tokens are TLD URLs to be consistent also with our evaluation section later. Of course our technique is general and can be applied to any repeating token beyond just URLs

3We take advantage of the definition of cosine similarity by NIST [27].

as we explain in Section4.6.

1. The data owner holding a list of tokens visited while browsing the Internet creates a dataset Dousing the do- main of each token in the dataset as a unique token.

2. FreqyWM creates a histogram of the original dataset Do such that it computes the frequency of each unique token and sorts them in a descending order (e.g., YouTube is the most visited, Facebook is the second, and so on).

3. FreqyWM identifies eligible pairs Leof tokens that are candidates to be watermarked using some secrets (i.e., R).

4. FreqyWM selects pairs of tokens from eligible pairs for watermarking denoted by Lwmbased on bugdet con- straints (parameter b). To select which tokens should be paired, FreqyWM benefits from the solutions of two well- known problems: Maximum Weight Matching (MWM) and Equally Valued 0/1 Knapsack problem (QKP). Eligi- ble pairs are converted to a graph representation where vertices represent tkiand an edge represents a pair. Fre- qyWMapplies maximum weight matching to select pairs.

After pairs are selected, to satisfy Constraint 2, Equally Valued 0/1 Knapsack problemis employed and final pairs for watermarking are chosen.

5. FreqyWM modifies the frequencies of the selected to- kens where the frequencies of a pair of tokens would be equal to 0 in some modulo that is calculated based on secrets and tokens in the pair. To make it more compre- hensible, let us assume that the frequencies of a chosen pair, e.g., youtube.com and instagram.com, are 1098 and 537, respectively, in a commercial click-stream dataset sold by companies like comScore, for example. Assume also that we select a modulo value of 129. The difference between the two frequencies in modulo 129 is 45. To set the difference to 0, we need to change the appearance frequencies for youtube and instagram in the dataset. 45 is divided (by 2) as 23 (by ceiling) and 22 (by flooring).

The new frequencies of youtube.com and instagram.com need to become 1098 − 23 = 1075 and 537 + 22 = 559 such that (1075 − 559) mod 129 ≡ 0. We can do that by removing 23 instances of youtube from the dataset, and adding 22 more instances of instagram. However, when the remainder (i.e., (1098 − 537) mod 21 ≡ 16) is greater than half of the modulo, we add the modulo re- sult calculated as (d(1098 − 537) ÷ 21e) × (1098 − 537) to the difference. This way, we never have to eliminate remainders that exceed half of the modulo.

6. FreqyWM inserts or removes tokens depending on the frequencies and produces a watermarked dataset Dw.

(5)

Original 3.2. MWM3.3. QKP

MWM : Maximum Weight Matching QKP : Equally Valued 0\1 Knapsack

3.Optimal Selection

Final Token Selection

: Watermarking secrets

Watermarked

youtube.com\...

google.com\...

instagram.com\..

bbc.com\…

cnn.com\...

….

elpais.com\...

youtube.com\...

youtube.com\…

google.com\...

instagram.com\..

bbc.com\…

cnn.com\...

….

elpais.com\...

Original Data

Watermarked Data

3.1 Graph Representation

Budget 2. Eligible

Tokens 1. Histogram

Generation

4. Frequency Modification 5. Data

Transformation

Watermark Generation Watermark Detection

Detection Watermarked

Data

: Thresholds for detection +1

-1

0 +1

-1

+1 -1 -1 -1 +1 -1

Figure 1: FreqyWM Overview illustrated based on a (Top Level Domain, TLD) URL dataset. URLs chosen as a pair for watermarking are represented with the same colored frequencies (e.g., Youtube and Google) while the ones that are not selected for watermarking are colored black (e.g., CNN).

4.2 Detailed Description of FreqyWM

4.2.1 Watermark Generation

The data owner holds data Do(which we refer it as the origi- nal data) and defines a budget b where it decides how much distortion a watermark can introduce (e.g., b = 2 means co- sine similarity must be at least 98%). Then, given these two inputs, the watermarking algorithm follows the steps provided by AlgorithmIin order to generate the optimal watermark, i.e., the watermark with the largest number of watermarked pairswithin the given budget b. For our optimal solution, we acutely reduce our problem to MWM and QKP problems (see Section4.3) to solve. We also devise two faster heuristic approaches: greedy and random. In our solutions, we do not consider parameter optimizations such as selecting the opti- mal threshold and budget since it is out of scope for this work.

Now we explain the watermark generation as follows:

1. pre-processes Do to generate frequencies (number of appearances of each aforementioned unique token) of to- ken domains and calculate the boundaries of each token as Dhisto = Preprocess(Do). Each token domain tkiholds a (original) frequency fio, an upper boundary ui(except the first most frequent item) and a lower boundary li. Ini-

tially, the upper boundary of tk0, the first token that has the highest frequency in the histogram, is set to infinity and u0= ∞ since we can increase the frequency of tk0

as much as we want and the lower boundary of tk|Dhist o |−1

is set to its frequency as l|Dhist

o |−1= f|Dhist

o |−1. For the rest of the boundaries, uiis defined as the difference between

f(i−1)o and fiowhile liis assigned as fio− f(i+1)o . Remark: We need these upper and lower boundaries to keep the distortion introduced by the watermark at minimum. For instance, after watermarking, YouTube is still the most visited, Facebook is still the second most but their frequencies would have been changed.

2. generates a random number R ← {0, 1}λand an integer z∈ Z+.

3. computes si j values for modulo operation as si j = H(tki||tkj||R) mod z.

4. defines all eligible pairs as Le← Eligible(Dhisto , {si j}) using generated si jvalues and the (token) pairs associ- ated with.

Remark: The eligible items are needed for creating a relation for transforming dataset Dointo dataset Dweven- tually.

(6)

5. runs optimal matching algorithm that chooses optimal number of pairs of tokens (frequencies) from the eligi- ble pairs as Lwmusing the frequencies and si jvalues as

← OptMatch(Dhisto , Le, {si j}, b) where b represents the budget.

6. creates new frequencies of tokens chosen from the op- timal matching algorithm ( fiw− fwj) mod si j≡ 0 (see Section4.3for the optimal matching solution).

Remark: The position of where to add tokens is im- portant for security of FreqyWM against guess attack.

Therefore, new tokens should be added in random posi- tions (see Section4.6for more discussion).

7. generates or removes tokens based on new frequencies.

8. returns the list of tokens Dwand stores Lwm, z value, and the random value R as a list Lsc.

Algorithm I: Watermark Generation Input: Do, b

Output: Dw, Lsc Dhisto = Preprocess(Do) R← {0, 1}λ, z ← Z+ foreach {tki,tkj}i6= j∈ Dodo

si j= H(tki||tkj||R) mod z end

Le← Eligible(Dhisto , {si j})

Lwm← OptMatch(Dhisto , Le, {si j}, b) foreach {tki,tkj} ∈ Lwmdo

Dhisto .Update( fiw, fwj, si j) end

Dw← Create(Dhisto , Do) Lsc= {Lwm, R, z}

Result: Dw, Lsc= {Lwm, R, z}

Generation of Eligible Pairs: A set of eligible pairs (Le) are generated by an algorithm Eligible based on given pairs {tki,tkj} and corresponding si jvalues. A pair is accepted as eligible if it satisfies that the boundaries of each token in the pair are at least dsi j/2e where si j≥ 2. Here si jcannot be 0 or 1 because of modulo operation since modulo 0 is undefined and modulo 1 is 0.

4.2.2 Watermark Detection

In AlgorithmII, we provided our watermark detection algo- rithm as a pseudo-code. In the detection algorithm, the data owner suspects a token dataset D0w and wishes to know if there is a watermark of its in D0wto claim the ownership. The owner holds its secret list Lsc. Given a list of tokens (D0w), the detection algorithm will check if the given list is its or not based on the secret list Lscand pre-defined parameters k and t.

The detection algorithm can be ran by the owner (or whoever

holds the secrets). If the owner wants to prove that it is indeed the owner to a third party, it has to reveal its secrets to that party. This causes to prove the ownership once in public (see Section6.4for further discussion). The algorithm proceeds as follows:

1. generates the histogram list Dhistw of the suspected dataset D0was in the watermark generation algorithm. However, the algorithm does not need to calculate the boundaries (as it was in W MGenerate). It only calculates the fre- quencies of tokens.

2. sets a counter as count = 0.

3. for each token pair {tki,tkj} in Lwm, checks if it exists in Dhistw . If it is found, it creates si jvalues as si j= H(tki||tkj

||R) mod z.

4. checks for each found {tki,tkj} if the following state- ment holds ( fi− fj) mod si j≤ t. If it holds, the counter is increased by 1.

Remark: t is a value between 0 and si j− 1. How to set t (and k) depends on the robustness the owner wants. The owner decides these parameters by itself (see Section 4.3and5for further discussions on t and k).

5. returns accept (verified) if count is greater than or equal to the threshold k. Returns re ject otherwise.

Algorithm II: Watermark Detection Input: D0w, Lsc= {Lwm, R, z}, k,t Output: accept/re ject

Dhistw = Preprocess(D0w) count= 0, result = re ject foreach {tki,tkj} ∈ Lwmdo

if Found(tki,tkj, Dhistw ) then si j= H(tki||tkj||R) mod z if ( fi− fj) mod si j≤ t then

count+ + end

end end

if count ≥ k then result= accept end

Result: result

4.3 Optimal and Heuristic Approaches

Now we discuss here our optimal and heuristic solutions for the matching function in the watermark generation algorithm.

First, we show that the optimal set of watermark pairs for a given budget b is given by the solution of MWM problem followed by a QKP on its output. Given that all watermarked

(7)

pairs have equal value in terms of proving ownership of the data, an optimal watermark is just a watermark of maximum size in terms of watermarked pairs, within the defined con- straints. Later, we describe a greedy and a random heuristic for the same problem. Last, we give our experimental results by comparing the two heuristic approaches to the optimal.

4.3.1 Optimal Matching

We prove that AlgorithmIinserts an optimal watermark with a given budget b considering the constraints described in Sec- tion4. Our pairing problem is constructed based on Definition 2and Definition3. Now, let us define our optimal watermark- ing:

Definition 4 (Optimal Watermarking) Let OptW M(G(V, E), b) be the optimal watermarking with a budget of b among an eligible set of items Lewhich is defined as a connected undi- rected graph G(V, E). The optimal watermarking produces the maximum number of edges (pairs) while not exceeding the budget b.

Let G = {V, E} be a connected undirected graph which is the representation of frequencies driven from eligible pairs Le. V = {v1, v2, . . . , v|V |} where vi represents tki and E= {e1, e2, . . . , e|E|} where e(vi, vj) is the edge between vi

and vj. The weight of an edge e(vi, vj), w(ei), is equal to T− ( fi− fj mod si j) (recall that fiis the frequency of tki) where T is a big value (e.g., T > H where H is the highest dif- ference between two frequencies in the eligible pairs.). Now, we represented the eligible pairs as a graph. With this kept in mind, our optimal watermarking problem reduces to finding the maximum number of edges (pairs) such that no edge has a common vertex and the given budget b is not exceeded. To solve this problem we take advantage of two known prob- lems that have polynomial time solutions: Maximum Weight Matching(MWM) and Equally Valued 0/1 Knapsack prob- lem(QKP) (which we have a special case where all values are equal). After defining the eligible pairs as a graph, the following steps are taken;

• Find the maximum weight matching M = e1, e2, . . . , e|M|

as M = MW M(G(V, E)).

Remark: Notice that M includes the edges (where no two edges share a common vertex) such that the sum gives the maximum weight. The maximum weight for our watermarking refers to minimum weight since we defined weights as T − ( fi− fj mod si j). This conver- sion makes the highest frequency difference having the smallest weight and the smallest one having the highest weight. This is required because we want to identify the edges that will distort the (histogram) dataset minimum.

• Find the set of edges Lwm in M such that the selected edges do not exceed the budget b by employing the QKP as Lwm= QKP(M, b) where Lwm= e1, e2, . . . , e|Lwm|and

value of each eiis 1 (val(ei) = 1).

Remark: After finding the edges via MWM, we have one more constraint in the watermarking which is the budget b. Our optimal watermarking solution returns the maximum number of matchings for which the distortion (e.g., based on cosine similarity) does not exceed b. This is an Equally Valued 0/1 Knapsack problem where the value of each item is 1 and the weight is recomputed as T − w(ei). Recomputation is necessary since for the QKP we want to add as many items as possible that will be bounded by b.

Now, in Theorem1we prove that the resulting watermark is optimal via proof-by-contradiction.

Theorem 1 The watermarking solution OW Mis optimal (ac- cording to Definition4) if the solutions produced from the Maximum Weight Matching (according to Definition2) and Equally Valued 0/1 Knapsack problem (according to Defini- tion3) are optimal.

Proof 1 (Proof-by-contradiction) We prove that our water- marking is optimal via proof-by-contradiction. Let OW M= {e1, e2, . . . , e|O

W M|} be the optimal watermarking solution.

To prove that this solution (which is derived from the opti- mal solution OMW Mgenerated by MaximumWeightMatching (MWM) algorithm and the optimal solution OQKPgenerated by Equally Valued0/1 Knapsack problem (QKP)) is opti- mal, we assume that the solution OW M is not optimal but the solution derived from MW M and0/1 Knapsack are op- timal. If it is not optimal then there must be another solu- tion OW M0 which is optimal (according to Definition4) such that|OW M0 | > |OW M|. In other words, O0W Mmust includeat least one different edge (pair) eothat does not exist in OW M, eo∈ O/ W M. W.l.o.g, considering eo∈ O/ W M, the following two cases happen:

• Case 1: eo∈ OMW M but eo∈ O/ QKP. This states that the QKP solution did not choose eoin OW M. That would mean that QKP did not optimally generate the number of items such that W(w(ei))i=1,2,...,|OW M|≤ b. However, this contradicts with our assumption that QKP is optimal. If eois (or must be) in the optimal solution then it should have been chosen by QKP. Therefore, OW Mcannot be non-optimal (or O0W Mcannot be an optimal solution) .

• Case 2: eo∈ O/ MW M (logically, eo∈ O/ QKP). This case occurs when MW M does not choose the eoand eocomes from the graph G (which is the graph representation of all the eligible items) and appears in OW M0 . Similar to Case 1, while creating the maximum weight matching, eowas not good enough to appear in the optimal match- ing. However, the optimal watermarking solution OW M0 would have eosince it also creates the minimum distor- tion which is the maximum weigh equivalence in MW M.

Then, OMW Mis not optimal since it did not choose the

(8)

maximum weight matching. However, it contradicts with our assumption that OMW Mis optimal. If O0W Mis optimal then eowould have been included in OMW M. Therefore, there cannot exist an edge eothat makes the solution optimal but was not chosen by OMW M.

To conclude our proof, considering the two cases above, we showed that there cannot exist an edge that was not chosen by any of the optimal algorithms and yet it exists in the optimal watermarking solution.

4.3.2 Greedy Matching

In our greedy solution, all the eligible token pairs are sorted in an ascending order by remainders defined as rmi j≡ ( fio− foj) mod si j. The algorithm starts selecting a pair respectively for watermarking where the budget b would not be exceeded when it is chosen (i.e., comparing the similarities of original and watermark datasets). This continues until the budget is exhausted or there is no more item to visit.

4.3.3 Random Matching

Our random matching solution differs from the greedy solu- tion in terms of the order of selecting the pairs. The random matching solution starts by selecting pairs randomly. For each selection, it follows the same steps as greedy algorithm (e.g., checking the similarities).

4.4 Algorithm Comparison and Effect of Main Parameters

We evaluate how the parameters (a modulo value z, a budget b, and skewness parameter α) are affecting the number of chosen pairs for watermarking and the performance of op- timal, greedy, and random approached in terms of number of chosen pairs. In order to provide experimental results, we used synthetic datasets where we generated these samples using a power − law distribution [10] with different skewness values α as [0.05, 0.2, 0.5, 0.7, 0.9, 1]. The sample size is 1M and the number of tokens is 1K for each different α value.

When α is 0, it is a uniform distribution in which there are no eligible tokens to watermark. When α is 1, the original dataset Dois skewed with a very long tail with almost equal values.

Figure2a shows the correlation between skewness of a dataset α and the size of chosen pairs when budget b = 2 and modulo value z = 1031. When a dataset is almost uniform (i.e., α = 0.05), the solutions can select very few pairs since there are not many eligible items (i.e., the upper and lower boundaries are not enough, in fact many of them are 0). When α starts increasing, the differences between the frequencies of tokens increase. Thus, the number of eligible items increases which also affects the number of chosen pairs under a given budget. However, at some point (i.e., α = 0.7), the number of

chosen pairs decreases due to the tail of (histogram) frequen- cies becoming uniform. The same figure shows the superior performance of the optimal solution. The gap between optimal and the heuristics is around 20% for most α values while the two heuristics perform similar the one with the other (0.02%

in average).

Figure2billustrates how the modulo value z affects the size of chosen pairs. When we pick smaller modulo value z, we would have a higher number of chosen pairs. The reason is that a smaller z leads to smaller remainders si jthat need to be eliminated, thereby yielding more selected pairs within a given budget b. When z is very small (i.e., 10), the three approaches are very close (also see Section6.1for the effect of z in terms of security). However, when z increases, our optimal approach selects many more pairs than greedy and random.

Figure2cshows how the budget selection affects the per- formance comparison between the heuristics and the optimal.

We set modulo value z = 1031 and use the dataset with the skewness α = 0.7. When we increase the budget b, the heuris- tics get closer to the optimal performance. This is expected since even the optimal algorithm cannot select more than all the eligible pairs and with a large budget even the heuristics can approach that.

4.5 Validation Using Real World Datasets

Next we apply FreqyWM to three real world datasets from different domains: (1) Chicago Taxi dataset [9]; (2) A real click-stream dataset logging the URLs visited by a group of users of the eyeWnder advertisement detection add-on [19];

(3) Adult dataset [2]. Our intention is to validate our main conclusion using real data from the previous evaluation with synthetic data, i.e., that the heuristic approaches perform well enough compared to the optimal. A second evaluation ob- jective is to report the real processing time on an ordinary machine for generating and detecting the watermark using these real datasets.

Our experimental results are produced on a standard lap- top machine with dual-core Intel Core(TM) i7 − 5600U CPU 2.5GHz, 16.00 GB RAM, 64-bit OS, and implemented in Python language.For our watermark generation, we set the modulo value z = 131 and the budget b = 2, and employed SHA256. We run our algorithm 30 times and take the mean of total computations. Table2presents our validation results.

Taxi ID, URL, and Age were chosen as tokens for Chicago Taxi, eyeWnder, and Adult, respectively. After generation, for Chicago Taxi, eyeWnder, and Adult datasets, our opti- mal solution chose 805, 38, and 21 pairs, respectively. Con- sidering the heuristic approaches, greedy chose 770 pairs for Chicago Taxi, 33 pairs for eyeWnder, and 20 pairs Adult while random chose 773, 31, and 17 pairs, for Chicago Taxi, eyeWnder, and Adult, respectively. Running times of com- putations for watermark generation on the Chicago Taxi,

(9)

(a) Effect of different skewness parameters (α) on chosen pairs by Optimal, Greedy, and Random.

(b) Effect of different modulo values (z) on chosen pairs by Optimal, Greedy, and Random.

(c) Chosen pairs by Greedy and Random with respect to Optimal.

Figure 2: Effects of parameters on the size of chosen pairs for watermarking.

Dataset Size Token Distinct Tokens |Le| Optimal Greedy Random Gen Detect

(sec) (sec)

Chicago Taxi [9] 9.68GB Taxi ID 6573 33308 805 770 773 182.51 0.609

eyeWnder [19] 247MB URL 11479 257 38 33 31 420.81 0.053

Adult [2] 4MB Age 73 72 21 20 17 0.03 0.001

Table 2: Validation results on real world datasets. Dataset: Dataset used Size: The size of original dataset. Token: Definition of the token (e.g., the name(s) of the attributes). Distinct Token: The number of distinct tokens. |Le|: The number of eligible pairs.

Optimal: The number of chosen pairs by the optimal matching. Greedy: The number of chosen pairs by the greedy matching.

Random: The number of chosen pairs by the random matching. Gen: Running time of watermark generation. Detect: Running time of watermark detection.

eyeWnder, and Adult datasets were 182.51 secs, 420.81 secs, and 0.03 secs, respectively (where we exclude histogram and watermarked data generations). For watermark detection, the total detection time for each watermarked datasets was less than 1 sec. As it can be interpreted from Table2, the number of chosen pairs increases when the number of eligible pairs increases. For instance, while eyeWnder has more distinct to- kens (11479) than Chicago Taxi has (6573), eyeWnder has fewer eligible pairs (257) than Chicago Taxi has (33308).

Thus, FreqyWM has selected more pairs for Chicago Taxi (805) than it did for eyeWnder (38).

4.6 Watermarking Multi-Dimensional Data

During our evaluation so far, on both real world and syn- thetic datasets, we set the token as a single attribute (e.g., URL, Taxi ID, Age). However, as we previously stated, a token does not necessarily need to be restricted to a single attribute of a multi-dimensional dataset. Therefore, a token can be also defined as combination of more than one attributes (e.g., [Age, WorkClass]) in the Adult dataset. We ran FreqyWM on such token represented as [Age, WorkClass] with the same parameter setting in Section4.5and the number of to- kens (i.e., distinct [Age, WorkClass] attributes in the real dataset) were 481. The size of pairs chosen for watermark-

ing was 20. With multi-dimensional data removing a token appearance is as simple as with single-dimensional data. In- creasing, however, a token’s frequency is more involved. The reason is that just repeating the value of the token would leave other fields not being part of the token with a value to be set, e.g., all the other fields beyond Age and WorkClass in the Adultdataset. A naive solution would be select a random appearance of the token and copy its other fields every time that an additional instance of the token needs to be added to the watermarked dataset. This, however, could create seman- tic inconsistencies if there are constraints to be respected for individual attributes or combinations of them. Making sure that added appearances of a token do not lead to semantic inconsistencies that, in addition to degrading the quality of the data, could also give away the existence of a watermarked pair to an attacker, requires domain knowledge about what each dataset represents. Such knowledge, however, is orthog- onal to all previous steps of our algorithms and, thus, can be appended as a last step based on one’s domain knowledge of the data whenever a token’s frequency needs to be increased.

(10)

5 Probabilistic Analysis of False Positives

In this section we develop a statistical bound derived from Markov’s inequality theorem, to demonstrate that the false positive probability (i.e., accepting a dataset as watermarked when it is not) goes to zero if we increase the minimum number of pairs that has to be accepted, or if we decrease the threshold to accept a pair as watermarked. We will fo- cus in an example where the remainders follow a Uniform distribution to show the results and some numerical exam- ples, but Markov’s bound is a general property and it can be applied no matter the distribution of the remainders. The significance level accepting a pair as watermarked t deter- mines how amenable the system is to false hits. By choosing lower values of t, the owner can increase its confidence that if the detection algorithm finds its watermark in a suspected relation, it probably is an illegal copy of its watermarked to- ken list. Recall that if the i-th pair of tokens is accepted as watermarked, this means that ( fi− fj) mod si j≤ t holds for this pair of tokens. The probability that the “watermarking statement” holds for the i-th pair of token, is represented as P(Xi= 1) = pifor i = 1, ..., n. The value of pi depends on t. The logic is that if t is zero, the false negatives will in- crease. If t increases, we will have more chance to be robust against slight modifications. But if we increase t too much, we will have more chance for false positives (accepting as watermarked when it is not).

Let us define now the random variable Sn= ∑ni=1Xi as the sum of the n independent and not necessarily identically distributed Bernoulli’s. Then, Snfollows a Poisson-Binomial distribution [8] and takes values {0, 1, 2, ..., n} since it counts the number of pairs accepted as watermarked, in a set of n total pairs of tokens. The probabilities pido not necessarily have to be the same for each pair of tokens, in fact, in reality they will be different. If they are the same, then we have the particular case of the Binomial distribution.

Let us assume that the probabilities pi follow a Uniform[0,1]distribution.4This way the probability of suc- cess for a pair (which is accepting a pair as watermarked pair) is pi= {|t|/si j} in our watermarking solution. The probabil- ity of having at least k successes in n trials can be written as P(Sn≥ k) = ∑ni=kP(Sn= i). Note that k represents the mini- mum number of pairs that has to be “accepted”, to say that the dataset D0wis watermarked by the owner.

We are interested in studying the behavior of the probability P(Sn≥ k) depending on the behavior of the parameters t and k. For this, we can use the Sandwich Rule of probability and Markov’s upper bound obtained by its inequality theorem P(Sn≥ k) ≤µk, where µ = ∑ni=1piis the mean of Sn.

What happens in the limit of t? The parameter t defines the robustness we want to have. The individual probabilities pidirectly depend on the value of t. If t tends to zero, the probabilities piwill tend to zero for all i = 1, ..., n, and then, µ

4They can follow different distributions, such as Uniform, Normal, Beta.

will also tend to zero. Therefore, we need to see what happens when µ goes to zero. It is easy to see from Markov’s bound that limµ→0P(Sn≥ k) = 0. This means that if we decrease t, the probability of accepting a dataset as watermarked will decrease to zero.

What happens in the limit of k? The parameter k defines the minimum number of “accepted” pairs of tokens that we need in order to conclude that the dataset is watermarked.

If we increase k, it will be hard to “accept” a dataset as wa- termarked. Thus the probability of “acceptance” should be reduced. If we make k tend to infinity limk→∞P(Sn≥ k) = 0.

Let us see what happens if we make k tend to 0. From the definition of the cumulative distribution function of a random variable, and since Snis non-negative, we have that limk→0P(Sn≥ k) = 1. Figure3shows the survival probabili- ties P(Sn≥ k) against the value of k (where the probabilities piare Uniform[0,1]), computed using the Discrete Fourier Transform of the characteristic function (DTF-CF) of the Poisson-Binomial distribution for n = 50.

Figure 3: Survival probabilities for n = 50.

6 Security and Robustness Analysis

This section discusses the security and robustness of our Fre- qyWMmethod. We consider three attack scenarios as guess, sampling, destroy, and re-watermarking (false-claim) at- tacks.

6.1 Guess (Brute-Force) Attack

In the guess attack, the probabilistic polynomial time (PPT) adversary tries to guess the watermark, i.e., the secret em- bedded in the data. This is possible only if he can figure out a subset of token pairs {tki,tkj}N (where |Dhistw2 | ≥ N ≥ k) based on the watermarked data Dw, the random value R, and the modulo value z where the watermark detection algorithm based on these inputs (for some fixed k and t) returns accept.

Assuming that the hash function is collision resistant, R is ran- dom, and z is an integer, the probability of the attacker being successful is 1

|z|×(KN)×2λ where K = |Dhistw2 |, i.e., it becomes

(11)

negligible for typical parameter values (i.e., λ = 256) such as those used in our evaluation.

6.2 Sampling Attack

In this attack,Acopies a random subsample from the water- marked dataset Dw in an attempt to exploit (pirate/steal) it while hoping that the watermark won’t be detectable within the extracted sample. Figure4illustrates the attack for differ- ent sample sizes from 1% to 90%, extracted from the original watermarked dataset Dw, generated from an original dataset Dowith skewness parameter α = 0.5, that has 1M tokens in total, 1K unique tokens. The number of pairs chosen by our optimal matching was 139. For each percentage and each subsample we apply the detection algorithm and compute the percentage of accepted pairs. Since the extraction of the sub- samples is performed 100 times for each fixed percentage, we compute the average accepted pairs over all repetitions. Also, for each subsample detection experiment, we deploy different values of the threshold t for accepting a pair as watermarked as t = {0, 1, 2, 4, 10}.

Figure 4: Sampling attack results with α = 0.5.

The attack scenario is as follows:A randomly selects x%

of Dwwhere x defines the percentage for the sampling attack (e.g., 1) as a subsample size of 1M ×100x . For instance, for 1%

sampling attack, a subsample would have total of 1M ×0.01 = 10K. Note that if the sample size is greater than the number of distinct tokens, which is the number of items/tuples in Dhistw , the sample would have all the distinct tokens with a high chance. This would mean that all the chosen pairs for watermark would exist in the subsample.

Figure4shows that the size of the extracted subsample does not greatly affect the number of accepted pairs if it is greater than the number of unique tokens (1K). For that reason, every unique token will be found in the extracted subsample with a high probability, and thus, every watermarked pair of tokens will also be found. With the sampling attack, the frequencies of the tokens vary, that is why the value of t does affect the result. For example, with t = 0 the detection algorithm is

able to detect around 36% (in average) of the watermarked pairs. When t increases from 1 to 10, the performance of the detection increases (in average) from 72% to 99.5%.

Let us now see the results when the size of the extracted subsample is very low, so that it might not contain at least 1 token from the total 1K of unique tokens that the original wa- termarked dataset has. Figure5shows the results for sample size proportions between 0.0007% and 0.5%. It can be seen that if the sample size is greater than 5 times the number of unique tokens (1K), the detection algorithm stabilizes its per- formance for detecting the watermark. Below 2 times (2K), the performance starts to decrease with higher velocity. In these extreme cases, the detection algorithm will have more difficulties to detect the data as watermarked. However, the utility of the data is highly compromised since these are very small subsample sizes compared with the original size of 1M of tokens. In the most extreme cases, the utility is also de- stroyed in the sense that a small number of distinct tokens are found in the subsample, compared to the original number of 1K. We also tested the sampling attack in other watermarked datasets with different values of the skewness parameter α and we obtained similar results.

Figure 5: Sampling attack results with very low sample size and α = 0.5.

6.3 Destroy Attack

In this case the attackerA tries to damage the watermark.

The no-security-by-obscurity principle [23] allowsA to know that he can destroy the watermark in a way that it cannot be detected by the owner.A computes the histogram of water- marked data Dw.Amodifies the frequencies of tokens as he pleases by allowing re-ordering (changing the popularity/rank of the tokens) or without allowing re-ordering. By considering aforementioned two different types of destroy attacks, we fur- ther define these two attacks in details and discuss robustness of FreqyWM against these attacks. In order to measure the robustness, we run our optimal solution on a dataset where

(12)

the skewness parameter α = 0.5, the modulo value z = 131, and the budget b = 2.

6.3.1 Destroy Attack without re-ordering

In this attack type,A can modify the frequencies without changing the order of frequencies. In this attack, we introduce two types: (1) attacker changes the frequencies randomly by the given boundaries (2) attacker changes the frequencies by (at most) some percentage.

Changing the frequencies randomly within the upper and lower boundaries. In this attack type,Acalculates the bound- aries for each token. Then,A chooses a random value rifor each tki as ri← (−li, ui).A changes the frequency of the token and updates the upper boundary of the next token in the list by ri.

Changing the frequencies by (at most) some percentage.

A changes the frequencies of tokens up to some percent- age (e.g., 1%). To illustrate,A calculates the boundaries as ui(representing upper boundary) and li(representing lower boundary) for each tkiwhere he sets the percentage to 1%.

He calculates u0i= f loor(ui× 0.01) and li0= f loor(li× 0.01).

Then he gets a random value ribetween (−li0, u0i). It hereby changes tki by at most ±1%. li0 and u0i are already in the boundaries and they cannot exceed the boundaries. After ev- ery change ( fi0= fi+ ri), the boundary of the next element is updated as well. Thus, the attack never changes the rank- ing/ordering.

Figure 6: Percentage of verified pairs for the following datasets: (1) Dw: the original watermarked dataset (power- law popularity with skewness α = 0.5) without any attack, (2) Dnon: a non-watermarked dataset denoted by defined over the same token space but with different power-law popular- ity with skewness α = 0.7, (3) Drw: the watermarked dataset Dwafter attacked by random attack without reordering, (4) D1w: the watermarked dataset Dwafter attacked by changing frequencies at most 1%.

Figure6shows how robust FreqyWM is against these two destroy attacks. We perform these two attacks on the wa-

termarked sample in which α = 0.5. Later, we compare the success rate (the percentage of accepted token pairs given threshold for accepting a pair t) of detection algorithm with respect to modified watermarked data after the attacks. We also include in the figure a second dataset of skewness α = 0.7 that does not carry the watermark, and report on how many of its pairs would be falsely verified for different values of t.

The blue line in the plot represents the watermarked data Dw without any modification. For an attack in which the frequen- cies are changed by (at most) some percentage (represented by the red line in the figure), FreqyWM can detect around 90% of the pairs when t = 0. When t is increased, after a point where t ≥ fi− fj mod si j, the success rate converges at around 90%. For an attack where the frequencies are changed randomly within the upper and lower frequency boundaries (green line in Fig.6), FreqyWM can detect more than 35% of the pairs when t = 0. Note that the latter is more powerful than the former attack. There is a direct proportion between t and the success rate. As shown, the success rate reaches to 90%

when t goes to 10. From Figure6, we can interpret in what parameter setting false negative (rejecting a pair while it is wa- termarked) and false positive (accepting a pair as watermarked while it is not) can be avoided. Thus, the watermarking detec- tion algorithm can successfully detect a watermarked dataset attacked and reject a dataset that was not watermarked. For in- stance, the rate of false positive increases when the threshold for accepting a pair t increases while the minimum number of accepted pairs for detection k decreases which is the area under the results of the dataset (not watermarked) with a dif- ferent skewness parameter (the area under the orange line).

On the other hand, the rate of false negative increases when the threshold accepting a pair t decreases while threshold for detecting a watermark k increases which is the area above the results of the attack without re-ordering (the area above the green line) if we consider a very strong attack. For the owner, to avoid false negatives and false positives, convenient parameter settings for the threshold for accepting a pair t and the minimum number of accepted pairs k for detecting a watermark lie between these two areas (between the orange and the green lines in Figure6). However, if a weaker attack (changing the frequencies by some percentage ) is considered, the range of these parameters increases (the area between the red and orange line). Therefore, the detection algorithm can detect a watermarked dataset and reject a dataset that was not watermarked by the owner with a careful parameter setting.

Tuning these parameters depends largely on the specifics of each application use case and, as we explain in Section8, is a topic of our future work plans.

6.3.2 Destroy Attack with re-ordering

In this attack type, an attackerA can modify the frequen- cies as he pleases without observing any ordering restric- tions. Note that this attack introduces more noise than

Referencias

Documento similar