FreqyWM: Frequency WaterMarking for the New Data Economy

(1)

FreqyWM: Frequency WaterMarking for the New Data Economy

Devri¸s ˙I¸sler

IMDEA Networks Institute

& Universidad Carlos III de Madrid [email protected]

Elisa Cabana IMDEA Networks Institute

[email protected]

Nikolaos Laoutaris IMDEA Networks Institute [email protected] Abstract

We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuristic algorithms for creating and verifying such watermarks.

We also demonstrate the robustness of our technique against various attacks and derive analytical bounds for the false positive probability of erroneously “detecting” a watermark on a dataset that does not carry it. Our technique is applicable to both single dimensional and multidimensional datasets, is independent of token type, and can be used in a variety of use cases that involve buying and selling data in contemporary data marketplaces.

1 Introduction

Data-driven decision making powered by Machine Learning (ML) algorithms is changing how society and the economy work and is having a profound positive impact on our daily lives. A McKinsey report predicted that data-driven decision- making could reach US$2.5 trillion globally by 2025 [4]. ML is driving up the demand for data in what has been called the fourth industrial revolution. To satisfy this demand, several data marketplaces (hereinafter, DMs) have appeared in the last few years. DMs are mediation platforms that aim to connect the two primary stakeholders of the data value chain, namely the data providers/sellers and the data buy- ers. DMs take care of tasks such as searching/cataloging, facilitating price negotiation, storing and transferring data, etc. This ecosystem includes open data repositories (e.g., Dataverse, Kaggle), general-purpose DMs (e.g., AWS, Data- Rade, DAWEX, DIH, Advaneo), and specialised or niche DMs for specific industries, such as automotive (e.g., Otonomo, Caruso), financial (e.g., Refinitiv, Battlefin), marketing (e.g., LOTAME, LiveRamp), and logistics (e.g., Veracity) to name a few. See [6] for an extensive survey of close to 200 DMs and other data providers.

The problems: The majority of DMs and data sellers have explicit clauses indicating that purchased datasets cannot be resold or used outside the jurisdiction of the buyer. But as with all digital assets, being able to copy/store/transmit datasets with close to zero cost makes creating illegal copies very easy. Even worse, unlike media content and software, the issue of ownership is less obvious when it comes to datasets.

Any popular movie, song, e-book, or piece of software can usually be attributed to a director/producer/actor, musician, author, or company, respectively, but the same is hard to do for large datasets. Consider an anonymised mobility dataset logging the movement of people in a city. Such a dataset may have been produced by collecting GPS readings from the smartphones of individuals using a map application, or it may be deduced by analysing cell phone traces [26] or Call Description Records (CDRs) maintained by mobile operators.

With weak to nonexistent ownership guarantees, it is diffi- cult to imagine that the data economy will ever flourish and reach its projected potential. Indeed, any sold copy of a dataset can be ‘pirated’ by a buyer-turned-seller that can then resell the same dataset in a DM at a potentially lower price thereby undercutting the rightful owner and rendering his investment useless. Piracy has threatened in the past the software and media industries and may very well do the same to the fledgling data economy by discouraging companies from investing in expensive measurement probes, big data infrastructures, and human labor costs to clean and curate commercial datasets.

A novel watermarking technique for data: In this paper we develop a technique for protecting dataset ownership from piracy. We present a novel Frequency Watermarking technique, henceforth FreqyWM,¹for hiding a secret within a dataset in a manner that makes the said secret indistinguish- able from the data that it protects. The main idea behind FreqyWMis to modify slightly the appearance frequency of existing tokens within a dataset in order to create a secret in the form of a complex relationship between the frequencies of different tokens. By making this relationship complex enough,

1Freqy pronounced as freaky.

(2)

we can reduce the probability that it appeared by chance close to zero. Therefore, by revealing knowledge of such secret relationship, a party can claim ownership over a dataset because the only practical way of knowing such a secret is to have inserted it in the data in the first place.

A token may be a word, a database record, a URL, or any repeating value within a structured or semi-structured commercial dataset. Our secret is created by first selecting a number of token pairs. Then for each pair, we slightly modulate the frequency counts of its tokens in order to make their difference yield zero under modulo N arithmetic. This can be easily done by adding or removing some instances of one, the other, or both tokens. By increasing the number of selected pairs we can make our watermark more resistant to attacks, as well as less likely to have appeared by chance.

FreqyWMcan be used to achieve several things. First and foremost, by revealing knowledge of the secret encoded by the watermark, a data seller can prove rightful ownership of a dataset to a third party, such as a DM. This can be used to distinguish a rightful owner from a pirate that may, for example, attempt to upload and monetize a pirated dataset in one or several DMs. If the DM, or the rightful owner detects such an event, the latter can demand that its dataset be removed and the pirate be banned. This would mimic what web-sites like YouTube do to protect copyrighted content. Detecting the presence of pirated copies can be achieved using content similarity [18], locality sensitive hashing [31] and even hashing similarity [7] that go beyond the scope of watermarking.

In addition to proving ownership, our watermarking technique can also be used to reveal who may have leaked (copied/pirated) a dataset in the first place. A dataset seller or a DM may create a different watermark for every buyer and in addition to encoding it into the data, store also a description of it in some immutable index, such as a blockchain. Then, if an unauthorized copy of the dataset is found at a latter point, the culprit can be identified by looking up its watermark against the said index.

Our contributions: Our first and foremost contribution is the idea of using the appearance frequency of tokens to encode invisible watermarks upon datasets traded in modern DMs. We establish a family of such watermarks using frequency pairs and modulo arithmetic and prove that creating an optimal FreqyWM reduces to solving a Maximum Weighted Matching (MWM) problem [15,28] combined with a polynomial special version of the 0/1 Knapsack problem [11] involv- ing items of equal value but different weights. Our second major contribution is the extension of frequency-based watermarking for making it resilient against a series of attacks. In particular, we protect our data engineering technique against a Guess Attack attempting to identify our watermarked pairs and secrets to impersonate the rightful owner. We make such an attack computationally hard by introducing a high-entropy secret while generating the watermark. We also protect against a Re-watermarking Attack mounted by having a pirate inject

his own watermark upon an already watermarked dataset, and then present the former as a false proof of ownership. We thwart such an attack by describing a simple protocol capable of ordering chronologically multiple watermarks that may be carried by different versions of the same dataset. We also protect against a Destroy Attack mounted by an attacker attempting to destroy our watermark by changing the frequency of different tokens in the dataset. By relaxing our modulo arithmetic rule used during the verification process of a particular watermark pair, as well as the percentage of pairs to be detected before the entire watermark is verified (accepted), we oblige the attacker to effectively also destroy the actual data in the process of attempting to destroy the watermark.

Finally, we show that our technique is robust to a Sampling Attackin which the attacker is attempting to pirate only a random sample of the watermarked data.

Our final contribution is an extensive performance evaluation study aiming to explain how the main parameters of FreqyWMimpact on major performance metrics under different attack scenarios using synthetic and real world datasets.

Our findings: Next we summarize the results from our performance evaluation study.

• We show that as long as there exists sufficient variation in the frequencies of different tokens, FreqyWM can encode robust watermarks with minimal distortion on the initial data. Our technique become inapplicable only when token appearance frequencies are almost uniform, because in this case there does not exist sufficient gap between different frequencies for encoding a watermark.

• Regarding the false positive probability, i.e., “detecting”

a watermark on a dataset that does not carry it, our analytical bounds show that it quickly goes to zero as we increase the number of pairs in our watermark.

• We demonstrate that a Guess Attack has negligible probability of success, thereby making it impossible for almost all practical cases. On the up side, the rightful owner or any party that is given the secret key for verifying the watermark, can do that very fast in linear time complexity.

• In terms of Sampling Attacks, we show that with the exception of very small samples (of correspondingly small value), our verification algorithm is still capable of detecting our watermark. Achieving this requires using the relaxed detection algorithm that trades robustness to attacks with false positives. For example, on a sample of 20% and with thresholds that impose tiny false positive probability, the watermark detection probability exceeds 90%.

• In terms of Destroy Attacks, we show that a watermark that imposes (costs) a tiny 0.0002% distortion on the original data, remains detectable even under attacks that

(3)

add random noise that imposes a 90% modification on the original data.

• Compared to existing solutions from the literature [3,30]

that are applicable only to numerical data and preserve only the mean value of the watermarked data, FreqyWM allows a data owner to control the exact amount of distortion introduced by the watermark in terms of cosine or other similarity metrics which, under [3,30] may become unbounded. For example, a FreqyWM watermark that imposes only 0.0002% distortion in terms of cosine similarity, is stronger than a watermark from [30] that imposes a 46.72% distortion under the same metric.

2 Related Work

Database watermarking is the closest type of watermarking to our work. There are of course other types of watermarking and fingerprinting, e.g., for sequential dataset [5], genomic datasets [21], but as they focus on very different data types we do not go into more details about them. Going back to databases watermarking, the first known technique was proposed by Agrawal et al. [3] in 2002. It is applicable to numerical data only and guarantees that the introduced distortion does not dramatically change the median and standard deviation of the data. The watermark information is normally embedded in the Least Significant Bit (LSB) of features of relational databases to minimize distortion.

Other database watermarking solutions introduce distortion by considering the statistics of numeric values [14,17,25,30].

The proposed solutions in [3,30] focus on keeping the change at minimum (i.e., median and standard deviation).

To avoid distortion, distortion-free database watermarking schemes have also been proposed. [12,13] are distortion-free techniques that introduce fake tuples or columns in the original database. The fake tuples or columns are created based on a watermark secret by computing a secret function which makes watermarking visible and easy to remove.

Survey papers such as [22,24,29] compare database watermarking techniques in terms of verifiability, distortion, supported data types, and other aspects. Many of these solutions are applicable only to numerical data and thus cannot be applied in a range of commercial datasets, e.g., to web-browsing click-streams.²Moreover, these solutions do not allow a data owner to limit the amount of distortion introduced by the watermark but merely to preserve the basic statistics of the data in terms of its first few moments.

2Note that categorical watermarking techniques would replace tokens in a dataset with another token in the same category, and this will cause an undesired distortion and require certain predefined categories. While a text watermarking would change a token (e.g., replacing a word with another similar word). This (insecure) change may invalidate a token (e.g., causing an invalid URL).

3 Preliminaries

Let λ ∈ N be a security parameter. A probabilistic polynomial time (PPT) algorithm A is a probabilistic algorithm taking 1^λ as an input and has running time bounded by a polynomial in λ. || denotes concatenation. In Table1we present our method notations used throughout the paper.

Hash Function.

Definition 1 A hash function H (chosen from a family of such functions) is a deterministic function from an arbitrary size input to a fixed size output, denoted H: {0, 1}^∗→ {0, 1}^λ. The hash function is collision resistant if it is hard to find two different inputs m₀6= m₁that hash to the same output H(m₀) = H(m₁).

Maximum Weight (Perfect) Matching (MWM).

Definition 2 Let G = {V, E} be a graph with a set of vertices V= {v₁, v₂, . . . , v_{|V |}} and a set of edges E = {e₁, e₂, . . . , e_|E|} where each edge has a weight denoted by w(e_i). A set M ⊆ E is a matching if no two edges in M have a common vertex.

A vertex v is matched by M if it contains an edge of M, and unmatched otherwise. In the maximum matching problem it is asked to find a matching M of maximum weight in a given input graph G= (V, E) which is formulated as MAX M =

∑ei∈Ew(e_i)x_i s.t. x_i∈ {0, 1}.

Equally Valued 0/1 Knapsack Problem (QKP).

Definition 3 Let T = {t₁,t₂, . . . ,t_n} be the set of items. Each item (t_i) is worth the same value val and has a weight w_t_i. The problem is to fill a knapsack of capacity W . The assumption is that weights w_t_i and the total weight W are integers. The greedy solution would fill the knapsack by adding the (sorted) items until the capacity W is reached. While the general 0/1 Knapsack problem is known to be NP-Hard [11], this special equally valued 0/1 Knapsack problem would have a polynomial time (greedy) solution.

Table 1: Notation

Do The original data to watermark.

Dw Watermarked (data) version of Do.

tki ith token.

f_i^o Frequency of ith token in Do. f_i^w Frequency of ith token in Dw.

R A high entropy secret.

Lwm A list of chosen token pairs for watermarking.

Lsc A list of secrets required for watermark detection.

Le A list of eligible token pairs for watermarking.

k Threshold for detecting a watermark.

t Threshold to accept a pair as watermarked.

b A budget threshold for distortion that watermarking can introduce.

(4)

4 Frequency-based Watermarking 4.1 Overview

Our new private-key based watermarking scheme called Fre- qyWMis blind (does not require the original data), primary- key free, robust, and secure against guess, sampling, destroy, and false-claim attacks with a high utility and a good trade-off between the complexity of the transformation and algorithmic efficiency of the solution we present.

Our FreqyWM scheme consists of two main algorithms: the watermark generation algorithm W MGenerate and the watermark detection algorithm W MDetect. W MGenerate generates watermarked data based on a budget b capturing how much the watermarked data may differ from the original one, e.g., in terms of cosine (or other) similarity metrics of their corresponding token frequency distributions. By calling W MGenerate, the owner creates a watermarked version of her data consisting of tokens which allows her to prove her ownership. W MDetect detects if a suspected dataset holds the watermark of the owner using the owner secrets produced by W MGenerate and two thresholds (k and t). If W MDetect outputs accept/verified, this evidence would prove that the owner can claim ownership of the watermark and thus the data. By nature, W MDetect can be computed as many times as desired in private while it can be computed only once in public, because it would mean that the potential data owner shall reveal the secret leading to such watermark to the public (or whomever must verify it, e.g., a judging third party). As part of our future work we are also looking at public verifiability without revealing the private key (see Section8). We introduce the following two constraints that must be satisfied by FreqyWM:

Constraint 1: W MGenerate should not change the ranking of tokens. For instance, let us assume that tokens are URLs. While the most frequent token may be YouTube in the original histogram of the dataset, in the watermarked histogram, YouTube must preserve its rank despite increase or decrease in frequency of given histogram tokens. Preserva- tion of ranking does not of course imply that frequencies of individual URLs need to remain intact – they can be modified as long as their ranking is preserved.

Constraint 2: The (cosine) similarity³between the original (histogram) data and its transformation to watermarked (histogram) data should not be any less than (100 − b)%

where b is a budget. Input b is determined by the owner.

For instance, the owner may want to keep distortion due to watermark generation within a given budget.

Next we describe the general idea behind FreqyWM, also illustrated in Figure1. To make the description easier we assume that tokens are TLD URLs to be consistent also with our evaluation section later. Of course our technique is general and can be applied to any repeating token beyond just URLs

3We take advantage of the definition of cosine similarity by NIST [27].

as we explain in Section4.6.

1. The data owner holding a list of tokens visited while browsing the Internet creates a dataset D_ousing the domain of each token in the dataset as a unique token.

2. FreqyWM creates a histogram of the original dataset D_o such that it computes the frequency of each unique token and sorts them in a descending order (e.g., YouTube is the most visited, Facebook is the second, and so on).

3. FreqyWM identifies eligible pairs L_eof tokens that are candidates to be watermarked using some secrets (i.e., R).

4. FreqyWM selects pairs of tokens from eligible pairs for watermarking denoted by L_wmbased on bugdet constraints (parameter b). To select which tokens should be paired, FreqyWM benefits from the solutions of two well- known problems: Maximum Weight Matching (MWM) and Equally Valued 0/1 Knapsack problem (QKP). Eligi- ble pairs are converted to a graph representation where vertices represent tk_iand an edge represents a pair. Fre- qyWMapplies maximum weight matching to select pairs.

After pairs are selected, to satisfy Constraint 2, Equally Valued 0/1 Knapsack problemis employed and final pairs for watermarking are chosen.

5. FreqyWM modifies the frequencies of the selected tokens where the frequencies of a pair of tokens would be equal to 0 in some modulo that is calculated based on secrets and tokens in the pair. To make it more compre- hensible, let us assume that the frequencies of a chosen pair, e.g., youtube.com and instagram.com, are 1098 and 537, respectively, in a commercial click-stream dataset sold by companies like comScore, for example. Assume also that we select a modulo value of 129. The difference between the two frequencies in modulo 129 is 45. To set the difference to 0, we need to change the appearance frequencies for youtube and instagram in the dataset. 45 is divided (by 2) as 23 (by ceiling) and 22 (by flooring).

The new frequencies of youtube.com and instagram.com need to become 1098 − 23 = 1075 and 537 + 22 = 559 such that (1075 − 559) mod 129 ≡ 0. We can do that by removing 23 instances of youtube from the dataset, and adding 22 more instances of instagram. However, when the remainder (i.e., (1098 − 537) mod 21 ≡ 16) is greater than half of the modulo, we add the modulo result calculated as (d(1098 − 537) ÷ 21e) × (1098 − 537) to the difference. This way, we never have to eliminate remainders that exceed half of the modulo.

6. FreqyWM inserts or removes tokens depending on the frequencies and produces a watermarked dataset D_w.

(5)

Original 3.2. MWM3.3. QKP

MWM : Maximum Weight Matching QKP : Equally Valued 0\1 Knapsack

3.Optimal Selection

Final Token Selection

: Watermarking secrets

Watermarked

youtube.com\...

google.com\...

instagram.com\..

bbc.com\…

cnn.com\...

….

elpais.com\...

youtube.com\...

youtube.com\…

google.com\...

instagram.com\..

bbc.com\…

cnn.com\...

….

elpais.com\...

Original Data

Watermarked Data

3.1 Graph Representation

Budget 2. Eligible

Tokens 1. Histogram

Generation

4. Frequency Modification 5. Data

Transformation

Watermark Generation Watermark Detection

Detection Watermarked

Data

: Thresholds for detection +1

-1

0 +1

-1

+1 -1 -1 -1 +1 -1

Figure 1: FreqyWM Overview illustrated based on a (Top Level Domain, TLD) URL dataset. URLs chosen as a pair for watermarking are represented with the same colored frequencies (e.g., Youtube and Google) while the ones that are not selected for watermarking are colored black (e.g., CNN).

4.2 Detailed Description of FreqyWM

4.2.1 Watermark Generation

The data owner holds data D_o(which we refer it as the original data) and defines a budget b where it decides how much distortion a watermark can introduce (e.g., b = 2 means cosine similarity must be at least 98%). Then, given these two inputs, the watermarking algorithm follows the steps provided by AlgorithmIin order to generate the optimal watermark, i.e., the watermark with the largest number of watermarked pairswithin the given budget b. For our optimal solution, we acutely reduce our problem to MWM and QKP problems (see Section4.3) to solve. We also devise two faster heuristic approaches: greedy and random. In our solutions, we do not consider parameter optimizations such as selecting the optimal threshold and budget since it is out of scope for this work.

Now we explain the watermark generation as follows:

1. pre-processes D_o to generate frequencies (number of appearances of each aforementioned unique token) of token domains and calculate the boundaries of each token as D^hist_o = Preprocess(D_o). Each token domain tkiholds a (original) frequency f_i^o, an upper boundary u_i(except the first most frequent item) and a lower boundary l_i. Ini-

tially, the upper boundary of tk₀, the first token that has the highest frequency in the histogram, is set to infinity and u₀= ∞ since we can increase the frequency of tk0

as much as we want and the lower boundary of tk_|Dhist o |−1

is set to its frequency as l_|Dhist

o |−1= f_|Dhist

o |−1. For the rest of the boundaries, u_iis defined as the difference between

f_(i−1)ô and f_iôwhile l_iis assigned as f_iô− f_(i+1)ô . Remark: We need these upper and lower boundaries to keep the distortion introduced by the watermark at minimum. For instance, after watermarking, YouTube is still the most visited, Facebook is still the second most but their frequencies would have been changed.

2. generates a random number R ← {0, 1}^λand an integer z∈ Z⁺.

3. computes s_{i j} values for modulo operation as s_{i j} = H(tk_i||tk_j||R) mod z.

4. defines all eligible pairs as L_e← Eligible(D^hist_o , {s_{i j}}) using generated s_{i j}values and the (token) pairs associ- ated with.

Remark: The eligible items are needed for creating a relation for transforming dataset D_ointo dataset D_weven- tually.

(6)

5. runs optimal matching algorithm that chooses optimal number of pairs of tokens (frequencies) from the eligible pairs as L_wmusing the frequencies and s_{i j}values as

← OptMatch(D^hist_o , L_e, {s_{i j}}, b) where b represents the budget.

6. creates new frequencies of tokens chosen from the optimal matching algorithm ( f_i^w− f^w_j) mod si j≡ 0 (see Section4.3for the optimal matching solution).

Remark: The position of where to add tokens is im- portant for security of FreqyWM against guess attack.

Therefore, new tokens should be added in random posi- tions (see Section4.6for more discussion).

7. generates or removes tokens based on new frequencies.

8. returns the list of tokens D_wand stores L_wm, z value, and the random value R as a list L_sc.

Algorithm I: Watermark Generation Input: D_o, b

Output: D_w, L_sc D^hist_o = Preprocess(D_o) R← {0, 1}^λ, z ← Z⁺ foreach {tk_i,tk_j}_{i6= j}∈ D_odo

s_{i j}= H(tk_i||tk_j||R) mod z end

L_e← Eligible(D^hist_o , {s_{i j}})

L_wm← OptMatch(D^hist_o , L_e, {s_{i j}}, b) foreach {tk_i,tk_j} ∈ L_wmdo

D^hist_o .Update( f_i^w, f^w_j, s_{i j}) end

Dw← Create(D^hist_o , D_o) Lsc= {L_wm, R, z}

Result: D_w, L_sc= {L_wm, R, z}

Generation of Eligible Pairs: A set of eligible pairs (Le) are generated by an algorithm Eligible based on given pairs {tk_i,tk_j} and corresponding s_{i j}values. A pair is accepted as eligible if it satisfies that the boundaries of each token in the pair are at least ds_{i j}/2e where s_{i j}≥ 2. Here s_{i j}cannot be 0 or 1 because of modulo operation since modulo 0 is undefined and modulo 1 is 0.

4.2.2 Watermark Detection

In AlgorithmII, we provided our watermark detection algorithm as a pseudo-code. In the detection algorithm, the data owner suspects a token dataset D⁰_w and wishes to know if there is a watermark of its in D⁰_wto claim the ownership. The owner holds its secret list L_sc. Given a list of tokens (D⁰_w), the detection algorithm will check if the given list is its or not based on the secret list L_scand pre-defined parameters k and t.

The detection algorithm can be ran by the owner (or whoever

holds the secrets). If the owner wants to prove that it is indeed the owner to a third party, it has to reveal its secrets to that party. This causes to prove the ownership once in public (see Section6.4for further discussion). The algorithm proceeds as follows:

1. generates the histogram list D^hist_w of the suspected dataset D⁰_was in the watermark generation algorithm. However, the algorithm does not need to calculate the boundaries (as it was in W MGenerate). It only calculates the frequencies of tokens.

2. sets a counter as count = 0.

3. for each token pair {tk_i,tk_j} in L_wm, checks if it exists in D^hist_w . If it is found, it creates s_{i j}values as s_{i j}= H(tk_i||tk_j

||R) mod z.

4. checks for each found {tk_i,tk_j} if the following statement holds ( f_i− f_j) mod si j≤ t. If it holds, the counter is increased by 1.

Remark: t is a value between 0 and si j− 1. How to set t (and k) depends on the robustness the owner wants. The owner decides these parameters by itself (see Section 4.3and5for further discussions on t and k).

5. returns accept (verified) if count is greater than or equal to the threshold k. Returns re ject otherwise.

Algorithm II: Watermark Detection Input: D⁰_w, L_sc= {L_wm, R, z}, k,t Output: accept/re ject

D^hist_w = Preprocess(D⁰_w) count= 0, result = re ject foreach {tk_i,tk_j} ∈ L_wmdo

if Found(tk_i,tk_j, D^hist_w ) then s_{i j}= H(tk_i||tk_j||R) mod z if ( f_i− f_j) mod si j≤ t then

count+ + end

end end

if count ≥ k then result= accept end

Result: result

4.3 Optimal and Heuristic Approaches

Now we discuss here our optimal and heuristic solutions for the matching function in the watermark generation algorithm.

First, we show that the optimal set of watermark pairs for a given budget b is given by the solution of MWM problem followed by a QKP on its output. Given that all watermarked

(7)

pairs have equal value in terms of proving ownership of the data, an optimal watermark is just a watermark of maximum size in terms of watermarked pairs, within the defined constraints. Later, we describe a greedy and a random heuristic for the same problem. Last, we give our experimental results by comparing the two heuristic approaches to the optimal.

4.3.1 Optimal Matching

We prove that AlgorithmIinserts an optimal watermark with a given budget b considering the constraints described in Sec- tion4. Our pairing problem is constructed based on Definition 2and Definition3. Now, let us define our optimal watermarking:

Definition 4 (Optimal Watermarking) Let OptW M(G(V, E), b) be the optimal watermarking with a budget of b among an eligible set of items L_ewhich is defined as a connected undirected graph G(V, E). The optimal watermarking produces the maximum number of edges (pairs) while not exceeding the budget b.

Let G = {V, E} be a connected undirected graph which is the representation of frequencies driven from eligible pairs L_e. V = {v₁, v₂, . . . , v_{|V |}} where v_i represents tk_i and E= {e₁, e₂, . . . , e_|E|} where e(v_i, v_j) is the edge between vi

and v_j. The weight of an edge e(v_i, v_j), w(ei), is equal to T− ( f_i− f_j mod s_{i j}) (recall that fiis the frequency of tk_i) where T is a big value (e.g., T > H where H is the highest difference between two frequencies in the eligible pairs.). Now, we represented the eligible pairs as a graph. With this kept in mind, our optimal watermarking problem reduces to finding the maximum number of edges (pairs) such that no edge has a common vertex and the given budget b is not exceeded. To solve this problem we take advantage of two known problems that have polynomial time solutions: Maximum Weight Matching(MWM) and Equally Valued 0/1 Knapsack problem(QKP) (which we have a special case where all values are equal). After defining the eligible pairs as a graph, the following steps are taken;

• Find the maximum weight matching M = e₁, e₂, . . . , e_|M|

as M = MW M(G(V, E)).

Remark: Notice that M includes the edges (where no two edges share a common vertex) such that the sum gives the maximum weight. The maximum weight for our watermarking refers to minimum weight since we defined weights as T − ( f_i− f_j mod s_{i j}). This conver- sion makes the highest frequency difference having the smallest weight and the smallest one having the highest weight. This is required because we want to identify the edges that will distort the (histogram) dataset minimum.

• Find the set of edges L_wm in M such that the selected edges do not exceed the budget b by employing the QKP as L_wm= QKP(M, b) where Lwm= e₁, e₂, . . . , e_|L_wm_|and

value of each e_iis 1 (val(e_i) = 1).

Remark: After finding the edges via MWM, we have one more constraint in the watermarking which is the budget b. Our optimal watermarking solution returns the maximum number of matchings for which the distortion (e.g., based on cosine similarity) does not exceed b. This is an Equally Valued 0/1 Knapsack problem where the value of each item is 1 and the weight is recomputed as T − w(e_i). Recomputation is necessary since for the QKP we want to add as many items as possible that will be bounded by b.

Now, in Theorem1we prove that the resulting watermark is optimal via proof-by-contradiction.

Theorem 1 The watermarking solution O_{W M}is optimal (according to Definition4) if the solutions produced from the Maximum Weight Matching (according to Definition2) and Equally Valued 0/1 Knapsack problem (according to Defini- tion3) are optimal.

Proof 1 (Proof-by-contradiction) We prove that our watermarking is optimal via proof-by-contradiction. Let O_{W M}= {e₁, e₂, . . . , e_|O

W M|} be the optimal watermarking solution.

To prove that this solution (which is derived from the optimal solution O_{MW M}generated by MaximumWeightMatching (MWM) algorithm and the optimal solution O_QKPgenerated by Equally Valued0/1 Knapsack problem (QKP)) is optimal, we assume that the solution OW M is not optimal but the solution derived from MW M and0/1 Knapsack are optimal. If it is not optimal then there must be another solution O_{W M}⁰ which is optimal (according to Definition4) such that|O_{W M}⁰ | > |O_{W M}|. In other words, O⁰_{W M}must includeat least one different edge (pair) e_othat does not exist in O_{W M}, e_o∈ O/ _{W M}. W.l.o.g, considering e_o∈ O/ _{W M}, the following two cases happen:

• Case 1: e_o∈ O_{MW M} but e_o∈ O/ _QKP. This states that the QKP solution did not choose e_oin O_{W M}. That would mean that QKP did not optimally generate the number of items such that W(w(e_i))i=1,2,...,|OW M|≤ b. However, this contradicts with our assumption that QKP is optimal. If e_ois (or must be) in the optimal solution then it should have been chosen by QKP. Therefore, O_{W M}cannot be non-optimal (or O⁰_{W M}cannot be an optimal solution) .

• Case 2: e_o∈ O/ _{MW M} (logically, eo∈ O/ _QKP). This case occurs when MW M does not choose the eoand eocomes from the graph G (which is the graph representation of all the eligible items) and appears in O_{W M}⁰ . Similar to Case 1, while creating the maximum weight matching, e_owas not good enough to appear in the optimal matching. However, the optimal watermarking solution O_{W M}⁰ would have e_osince it also creates the minimum distortion which is the maximum weigh equivalence in MW M.

Then, O_{MW M}is not optimal since it did not choose the

(8)

maximum weight matching. However, it contradicts with our assumption that O_{MW M}is optimal. If O⁰_{W M}is optimal then e_owould have been included in O_{MW M}. Therefore, there cannot exist an edge e_othat makes the solution optimal but was not chosen by O_{MW M}.

To conclude our proof, considering the two cases above, we showed that there cannot exist an edge that was not chosen by any of the optimal algorithms and yet it exists in the optimal watermarking solution.

4.3.2 Greedy Matching

In our greedy solution, all the eligible token pairs are sorted in an ascending order by remainders defined as rm_{i j}≡ ( f_i^o− f^o_j) mod s_{i j}. The algorithm starts selecting a pair respectively for watermarking where the budget b would not be exceeded when it is chosen (i.e., comparing the similarities of original and watermark datasets). This continues until the budget is exhausted or there is no more item to visit.

4.3.3 Random Matching

Our random matching solution differs from the greedy solution in terms of the order of selecting the pairs. The random matching solution starts by selecting pairs randomly. For each selection, it follows the same steps as greedy algorithm (e.g., checking the similarities).

4.4 Algorithm Comparison and Effect of Main Parameters

We evaluate how the parameters (a modulo value z, a budget b, and skewness parameter α) are affecting the number of chosen pairs for watermarking and the performance of optimal, greedy, and random approached in terms of number of chosen pairs. In order to provide experimental results, we used synthetic datasets where we generated these samples using a power − law distribution [10] with different skewness values α as [0.05, 0.2, 0.5, 0.7, 0.9, 1]. The sample size is 1M and the number of tokens is 1K for each different α value.

When α is 0, it is a uniform distribution in which there are no eligible tokens to watermark. When α is 1, the original dataset D_ois skewed with a very long tail with almost equal values.

Figure2a shows the correlation between skewness of a dataset α and the size of chosen pairs when budget b = 2 and modulo value z = 1031. When a dataset is almost uniform (i.e., α = 0.05), the solutions can select very few pairs since there are not many eligible items (i.e., the upper and lower boundaries are not enough, in fact many of them are 0). When α starts increasing, the differences between the frequencies of tokens increase. Thus, the number of eligible items increases which also affects the number of chosen pairs under a given budget. However, at some point (i.e., α = 0.7), the number of

chosen pairs decreases due to the tail of (histogram) frequencies becoming uniform. The same figure shows the superior performance of the optimal solution. The gap between optimal and the heuristics is around 20% for most α values while the two heuristics perform similar the one with the other (0.02%

in average).

Figure2billustrates how the modulo value z affects the size of chosen pairs. When we pick smaller modulo value z, we would have a higher number of chosen pairs. The reason is that a smaller z leads to smaller remainders s_{i j}that need to be eliminated, thereby yielding more selected pairs within a given budget b. When z is very small (i.e., 10), the three approaches are very close (also see Section6.1for the effect of z in terms of security). However, when z increases, our optimal approach selects many more pairs than greedy and random.

Figure2cshows how the budget selection affects the performance comparison between the heuristics and the optimal.

We set modulo value z = 1031 and use the dataset with the skewness α = 0.7. When we increase the budget b, the heuristics get closer to the optimal performance. This is expected since even the optimal algorithm cannot select more than all the eligible pairs and with a large budget even the heuristics can approach that.

4.5 Validation Using Real World Datasets

Next we apply FreqyWM to three real world datasets from different domains: (1) Chicago Taxi dataset [9]; (2) A real click-stream dataset logging the URLs visited by a group of users of the eyeWnder advertisement detection add-on [19];

(3) Adult dataset [2]. Our intention is to validate our main conclusion using real data from the previous evaluation with synthetic data, i.e., that the heuristic approaches perform well enough compared to the optimal. A second evaluation ob- jective is to report the real processing time on an ordinary machine for generating and detecting the watermark using these real datasets.

Our experimental results are produced on a standard lap- top machine with dual-core Intel Core(TM) i7 − 5600U CPU 2.5GHz, 16.00 GB RAM, 64-bit OS, and implemented in Python language.For our watermark generation, we set the modulo value z = 131 and the budget b = 2, and employed SHA256. We run our algorithm 30 times and take the mean of total computations. Table2presents our validation results.

Taxi ID, URL, and Age were chosen as tokens for Chicago Taxi, eyeWnder, and Adult, respectively. After generation, for Chicago Taxi, eyeWnder, and Adult datasets, our optimal solution chose 805, 38, and 21 pairs, respectively. Con- sidering the heuristic approaches, greedy chose 770 pairs for Chicago Taxi, 33 pairs for eyeWnder, and 20 pairs Adult while random chose 773, 31, and 17 pairs, for Chicago Taxi, eyeWnder, and Adult, respectively. Running times of computations for watermark generation on the Chicago Taxi,

(9)

(a) Effect of different skewness parameters (α) on chosen pairs by Optimal, Greedy, and Random.

(b) Effect of different modulo values (z) on chosen pairs by Optimal, Greedy, and Random.

(c) Chosen pairs by Greedy and Random with respect to Optimal.

Figure 2: Effects of parameters on the size of chosen pairs for watermarking.

Dataset Size Token Distinct Tokens |L_e| Optimal Greedy Random Gen Detect

(sec) (sec)

Chicago Taxi [9] 9.68GB Taxi ID 6573 33308 805 770 773 182.51 0.609

eyeWnder [19] 247MB URL 11479 257 38 33 31 420.81 0.053

Adult [2] 4MB Age 73 72 21 20 17 0.03 0.001

Table 2: Validation results on real world datasets. Dataset: Dataset used Size: The size of original dataset. Token: Definition of the token (e.g., the name(s) of the attributes). Distinct Token: The number of distinct tokens. |L_e|: The number of eligible pairs.

Optimal: The number of chosen pairs by the optimal matching. Greedy: The number of chosen pairs by the greedy matching.

Random: The number of chosen pairs by the random matching. Gen: Running time of watermark generation. Detect: Running time of watermark detection.

eyeWnder, and Adult datasets were 182.51 secs, 420.81 secs, and 0.03 secs, respectively (where we exclude histogram and watermarked data generations). For watermark detection, the total detection time for each watermarked datasets was less than 1 sec. As it can be interpreted from Table2, the number of chosen pairs increases when the number of eligible pairs increases. For instance, while eyeWnder has more distinct tokens (11479) than Chicago Taxi has (6573), eyeWnder has fewer eligible pairs (257) than Chicago Taxi has (33308).

Thus, FreqyWM has selected more pairs for Chicago Taxi (805) than it did for eyeWnder (38).

4.6 Watermarking Multi-Dimensional Data

During our evaluation so far, on both real world and synthetic datasets, we set the token as a single attribute (e.g., URL, Taxi ID, Age). However, as we previously stated, a token does not necessarily need to be restricted to a single attribute of a multi-dimensional dataset. Therefore, a token can be also defined as combination of more than one attributes (e.g., [Age, WorkClass]) in the Adult dataset. We ran FreqyWM on such token represented as [Age, WorkClass] with the same parameter setting in Section4.5and the number of tokens (i.e., distinct [Age, WorkClass] attributes in the real dataset) were 481. The size of pairs chosen for watermark-

ing was 20. With multi-dimensional data removing a token appearance is as simple as with single-dimensional data. In- creasing, however, a token’s frequency is more involved. The reason is that just repeating the value of the token would leave other fields not being part of the token with a value to be set, e.g., all the other fields beyond Age and WorkClass in the Adultdataset. A naive solution would be select a random appearance of the token and copy its other fields every time that an additional instance of the token needs to be added to the watermarked dataset. This, however, could create semantic inconsistencies if there are constraints to be respected for individual attributes or combinations of them. Making sure that added appearances of a token do not lead to semantic inconsistencies that, in addition to degrading the quality of the data, could also give away the existence of a watermarked pair to an attacker, requires domain knowledge about what each dataset represents. Such knowledge, however, is orthog- onal to all previous steps of our algorithms and, thus, can be appended as a last step based on one’s domain knowledge of the data whenever a token’s frequency needs to be increased.

(10)

5 Probabilistic Analysis of False Positives

In this section we develop a statistical bound derived from Markov’s inequality theorem, to demonstrate that the false positive probability (i.e., accepting a dataset as watermarked when it is not) goes to zero if we increase the minimum number of pairs that has to be accepted, or if we decrease the threshold to accept a pair as watermarked. We will focus in an example where the remainders follow a Uniform distribution to show the results and some numerical exam- ples, but Markov’s bound is a general property and it can be applied no matter the distribution of the remainders. The significance level accepting a pair as watermarked t deter- mines how amenable the system is to false hits. By choosing lower values of t, the owner can increase its confidence that if the detection algorithm finds its watermark in a suspected relation, it probably is an illegal copy of its watermarked token list. Recall that if the i-th pair of tokens is accepted as watermarked, this means that ( f_i− f_j) mod si j≤ t holds for this pair of tokens. The probability that the “watermarking statement” holds for the i-th pair of token, is represented as P(X_i= 1) = pifor i = 1, ..., n. The value of p_i depends on t. The logic is that if t is zero, the false negatives will increase. If t increases, we will have more chance to be robust against slight modifications. But if we increase t too much, we will have more chance for false positives (accepting as watermarked when it is not).

Let us define now the random variable S_n= ∑ⁿ_i=1X_i as the sum of the n independent and not necessarily identically distributed Bernoulli’s. Then, S_nfollows a Poisson-Binomial distribution [8] and takes values {0, 1, 2, ..., n} since it counts the number of pairs accepted as watermarked, in a set of n total pairs of tokens. The probabilities p_ido not necessarily have to be the same for each pair of tokens, in fact, in reality they will be different. If they are the same, then we have the particular case of the Binomial distribution.

Let us assume that the probabilities p_i follow a Uniform[0,1]distribution.⁴This way the probability of success for a pair (which is accepting a pair as watermarked pair) is p_i= {|t|/s_{i j}} in our watermarking solution. The probability of having at least k successes in n trials can be written as P(S_n≥ k) = ∑ⁿi=kP(S_n= i). Note that k represents the minimum number of pairs that has to be “accepted”, to say that the dataset D⁰_wis watermarked by the owner.

We are interested in studying the behavior of the probability P(S_n≥ k) depending on the behavior of the parameters t and k. For this, we can use the Sandwich Rule of probability and Markov’s upper bound obtained by its inequality theorem P(S_n≥ k) ≤^µ_k, where µ = ∑ⁿ_i=1p_iis the mean of S_n.

What happens in the limit of t? The parameter t defines the robustness we want to have. The individual probabilities p_idirectly depend on the value of t. If t tends to zero, the probabilities p_iwill tend to zero for all i = 1, ..., n, and then, µ

4They can follow different distributions, such as Uniform, Normal, Beta.

will also tend to zero. Therefore, we need to see what happens when µ goes to zero. It is easy to see from Markov’s bound that lim_µ→0P(S_n≥ k) = 0. This means that if we decrease t, the probability of accepting a dataset as watermarked will decrease to zero.

What happens in the limit of k? The parameter k defines the minimum number of “accepted” pairs of tokens that we need in order to conclude that the dataset is watermarked.

If we increase k, it will be hard to “accept” a dataset as watermarked. Thus the probability of “acceptance” should be reduced. If we make k tend to infinity lim_k→∞P(S_n≥ k) = 0.

Let us see what happens if we make k tend to 0. From the definition of the cumulative distribution function of a random variable, and since S_nis non-negative, we have that lim_k→0P(S_n≥ k) = 1. Figure3shows the survival probabilities P(S_n≥ k) against the value of k (where the probabilities p_iare Uniform[0,1]), computed using the Discrete Fourier Transform of the characteristic function (DTF-CF) of the Poisson-Binomial distribution for n = 50.

Figure 3: Survival probabilities for n = 50.

6 Security and Robustness Analysis

This section discusses the security and robustness of our Fre- qyWMmethod. We consider three attack scenarios as guess, sampling, destroy, and re-watermarking (false-claim) attacks.

6.1 Guess (Brute-Force) Attack

In the guess attack, the probabilistic polynomial time (PPT) adversary tries to guess the watermark, i.e., the secret embedded in the data. This is possible only if he can figure out a subset of token pairs {tk_i,tk_j}_N (where ^|D^hist^w₂ ^| ≥ N ≥ k) based on the watermarked data D_w, the random value R, and the modulo value z where the watermark detection algorithm based on these inputs (for some fixed k and t) returns accept.

Assuming that the hash function is collision resistant, R is random, and z is an integer, the probability of the attacker being successful is ¹

|z|×(^K_N)×2^λ where K = ^|D^hist^w₂ ^|, i.e., it becomes

(11)

negligible for typical parameter values (i.e., λ = 256) such as those used in our evaluation.

6.2 Sampling Attack

In this attack,Acopies a random subsample from the watermarked dataset D_w in an attempt to exploit (pirate/steal) it while hoping that the watermark won’t be detectable within the extracted sample. Figure4illustrates the attack for different sample sizes from 1% to 90%, extracted from the original watermarked dataset D_w, generated from an original dataset D_owith skewness parameter α = 0.5, that has 1M tokens in total, 1K unique tokens. The number of pairs chosen by our optimal matching was 139. For each percentage and each subsample we apply the detection algorithm and compute the percentage of accepted pairs. Since the extraction of the sub- samples is performed 100 times for each fixed percentage, we compute the average accepted pairs over all repetitions. Also, for each subsample detection experiment, we deploy different values of the threshold t for accepting a pair as watermarked as t = {0, 1, 2, 4, 10}.

Figure 4: Sampling attack results with α = 0.5.

The attack scenario is as follows:A randomly selects x%

of D_wwhere x defines the percentage for the sampling attack (e.g., 1) as a subsample size of 1M ×₁₀₀^x . For instance, for 1%

sampling attack, a subsample would have total of 1M ×0.01 = 10K. Note that if the sample size is greater than the number of distinct tokens, which is the number of items/tuples in D^hist_w , the sample would have all the distinct tokens with a high chance. This would mean that all the chosen pairs for watermark would exist in the subsample.

Figure4shows that the size of the extracted subsample does not greatly affect the number of accepted pairs if it is greater than the number of unique tokens (1K). For that reason, every unique token will be found in the extracted subsample with a high probability, and thus, every watermarked pair of tokens will also be found. With the sampling attack, the frequencies of the tokens vary, that is why the value of t does affect the result. For example, with t = 0 the detection algorithm is

able to detect around 36% (in average) of the watermarked pairs. When t increases from 1 to 10, the performance of the detection increases (in average) from 72% to 99.5%.

Let us now see the results when the size of the extracted subsample is very low, so that it might not contain at least 1 token from the total 1K of unique tokens that the original watermarked dataset has. Figure5shows the results for sample size proportions between 0.0007% and 0.5%. It can be seen that if the sample size is greater than 5 times the number of unique tokens (1K), the detection algorithm stabilizes its performance for detecting the watermark. Below 2 times (2K), the performance starts to decrease with higher velocity. In these extreme cases, the detection algorithm will have more difficulties to detect the data as watermarked. However, the utility of the data is highly compromised since these are very small subsample sizes compared with the original size of 1M of tokens. In the most extreme cases, the utility is also de- stroyed in the sense that a small number of distinct tokens are found in the subsample, compared to the original number of 1K. We also tested the sampling attack in other watermarked datasets with different values of the skewness parameter α and we obtained similar results.

Figure 5: Sampling attack results with very low sample size and α = 0.5.

6.3 Destroy Attack

In this case the attackerA tries to damage the watermark.

The no-security-by-obscurity principle [23] allowsA ^{to know} that he can destroy the watermark in a way that it cannot be detected by the owner.A computes the histogram of watermarked data D_w.Amodifies the frequencies of tokens as he pleases by allowing re-ordering (changing the popularity/rank of the tokens) or without allowing re-ordering. By considering aforementioned two different types of destroy attacks, we further define these two attacks in details and discuss robustness of FreqyWM against these attacks. In order to measure the robustness, we run our optimal solution on a dataset where

(12)

the skewness parameter α = 0.5, the modulo value z = 131, and the budget b = 2.

6.3.1 Destroy Attack without re-ordering

In this attack type,A can modify the frequencies without changing the order of frequencies. In this attack, we introduce two types: (1) attacker changes the frequencies randomly by the given boundaries (2) attacker changes the frequencies by (at most) some percentage.

Changing the frequencies randomly within the upper and lower boundaries. In this attack type,Acalculates the boundaries for each token. Then,A chooses a random value r_ifor each tk_i as r_i← (−l_i, u_i).A changes the frequency of the token and updates the upper boundary of the next token in the list by r_i.

Changing the frequencies by (at most) some percentage.

A changes the frequencies of tokens up to some percentage (e.g., 1%). To illustrate,A calculates the boundaries as u_i(representing upper boundary) and l_i(representing lower boundary) for each tk_iwhere he sets the percentage to 1%.

He calculates u⁰_i= f loor(u_i× 0.01) and l_i⁰= f loor(l_i× 0.01).

Then he gets a random value r_ibetween (−l_i⁰, u⁰_i). It hereby changes tk_i by at most ±1%. l_i⁰ and u⁰_i are already in the boundaries and they cannot exceed the boundaries. After every change ( f_i⁰= f_i+ r_i), the boundary of the next element is updated as well. Thus, the attack never changes the ranking/ordering.

Figure 6: Percentage of verified pairs for the following datasets: (1) D_w: the original watermarked dataset (power- law popularity with skewness α = 0.5) without any attack, (2) D_non: a non-watermarked dataset denoted by defined over the same token space but with different power-law popularity with skewness α = 0.7, (3) D^r_w: the watermarked dataset D_wafter attacked by random attack without reordering, (4) D¹_w: the watermarked dataset D_wafter attacked by changing frequencies at most 1%.

Figure6shows how robust FreqyWM is against these two destroy attacks. We perform these two attacks on the wa-

termarked sample in which α = 0.5. Later, we compare the success rate (the percentage of accepted token pairs given threshold for accepting a pair t) of detection algorithm with respect to modified watermarked data after the attacks. We also include in the figure a second dataset of skewness α = 0.7 that does not carry the watermark, and report on how many of its pairs would be falsely verified for different values of t.

The blue line in the plot represents the watermarked data D_w without any modification. For an attack in which the frequencies are changed by (at most) some percentage (represented by the red line in the figure), FreqyWM can detect around 90% of the pairs when t = 0. When t is increased, after a point where t ≥ f_i− f_j mod s_{i j}, the success rate converges at around 90%. For an attack where the frequencies are changed randomly within the upper and lower frequency boundaries (green line in Fig.6), FreqyWM can detect more than 35% of the pairs when t = 0. Note that the latter is more powerful than the former attack. There is a direct proportion between t and the success rate. As shown, the success rate reaches to 90%

when t goes to 10. From Figure6, we can interpret in what parameter setting false negative (rejecting a pair while it is watermarked) and false positive (accepting a pair as watermarked while it is not) can be avoided. Thus, the watermarking detection algorithm can successfully detect a watermarked dataset attacked and reject a dataset that was not watermarked. For instance, the rate of false positive increases when the threshold for accepting a pair t increases while the minimum number of accepted pairs for detection k decreases which is the area under the results of the dataset (not watermarked) with a different skewness parameter (the area under the orange line).

On the other hand, the rate of false negative increases when the threshold accepting a pair t decreases while threshold for detecting a watermark k increases which is the area above the results of the attack without re-ordering (the area above the green line) if we consider a very strong attack. For the owner, to avoid false negatives and false positives, convenient parameter settings for the threshold for accepting a pair t and the minimum number of accepted pairs k for detecting a watermark lie between these two areas (between the orange and the green lines in Figure6). However, if a weaker attack (changing the frequencies by some percentage ) is considered, the range of these parameters increases (the area between the red and orange line). Therefore, the detection algorithm can detect a watermarked dataset and reject a dataset that was not watermarked by the owner with a careful parameter setting.

Tuning these parameters depends largely on the specifics of each application use case and, as we explain in Section8, is a topic of our future work plans.

6.3.2 Destroy Attack with re-ordering

In this attack type, an attackerA can modify the frequencies as he pleases without observing any ordering restric- tions. Note that this attack introduces more noise than