Anonymization techniques can be classified into generalization (replacing a value with a less specific but semantically consistent one) and suppression (not releasing a value at all) tech- niques.
6.8.1 K-Anonymity
One of the first approaches to anonymization is k-Anonymity [92]. Several researchers have provided practical implementations of these algorithms ranging from Top-down specialization vs. bottom-up generalization to Global (single dimensional) vs. local (multidimensional) to Com- plete (optimal) vs. greedy (approximate) to Hierarchy-based vs. partition based [61, 59, 20].
K-anonymity ensures that individuals cannot be uniquely identified by a record linking attack, but does not necessarily prevent sensitive attribute disclosure. The algorithm primarily provides a clustering of nodes into equivalence classes where each node is indistinguishable in its quasi- identifying attributes QID from some minimum number of other nodes. When there is not much diversity in the sensitive attributes inside an equivalence class, the sensitive attribute of everyone in the equivalence class becomes known with high certainty.
6.8.2 L-Diversity and T-Closeness
L-diversity [71] alleviates the problem of sensitive attribute disclosure inherent to k-anonymity by ensuring that sensitive attribute values in each equivalent class are diverse. A set of records in an equivalence class C is l-diverse if it contains at least l well-represented (measured by several ways including frequency counts and entropy). One drawback thus lies in the possibility to infer sensitive attributes when the sensitive distribution in a class is very different from the overall distribution of the same attribute. If the overall distribution is skewed, then the belief of someones value may change drastically in the anonymized data. Also, the possibility to detect equivalent classes which contain very similar sensitive attribute values. Also, l-diversity based approaches implicitly assume that each sensitive attribute takes values uniformly over its class, so when frequencies of sensitive attribute values are not similar there may be large utility loss on data. For example, consider a dataset that contains 1000 patients with some quasi-identifying (QID) attributes and a single sensitive attribute disease with two possible values cancer or flu. If there are only 5 patients with cancer in the table, to achieve 2-diversity, at least one patient with cancer is needed in each QID group. So at most 5 groups can be formed which causes information loss. T-closeness [63], on the other hand, considers the sensitive attribute distribution in each class, and its distance to the overall attribute distribution. The distance is measured using similarity scores for distributions.
All these algorithms prevent uniquely identifying individuals through record linking. The main drawback, on the other hand, to all generalization and suppression algorithms lie in the utility or information loss incurred, since they rely on frequency of an item, in some cases, certain co- occurrences of items are considered the source of utility especially when record linkage is per- formed. If an item occurs in no frequent itemset, then suppressing that item (attribute) incurs no information loss. However, if an item occurs in many frequent itemsets (tuples) then suppress-
ing the item incurs a large information loss, since all frequent itemsets containing that item are removed from the data.
6.8.3 K-Anonymity Applied to Relational Data
k-anonymity has been initially proposed by Samarati and Sweeny [92] to prevent linking an individual to a record in a relational data table through a quasi-identifier. Later, several researchers have applied the notion of K-Anonymity to privacy. However, most of the K-Anonymity protection models are at the relational data table level not at the Web service operation level. Therefore, the private resource in their case is the identity of the subject whose information is contained in the data. The K-Anonymity concept have been implemented in several real world systems, including DataFly, mArgus, and K-Similar. Recentrly, few researchers have studied probabilistic notions of k-Anonymity [49].
6.8.4 K-anonymity Applied to Graphs
The literature has a number of definitions derived from anonymity tailored to structural prop- erties of network data. For example, k-degree anonymity [66], k-candidate anonymity [45], k- automorphism anonymity [110], k-neighborhood anonymity [109], and (k,l)-grouping [30]. k- degree anonymity indicates a graph as k degree anonymous if for every node v in V there exist at least k 1 other nodes that have the same degree as v. This anonymization technique can pre- serve privacy by preventing the re-identification of individual nodes by adversaries with a priori knowledge of the degree of certain nodes. However, the anonymized graph may not be useful. k-neighborhood anonymity has been first proposed by Wu et al. [109] and states that a graph is k-neighborhood anonymous if every node has a 1.5-hop neighborhood graph isomorphic to the 1.5-hop neighborhood graph of at least k-1 other nodes. k-candidate anonymity states that an anonymized graph satisifies k-candidate anonyity with respect to a structural query Q if there is a set of at least k nodes which match Q, and the likelihood of every candidate for a node in this set with respect to Q is less than or equal to 1/k. Zou et al. have proposed k-automorphism anonymity. A graph is k-automporphic if every node has the same subgraph signature as at least k-1 other graph nodes, and the likelihood of every candidate for that node is less than or equal to 1/k. Finally, (k,l)- grouping is a privacy mechanism that has been proposed by Cormode et al. to handle attacks in affiliation networks. It assumes that affiliation links can be predicted based on node attributes and
the structure of the network. They use a greedy algorithm that generalizes node attributes without modifying the network structure. They assume that each node is indistinguishable from at least k-1 other nodes in terms of node attributes.