Social media and the Internet can also be used to re-identify individuals in the data. For instance, the New York Times wrote an article about Tyquan Brehon. Mr. Brehon lives “in one of the most heavily policed neighborhoods in Brooklyn” and was stopped more than 60 times before his 18th birthday [9]. The pictures in the article suggest that he is African American, and we know from the article that he is male. There is only one profile belonging to someone named Tyquan Brehon on Facebook, and the publicly visible images match the man in the article. His birthday is also publicly visible on Facebook. Using his visible characteristics, those mentioned in the article, and the information gleaned from the Internet we can query for stops matching Tyquan Brehon in the dataset.
The results of querying for Mr. Brehon’s characteristics are presented in Table 5.1. It assumes his hair and eye color were both recorded as either black or brown and only queries the data from 2003 to 2008 (before his 18th birthday, the period in which the article tells us he was stopped multiple times). In addition, we search for stops in Brooklyn for some of the quasi-identifiers, assuming that he was most likely stopped in his “heavily policed” neighborhood (though he may have been stopped in other cities). We note that city is not recorded in the 2003 dataset, so the considered quasi-identifiers involving city will undercount the number of matches. Nevertheless, it may be safe to assume that most of his stops would have occurred after 2003, since Mr. Brehon was only 12 or 13 in that year and thus outside of the age range in which people are usually stopped.
We do not know for sure that all of these stops belong to Mr. Brehon. However, if at least some of the time his relevant information was coded correctly, we can assume that some of his stops are captured in the matching stops. This significantly narrows the field of stops that could belong to Mr. Brehon. Perhaps more importantly, though, even if none of these stops are Mr. Brehon, an adversary may falsely believe that they do belong to him and discriminate against him accordingly. An incorrect assumption based on this data may still lead to negative effects for those in the dataset.
Quasi-identifier # of matches DOB, race, sex, hair, eye, city 29
DOB, race, sex, city 32 DOB, race, sex, hair, eye 70
DOB, race, sex 76
Table 5.1: Number of matches for quasi-identifier instances that match Tyquan Bre- hon’s characteristics during the time period (2003-2008) in which we know he was stopped.
5.3
Criminal Records
Another method of re-identifying individuals in the data would be to search for crim- inal records or mugshots, as suggested in the Harvard study [30]. This has some limitations, however. We know that 88.9% of the stops in the data do not result in a summons or arrest. Additionally, arrest records and other police information are not made public in the state of New York. What would be public, however, are the court records associated records. “Court records are subject to disclosure under Section 255 of the Judiciary Law” according to O’Connor, who used them to map all the summonses in New York City [20]. While this would re-identify those stops resulting in court proceedings, it seems that there would be more information in court proceedings themselves than in the the stop-and-frisk dataset, and thus the dataset might not significantly contribute to the harm faced by the individual. This paper does not pursue this method of re-identification.
5.4
Voter Records
Finally, the Harvard study suggests using public voter information to re-identify in- dividuals in this dataset [30]. They suggest using the sex, race, date of birth, height, hair color, eye color, build quasi-identifier to identify the 100 most frequently stopped individuals, and then to assume that the zip code where they were most frequently stopped is their home zip code. The combination of sex, date of birth, and zip code should be enough to uniquely identify 87% of individuals in the voting records [24]. There are a number of shortcomings to this, which the authors address. For instance,
people may be stopped more often in locations that do not correspond to their home location. Also, the young African American men that make up the bulk of the stops are less likely to be registered voters, both because they are often too young to be eligible and because low-income People of Color are less likely to vote [30].
By submitting a FOIL request, we were able to obtain the voter records for New York State. We tested both the Harvard quasi-identifier as well as the photo ID quasi-identifier, using the methods suggested by Harvard on the 2012 NYCLU data. This involved only using entries with photo identification in order to increase the likelihood that the information was correctly coded in the stop-and-frisk data.
A complication arose from the fact that the zip code field is never filled out in the stop-and-frisk data. Instead, we used the State Plane Coordinate System of 1983 x- and y-coordinates from the data, which were automatically generated at each stop. We converted them to latitude and longitude using Earth Point [3] and got the corresponding zip codes using Pygeocoder [31]. Occasionally, one of the top 100 most frequently stopped individuals had more than one most common stop zip code. In that case, we searched for all of the most common stop zip codes associated with the quasi-identifier instance.
Of the top 100 most frequently stopped individuals in the 2012 NYCLU data who had photo identification, using the Harvard quasi-identifier, 28 had only one match for their date of birth, sex, and zip code in the voter records. 75 had one or more matches in the voter records, with the average number of matches being 3.01 (Table 5.2). This shows that voter records may be a reasonable way to re-identify individuals in this data, though it is possible that this process incorrectly links people, since it can only correctly identify individuals who have registered to vote who were also stopped in their home zip code. Likewise, since we used the specific Harvard quasi-identifier, it is more likely that all the stops selected as belonging to one of the top 100 most frequently stopped individuals do indeed correspond to the same person. However, this method may miss some of an individual’s stops, if, for example, their build was coded different across the dataset.
Percent with 1 [sex,dob,zip] Match in the Voter Records 28% Percent with any number of [sex,dob,zip] Matches in Voter Records 75% Average Number of [sex,dob,zip] Matches 3.01 Maximum Number of [sex,dob,zip] Matches 13
Table 5.2: Results of the re-identification experiment with the top 100 most frequently stopped individuals with photo ID in the 2012 NYCLU data, using Harvard’s quasi- identifier [30].
Chapter 6
Conclusion
This paper examined the risks associated with the release of New York City’s stop- and-frisk data. While this data release was instrumental in ending the racially biased program, the release itself may bring further harm upon the people stopped under the program.
This paper found that each stop contains an average of 7.7 pieces of sensitive information. Moreover, an adversary who knows only three well-chosen pieces of basic demographic information can uniquely identify 64% of all stops. The vast majority of people can be uniquely identified in the 2012 NYCLU data by an adversary who knows enough information about their target: for instance, knowing a target’s sex, race, date of birth, eye color and exact height and weight makes 89.9% of the 2012 NYCLU stops unique and leads to just 1.1 average matches. In the NYPD data from 2003 to 2012, this combination of attributes makes 67.5% of the stops unique and leads to 1.8 average matches.
Moreover, this paper demonstrated that there are real ways for an adversary to re-identify this unique data: the most straightforward example of such an attack is a nosy neighbor who, by virtue of being acquainted with their target, would know basic information about the target. Since the data contains sensitive information that might not otherwise be known, such as the reason a police officer believed someone was suspicious, there is a risk of harm associated with its release.
tive information about individuals, particularly information about individuals from marginalized groups who might suffer the effects of discrimination more keenly than people of a higher socio-economic status. The NYPD’s 2013 data mitigates some of the threat by not including the date of birth field. Without this field, the studies in this paper would be much less successful in uniquely identifying individuals. Never- theless, removing the date of birth field is not sufficient: the data contain additional information that could be used by a nosy neighbor to identify their target, such as the exact location where a stop was made (which might correspond to the target’s home address), and the time and date of the stop (which could be observed). More work must be done to assess the best way to this release data in a way that preserves its utility for activists seeking to challenge racist police practices. Techniques such as differential privacy, used by Mir et al. to release cellular records, might prove promising [16].
Bibliography
[1] Nate Anderson. “anonymized data really isn’t and here’s why not. http://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-
databases-of-ruin/. Accessed: 2014-11-18.
[2] Daniel C Barth-Jones. The re-identification’of governor william weld’s medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (June 4, 2012), 2012.
[3] Bill Clark. Earth point. http://www.earthpoint.us/. Accessed: 2014-4-7. [4] Nick Corasaniti and Josh Williams. Evolution of the new york drivers
license. http://www.nytimes.com/interactive/2013/03/17/nyregion/ 17licenses-evolution.html, March 2013. Accessed: 2015-04-07.
[5] Tore Dalenius. Finding a needle in a haystack-or identifying anonymous census record. Journal of official statistics, 2(3):329–336, 1986.
[6] Luis Daniel. Analysis finds racial disparities, ineffectiveness in nypd stop-and-frisk program; links tactic to soaring marijuana arrest rate. http://blog.luisdaniel.com/de-anonymizing-stop-and-frisk-data/. Accessed: 2014-11-18.
[7] Yves-Alexandre de Montjoye, C´esar A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3, 2013.
[8] New York Police Department. The stop, question and frisk data. http://www.nyc.gov/html/nypd/html/analysis_and_planning/stop_ question_and_frisk_report.shtml. Accessed: 2015-04-10.
[9] Julie Dressner and Edwin Maritnez. The scars of stop-and-frisk, June 2012. [Online; posted 12-June-2012].
[10] Cynthia Dwork. Differential privacy. In Encyclopedia of Cryptography and Se- curity, pages 338–340. Springer, 2011.
[11] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages 106–115. IEEE, 2007.
[12] Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakr- ishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3, 2007.
[13] Nick Manfredi, Darakhshan J Mir, Shannon Lu, and Dominick Sanchez. Dif- ferentially private models of tollgate usage: The milan tollgate data set. In Proceedings of the 2014 IEEE Big Data Conference, 2014.
[14] Nick Manfredi, Darakhshan J Mir, and Claire Schlenker. Identifiability of vehicle tollgate records: The milan tollgate dataset. In The Sixth ASE International Conference on Privacy, Security, Risk and Trust, 2014.
[15] Darakhshan J Mir. Differential privacy: an exploration of the privacy-utility landscape. PhD thesis, Rutgers University-Graduate School-New Brunswick, 2013.
[16] Darakhshan J Mir, Sibren Isaacman, Ram´on C´aceres, Margaret Martonosi, and Rebecca N Wright. Dp-where: Differentially private modeling of human mobility. In Big Data, 2013 IEEE International Conference on, pages 580–588. IEEE, 2013.
[17] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.
[18] NYCLU. De-anonymizing stop and frisk data.
http://www.nyclu.org/news/analysis-finds-racial-disparities-ineffectiveness- nypd-stop-and-frisk-program-links-tactic-soar. Accessed: 2014-11-18.
[19] NYCLU. Stop-and-frisk data. http://www.nyclu.org/content/ stop-and-frisk-data. Accessed: 2014-3-3.
[20] Maura O’Connor. How we mapped nypd summonses across the city: Part ii, October 2012. [Online; posted 8-October-2012].
[21] City of New York. Mayor de blasio announces agreement in landmark stop- and-frisk case. http://www1.nyc.gov/office-of-the-mayor/news/726-14/mayor- de-blasio-agreement-landmark-stop-and-frisk-case. Accessed: 2014-11-18.
[22] City of New York. Non-driver state id. http://www.nyc.gov/html/id/html/ how/state_id.shtml. Accessed: 2014-4-7.
[23] Ayla Solomon, Raquel Hill, Erick Janssen, Stephanie A Sanders, and Julia R Heiman. Uniqueness and how it impacts privacy in health-related social science datasets. In Proceedings of the 2nd ACM SIGHIT International Health Infor- matics Symposium, pages 523–532. ACM, 2012.
[24] Latanya Sweeney. Simple demographics often identify people uniquely. Health (San Francisco), 671:1–34, 2000.
[25] Latanya Sweeney. Computational disclosure control. A Primer on Data Privacy Protection, 2001.
[26] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557– 570, 2002.
[27] Latanya Sweeney. Matching known patients to health records in washington state data. Available at SSRN 2289850, 2013.
[28] Latanya Sweeney, Akua Abu, and Julia Winn. Identifying participants in the personal genome project by name. Available at SSRN 2257732, 2013.
[29] New York Civil Liberties Union. Stop-and-frisk data. http://www.nyclu.org/ content/stop-and-frisk-data. Accessed: 2014-11-18.
[30] Keyon Vafa, Christian Haigh, Kang Daniel Peinado, Samuel, and Ben Misch. Stop and frisk. 2015.
[31] Xiao Yu. pygeocoder. http://code.xster.net/pygeocoder/wiki/Home. Ac- cessed: 2014-4-7.
Appendix A
Additional Figures
The following sections contain additional graphs from the sampled (A.1) and binning (A.2) uniqueness studies. They provide additional information about the uniqueness of stops for each racial groups, women, and individuals who are at least twenty-five years of age.
(a) Average number of matches
(b) Percent unique
Figure A.1: Sampled uniqueness results for African Americans in the 2012 NYCLU dataset.
(a) Average number of matches
(b) Percent unique
Figure A.2: Sampled uniqueness results for people aged 25 and older in the 2012 NYCLU dataset.
(a) Average number of matches
(b) Percent unique
Figure A.3: Sampled uniqueness results for people aged 25 and older with photo identification in the 2012 NYCLU dataset.
(a) Average number of matches
(b) Percent unique
(a) Average number of matches
(b) Percent unique
Figure A.5: Sampled uniqueness results for Asians with photo identification in the 2012 NYCLU dataset.
(a) Average number of matches
(b) Percent unique
Figure A.6: Sampled uniqueness results for Black Hispanics in the 2012 NYCLU dataset.
(a) Average number of matches
(b) Percent unique
Figure A.7: Sampled uniqueness results for Black Hispanics with photo identification in the 2012 NYCLU dataset.