Data Privacy and Consumer Protection: Anonymizing User Data is Necessary, and Difficult

Soren Heitmann IFC-Mastercard Foundation Partnership for Financial Inclusion

by Soren Heitmann

Next generation data analytics are driving innovative products, services and new FinTech business models.  Many of these products draw on individual consumer data.  Responsibly managing data privacy and ensuring consumer data protection is critical to mitigate operational and reputational risks.  In many markets, regulators are still catching up.  Unfortunately, many innovators identify risks after it is too late.  This post explores the issue of data anonymization and encryption.  Three cases identify different ways in which individually identifying data was exposed, even though providers took steps to anonymize and encrypt identifying information.

Difficulties in Anonymizing Data are Well-Documented
In 2006, America Online (AOL), an internet service provider, made 20 million search queries publicly available for research. People were anonymized by a random number.  In a New York Times article, journalists Michael Barbaro and Tom Zeller describe how customer number 4417749 was identified and subsequently interviewed for their article. While user 4417749 was anonymous, her searches were not. She was an avid internet user, looking up identifying search terms: ‘numb fingers’; ‘60 single men’; ‘dog that urinates on everything’. Searches included people’s names and other specific information including, ‘landscapers in Lilburn, Georgia, United States of America’. No individual search is identifying, but for a sleuth – or a journalist – it is easy to identify the sixty-something women with misbehaving dogs and nice yards in Lilburn, Georgia. Thelma Arnold was found and affirmed the searches were hers. It was a public relations debacle for AOL.

Another data breach made headlines in 2014 when Vijay Pandurangan, a software engineer, de-anonymized 173 million taxi records released by the city of New York for an Open Data initiative. The data was encrypted using a technique that makes it mathematically impossible to reverse-engineer the encrypted value. The dataset had no identifying search information as in the case of Arnold above, but the encrypted taxi registration numbers had a publicly known structure: number, letter, number, number (e.g., 5H32). Pandurangan calculated that there were only 23 million combinations, so he simply fed every possible input into the encryption algorithm until it yielded matching outputs. Given today’s computing power, he was able to de-anonymize millions of taxi drivers in only two hours.

Netflix, an online movie and media company, sponsored a crowdsourced competition challenging data scientists to improve by 10 percent its internal algorithm to predict customer movie rating scores. One of the teams de-anonymized the movie watching habits of encrypted users for the competition. By cross-referencing the public Internet Movie Database (IMDB), which provides a social media platform for users to rate movies and write their own reviews, users were identified by the patterns of identically rated sets of movies in the respective public IMDB and encrypted Netflix datasets. Netflix settled lawsuits filed by identified users and faced consumer privacy inquiries brought by the United States government.

Properly anonymizing data is very difficult, with many ways to reconstruct information. In these examples, cross-referencing public resources (Netflix), brute force and powerful computers (New York Taxis), and old-fashioned sleuthing (AOL) led to privacy breaches. If data are released for open data projects, research or other purposes, great care is needed to avoid de-anonymization risks and serious legal and public relations consequences.

There are many good reasons to provide access to data.  Academic research may seek to provide access for peer reviewers.  Firms may crowdsource innovative techniques to solve problems.  Products may provide public Application Programming Interfaces (APIs) to enable derivative services.  Consider first if needs can be met without providing any identifiable information.  Understand unstructured data, such as user-generated memo fields and information it could contain, like names or places; and if so, consider if these notes, when grouped together, might be attributed to a specific individual.  Where encryption is required, ensure industry standards are used; but also add-in randomly generated information to each identifier.  This is known as a salt, and can eliminate risks of unlocking entire datasets with a single key.  Much has been written on how to anonymize data.  The first thing to remember is that it is not a trivial task and it should be undertaken after purposeful planning and in consideration of the data at hand.

Note: Adapted from a case study presented in the Data Analytics and Digital Financial Services Handbook (June, 2017).  This post was authored by Soren Heitmann, IFC-Mastercard Foundation Partnership for Financial Inclusion, for the Responsible Finance Forum Blog November, 2017.