PET Lab.

K-Anonymity - Basic Privacy Guarantees for Data Sharing

Cover Image for K-Anonymity - Basic Privacy Guarantees for Data Sharing
Ryan Tenorio
Ryan Tenorio

Consider this scenario: a state government decides to release medical data about its citizens to help further research into if the healthcare system is meeting the specific medical needs of locals in its area. The dataset includes details such as a citizen’s name, their zip-code, their age and ethnicity, and the different medical conditions of the individual. To try to do this safely, the government removes the names from the dataset before release. Fantastic!

However, something not taken into account for this release: high-density zipcodes may have many people with the same age and ethnicity, so removing the names provides some guarantee for privacy. A low-density zipcode in a rural area may only have a single individual of a particular age and ethnicity. A crafty, curious individual could combine this dataset with an external dataset, such as census data or even a phonebook, to figure out the identity of some of the citizens in the dataset, even without their names. In today’s world of thousands of published datasets and ways to connect information, this type of privacy attack is real and commonplace.

What is it?

K-Anonymity is one technique used to protect the privacy of individuals in a dataset against this type of external knowledge attack. It involves removing direct identifiers (such as name) from a dataset in addition to generalizing, truncating, or redacting what are known as quasi-identifiers. Quasi-identifiers are attributes that, while associable with multiple individuals (like zipcode, age, and ethnicity), can be combined with external knowledge to directly identify an individual.

When should I use it?

The de-facto use-case for applying a technique like k-anonymity is to share sensitive information while reducing the risk of reidentification of individuals in a dataset. As with similar techniques, trade-offs should be considered between the strength of its application (in this case, the number determined for k) and the usefulness of the data after the technique is applied.

Deidentification through k-anonymity may also be useful in internal scenarios as well. If your organization has two business groups that need to share data for analytics to improve products and want to ensure access to the data will not enable targeted advertising or onward data sharing, access can be provided to a purpose-built table that represents a k-anonymized version of the production data. Combined with techniques like data access controls, this methodology can preserve privacy guarantees and help your organization exercise purpose limitation.

How does it work?

To achieve k-anonymity, the dataset has to be generalize its quasi-identifiers using a combination of techniques, such as redaction, truncation, generalization hierarchies, and dropping records of data.

Redaction

Redaction is most commonly used for the direct identifiers. Equivalently, removing the column completely will serve the same purpose.

Truncation

The outcome of truncating values is to reduce the granularity of a particular field. For example, a set of zip codes may redact the last two digits (55555 → 555**) to reduce risk for areas with low density.

Generalization hierarchies

Generalization hierarchies work well for categorical or discrete values. For example, instead of age, each row may be presented with their age as a range (18 → 18-25).

Dropping Data

It may be necessary to drop entire rows of data to remove outliers or individuals that are easily identifiable even after applying different techniques to de-identify their data.

How do I know what’s a quasi-identifier?

There is no mathematical formula to determine this, and is where the art of privacy begins to mix with the science. You will have to assess the data’s context, risk to individuals, and the purpose of the dataset to determine which fields are direct identifiers, quasi-identifiers, and which are the sensitive attributes.

The Code

One of the most common algorithms used to apply k-anonymization is the Mondrian Algorithm. It works by partitioning a dataset into “boxes” that represent the attributes of individuals in a dataset. The algorithm will attempt to arrange the boxes such that each resulting box contains at least K individuals. From the dataset’s perspective, each of those boxes represent the combination of generalized quasi-identifiers that those individuals will now be represented by.

If you are implementing k-anonymization, it is useful to be aware of the impact the technique has to the statistical properties of the data. In the below Repl, you can see how various values of K affect the statistical distributions of the underlying dataset.