Data protection and keeping sensitive information private is very important for companies and their customers. There have been several huge data leaks in the past that lead to trust issues towards the involved companies. To get value from text data with machine learning large collections of documents are necessary. But to access them can be a privacy issue for customers in finance, legal, medicine and many more. As a freelancer in data science and machine learning accessing large quantities of data is necessary to build accurate and useful models. But for some of your clients, it might be (with good reasons) scary or impossible to disclose their data raw and unprotected.
So how can you work with sensitive text data at scale yet keep the contained information as secure as possible? In this article, I'm going to show you some of my methods to work with sensitive text data and discuss what the caveats are.
To explain the methods, we use the 20 Newsgroups dataset which is also easily available through scikit-learn. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. So, we're facing a document classification problem here. We start by loading the data through the scikit-learn API.
We have the following categories for the documents:
Let's have a look at an example.
So, this document is about cars, which we would probably have guessed.
We first setup the machine learning pipeline we will be using throughout the article. For simplicity reasons, we use a simple bag of words TFIDF model with a naive bayes classifier, a simple, effective and popular method for text classification. But the proposed methods would also work with more complicated methods like neural networks.
To compute a performance baseline, we first run the pipeline with unprotected raw data.
Since we want to do machine learning with the documents, we want to preserve as much information as possible. Depending on how critical your data is, you can pick from several ways to do this. I'll show you three different ways to do this and we compare the performance on our simple machine learning model. The basic idea here is, that (most) machine learning models basically treat the token Hello the same way as 16ff566c558eb688e. As long as the relationships between the tokens are preserved everything will work fine. In the end, what the models see is just numbers.
We start out with a method to automatically remove personally identifiable information such as names and locations. Depending on your dataset you might also remove credit card information or certain IDs with this method. A simply basic approach is to use a named entity tagger to find this information in the text and then replace it with a random string. For tokenization and named entity recognition we will use the awesome spaCy library.
Let's see what the entity tagger found.
So, the tagger detected the person Lerxst and the location University of Maryland. We remove these entity types from the text now. You already see a caveat of this method, since it didn't recognize the email address. But you could filter it out with a regular expression.
We can clearly see, that the methods provide only a little lower performance as the baseline. But this is dependent on the dataset and problem at hand. On the downside, the method has no guarantee that all relevant personal information is found. So, keep that in mind.
For some use-cases, it might cause problems to remove all personal information by their types. For example, specific locations can contain relevant information for your problem. So, we modify the previous approach. Now we're not replacing the detected personal information by its type but its unique hash value. This keeps the information available for the machine learning model and preserves its privacy.
This method performs well by keeping more relevant information. However, it still suffers from the privacy problems of the previous approach. Also, the hashes can be cracked by brute force or count-based statistical approaches.
One method to mitigate the problems with the previous two methods is to go fully encrypted. To keep as much information as possible, we can just map every word to a unique secret value that cannot easily be inverted. A hash function will work well here. This will not change the vectorspace of our bag of words but makes it impossible for humans to understand.
We can see, that we kept most of the performance compared to the baseline. One serious downside of this method is that you cannot interpret the dataset after you apply it. This makes it ultimately secure, but it also makes the machine learning workflow more tedious. Also note, that the method is vulnerable to statistical attacks and brute force attacks to decrypt the data.
We saw three methods of how you can work with text datasets to keep sensible personal or commercial information safe. One serious drawback is that they make it harder (or impossible) to diagnose your models. But they can help you to get started with your projects faster. You can start from here to craft a method fitting your use-case best.
Here you can see a fast overview of the covered methods:
I hope this article helps you in your day-to-day work as a data scientist, machine learning engineer or especially as a freelancer in these fields. Nonetheless, always keep in mind that none of these methods is completely safe against attacks!