Finding British names, phone numbers, and addresses in text

Posted: March 22, 2022   Posted By: Abhinay Mehta

This article comes from a very specific problem that arose on a real-life project, and was addressed with success through the described methodology.

The Problem: Anonymizing Text

Imagine that you have a huge number of documents (in the millions) containing personal data about some people. You need to label these documents using crowdsourcing websites like Mechanical Turk.

Before you can send the millions of documents to strangers around the world, you need to make sure all personally identifiable information (PII) from within the text documents are removed.

The list of data entities you need to consider is rather long, it includes names, home addresses, phone numbers, credit card numbers, IP addresses, etc. It would take a rather long time to go through 10 million documents manually therefore this requires an automated solution.

The Solution: London Analytics

Trying to solve this problem on your own is rather difficult, as identifying names, locations, etc. from text is known in the world of AI and Natural Language Processing as Named Entity Recognition. There are several open source tools and service providers that attempt to help you with this problem.

But if you were based in the UK (as we are) and needed to find British specific data from text such as UK addresses, UK phone numbers, etc. you would find there aren’t many options to choose from. There is an online service for us UK folks called London Analytics that attempts to do exactly this.

Example Usage: With Python

London Analytics provides an online tool and an API to help find PII in text. After contacting London Analytics for an API Key, you can do the following.

Prerequisite:

> $ python -m pip install requests

Get a list of items the API could identify as potentially meaningful from some text:

This prints:

Now you can go through the list of data types you’re interested in, for us this included: PERSON, STREET_ADDRESS, POST_CODE, PHONE, EMAIL, IP, IPV6, CREDIT_CARD , and REFERENCE. We used this information to anonymize our text like this:

This prints:

Caveat: It should go without saying that no service is perfect, and of course PII will slip through the cracks here and there, so you need to do your own risk assessment on how best to use this service.

Conclusion

Apart from helping us with anonymization, finding out whether documents contain PII has been useful for several other reasons:

  • Choosing retention periods for different documents types
  • Choosing storage security strategies
  • Deciding which types of documents can or can’t be shared with third parties
  • Compliance departments can assess whether documents are following compliance rules