Anonymize Data

Cleansing unlabelled data for upload of personally idendifiable material#

Anonymize Data Loom Video

🦸SuperPower Objectives :#

  • Use Microsoft Presidio to replace examples of person names and telephone or account-a-like numbers in the dataset
  • Create an unlabelled workspace from them

Example#

./abcd_download.sh
source venv/bin/activate
python abcd_unlabelled.py --sample 100 --anonymize

If you check ./data/abcd_unlabelled05.json you will see the anonymization of a sample of utterances.

Anonymization is expensive in CPU and Memory (utilises spacy large language model) running the script on the full set will take a while.

Microsoft Presidio is available under the MIT license.

Check out the uploading data exercise Load Data for how to upload your data.