Anonymize Data

Cleansing unlabelled data for upload of personally idendifiable material#

Use Microsoft Presidio to replace examples of person names and telephone or account-a-like numbers in the dataset
Create an unlabelled workspace from them

./abcd_download.sh
source venv/bin/activate
python abcd_unlabelled.py --sample 100 --anonymize

If you check ./data/abcd_unlabelled05.json you will see the anonymization of a sample of utterances.

Anonymization is expensive in CPU and Memory (utilises spacy large language model) running the script on the full set will take a while.

Check out the uploading data exercise Load Data for how to upload your data.