Load Data

Loading Data Basics#

If you just want to do the exercises, you don't need to do this! We provide all the datasets and workspaces as demo workspaces in the Studio GUI.

If you want to load your own data this page is for you.

You can easily also load unlabelled data as utterances from a txt file Or from one of the pre-built integrations.

If you want to upload your own labelled data manually you can also use a simple csv file, for unlabelled data this lets you preserve the conversational context and filter by date. for unlabelled data it lets you upload the intent they belong in.

What that doesn't give you access to is:

  • Detailed metadata present at the point of capture of the utterance
  • Preserve any tags an annotator may have provided in addition to the intent label
  • Annotate entities within intent spans

To do any of these outside a pre-built integration you'll need to use the HumanFirst JSON format.

If you'd like to do that, this is the place for you!

Helpful core doc references:

Loading data ,aximising the conversational context for annotators#

🦸 SuperPower Objectives :#

Part 1 - Technical#

  • Get helper files describing the JSON objects
  • Run an example producing the HumanFirst Academy ABCD May and June files

Part 2 - In the Studio - what does that let you do#

  • View all utterances within context of the conversation
  • Preserve important metadata already present in the data set that has come from our existing chat, nlu or bot system.
  • Filter based on that meta data
  • Tag the data on upload to make it easy to for particular cases the business may be interested in
  • Index the speakers in it to allow to zoom in on the first and second utterance for bootstrapping a bot
  • Or last client utterance for looking at fallbacks.
  • Split the data into a train and validate set by day/week/month

Loading Data with Metadata and Tags into Humanfirst

Instructions#

Acquiring and Running loading script examples#

We provide a humanfirst.py helper class which represents the objects in the HumanFirst JSON and examples of scripts to load data to HumanFirst in various ways.

  • Examples scripts are provided for running a python conversion of popular formats into the humanfirst json
  • They are kept in a this repository https://github.com/zia-ai/academy
  • It provides an ubuntu:focal based dockerfile to run the scripts
  • They should run fine on a linux python3 environment
  • If you are mac user it's reported that you can run the scripts direct if you wish but we don't test for this
  • We have made scripts run on Windows for clients but you will need to adjust the commands suitably for path names, ENV variables and other such windows things
  • Using the dockerfile and mounting your windows directory is much easier
  • For instructions see the README and the licence is in the repo.

General usage#

We don't provide the ABCD dataset direct, but do provider a downloader if you wish to use it.

./abcd_download.sh

Make sure your python venv setup, and then run the unlabelled script. You probably don't want to process all 10,000 convos to just get an example of how to load You can randomly sample a smaller number

source venv/bin/activate
python ./abcd_unlabelled.py --sample 100

Produces two unlabllelled data sets for May and June Go to data management in the left bar of the tool tool. Upload each file to a different dataset, so you can select one for initial creation and one for validation, or check the different coverage across months.

Importing a demo dataset for ABCD from demo workspaces for an exercise doesn't use up datapoints. These are created from the above files. But the system will treat any data you upload yourself as a new copy and so uploading these files again will use up some of your datapoint allowance.

If you previously had an older version of the academy alpha scripts, this used to produce a labelled file as well. With the split of metadata and tags this wasn't necessary to couple them. You can upload your labelled and labelled data entirely separately.

The repo contains examples of loading data from various NLU systems we don't have native integrations for at the moment.

But the best way to get started quickly on the ABCD data with a labelled set, is to use one of the academy demo workspaces.