Loading Data Overview

If you just want to do the exercises, you don't need to do this! We provide all the datasets and workspaces as demo workspaces in the Studio GUI.

If you want to load your own data this page is for you.

You can easily also load unlabelled data as utterances from a txt file Or from one of the pre-built integrations.

If you want to upload your own labelled data manually you can also use a simple csv file. Similarly if you want to upload your own unlabelled data manually you can also use a simple csv file.

What labelled/unlabelled data uploading in CSV format doesn't give you access to is:

  • Detailed metadata present at the point of capture of the utterance (This can really help the annotator)
  • Tags to track your unlabelled data as you turn it into a model
  • Entities and annotations within intent training data

To do any of these outside a pre-built integration you'll need to use the HumanFirst JSON format.

If you'd like to do that, this is the place for you!

Helpful core doc references:

Loading data, maximising the conversational context for annotators#

🦸 SuperPower Objectives :#

Part 1 - Technical#

  • Get the Humanfirst python module and Academy scripts to help make data uploading easy
  • Use the module and scripts to convert an example CSV containing conversations and other useful metadata into Humanfirst JSON object
  • Upload it to the HumanFirst tool

Part 2 - In the Studio - what does that let you do#

  • View all utterances within context of the conversation

  • Preserve important metadata already present in the data set that has come from our existing chat, NLU or bot system.

  • Filter based on that metadata

  • If you can it is highly recomendeded to upload the conversations with the following metadata. Each of the below metadata can support powerful flows to improve the performance of a chatbot or NLU model

    • Conversation ID:

      • Purpose: The Conversation ID is a unique identifier for each interaction. This ID is essential for tracking and referencing specific conversations within larger datasets.

      • Reason for Inclusion: Including the ID allows users to correlate chatbot interactions with other data in their systems, such as customer profiles or transaction history. This correlation enables a more holistic understanding of the conversation in the context of the user's journey and needs.

    • First/Second Utterance:

      • Purpose: This indicates whether an utterance is the first or second in a conversation

      • Reason for Inclusion: This information is useful in routing analysis as it provides initial information about the topic, tone, and direction of the conversation. Analysts can quickly identify the starting point of a dialogue and understand the initial intent or inquiry.

    • Runtime Classification (Intent Name and Confidence):

      • Purpose: Runtime classification includes the name of the intent recognized by the chatbot and the confidence level of this recognition.

      • Reason for Inclusion: This information is useful for identifying and analyzing the performance of intents. Understanding which intents are recognized with high or low confidence helps in pinpointing areas where the chatbot may need additional training. Focusing on problematic intents (those with lower confidence scores) can significantly improve the accuracy and effectiveness of the chatbot.

    • Fallback Indicator:

      • Purpose: This indicates instances where the chatbot could not understand or process the user's input and defaulted to a generic response or a fallback mechanism.

      • Reason for Inclusion: Tracking fallback occurrences is crucial for identifying gaps in the chatbot's understanding or areas lacking in training. Regular analysis of fallback incidents can guide improvements in the chatbot's design, ensuring better handling of similar queries in the future.

    • Escalation Indicator:

      • Purpose: This marks conversations that were escalated to a human agent or involved other significant events.

      • Reason for Inclusion: The escalation indicator helps in understanding the contexts or conversation types that the chatbot struggles to handle. It provides insights into the scenarios where human intervention is necessary, allowing for targeted improvements in the chatbot's capabilities. It also helps in assessing the overall efficiency of the automated system and the balance between automation and human support.

    • In summary, including these metadata elements in the uploaded conversations enables a comprehensive analysis of whether the conversation was successful or not at runtime. This identifies specific areas for improvement, guides training efforts, and ultimately leads to a more effective and efficient chatbot system.. If you don't have this information you can still improve the model, but you will need to run simulations in the tool of what we think the NLU would have predicted at runtime. If you can include the data up front you can save time and API calls.

  • Tag the data on upload to make it easily accessible for use cases your team might be interested in.

  • Index the speakers in it to allow to zoom in on the first and second utterance for bootstrapping a bot

  • Split the data into a train and validate set by day/week/month