Evaluation Tool

HumanFirst allows you to run tests on your data to help uncover problems and optimize your NLU data.

What problems does this solve?#

  • Identify which intents need additional training examples.
  • Identify which intents have high confusion (typically intents with a mix of very different training examples).
  • Identify which training examples belong to another intent in your corpus.
  • Re-label problematic training examples.

Getting an evaluation report#

Evaluation reports are an on-demand feature that can only be run on workspaces that contain two or more intents with 5 or more training examples each. The more intents & training examples your workspace has the better your evaluation reports will be.

To get your evaluation report go to the Evaluation section and click Run evaluation. This operation can take a few minutes to complete.

Identifying problematic intents#

The evaluation report will provide a list of Intents with their F1 scores (higher is better).
You'll want to give extra attention to intents with low scores, especially those in red.

The colors are general indicators of possible problems, but not a guarantee that anything needs to be changed.

Identifying problematic training examples#

Once you've selected an intent from the evaluation report, you'll be provided with a ranked list of training examples. These training examples will be ranked by confusion (entropy).

Confusion score: this is a measure of how confused the model is with this training example. A high confusion score means the training example is often misclassified.

If your intent has very few training examples (less than 20), it's likely the intent would benefit from additional training examples.
For any intent with sufficient data, training examples with high confusion scores are prime candidates for re-labeling.

Re-labeling problematic training examples#

When you select a training example to be re-labeled, we provide a ranked list of likely intents by leveraging the evaluation report.

If the recommendations don't contain the intent you wish to use for re-labeling, you can toggle between the recommended intents and your full intent tree.

Evaluation report lifecycle#

The evaluation report is a snapshot of your model data at the time the report is generated. You'll only get report metrics for intents & training examples that existed at the time of the report request.

Over time, you'll be updating your model (creating or removing intents, adding or moving training examples, etc.) These changes will not appear in your evaluation report. The more changes you've made since the evaluation report was created, the more you should consider re-running evaluation to get more reliable metrics.