Improve model accuracy

Quick description#

Did you know

HumanFirst allows you to run tests on your data to help uncover problems and optimize your NLU data.

You can:

  • Identify which intents need additional training examples
  • Identify which intents have high confusion (typically intents with a mix of very different training examples)
  • Identify which training examples belong to another intent in your corpus
  • Re-label problematic training examples

You must be on the Team or Enterprise tier to use these set of features.

Video Demo#

Run evaluation#

Evaluation reports are an on-demand feature that can be run on workspaces that contain two or more intents with 5 or more training examples each. The more intents & training examples your workspace has the better your evaluation reports will be.

To get your evaluation report go to the Evaluation section and click Run evaluation. This operation can take a few minutes to complete.

Technical description

Evaluation runs a 5-fold cross-validation test.


F1 scores#

The evaluation report will provide a list of Intents with their F1 scores (higher is better). You'll want to give extra attention to intents with low scores, especially those in red.


The colors are general indicators of possible problems, but not a guarantee that anything needs to be changed.

Training example confusion scores#

Once you've selected an intent from the evaluation report, you'll be provided with a ranked list of training examples. These training examples will be ranked by confusion score (entropy).

Confusion score: this is a measure of how confused the model is with this training example. A high confusion score means the training example is often misclassified.


If your intent has very few training examples (less than 20), it's likely the intent would benefit from additional training examples. For any intent with sufficient data, training examples with high confusion scores are prime candidates for re-labeling.

Take action#

Select training phrase#

When you select a training example to be re-labeled, HumanFirst provides a ranked list of likely intents by leveraging the evaluation report.

Recommended most similar#

If the recommendations don't contain the intent you wish to use for re-labeling, you can toggle between the recommended intents and your full intent tree.

Evaluation report lifecycle#

The evaluation report is a snapshot of your model data at the time the report is generated. You'll only get report metrics for intents & training examples that existed at the time of the report request.