Evaluation reports help identify weaknesses, improvements or regressions in your model.
You can run a k-fold cross-validation evaluation anytime within HumanFirst. When running a k-fold you can specify the number of folds (more folds will increase the evaluation run time, but will yield more accurate evaluation results) and also exclude/include intents or training examples from the evaluation.
You can run a regression evaluation using a dedicated test set. Your test set is defined using tags.
Both evaluations will expose the following metrics:
Precision helps understand how precise your model is at predicting actual positives. It is calculated using the following formula :
# of true positives / (# of true positives + # of false positives).
Recall helps understand how well your model is at predicting all actual positives. It is calculated using the following formula :
# of true positives / (# of true positive + # of false negatives).
F1 is a combination of precision and recall. The formula is the following:
2 x [(Precision*Recall) / (Precision + Recall)]
Accuracy is calculated using the following formula :
# of correct predictions / # of all predictions
By selecting multiple evaluations, you can compare them to visualize how your models' performance has changed over time.
You can easily export the full results of and evaluation report by clicking on the menu next to the summary and clicking
Export to CSV. The CSV will contain a summary for intents, phrases and entities.
When running an evaluation you can exclude parts of your training data (intents and/or training examples) using tags. Tags can be set on both intents and training examples. This is particularly useful if your model contains context specific flows (like Dialogflow Pages/Flows). It allows to test specific parts of your model and reflect the mutual exclusiveness intents might have.
Learn more about NLU and the various metrics we expose at the Machine Learning University (MLU)