Semantic Clustering

Overview#

HumanFirst maintains latent representations of every example based on a language model. This enables powerful features such as semantic search and clustering.

Starting from any utterance list (whether part of unlabelled data or intents), you can turn on clustering in order to filter the results and group semantically similar items together. This operation looks at the top ~2500 items and identifies clusters of semantically similar items. Any selected ordering is reflected in the order of the returned clusters.

Clusters are shown with a number indicating the number of contained items. Clicking on a cluster brings up a Cluster Confirmation screen, allowing you to review and de-select items that should not be part of the selection. Clicking Confirm will select the phrases and allow you to continue the process - at which point you can either assign these phrases to an intent, or use the selection for semantic search (see below).

Parameters#

Clustering being inherently dependent on the underlying data distribution, sometimes it's necessary to tune parameters in order to get better results. When turning clustering on, a small popop will appear with some parameters, you can change them and immediately see new results based on the new settings.

  • Minimum Cluster Size

    • Defines the minimum number of utterances in each cluster. If you are getting tons of small clusters that are very similar to each other, try increasing this value in order to group clusters together.
  • Granularity

    • Roughly defines how granular the concepts for every clusters are found. Try lowering if the clusters found references concepts that are too precise. For example, if you're looking to group all different financial loan types together, you may have to lower the granularity if you are getting lots of cluster with different loan types.
    • If you are used to hdbscan, granularity = min_samples - 30

Recipes#

Clustering really shines when combined with other filters, allowing you to sift through larger amounts of data.

Semantic Similarity#

When clustering is on and items have been selected, clicking on Sort by similarity or Show similar to selection will perform a semantic search using all the selected phrases and then cluster the top matches. The resulting clusters will respect the similarity sort, with the most similar phrases shown in the main view.

Note that you will undoubtedly notice a higher number of clusters, because the distribution of the data is now biased towards semantically similar items.

Active Learning#

While in the data section, set your Active NLU engine to Humanfirst NLU (or another enabled engine, if using a third party) - this unlocks new sorting options that allows sorting all unlabelled examples by active learning metrics (Uncertainty, Margin score, Entropy). This can naturally be combined with clustering in order to identify clusters of elements on which the model has poor performance. It's a good proxy metric for intent novelty that's helpful to aid with intent discovery.