Data management

Datasets#

Datasets help organize the unlabelled data files inside of a namespace. They can be seen as folders containing uploaded files that can be linked to any workspaces of a namespace. Both utterances and conversations can be uploaded to datasets.

About data de-duplication#

Utterances are de-duplicated within a dataset. This means that only 1 version of a unique utterance will be made available for labeling when linking to a dataset. However, if linking multiple datasets to a workspace and duplicates exist between the datasets, these duplicates will appear in your workspace.

Linking data to workspaces#

When adding data to a workspace, you're in fact telling the workspace to look at the data in one or many datasets, we call this a "linked dataset". This means that multiple workspaces can work on the same data. It also means that adding or removing data from a dataset will affect the data available to all workspaces linked to that dataset.

Finally, workspaces are used to provide structure to data, but changes made within the workspace do not affect the linked datasets and their data.