CLI data management

Managing datasets#

As documented in the Data Management section, datasets help organize the unlabelled data files. They can be seen as folders containing uploaded files that can be linked to any workspaces of a namespace.

All dataset commands are available via the data sub-commands of hf. Execute hf data --help to see the list of available sub-commands.

Listing#

To list datasets, execute hf data sets list. The --sources and --workspaces options can be used to display the unique identifiers of the linked data sources and workspaces.

$ hf data sets list
id name utterances workspaces sources
convset-B4IY4M5LENAOJKKDPFPU2XRQ set1 42 playbook-XEUSQXIS3RHIZMXTC6W7HQI7 convsrc-CBY6YDXCRVGAJKUKZPVCW7PH
convset-BT7QT6YCRFHLNDINXQP2KWG7 set2 3117

Creation#

To create a new dataset, execute hf data sets create <name of the dataset>. To link this dataset to a workspace, consult the workspaces documentation.

$ hf data sets create my-dataset-name
Created dataset my-dataset-name (id: convset-DXTQTEBJGRCUZDPG4G4NELK6)

Deletion#

To delete a dataset, execute hf data sets --dataset <id of the dataset> delete. A dataset that is still linked to a workspace cannot be deleted. Use the list command with --workspaces to see the workspaces that are linked to the dataset and then use the unlink command of the workspace data sub-command.

$ hf data sets --dataset convset-DXTQTEBJGRCUZDPG4G4NELK6 delete
Deleted dataset convset-DXTQTEBJGRCUZDPG4G4NELK6

Linking to workspaces#

Refer to the workspaces documentation for more information on how to link and unlink datasets to workspaces.

Managing files#

File formats#

Different file formats can be uploaded into a dataset as documented in Data Management section.

The CLI supports all file formats supported by Studio and they can be referenced by their shorthand name:

Shorthand nameDescription
jsonHumanFirst JSON format
txtUtterance text format
csvSimple conversation CSV format

Validating#

Before uploading a file, its format can be validated using the hf data validate [--format format] <filename> command. Refer to the formats section for more information about the supported file formats.

$ hf data validate --format txt my_utterances.txt
my_utterances.txt: File is valid and contains 3 conversations

Listing#

To list files in a dataset, execute hf data sets --dataset <id of the dataset> files list.

$ hf data sets --dataset convset-CG5CDTBVBVFYPIUHAXYVK6OG files list
file format time
myfile1.txt txt 2022-02-22T20:42:08Z
myfile2.txt txt 2022-03-30T15:31:00Z

Importing#

To import a file into a dataset, execute hf data sets --dataset <id of the dataset> files import [--format format] [filename].

Refer to the formats section for more information about the supported file formats.

The file content needs to be encoded in UTF-8 and can be passed via filename or stdin.

$ hf data sets --dataset convset-BT7QT6YCRFHLNDINXQP2KWG7 files import --format txt my_file.txt
my_file.txt: File uploaded successfully

Deleting#

To delete a file from a dataset, execute hf data sets --dataset <id of the dataset> files delete <filename>.

$hf data sets --dataset convset-CG5CDTBVBVFYPIUHAXYVK6OG files delete my_file.txt
Deleted file 'my_file.txt' from dataset