Upload Unlabeled Data

Overview#

Private datasets#

info

We know that uploading data is sensitive, so think it's important to mention that:

  • You own your data
  • We will never share your data with anyone
  • You can export your data at any time
  • You can delete your data at any time

Public datasets#

HumanFirst makes public datasets readily available within workspaces. You can navigate to Unlabeled Data and click on Manage Data > Add Data and select a demo dataset you'd like to experiment with.

Data format#

Multi-turn (conversations)#

This format is useful if you have bi-directional conversations (example: client + agent) that contain one or more utterances per conversation. The format consists of a simple file with a .csv extension, in which each line contains 4 columns delimited by comma describing a single utterance that is part of a conversation.

The columns of the file need to be:

  1. Conversation ID (string): A unique identifier of the conversation. If the conversation contains multiple utterances, the utterances should share the same identifier.
  2. Utterance timestamp (number): The date at which the utterance was done by the client or agent. This date should be a Unix epoch timestamp with milliseconds precision. In Excel, you can use the following formula to convert a date to the proper format: =(A1-DATE(1970,1,1))*86400000 Example: 08/27/2020 at 5:28pm should have value 1598549339000.
  3. Utterance source (string): The source is the person who did the utterance in a conversation. This value must be set to either client or expert.
  4. Utterance (string): The text of the utterance.

Example#

conversations.csv

convId1,1565799760013,client,"No thank you, I'm done for today"
convId1,1565799760014,expert,Very well
convId1,1565799760015,expert,Have a lovely day!
convId1,1565799760016,client,Likewise
convId1,1565799760017,expert,Thank you
convId2,1565803942001,client,"Hi, I'm looking for a new car"
convId2,1565803942002,expert,"Good morning. Sure, can you tell me what you are looking for?"
convId2,1565803942003,client,"What brands are you selling ?"
convId2,1565803942004,expert,"You can visit http://www.somesite.com to consult our inventory."

Notes#

  • The CSV file should not have headers (i.e. no columns names on first row).
  • The file needs to be encoded using UTF8 in order to properly support non-ascii characters (ex: accents, emojis).
  • If saving from Excel, save the file with the CSV UTF-8 (Comma delimited) (.csv) format.

Single utterances#

This format is useful if you have a single list of utterances that you want to import in HumanFirst. The format consists of a simple file with a .txt extension in which each line contains 1 utterance.

Example#

utterances.txt

I have a problem signing up to your service.
Iā€™m looking for a Honda Civic 2015.
How can I help you today?

Notes#

  • The file needs to be encoded using UTF8 in order to properly support non-ascii characters (ex: accents, emojis).
  • The imported utterances in HumanFirst will be dated to the date of the file upload.

Longer documents (beta)#

How to upload#

From Studio#

Navigate to Unlabeled Data and click on Manage Data > Add Data.

From Command Line Tool#

For conversations:#

hf conversation import --workspace [workspace id] conversations.csv1

For utterances:#

hf utterance import --workspace [workspace id] utterances.txt

info

Options: --workspace value specifies workspace to import into