DATA COLLECTION FOR AI SPEECH PROJECTS

Introduction

Many AI projects include speech recognition components such as voice assist Speech recognition models, These requires training to understand new domains. This training is done using data collection. This post will explain the steps involved in Speech data collection and training. 

 

Components of Speech Data Collection

 

1. Find out what users Need

2. Determine the domain-specific languages

3. Create a "script"

4. Identify target population

5. Record people reading your script

6. Record what the response

7. Make a test set

8. Train a language model

9. Make an acoustic model

 

The rest of this article will go into more detail about each step. The ultimate goal is to gather enough Audio Dataset to train a speech-model with maximum effectiveness. We currently aim to collect 25 hours of audio data for speech model training.

Step 1: What users need?

To train a voice assistant, log user interactions. Transcribe any call center recordings. You can also. If you are creating a chat dialog, look at the responses your users provide. You can use data from real users whenever you want, but only augment it if necessary.

Step 2: Determine the domain-specific languages

This step separates the "general" and "domain-specific languages. This is because the speech model inherits "general" training data so we only need to supply data that falls outside this. The "general" language is the language your smartphone understands, and the "domain-specific language" are things you say/hear only at work.

 

 Although it may seem tempting to simply extract "neuter", "Corgi" or "Corgi", the speech model will benefit from being able to understand these words in context. Speech training uses word sequences, not individual words. Knowing that "to neuter", is most likely in your domain, you tell the model that "tune her to her" is unlikely to be a translation despite their phonetic similarity.

Step 3: Create a "script" (data will we collect from the human race)

Take a list of statements from the previous steps and compile them into a "script" from which you will ask people to speak. You will also record them. The script should contain a representative sample from the previous step. You should include equal numbers of "neuter", "spay", and "microchip" appointments in your script. You should have twice the number of "neuter", "microchip," and "microchip," appointments.

 

Your script should begin with an introduction that identifies the subject's accent, environment and device. This metadata will come in handy later.

Step 4: Under what conditions should you speak?

You will need to identify your target population. Then, create a record of Dataset For Machine Learning collection plan for that population. You will want to record data from a variety of people (to cover speaking styles and accents) as well as a variety of environments and devices.

Step 5: Record people reading your script

Give your script to data collection subjects. Have them call this environment. Your users should not worry about any mistakes that they make. They should simply continue to read the script. Your data collection subjects should find the process as simple as possible. You can always fix their mistakes later.

Step 6: Record what the response

You need to capture what the callers actually said, as they can make mistakes. While this can be more work, you will also have the added benefit of having more training data.

 

Although this step will require your (or our) human intervention, you can make it easier. Instead of having to transcribe the whole call, you can ask the Speech to Text engine to do it. Your human transcriber will then correct any errors.

Step 7: Create the test set

Separate the audio and text pairs to create one statement. Segmentation can be done using newlines (for text files) or long pauses (for audio files). You will generate files like 1234_us-south_landline_noisy_line001, 1234_us-south_landline_noisy_line002, etc. You should ensure that all segments are generated by the same segmentations and that each segment lines up (line079text should be the transcription from line079 audio).

 

The segments that correspond with the user introduction will be removed ("I am calling with US South accent from landline in busy office").

Step 8: Train a language model

You can list all valid codes if you have domain-specific inputs like ICD-10 codes. Language models have a limit on the number of words they can contain. Language models can have and should include additional variations to what you have recorded audio. Marco's speech-training post contains additional information about training.

Step 9: Make an acoustic model

A language model is sufficient to provide enough accuracy for many solutions without having to build an acoustic one. You can actually test your language model against 100% of the audio that you have collected. Marco Noel's Part II speech training post has more information about whether you need an auditory model.

How GTS Assist You With Audio Dataset

Global Technology Solutions (GTS) provides you with all the data you could possibly need to power your technology in whatever dimension of speech, language, or voice function you would want. We have the means and expertise to handle any project relating to constructing a natural language corpus, truth data collection, semantic analysis, and transcription along with Data Annotation Services and OCR Training Dataset to train your ML models.


Comments

Popular posts from this blog