Skip to main content
All CollectionsCorti AIUnderstanding Corti's AI Models
Tuning Automatic Speech Recognition Models
Tuning Automatic Speech Recognition Models

Understand how Corti tunes ASR models to your organizations needs

Updated this week

Before consuming the contents of this page, we recommend first reading Tuning Corti's AI Models, which gives an overview of the typical types of data tuning we sometimes work with customers on and the circumstances in which we may do so.

Overview

Ensuring that documents and dictations are as precise and accurate as possible, there may be opportunity to tune our proprietary ASR. Some reasons we've tuned our models in the past include something as vast as training a whole new language to something smaller like tuning to a new dialect or medical specialty. The collection of this data can be broken down into three components to the data collection: Data Types, Data Format and Data Volume:

Data Types

Data for automatic speech recognition (ASR) training or finetuning takes the form of pairs of audio recordings and target transcripts of that audio.

Audio Recordings

The audio recordings should reflect the domain to which the ASR is expected to function including background noise and type of recording equipment, but also speaker dialect, accent, gender, race, age and other characteristics that might affect the speech.

Target Transcripts

The transcripts must be verbatim (or near-verbatim) to be useful for training, finetuning and/or evaluating an ASR model. Any syntactic property of the transcripts expected from the delivered ASR model must also be reflected in the target transcripts (e.g. capitalisation and punctuation).

**Transcripts generated by other highly accurate dictation software vendors can also be excellent sources of data, provided they meet the criteria for verbatim or near-verbatim accuracy. This ensures alignment with the expected outputs of the ASR model. Leveraging such transcripts can often be an effective way to produce high-quality datasets for training or evaluation.

Associated Metadata

A data sample can also include metadata such as speaker_id, gender, race, age, language, dialect, location, date and generally should include as much metadata as possible. This might be used for training but is always valuable to estimate model performance on subpopulations.

Data Format

Below you will find an example submission of recordings and transcription/metadata files. In this submission, there would be a singular mapping document which maps the recording files to the transcription file(s), the individual recordings, and the transcription file(s).

Mapping File:

delivered_data/ 
├── file_name_1.wav
├── file_name_1.json
├── file_name_2.mp3
├── file_name_2.json
├── file_name_3.ogg
├── file_name_3.json
├── ...
└── file_name_N.json

Transcription File:

[ 
{
"file": "file_name_1.wav",
"start": "0:00:00.1",
"stop": "0:00:09.4",
"speaker_id": 0,
"transcript": "Hello, this is the captain speaking."
},
{
"file": "file_name_1.wav",
"start": "0:00:11.6",
"stop": "0:00:15.2",
"speaker_id": 1,
"transcript": "I bet this one drew their license in a machine."
},
...
{
"file": "file_name_2.wav",
"start": "0:00:04.4",
"stop": "0:00:13.6",
"speaker_id": 2,
"transcript": "Nonetheless, the plane landed safely." }, ]

Data Volume

As mentioned before, diversity within the dataset is the key to a better tuned ASR model that meets all of your needs. That said, even a relatively small amount of data can improve our already best in class models for any of your more nuanced needs when it comes to dialects, specialties, etc.

To achieve the best results in mapping Audio Conversations (X) to Transcripts (Y), the quality and quantity of data are critical. These are the recommended levels of data availability:

🥇 (Ideal) 2,000+ hours of recordings with corresponding transcripts, involving at least 20 different speakers.

🥈 (Preferred) 500+ hours of recordings with corresponding transcripts, involving at least 20 different speakers.

🥉 (Minimum) 50+ hours of recordings with corresponding transcripts, involving at least 10 different speakers.

Did this answer your question?