How to Evaluate the Accuracy of a Speech Recognition Model

Introduction

Ensuring accurate speech recognition is essential for evaluating any ASR service, especially those deployed in healthcare settings where trust and accuracy is paramount. This guide is designed to help you get the most accurate and reliable results when assessing ASR capabilities to make sure you choose the right one for you. Here, we will walk through the key steps to properly measure transcription accuracy, highlight best practices, and help you avoid common pitfalls. By following these guidelines, you will be able to conduct a fair and meaningful evaluation that aligns with real-world usage scenarios.

Steps to Evaluate ASR Model Correctly

Step 1: Creating a High-Quality Reference Transcript and Transcription Format

Ensure transcripts are manually annotated following strict transcription guidelines.
Avoid using automatically generated transcripts as the reference, as they often contain errors.
Use a dataset that reflects real-world variability (e.g. different accents, noise levels, and speaking speeds).

Please refer to our Transcription Guidelines to ensure accuracy and consistency.

Step 2: Normalize the Hypothesis and Reference Transcripts

Text normalization is critical for accurate Word Error Rate (WER) calculation. Differences in formatting, punctuation, and casing can lead to inflated error rates, so it’s important to standardize both the reference and hypothesis transcripts before evaluating WER.

Here’s how to properly normalize your transcripts:

Convert all text to lowercase to avoid case mismatches affecting WER scores.
Remove all punctuation (except apostrophes when necessary for clarity, such as in contractions like “don’t”).
Normalize numbers: Convert numerals into words or vice versa (e.g., "twenty-five" to "25" or "$450" to "four hundred fifty dollars"). Choose a consistent format and apply it to both the reference and hypothesis transcripts.
Standardize compound words: Some ASR systems may output words as compounds or separate entities (e.g., "ballpark" vs. "ball park"). Choose a format and ensure consistency across both transcripts.
Tokenize correctly: In languages where spacing rules vary (e.g., Chinese, Japanese), ensure word segmentation follows a consistent rule.
Remove filler words and hesitations: Words such as "um," "uh," and "hmm" should generally be excluded unless they carry meaning relevant to the evaluation.
Expand or contract abbreviations consistently (e.g., "Dr." → "Doctor").

By applying these normalizations to both the reference transcript (gold standard) and ASR output (hypothesis), you ensure a fair and accurate WER calculation. This will minimize false discrepancies that do not reflect actual recognition errors.

Step 3: Proper Averaging of WER

When calculating WER across multiple examples, it’s important to avoid averaging the WER of individual examples directly. Instead, the correct approach is:

Average WER = (Total errors across dataset) ÷ (Total words in dataset)

This method ensures that longer transcripts don’t get disproportionately weighted in the final metric.

Step 4: Proper Averaging for Datasets

When evaluating ASR models across multiple datasets, we must also consider dataset size. Simply averaging WERs from different datasets can be misleading if the datasets are not of equal size. The correct approach is:

Weighted Average WER = (Sum of total errors for each dataset) ÷ (Sum of total words for each dataset)

This ensures that smaller datasets do not skew the final result disproportionately.

Step 5: Interpreting Evaluation Results

Beyond simply looking at WER percentages, it is essential to analyze the types of errors the model makes. Categorizing errors into meaningful groups can provide more actionable insights.

Here are some useful error categories:

Minor spelling mistake (e.g., "color" vs. "colour")
Major spelling mistake (word is unrecognizable but phonetically similar)
Missing text (a word or phrase is omitted from the transcript)
Hallucination/made-up text (the model introduces words that weren’t spoken)
Error in clinically important word (misrecognizing critical medical terms)
Error in clinically irrelevant word (e.g., mishearing filler words)

Considering error significance (e.g., "important" vs. "not important") helps refine evaluations. While this is a qualitative approach, our ML team is working on tools to formalize these evaluations and make comparisons across ASR models easier.

📌 Visualizing ASR Errors

While WER gives a numerical measure of accuracy, visualizing transcription mistakes makes it easier to understand where and how errors occur. This helps in identifying common ASR failure points, such as misrecognitions, insertions, and deletions. Below is an example of an ASR-generated transcript with errors highlighted:

Common Mistakes to Avoid

🚫 Comparing ASR Transcripts Directly
Always compare ASR output against a manually created reference transcript, not another ASR system’s output. Comparing two ASR-generated transcripts only shows differences but doesn’t indicate which system is more accurate.
🚫 Skipping Text Normalization
Different ASR services may handle casing, punctuation, and number formatting differently. Before calculating Word Error Rate (WER), standardize both the reference and ASR output to ensure a fair comparison.
🚫 Relying on Visual Inspection Alone
A structured, quantitative evaluation is essential. Visual assessment can be misleading, especially when small formatting differences (e.g., capitalization, punctuation, or numerals) stand out more than actual recognition errors.
🚫 Ignoring Your Use Case
Not all ASR evaluations have the same goal. If you’re extracting keywords, WER might not be the best metric. Instead, consider precision, recall, or content word accuracy depending on how the transcripts will be used.

Further Reading:
For a more detailed breakdown of ASR evaluation metrics and best practices, check out AWS’s guide on evaluating ASR services.

Conclusion

By following these best practices, you can ensure that your evaluation of Corti’s ASR is fair and accurate. If you have any questions or need assistance, feel free to reach out to our support team via Intercom.

Tuning Automatic Speech Recognition Models

General transcription guidelines