Skip to main content
All CollectionsCorti AIUnderstanding Corti's AI Models
Tuning Medical Text Summarization Models
Tuning Medical Text Summarization Models

Understand how Corti tunes coding models to your organizational needs

Updated this week

Before consuming the contents of this page, we recommend first reading Tuning Corti's AI Models, which gives an overview of the typical types of data tuning we sometimes work with customers on and the circumstances in which we may do so.

Overview

The documentation needs of the organizations that we work with can differ, thus there may be a need to work to better tune our summarization models to include the right level of detail in your summaries. You may have already seen, we have a library of Corti Assistant Section Templates, but if you need something more, we're here to help! The collection of this data can be broken down into three components to the data collection: Data Types, Data Format, and Data Volume.

Data Types

Data for automatic speech recognition (ASR) training or finetuning takes the form of pairs of audio recordings and target transcripts of that audio.

Audio Recordings or Transcripts

A source for summarisation must be text or audio for which a transcript can be created. This source file allows us to train our models to include the organizationally relevant facts to extract from that of the consultation topics.

Medical Summary

A target summary is a string of text that provides the same essential information as was in the source file but in a more concise format. It might follow a certain medical summarisation standard, such as SOAP, containing different sections that group the information. The quality of the model can improve more if several target summaries are provided for each source along with a score that indicates how โ€œgoodโ€ each summary is.

Data Format

Source audio files can be provided in any audio format as individual files. Source transcripts can be provided in many ways, e.g. individual text files or all in a single file.

Summaries often have a structure and therefore might be provided as structured text such as Markdown or in JSON format. In the simplest case where the summary is not formatted, it might be provided in a plain text file.

Example Data File (Transcripts):

{ 
"source_file": "transcript.txt", // or "audio.wav"
"target_summaries": [
{
"summary": "Here is a summary of the transcript.",
"score": 1,
},
{
"summary": "Here is a another, but worse, summary, oh noooo...",
"score": 0,
}
]
}

Data Volume

it should be stressed that every little bit helps to ensure we are tuning documentation templates to best support your users. Even 1 example is better than none, so don't be scared from some of our asks! We expect useful improvements in summary quality with 10.000+ pairs of target summaries and corresponding source files.

To achieve the best results in mapping Transcripts + EHR (X) to Clinical Notes (Y), the quality and quantity of data are critical. These are the recommended levels of data availability:

๐Ÿฅ‡ (Ideal) 100,000 examples of corresponding X and Y

๐Ÿฅˆ (Preferred) 1,000 examples of corresponding X and Y

๐Ÿฅ‰ (Minimum) At least 1 example of any X with its corresponding Y

Did this answer your question?