Before consuming the contents of this page, we recommend first reading Tuning Corti's AI Models, which gives an overview of the typical types of data tuning we sometimes work with customers on and the circumstances in which we may do so.
Overview
The documentation needs of the organizations that we work with can differ, thus there may be a need to work to better tune our summarization models to include the right level of detail in your summaries. You may have already seen, we have a library of Corti Assistant Section Templates, but if you need something more, we're here to help! The collection of this data can be broken down into three components to the data collection: Data Types, Data Format, and Data Volume.
Data Types
Data for automatic speech recognition (ASR) training or finetuning takes the form of pairs of audio recordings and target transcripts of that audio.
Audio Recordings or Transcripts
A source for summarisation must be text or audio for which a transcript can be created. This source file allows us to train our models to include the organizationally relevant facts to extract from that of the consultation topics.
Medical Summary
A target summary is a string of text that provides the same essential information as was in the source file but in a more concise format. It might follow a certain medical summarisation standard, such as SOAP, containing different sections that group the information. The quality of the model can improve more if several target summaries are provided for each source along with a score that indicates how โgoodโ each summary is.
Data Format
Source audio files can be provided in any audio format as individual files. Source transcripts can be provided in many ways, e.g. individual text files or all in a single file.
Summaries often have a structure and therefore might be provided as structured text such as Markdown or in JSON format. In the simplest case where the summary is not formatted, it might be provided in a plain text file.
Example Data File (Transcripts):
{
"source_file": "transcript.txt", // or "audio.wav"
"target_summaries": [
{
"summary": "Here is a summary of the transcript.",
"score": 1,
},
{
"summary": "Here is a another, but worse, summary, oh noooo...",
"score": 0,
}
]
}
Data Volume
it should be stressed that every little bit helps to ensure we are tuning documentation templates to best support your users. Even 1 example is better than none, so don't be scared from some of our asks! We expect useful improvements in summary quality with 10.000+ pairs of target summaries and corresponding source files.
To achieve the best results in mapping Transcripts + EHR (X) to Clinical Notes (Y), the quality and quantity of data are critical. These are the recommended levels of data availability:
๐ฅ (Ideal) 100,000 examples of corresponding X and Y
๐ฅ (Preferred) 1,000 examples of corresponding X and Y
๐ฅ (Minimum) At least 1 example of any X with its corresponding Y