Tuning Medical Coding Models

Before consuming the contents of this page, we recommend first reading Tuning Corti's AI Models, which gives an overview of the typical types of data tuning we sometimes work with customers on and the circumstances in which we may do so.

Overview

Depending on the organization, there can be very specific needs for the codes that they use, the coding workflows that need support, and what type of coding is needed (diagnostic, procedure, etc). As you have likely seen, the more (relevant) data that you are able to send, the better the outcome of the tuning exercise. The collection of this data can be broken down into two components to the data collection: Data Format and Data Volume.

Data Format

Below you will find an example format for what that data extraction should look like. The more of this data we can get the better, but at an absolute minimum we need the clinical notes and code(s) used. The format ideally is a JSON. If not a JSON, it would still need to be machine readable and structured (e.g. PDF images or copies of the chart aren't sufficient). This typically would be coming from an EHR export.

{ 
    "clinical_note": "The patient was admitted following (...)",
    "target_codes": { 
       "ICD-10": ["A02.1", "K70.1"], 
       "CPT": [...], 
    }, 
    "medications": ["paracetamol"], 
    "test_results": { 
       "blood_pressure": "180/120", 
    }, 
    "physician_id": "95847362", 
    "medical_coder_id": "10450438", 
    "patient_id": "123456-9876", 
    "admission_id": "918273645", 
    "date_of_admission": "01-01-2024", 
    "department": "Oncology", 
    "patient_age": 66, 
    "patient_gender": "male", 
    "patient_ethnicity": "caucasian", 
    "patient_survived": true, 
 }

Data Volume

One way to approach the amount of data needed to develop a medical coding system is on a per-code level. In order to learn how to utilise any given code to any useful degree, our system will need at minimum 10 examples of the code, but likely upwards 100 examples.

The underlying challenge is the long-tail distribution of medical codes. Given the fact that most codes are used very rarely, and a few codes are used very frequently, this can mean that millions of documents must be collected before rare codes occur more than 10 times. Corti has found the best balance to this is data is handed over once every one of the X most common codes have at least 100 cases in which they occur. The exact value of X comes down to the set of codes that the system should support with high accuracy. Typically, we see this number being around 50% of total number of consultations (not 50% of codes used!).

To achieve the best results in mapping Clinical Notes + EHR Data (X) to Medical/CPT Codes (Y), the quality and quantity of data are critical. These are the recommended levels of data availability:

🥇 (Ideal) - 1,000+ examples of each individual code (Y) with corresponding data (X)

🥈 (Preferred) - 100+ examples of each individual code (Y) with corresponding data (X)

🥉 (Minimum) - 10+ examples of each individual code (Y) with corresponding data (X)

Tuning Corti's AI Models

Tuning Automatic Speech Recognition Models

Tuning Medical Text Summarization Models

Module 2: Corti Assistant

How do Corti APIs work?