Check out our new paper: Updating Clinical Risk Stratification Models Using Rank-Based Compatibility: Approaches for Evaluating and Optimizing Joint Clinician-Model Team Performance. It was accepted to the 2023 Machine Learning for Healthcare Conference.

Download paper.
Paper on arXiv.

Code for the new measure, loss function, and experimental analysis can be found at this GitHub repo.

Abstract

As data shift or new data become available, updating clinical machine learning models may be necessary to maintain or improve performance over time. However, updating a model can introduce compatibility issues when the behavior of the updated model does not align with user expectations, resulting in poor user-model team performance. Existing compatibility measures depend on model decision thresholds, limiting their applicability in settings where models are used to generate rankings based on estimated risk. To address this limitation, we propose a novel rank-based compatibility measure, \(C^R\), and a new loss function that optimizes discriminative performance while encouraging good compatibility. Applied to a case study in mortality risk stratification leveraging data from MIMIC, our approach yields more compatible models while maintaining discriminative performance compared to existing model selection techniques, with an increase in \(C^R\) of \(0.019\) (\(95\%\) confidence interval: \(0.005\), \(0.035\)). This work provides new tools to analyze and update risk stratification models used in settings where rankings inform clinical care.

Here’s a 30,000 foot summary of the paper.

Updating Clinical Risk Models While Maintaining User Trust

As machine learning models become more integrated into clinical care, it’s crucial we understand how updating these models impacts end users. Models may need to be retrained on new data to maintain predictive performance. But if updated models behave differently than expected, it could negatively impact how clinicians use them.

My doctoral advisors (Dr. Brian T. Denton and Dr. Jenna Wiens) and I recently explored this challenge of updating for clinical risk stratification models. These models estimate a patient’s risk of some outcome, like mortality or sepsis. They’re used to identify high-risk patients who may need intervention.

Backwards Trust Compatibility

An existing compatibility measure is backwards trust compatibility (developed by Bansal et al.). It checks if the original and updated models label patients correctly in the same way. But it depends on setting a decision “threshold” to convert risk scores into labels.

In many clinical settings, like ICUs, physicians may use risk scores directly without thresholds. So we wanted a compatibility measure that works for continuous risk estimates, not just thresholded labels.

Rank-Based Compatibility

We introduced a new rank-based compatibility measure. It doesn’t require thresholds. Instead, it checks if the updated model ranks patients in the same order as the original model.

For example, if the original model ranked patient A’s risk higher than patient B, does the updated model preserve this ordering? The more patient pair orderings it preserves, the higher its rank-based compatibility.

Training Models to Prioritize Compatibility

But simply measuring compatibility isn’t enough - we want to optimize it during model training. So we proposed a new loss function that balances predictive performance with rank-based compatibility.

Using a mortality prediction dataset, we compared models trained normally vs with our compatibility-aware loss function. The optimized models achieved significantly better compatibility without sacrificing much accuracy.

Why This Matters

Model updating is inevitable as new data emerge. But unintended changes in model behavior can violate user expectations. By considering compatibility explicitly, we can develop clinical AI that better aligns with physician mental models.

This helps ensure updated models are readily adopted, instead of met with skepticism. It’s a small but important step as we integrate machine learning into high-stakes medical settings. We’re excited to continue improving these models collaboratively with end users.

Please let me know if you have any questions.

Cheers,
Erkin
Go ÖN Home

N.B. this blog post was writen in collaboration with Anthropic’s Claude.