Ophthalmol Sci
Ophthalmol SciApril 2026Journal Article

Improving Fairness and Mitigating Bias in Multicenter Electronic Health Records Models to Predict Glaucoma Outcomes.

Artificial IntelligenceDisease Progression

Summary

In-processing methods best mitigate bias in glaucoma progression prediction models, improving fairness and performance across diverse patient groups. FairOdds-AUC is a new, flexible metric for evaluating fair clinical AI.

Abstract

PURPOSE

To evaluate the effectiveness and generalizability of bias mitigation methods in glaucoma progression prediction models across a multicenter electronic health records (EHRs) repository and to propose a novel evaluation metric that balances performance and fairness in clinical artificial intelligence (AI).

DESIGN

A cohort study.

PARTICIPANTS

A total of 50 656 glaucoma patients drawn from seven participating institutions in the SOURCE consortium, a harmonized EHR repository spanning ophthalmology departments in the United States.

METHODS

We trained five model architectures (e.g., XGBoost, neural networks, and transformers) to predict progression to surgery. Each model was evaluated with and without five bias-mitigation methods across preprocessing, inprocessing, and postprocessing. Performance and fairness were assessed on 1 internal and 2 external test sets. We introduced FairOdds-AUC, a composite metric that adjusts area under the receiver operating curve (AUROC) by equalized odds gaps across sex and race/ethnicity. The FairOdds-AUC metric was implemented in Python and is available as an open-source package for reproducibility and future use.

MAIN OUTCOME MEASURES

Area under the receiver operating curve, equalized odds for sex and race/ethnicity, and FairOdds-AUC.

RESULTS

Inprocessing methods, particularly inverse propensity weighting (IPW) and the adversarial fairness classifier, achieved more favorable fairness-performance tradeoffs than baseline and other mitigation approaches across all evaluation sets. For example, on the internal test set, IPW improved FairOdds-AUC from 0.562 (95% confidence interval 0.540, 0.581) to 0.600 (0.575, 0.629) for the transformer model and from 0.556 (0.534, 0.577) to 0.5922 (0.53, 0.61919) for a fully connected network, while maintaining essentially the same discrimination. Adversarial fairness classifier achieved the highest FairOdds-AUC in several settings (up to 0.613 [0.595, 0.629] for the deep learning fully connected network) with substantial reductions in equalized odds difference for sex. Postprocessing and preprocessing bias mitigation strategies yielded more variable FairOdds-AUC changes (-0.009 to +0.021) and showed weaker generalizability across external sites. FairOdds-AUC consistently reflected the balance between AUROC and equalized odds, with the optimal mitigation strategy depending on fairness-utility priorities.

CONCLUSIONS

Across a large, diverse glaucoma cohort, inprocessing bias methods provided the most consistent performance across evaluation sites in promoting fairness. FairOdds-AUC offers a flexible, interpretable way to evaluate clinical AI where fairness matters. Our findings support the recommendation to incorporate fairness evaluations and fairness-aware model training for future ophthalmic AI applications.

FINANCIAL DISCLOSURES

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Keywords

Artificial intelligenceElectronic health recordsFairnessGeneralizabilityGlaucoma

This article has not yet been placed in the Knowledge Library.

Discussion

Comments and discussion will appear here in a future update.