ChatGPT-Assisted Glaucoma Diagnosis: A Health-Equitable Multi-Ancestry Analysis Using Visual Field and Optical Coherence Tomography Data.
Huang Andy S, Fam Anthony, Zhao Hetince, Paulescu Nicole, Fabczak-Kubicka Anna, Wiggs Janey L, Zebardast Nazlee, Friedman David S, DO Ron, Aziz Kanza
AI Summary
ChatGPT diagnosed glaucoma with high sensitivity (96%) and reasonable specificity (83.7%) from visual field/OCT data. Its performance was consistent across diverse ancestries and genetic risks, suggesting LLMs could be unbiased screening tools for early glaucoma detection.
Abstract
Purpose
Early glaucoma detection is challenging due to variable ocular anatomy, non-glaucomatous optic neuropathy impacting optical coherence tomography (OCT) results, and the subjective nature of visual field (VF) tests. Multimodal large language models may overcome these challenges to provide equitable and accurate screening diagnoses across ancestries and glaucoma genetic predispositions. We evaluated ChatGPT o1 Pro's accuracy in identifying glaucoma using circumpapillary retinal nerve fiber layer (RNFL) OCT and VF data, and its consistency across ancestries and glaucoma polygenic risk scores (PRS).
Design
Cross-sectional diagnostic accuracy study.
Settings and participants: We enrolled 204 participants from the Mount Sinai BioMe Biobank for a comprehensive ophthalmic examination from November 2022 to March 2025. This cross-sectional diagnostic accuracy study included 38% European (EUR) and 62% non-European (non-EUR) participants stratified by low/intermediate (n = 107) and high-risk glaucoma PRS (n = 97). Two glaucoma specialists masked to PRS status provided a consensus reference diagnosis. ChatGPT received only de-identified VFs and OCT-RNFL numerical outputs to determine glaucoma status. Performance metrics were compared with the reference diagnosis. Subgroup comparisons by ancestry (EUR versus non-EUR) and PRS (high versus low/intermediate) were conducted. We used logistic regression models to assess the impacts of ancestry, PRS and ocular parameters on classification accuracy.
Main outcome measures
ChatGPT o1 Pro's diagnostic performance in detecting glaucoma compared to consensus specialist diagnoses, stratified by ancestry and genetic risk.
Results
ChatGPT o1 Pro exhibited 96.0% sensitivity (95% confidence interval (CI): 88.3%-100%), 83.7% specificity (95% CI: 78.3%-89.1%), 85.2% accuracy (95% CI: 80.3%-90.1%), an area under the receiver operator curve (AUC) of 0.899, a positive predictive value (PPV) of 45.3% (95% CI: 31.9%-58.7%), and a negative predictive value (NPV) of 99.3% (95% CI: 98.0%-100%); κ for agreement with the consensus reference was 0.538. No significant differences were observed between EUR and non-EUR subgroups (AUC: 0.894 vs 0.906, P = .79; accuracy: 88.3% vs 83.3%, P = .44) or high and low/intermediate-PRS subgroups (AUC: 0.889 vs 0.922, P = .45; accuracy: 85.4% vs 85.0%, P = .50). Global RNFL was the only determinant of reference disease classification (OR = 1.1 per micron, P < .001).
Conclusion
ChatGPT o1 Pro diagnosed glaucoma similarly to specialists using only VF and OCT data. The model performance was similar across ancestral groups and genetic predispositions to glaucoma.
MeSH Terms
Key Concepts4
ChatGPT o1 Pro exhibited 96.0% sensitivity (95% CI: 88.3%-100%), 83.7% specificity (95% CI: 78.3%-89.1%), and 85.2% accuracy (95% CI: 80.3%-90.1%) in identifying glaucoma using circumpapillary retinal nerve fiber layer (RNFL) OCT and visual field (VF) data, compared to consensus specialist diagnoses.
ChatGPT o1 Pro demonstrated an area under the receiver operator curve (AUC) of 0.899, a positive predictive value (PPV) of 45.3% (95% CI: 31.9%-58.7%), and a negative predictive value (NPV) of 99.3% (95% CI: 98.0%-100%) for glaucoma diagnosis using circumpapillary retinal nerve fiber layer (RNFL) OCT and visual field (VF) data.
No significant differences were observed in ChatGPT o1 Pro's glaucoma diagnostic performance between European (EUR) and non-European (non-EUR) subgroups (AUC: 0.894 vs 0.906, P = .79; accuracy: 88.3% vs 83.3%, P = .44) or between high and low/intermediate-polygenic risk score (PRS) subgroups (AUC: 0.889 vs 0.922, P = .45; accuracy: 85.4% vs 85.0%, P = .50).
Global retinal nerve fiber layer (RNFL) was the only determinant of reference disease classification, with an odds ratio (OR) of 1.1 per micron (P < .001) in a study assessing ChatGPT o1 Pro's glaucoma diagnostic accuracy.
Related Articles5
Endpoints and Design for Clinical Trials in USH2A-Related Retinal Degeneration: Results and Recommendations From the RUSH2A Natural History Study.
Clinical TrialDetailed Comparison Between Two Main Phenotypes of CRB1-Related Retinal Dystrophy, Pan-retinopathy and Maculopathy.
Cohort StudyMultimodal High-Resolution Imaging in Retinitis Pigmentosa: A Comparison Between Optoretinography, Cone Density, and Visual Sensitivity.
Observational StudyClinical and Multimodal Imaging of Acute Outer Retinopathy: Expanding the Spectrum of Acute Annular Outer Retinopathy.
Case SeriesRetinitis Pigmentosa GTPase Regulator-Associated X-Linked Retinitis Pigmentosa: Molecular Genetics and Clinical Characteristics.
Case SeriesIs this article assigned to the wrong chapter(s)? Let us know.