ChatGPT-Assisted Glaucoma Diagnosis: A Health-Equitable Multi-Ancestry Analysis Using Visual Field and Optical Coherence Tomography Data.
Andy S Huang, Anthony Fam, Hetince Zhao, Nicole Paulescu, Anna Fabczak-Kubicka, Janey L Wiggs, Nazlee Zebardast, David S Friedman, Ron DO, Kanza Aziz, Jae Hee Kang, Tobias Elze, Mengyu Wang, Alon Harris, Tak Yee Tania Tai, James C Tsai, Louis R Pasquale
Summary
ChatGPT o1 Pro diagnosed glaucoma similarly to specialists using only VF and OCT data. The model performance was similar across ancestral groups and genetic predispositions to glaucoma.
Abstract
PURPOSE
Early glaucoma detection is challenging due to variable ocular anatomy, non-glaucomatous optic neuropathy impacting optical coherence tomography (OCT) results, and the subjective nature of visual field (VF) tests. Multimodal large language models may overcome these challenges to provide equitable and accurate screening diagnoses across ancestries and glaucoma genetic predispositions. We evaluated ChatGPT o1 Pro's accuracy in identifying glaucoma using circumpapillary retinal nerve fiber layer (RNFL) OCT and VF data, and its consistency across ancestries and glaucoma polygenic risk scores (PRS).
DESIGN
Cross-sectional diagnostic accuracy study.
SETTINGS AND PARTICIPANTS
We enrolled 204 participants from the Mount Sinai BioMe Biobank for a comprehensive ophthalmic examination from November 2022 to March 2025. This cross-sectional diagnostic accuracy study included 38% European (EUR) and 62% non-European (non-EUR) participants stratified by low/intermediate (n = 107) and high-risk glaucoma PRS (n = 97). Two glaucoma specialists masked to PRS status provided a consensus reference diagnosis. ChatGPT received only de-identified VFs and OCT-RNFL numerical outputs to determine glaucoma status. Performance metrics were compared with the reference diagnosis. Subgroup comparisons by ancestry (EUR versus non-EUR) and PRS (high versus low/intermediate) were conducted. We used logistic regression models to assess the impacts of ancestry, PRS and ocular parameters on classification accuracy.
MAIN OUTCOME MEASURES
ChatGPT o1 Pro's diagnostic performance in detecting glaucoma compared to consensus specialist diagnoses, stratified by ancestry and genetic risk.
RESULTS
ChatGPT o1 Pro exhibited 96.0% sensitivity (95% confidence interval (CI): 88.3%-100%), 83.7% specificity (95%
CI
78.3%-89.1%), 85.2% accuracy (95%
CI
80.3%-90.1%), an area under the receiver operator curve (AUC) of 0.899, a positive predictive value (PPV) of 45.3% (95%
CI
31.9%-58.7%), and a negative predictive value (NPV) of 99.3% (95%
CI
98.0%-100%); κ for agreement with the consensus reference was 0.538. No significant differences were observed between EUR and non-EUR subgroups (AUC: 0.894 vs 0.906, P = .79; accuracy: 88.3% vs 83.3%, P = .44) or high and low/intermediate-PRS subgroups (AUC: 0.889 vs 0.922, P = .45; accuracy: 85.4% vs 85.0%, P = .50). Global RNFL was the only determinant of reference disease classification (OR = 1.1 per micron, P < .001).
CONCLUSION
ChatGPT o1 Pro diagnosed glaucoma similarly to specialists using only VF and OCT data. The model performance was similar across ancestral groups and genetic predispositions to glaucoma.
More by Andy S Huang
View full profile →Top Research in Diagnosis & Screening
Browse all →Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs.
Dry eye disease and oxidative stress.
Central Corneal Thickness in the Ocular Hypertension Treatment Study (OHTS).
Discussion
Comments and discussion will appear here in a future update.