Am J Ophthalmol
Am J OphthalmolMarch 2026Journal Article

ChatGPT-Assisted Glaucoma Diagnosis: A Health-Equitable Multi-Ancestry Analysis Using Visual Field and Optical Coherence Tomography Data.

Diagnosis & ScreeningOptic Nerve & Disc

Summary

ChatGPT o1 Pro diagnosed glaucoma similarly to specialists using only VF and OCT data. The model performance was similar across ancestral groups and genetic predispositions to glaucoma.

Abstract

PURPOSE

Early glaucoma detection is challenging due to variable ocular anatomy, non-glaucomatous optic neuropathy impacting optical coherence tomography (OCT) results, and the subjective nature of visual field (VF) tests. Multimodal large language models may overcome these challenges to provide equitable and accurate screening diagnoses across ancestries and glaucoma genetic predispositions. We evaluated ChatGPT o1 Pro's accuracy in identifying glaucoma using circumpapillary retinal nerve fiber layer (RNFL) OCT and VF data, and its consistency across ancestries and glaucoma polygenic risk scores (PRS).

DESIGN

Cross-sectional diagnostic accuracy study.

SETTINGS AND PARTICIPANTS

We enrolled 204 participants from the Mount Sinai BioMe Biobank for a comprehensive ophthalmic examination from November 2022 to March 2025. This cross-sectional diagnostic accuracy study included 38% European (EUR) and 62% non-European (non-EUR) participants stratified by low/intermediate (n = 107) and high-risk glaucoma PRS (n = 97). Two glaucoma specialists masked to PRS status provided a consensus reference diagnosis. ChatGPT received only de-identified VFs and OCT-RNFL numerical outputs to determine glaucoma status. Performance metrics were compared with the reference diagnosis. Subgroup comparisons by ancestry (EUR versus non-EUR) and PRS (high versus low/intermediate) were conducted. We used logistic regression models to assess the impacts of ancestry, PRS and ocular parameters on classification accuracy.

MAIN OUTCOME MEASURES

ChatGPT o1 Pro's diagnostic performance in detecting glaucoma compared to consensus specialist diagnoses, stratified by ancestry and genetic risk.

RESULTS

ChatGPT o1 Pro exhibited 96.0% sensitivity (95% confidence interval (CI): 88.3%-100%), 83.7% specificity (95%

CI

78.3%-89.1%), 85.2% accuracy (95%

CI

80.3%-90.1%), an area under the receiver operator curve (AUC) of 0.899, a positive predictive value (PPV) of 45.3% (95%

CI

31.9%-58.7%), and a negative predictive value (NPV) of 99.3% (95%

CI

98.0%-100%); κ for agreement with the consensus reference was 0.538. No significant differences were observed between EUR and non-EUR subgroups (AUC: 0.894 vs 0.906, P = .79; accuracy: 88.3% vs 83.3%, P = .44) or high and low/intermediate-PRS subgroups (AUC: 0.889 vs 0.922, P = .45; accuracy: 85.4% vs 85.0%, P = .50). Global RNFL was the only determinant of reference disease classification (OR = 1.1 per micron, P < .001).

CONCLUSION

ChatGPT o1 Pro diagnosed glaucoma similarly to specialists using only VF and OCT data. The model performance was similar across ancestral groups and genetic predispositions to glaucoma.

Discussion

Comments and discussion will appear here in a future update.