JAMA OphthalmolApril 202489 citations

Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.

Huang Andy S, Hirabayashi Kyle, Barna Laura, Parikh Deep, Pasquale Louis R

View on PubMed DOI: 10.1001/jamaophthalmol.2023.6917

AI Summary

This study found GPT-4 outperformed glaucoma specialists and matched retina specialists in diagnostic accuracy and completeness, suggesting its promise as an ophthalmic clinical adjunct.

Abstract

Importance

Large language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties.

Objective

To compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management.

Design, setting, and participants: This comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology's commonly asked questions Ask an Ophthalmologist. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai-affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023.

Main outcomes and measures: Responses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison.

Results

The combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P < .001), and the mean rank for completeness was 528.3 and 398.7, respectively (n = 828; Mann-Whitney U = 25218.5; P < .001). The mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17), and the mean rank for completeness was 258.3 and 208.7, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005). The Dunn test revealed a significant difference between all pairwise comparisons, except specialist vs trainee in rating chatbot completeness. The overall pairwise comparisons showed that both trainees and specialists rated the chatbot's accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot's accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P < .001).

Conclusions and relevance: This study accentuates the comparative proficiency of LLM chatbots in diagnostic accuracy and completeness compared with fellowship-trained ophthalmologists in various clinical scenarios. The LLM chatbot outperformed glaucoma specialists and matched retina specialists in diagnostic and treatment accuracy, substantiating its role as a promising diagnostic adjunct in ophthalmology.

MeSH Terms

HumansUnited StatesCross-Sectional StudiesGlaucomaRetinaOphthalmologists

Shields Classification

IntroAn Overview of Glaucoma

Ch. 27Management of the Glaucoma Patient

Key Concepts3

The LLM chatbot (GPT-4, version dated May 12, 2023) outperformed fellowship-trained glaucoma specialists in diagnostic accuracy (mean rank 506.2 for LLM vs 403.4 for specialists; n=831; Mann-Whitney U=27976.5; P<.001) and completeness (mean rank 528.3 for LLM vs 398.7 for specialists; n=828; Mann-Whitney U=25218.5; P<.001) when responding to ophthalmological questions and real patient case management.

Comparative EffectivenessCross-sectionalComparative Cross-sectional Studyn=15 participants (12 attending physici…Ch1Ch28

The LLM chatbot (GPT-4, version dated May 12, 2023) matched fellowship-trained retina specialists in diagnostic accuracy (mean rank 235.3 for LLM vs 216.1 for specialists; n=440; Mann-Whitney U=15518.0; P=.17) but showed a significant difference in completeness (mean rank 258.3 for LLM vs 208.7 for specialists; n=439; Mann-Whitney U=13123.5; P=.005) when responding to ophthalmological questions and real patient case management.

Comparative EffectivenessCross-sectionalComparative Cross-sectional Studyn=15 participants (12 attending physici…Ch5Ch28

Both trainees and specialists rated the LLM chatbot's (GPT-4, version dated May 12, 2023) accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot's accuracy (z=3.23; P=.007) and completeness (z=5.86; P<.001).