Ophthalmol SciNovember 20250 citations

Accuracy and Readability of Chat Generative Pre-Trained Transformer-4 Omni in Answering Ophthalmology Patient Questions.

Hamzeh Nikoo, Lidder Alcina K, Feder Robert S, Sarmiento Emmanuel A, Mirza Rukhsana G, Thau Avrey J, Tanna Angelo P

View on PubMed DOI: 10.1016/j.xops.2025.101007

AI Summary

ChatGPT-4o accurately answered most patient ophthalmology questions, but often at a 12th-grade reading level. Prompting for a 6th-grade level improved readability without losing accuracy, suggesting potential for patient education tools.

Abstract

Purpose

To assess the quality of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart.

Design

Retrospective cross-sectional study.

Participants

One hundred sixty-five patients who submitted ophthalmology-related questions via Epic MyChart.

Methods

Questions asked by ophthalmology clinic patients related to the subspecialties of glaucoma, retina, and cornea via the Epic MyChart at a single institution were evaluated. Nonclinical questions were excluded. Each question was submitted to ChatGPT-4o twice, first without limitations and then after priming the large language model (LLM) to respond at a sixth-grade reading level. The ChatGPT-4o output and subsequent conversations were graded by 2 independent ophthalmologist reviewers as "accurate and complete," "incomplete," or "unacceptable" with respect to the quality of the output. A third subspecialist reviewer provided adjudication in cases of disagreement. Readability of the ChatGPT-4o output was assessed using the Flesch-Kincaid Grade Level and other readability indices.

Main outcome measures

Quality and readability of answers generated by ChatGPT-4o.

Results

Two hundred eighty-five queries asked by 165 patients were analyzed. Overall, 220 (77%) responses were graded as accurate and complete, 49 (17%) as incomplete, and 16 (6%) as unacceptable. The initial 2 reviewers agreed in 87% of the responses generated by ChatGPT-4o. The overall mean Flesch-Kincaid reading grade level was 12.1 ± 2.1. When asked to respond at a sixth-grade reading level, 242 (85%) responses were graded as accurate and complete, 38 (13%) were incomplete, and 5 (2%) were graded as unacceptable.

Conclusions

Chat Generative Pre-Trained Transformer-4 Omni usually provides accurate and complete answers to the questions posed by patients to their glaucoma, retina, and cornea subspecialists. A substantial proportion of the responses were, however, graded as incomplete or unacceptable. Chat Generative Pre-Trained Transformer-4 Omni responses required a 12th-grade education level as assessed by Flesch-Kincaid and other readability indices, which may make them difficult for many patients to understand; however, when prompted to do so, the LLM can generate responses at a sixth-grade reading level without a compromise in response quality. Chat Generative Pre-Trained Transformer-4 Omni can potentially be used to answer clinical ophthalmology questions posed by patients; however, additional refinement will be required prior to implementation of such an approach.

Financial disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Shields Classification

IntroAn Overview of Glaucoma

Ch. 27Management of the Glaucoma Patient

Key Concepts4

Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart were graded as accurate and complete in 77% (220 out of 285) of cases, incomplete in 17% (49 out of 285) of cases, and unacceptable in 6% (16 out of 285) of cases, based on evaluation by two independent ophthalmologist reviewers with adjudication by a third subspecialist reviewer.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

The overall mean Flesch-Kincaid reading grade level for Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to ophthalmology patient questions was 12.1 ± 2.1, indicating a 12th-grade education level was required for understanding.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

When Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) was primed to respond at a sixth-grade reading level to ophthalmology patient questions, 85% (242 out of 285) of responses were graded as accurate and complete, 13% (38 out of 285) were incomplete, and 2% (5 out of 285) were unacceptable, demonstrating that the large language model can generate responses at a lower reading level without compromising quality.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

A retrospective cross-sectional study evaluated 285 queries asked by 165 patients related to glaucoma, retina, and cornea subspecialties via Epic MyChart, with each question submitted to Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) twice (once without limitations and once primed for a sixth-grade reading level) and graded by two independent ophthalmologist reviewers.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

Global Search

Accuracy and Readability of Chat Generative Pre-Trained Transformer-4 Omni in Answering Ophthalmology Patient Questions.

Abstract

Shields Classification

Key Concepts4

Related Articles5