Global Search

Search articles, concepts, and chapters

Ophthalmol SciNovember 20250 citations

Accuracy and Readability of Chat Generative Pre-Trained Transformer-4 Omni in Answering Ophthalmology Patient Questions.

Hamzeh Nikoo, Lidder Alcina K, Feder Robert S, Sarmiento Emmanuel A, Mirza Rukhsana G, Thau Avrey J, Tanna Angelo P


AI Summary

ChatGPT-4o accurately answered most patient ophthalmology questions, but often at a 12th-grade reading level. Prompting for a 6th-grade level improved readability without losing accuracy, suggesting potential for patient education tools.

Abstract

Purpose

To assess the quality of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart.

Design

Retrospective cross-sectional study.

Participants

One hundred sixty-five patients who submitted ophthalmology-related questions via Epic MyChart.

Methods

Questions asked by ophthalmology clinic patients related to the subspecialties of glaucoma, retina, and cornea via the Epic MyChart at a single institution were evaluated. Nonclinical questions were excluded. Each question was submitted to ChatGPT-4o twice, first without limitations and then after priming the large language model (LLM) to respond at a sixth-grade reading level. The ChatGPT-4o output and subsequent conversations were graded by 2 independent ophthalmologist reviewers as "accurate and complete," "incomplete," or "unacceptable" with respect to the quality of the output. A third subspecialist reviewer provided adjudication in cases of disagreement. Readability of the ChatGPT-4o output was assessed using the Flesch-Kincaid Grade Level and other readability indices.

Main outcome measures

Quality and readability of answers generated by ChatGPT-4o.

Results

Two hundred eighty-five queries asked by 165 patients were analyzed. Overall, 220 (77%) responses were graded as accurate and complete, 49 (17%) as incomplete, and 16 (6%) as unacceptable. The initial 2 reviewers agreed in 87% of the responses generated by ChatGPT-4o. The overall mean Flesch-Kincaid reading grade level was 12.1 ± 2.1. When asked to respond at a sixth-grade reading level, 242 (85%) responses were graded as accurate and complete, 38 (13%) were incomplete, and 5 (2%) were graded as unacceptable.

Conclusions

Chat Generative Pre-Trained Transformer-4 Omni usually provides accurate and complete answers to the questions posed by patients to their glaucoma, retina, and cornea subspecialists. A substantial proportion of the responses were, however, graded as incomplete or unacceptable. Chat Generative Pre-Trained Transformer-4 Omni responses required a 12th-grade education level as assessed by Flesch-Kincaid and other readability indices, which may make them difficult for many patients to understand; however, when prompted to do so, the LLM can generate responses at a sixth-grade reading level without a compromise in response quality. Chat Generative Pre-Trained Transformer-4 Omni can potentially be used to answer clinical ophthalmology questions posed by patients; however, additional refinement will be required prior to implementation of such an approach.

Financial disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.


Key Concepts4

Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart were graded as accurate and complete in 77% (220 out of 285) of cases, incomplete in 17% (49 out of 285) of cases, and unacceptable in 6% (16 out of 285) of cases, based on evaluation by two independent ophthalmologist reviewers with adjudication by a third subspecialist reviewer.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

The overall mean Flesch-Kincaid reading grade level for Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to ophthalmology patient questions was 12.1 ± 2.1, indicating a 12th-grade education level was required for understanding.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

When Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) was primed to respond at a sixth-grade reading level to ophthalmology patient questions, 85% (242 out of 285) of responses were graded as accurate and complete, 13% (38 out of 285) were incomplete, and 2% (5 out of 285) were unacceptable, demonstrating that the large language model can generate responses at a lower reading level without compromising quality.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

A retrospective cross-sectional study evaluated 285 queries asked by 165 patients related to glaucoma, retina, and cornea subspecialties via Epic MyChart, with each question submitted to Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) twice (once without limitations and once primed for a sixth-grade reading level) and graded by two independent ophthalmologist reviewers.

MethodologyCross-sectionalRetrospective cross-sectional studyn=285 queries from 165 patientsCh28

Is this article assigned to the wrong chapter(s)? Let us know.