Accuracy and Readability of Chat Generative Pre-Trained Transformer-4 Omni in Answering Ophthalmology Patient Questions.
Hamzeh Nikoo, Lidder Alcina K, Feder Robert S, Sarmiento Emmanuel A, Mirza Rukhsana G, Thau Avrey J, Tanna Angelo P
AI Summary
ChatGPT-4o accurately answered most patient ophthalmology questions, but often at a 12th-grade reading level. Prompting for a 6th-grade level improved readability without losing accuracy, suggesting potential for patient education tools.
Abstract
Purpose
To assess the quality of Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart.
Design
Retrospective cross-sectional study.
Participants
One hundred sixty-five patients who submitted ophthalmology-related questions via Epic MyChart.
Methods
Questions asked by ophthalmology clinic patients related to the subspecialties of glaucoma, retina, and cornea via the Epic MyChart at a single institution were evaluated. Nonclinical questions were excluded. Each question was submitted to ChatGPT-4o twice, first without limitations and then after priming the large language model (LLM) to respond at a sixth-grade reading level. The ChatGPT-4o output and subsequent conversations were graded by 2 independent ophthalmologist reviewers as "accurate and complete," "incomplete," or "unacceptable" with respect to the quality of the output. A third subspecialist reviewer provided adjudication in cases of disagreement. Readability of the ChatGPT-4o output was assessed using the Flesch-Kincaid Grade Level and other readability indices.
Main outcome measures
Quality and readability of answers generated by ChatGPT-4o.
Results
Two hundred eighty-five queries asked by 165 patients were analyzed. Overall, 220 (77%) responses were graded as accurate and complete, 49 (17%) as incomplete, and 16 (6%) as unacceptable. The initial 2 reviewers agreed in 87% of the responses generated by ChatGPT-4o. The overall mean Flesch-Kincaid reading grade level was 12.1 ± 2.1. When asked to respond at a sixth-grade reading level, 242 (85%) responses were graded as accurate and complete, 38 (13%) were incomplete, and 5 (2%) were graded as unacceptable.
Conclusions
Chat Generative Pre-Trained Transformer-4 Omni usually provides accurate and complete answers to the questions posed by patients to their glaucoma, retina, and cornea subspecialists. A substantial proportion of the responses were, however, graded as incomplete or unacceptable. Chat Generative Pre-Trained Transformer-4 Omni responses required a 12th-grade education level as assessed by Flesch-Kincaid and other readability indices, which may make them difficult for many patients to understand; however, when prompted to do so, the LLM can generate responses at a sixth-grade reading level without a compromise in response quality. Chat Generative Pre-Trained Transformer-4 Omni can potentially be used to answer clinical ophthalmology questions posed by patients; however, additional refinement will be required prior to implementation of such an approach.
Financial disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Shields Classification
Key Concepts4
Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to questions submitted by patients through Epic MyChart were graded as accurate and complete in 77% (220 out of 285) of cases, incomplete in 17% (49 out of 285) of cases, and unacceptable in 6% (16 out of 285) of cases, based on evaluation by two independent ophthalmologist reviewers with adjudication by a third subspecialist reviewer.
The overall mean Flesch-Kincaid reading grade level for Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) responses to ophthalmology patient questions was 12.1 ± 2.1, indicating a 12th-grade education level was required for understanding.
When Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) was primed to respond at a sixth-grade reading level to ophthalmology patient questions, 85% (242 out of 285) of responses were graded as accurate and complete, 13% (38 out of 285) were incomplete, and 2% (5 out of 285) were unacceptable, demonstrating that the large language model can generate responses at a lower reading level without compromising quality.
A retrospective cross-sectional study evaluated 285 queries asked by 165 patients related to glaucoma, retina, and cornea subspecialties via Epic MyChart, with each question submitted to Chat Generative Pre-Trained Transformer-4 Omni (ChatGPT-4o) twice (once without limitations and once primed for a sixth-grade reading level) and graded by two independent ophthalmologist reviewers.
Related Articles5
Age-period-cohort analysis of the global burden of visual impairment according to major causes: an analysis of the Global Burden of Disease Study 2019.
Cohort StudyAssessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.
Cross-Sectional StudyEvaluation of ChatGPT-4 responses to glaucoma patients' questions: Can artificial intelligence become a trusted advisor between doctor and patient?
Observational StudyEuropean Glaucoma Society research priorities for glaucoma care.
Observational StudyAn Analysis of ChatGPT4 to Respond to Glaucoma-Related Questions.
Cross-Sectional StudyIs this article assigned to the wrong chapter(s)? Let us know.