Transl Vis Sci TechnolJanuary 2026Journal Article

Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.

Authors

Yapei Zhang, Min Shi, In Young Chung, Daniel L Liebman, Laura E Barna, Louis R Pasquale, David S Friedman, Michael V Boland, Lucy Q Shen, Mengyu Wang

Diagnosis & ScreeningDisease Progression

1 citation

Summary

Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.

Abstract

PURPOSE

The purpose of this study was to evaluate the accuracy of large language model (LLM) LLaMA 2-70B in summarizing glaucoma clinic notes into patient-friendly language and generating educational material.

METHODS

A random sample of 147 clinic notes from unique patients who visited Glaucoma Service at a tertiary center was analyzed. LLaMA 2 generated paragraph and bullet-point summaries in five subjects: (1) glaucoma diagnosis and type, (2) disease progression, (3) treatment plan, (4) treatment changes, and (5) surgical/laser interventions. Two ophthalmologists reviewed responses for accuracy and categorized them as "correct," "partially correct," or "incorrect." Discrepancies were adjudicated by a glaucoma specialist. A comparison using identical prompts was performed on a subset (n = 50) with ChatGPT-4.

RESULTS

LLaMA 2 correctly summarized 97 notes (66%) in paragraph and 103 (70%) in bullet format. Another 44 (30%) and 41 (28%) were partially correct, respectively. Paragraph summaries were more accurate and complete for glaucoma suspects than diagnosed patients (82% vs. 53%, P < 0.001). For targeted clinical questions, LLaMA 2 accurately identified glaucoma diagnosis in 118 notes (80%), disease stability/progression in 129 (88%), treatment plans in 127 (87%), treatment changes in 134 (91%), and surgical/laser interventions in 124 (84%). ChatGPT-4 achieved 46% correct paragraph summaries, 50% correct bullet summaries, and accuracies of 96%, 88%, 64%, 78%, and 82%, respectively, for targeted questions.

CONCLUSIONS

Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.

TRANSLATION RELEVANCE

LLMs may enhance patient experience and health literacy by standardizing patient-friendly language in clinical care.

Top Research in Diagnosis & Screening

Browse all →

Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs.

2018Ophthalmology701 citations

Dry eye disease and oxidative stress.

2018Acta Ophthalmol299 citations

Central Corneal Thickness in the Ocular Hypertension Treatment Study (OHTS).

2020Ophthalmology293 citations

In the Knowledge Library

Management of the Glaucoma PatientPatient CommunicationArtificial Intelligence Tools An Overview of GlaucomaFuture DirectionsArtificial Intelligence

Discussion

Comments and discussion will appear here in a future update.