Assessing the Accuracy of Artificial Intelligence-Generated Clinical Summaries From Ambulatory Glaucoma Subspecialty Clinical Encounters.
Yapei Zhang, Min Shi, In Young Chung, Daniel L Liebman, Laura E Barna, Louis R Pasquale, David S Friedman, Michael V Boland, Lucy Q Shen, Mengyu Wang
Summary
Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.
Abstract
PURPOSE
The purpose of this study was to evaluate the accuracy of large language model (LLM) LLaMA 2-70B in summarizing glaucoma clinic notes into patient-friendly language and generating educational material.
METHODS
A random sample of 147 clinic notes from unique patients who visited Glaucoma Service at a tertiary center was analyzed. LLaMA 2 generated paragraph and bullet-point summaries in five subjects: (1) glaucoma diagnosis and type, (2) disease progression, (3) treatment plan, (4) treatment changes, and (5) surgical/laser interventions. Two ophthalmologists reviewed responses for accuracy and categorized them as "correct," "partially correct," or "incorrect." Discrepancies were adjudicated by a glaucoma specialist. A comparison using identical prompts was performed on a subset (n = 50) with ChatGPT-4.
RESULTS
LLaMA 2 correctly summarized 97 notes (66%) in paragraph and 103 (70%) in bullet format. Another 44 (30%) and 41 (28%) were partially correct, respectively. Paragraph summaries were more accurate and complete for glaucoma suspects than diagnosed patients (82% vs. 53%, P < 0.001). For targeted clinical questions, LLaMA 2 accurately identified glaucoma diagnosis in 118 notes (80%), disease stability/progression in 129 (88%), treatment plans in 127 (87%), treatment changes in 134 (91%), and surgical/laser interventions in 124 (84%). ChatGPT-4 achieved 46% correct paragraph summaries, 50% correct bullet summaries, and accuracies of 96%, 88%, 64%, 78%, and 82%, respectively, for targeted questions.
CONCLUSIONS
Although LLaMA 2 is not yet reliable as a standalone clinical tool, it shows promise to improve clinical communication.
TRANSLATION RELEVANCE
LLMs may enhance patient experience and health literacy by standardizing patient-friendly language in clinical care.
Top Research in Diagnosis & Screening
Browse all →Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs.
Dry eye disease and oxidative stress.
Central Corneal Thickness in the Ocular Hypertension Treatment Study (OHTS).
In the Knowledge Library
Discussion
Comments and discussion will appear here in a future update.