Evaluation of Glaucoma Treatment Information on Social Media Using Large Language Models.
Asha Bulusu, Paul R Cotran, Amer M Alwreikat, Ying Jiang, Michael Lee Cooper, Kathryn Moynihan Ramsey, Ashwin P Verghese, David J Ramsey
Summary
Although glaucoma experts and artificial intelligence (AI)-based systems were in moderate agreement when evaluating the quality of posts, the LLM was less able to discriminate posts of low quality.
Abstract
PRCIS
This study investigates the accuracy, readability, utility, and educational value of glaucoma treatment content on social media platforms and explores how large language models assess the quality of social media posts compared with glaucoma experts.
PURPOSE
To assess the quality of information on glaucoma treatment available on social media platforms.
METHODS
A 30-question survey consisting of the "top posts" from three social media platforms (X, Instagram, and Reddit) was assessed by 5 board-certified glaucoma experts across four domains (readability, utility, educational value, and accuracy) by using a 5-point Likert scale. The overall quality of each post was calculated as the average of the median score assigned to each of the four domains to create a reference standard. Expert agreement was assessed using Kendall's coefficient of concordance ( W ). A large language model (LLM), GPT-4 (OpenAI), was then prompted to evaluate the same posts with identical instructions. Agreement with expert consensus was compared using Cohen weighted kappa ( κ ), and the difference in favorability of each post assessed using McNemar exact test.
RESULTS
Fewer than half of social media posts on glaucoma treatment were judged favorably by glaucoma experts (40%). GPT-4 was less critical of social media content and provided a favorable rating nearly twice as often (77%, P =0.017). Despite this difference, there was moderate agreement between the LLM compared with the glaucoma experts ( κ =0.421, P =0.005). The lack of agreement predominantly stemmed from cases where the experts rated the content unfavorably, with disagreement occurring in 56% of cases, compared with 0% when the content was deemed favorable ( P =0.005).
CONCLUSIONS
Although glaucoma experts and artificial intelligence (AI)-based systems were in moderate agreement when evaluating the quality of posts, the LLM was less able to discriminate posts of low quality.
Keywords
Top Research in Artificial Intelligence
Browse all →Digital technology, tele-medicine and artificial intelligence in ophthalmology: A global perspective.
Deep learning in ophthalmology: The technical and clinical considerations.
Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs.
In the Knowledge Library
Discussion
Comments and discussion will appear here in a future update.