J Glaucoma
J GlaucomaMarch 2026Journal Article

Evaluation of Glaucoma Treatment Information on Social Media Using Large Language Models.

Artificial Intelligence

Summary

Although glaucoma experts and artificial intelligence (AI)-based systems were in moderate agreement when evaluating the quality of posts, the LLM was less able to discriminate posts of low quality.

Abstract

PRCIS

This study investigates the accuracy, readability, utility, and educational value of glaucoma treatment content on social media platforms and explores how large language models assess the quality of social media posts compared with glaucoma experts.

PURPOSE

To assess the quality of information on glaucoma treatment available on social media platforms.

METHODS

A 30-question survey consisting of the "top posts" from three social media platforms (X, Instagram, and Reddit) was assessed by 5 board-certified glaucoma experts across four domains (readability, utility, educational value, and accuracy) by using a 5-point Likert scale. The overall quality of each post was calculated as the average of the median score assigned to each of the four domains to create a reference standard. Expert agreement was assessed using Kendall's coefficient of concordance ( W ). A large language model (LLM), GPT-4 (OpenAI), was then prompted to evaluate the same posts with identical instructions. Agreement with expert consensus was compared using Cohen weighted kappa ( κ ), and the difference in favorability of each post assessed using McNemar exact test.

RESULTS

Fewer than half of social media posts on glaucoma treatment were judged favorably by glaucoma experts (40%). GPT-4 was less critical of social media content and provided a favorable rating nearly twice as often (77%, P =0.017). Despite this difference, there was moderate agreement between the LLM compared with the glaucoma experts ( κ =0.421, P =0.005). The lack of agreement predominantly stemmed from cases where the experts rated the content unfavorably, with disagreement occurring in 56% of cases, compared with 0% when the content was deemed favorable ( P =0.005).

CONCLUSIONS

Although glaucoma experts and artificial intelligence (AI)-based systems were in moderate agreement when evaluating the quality of posts, the LLM was less able to discriminate posts of low quality.

Keywords

artificial intelligence (AI)glaucomahealth literacylarge language modelssocial media

In the Knowledge Library

Discussion

Comments and discussion will appear here in a future update.