Comparing the Accuracy of Four Artificial Intelligence Models in PubMed Citation Generation for Glaucoma Research.
Summary
Biomedically enriched AI, DeepSeek, most accurately generated glaucoma PubMed citations, outperforming general models. However, all AI models produced errors, emphasizing the critical need for human verification to prevent citation inaccuracies.
Abstract
PRCIS
DeepSeek, a biomedically enriched AI model, achieved the highest accuracy in generating PubMed citations for glaucoma research, outperforming general-purpose models and highlighting the necessity of human oversight to mitigate AI-related citation errors.
PURPOSE
This study evaluated the accuracy and reliability of four artificial intelligence (AI) models-ChatGPT (OpenAI GPT-3.5), Copilot (GitHub/Microsoft), DeepSeek (DeepSeek AI), and Gemini (Google AI)-in generating PubMed citations for glaucoma research. This study aimed to assess the potential of AI tools for academic reference generation and identify their limitations, particularly in specialized ophthalmology fields.
METHODS
Thirty-five standardized clinical paragraphs from The Review of Ophthalmology (4th edition) were used to test citation accuracy. Each model was instructed to generate AMA 11-style PubMed citations. Citations were evaluated for accuracy, DOI matching, and clinical relevance. An expert review validated the outputs and classified them as "Fully Cited," "Partially Cited," or "Not Cited."
RESULTS
DeepSeek, a biomedically enriched model, outperformed the others, with an accuracy of 92.0%. Copilot and Gemini achieved moderate accuracies of 66.7% and 25.8%, respectively, while ChatGPT achieved the lowest citation accuracy at 19.4%. Frequent errors included DOI mismatches, incorrect journal names, and irrelevant references. Expert review confirmed that even the best model produced citation errors, emphasizing the need for human oversight. We interpret this apparent advantage cautiously, as model details, updates, and changes in underlying data may influence performance.
CONCLUSION
AI models-particularly biomedically enriched tools such as DeepSeek-can accelerate citation drafting, but citation hallucinations and metadata errors remain common. AI should serve as a decision support tool for reference retrieval and formatting, not a substitute for rigorous manual verification before submission.
Keywords
Top Research in Artificial Intelligence
Browse all →Digital technology, tele-medicine and artificial intelligence in ophthalmology: A global perspective.
Deep learning in ophthalmology: The technical and clinical considerations.
Efficacy of a Deep Learning System for Detecting Glaucomatous Optic Neuropathy Based on Color Fundus Photographs.
Discussion
Comments and discussion will appear here in a future update.