Assessing the Accuracy and Precision of Artificial Intelligence for Diabetes Mellitus and Hypertension Management

Scritto il 26/06/2026
da Abdullah Al Hamid

J Clin Med. 2026 Jun 7;15(12):4419. doi: 10.3390/jcm15124419.

ABSTRACT

Background/Objectives: Diabetes mellitus and hypertension are major chronic conditions that markedly affect patients' health and quality of life worldwide. With the rapid development of technology, there has been a growing interest in exploring the potential role of artificial intelligence (AI) in the management of such diseases. This study aims to assess the accuracy and reliability of artificial intelligence tools in providing information for diabetes mellitus and hypertension management. Methods: This study assessed the accuracy and reliability of the information provided by major AI tools such as ChatGPT, Gemini, POE, Claude, Consensus, and Perplexity. Twenty questions that are essential for the management of diabetes mellitus and hypertension were constructed based on the chapters of the respective guidelines and were fed to the AI tools. The outcomes were compared with evidence-based treatment guidelines, such as those from the American Diabetes Association (ADA), the American Heart Association (AHA), the European Society of Cardiology (ESC), and the National Institute for Health and Care Excellence (NICE). Answers were classified into "accurate ", "inaccurate", and "accurate with missing information". Three rounds of six-week intervals were conducted to assess accuracy and reliability. In addition, they were conducted to evaluate data updates by comparing answers across the rounds. Results: In round one of the evaluations, ChatGPT and Poe showed the highest accuracy, both at 65% (95% CI: 41.0-83.7), followed by Claude at 60% (95% CI: 41.0-83.7). ChatGPT had the lowest inaccuracy rate at 5% (95% CI: 1.75-33.1), while Claude demonstrated the smallest percentage of responses with missing information at only 6%. (95% CI: 12.8-54.3). In round 2, Claude markedly outperformed all other tools, achieving an accuracy rate of 95% (95% CI: 73.0-99.7) and no responses with missing information (0%). In round 3, ChatGPT came second with 70% (95% CI: 45.70-87.2) accuracy and maintained the lowest inaccuracy rate of 5% (95% CI: 0.26-26.9). Consensus had the largest inaccuracy rate at 40% (95% CI: 20.0-63.6) and the lowest accuracy rate at 40% (95% CI: 20.0-63.6). Overall, statistically significant pairwise comparisons showed that Cloud in the second round has the highest accuracy compared to Poe (p = 0.0154), Gemini (p = 0.0421), Consensus (p = 0.0035), and Perplexity (p = 0.0302). In the assessment of performance shift from round 1 to round 2, Claude achieved the greatest improvement in accuracy at 40%. In the assessment of performance shift from round 2 to round 3, Poe improved the most with an accuracy increase of 25%, while ChatGPT followed with 20%. When evaluating the unprompted and guideline-prompted questions for all AI tools using McNemar's test, it did not reveal a statistically significant distinction in the proportion of accurate responses (p > 0.05). Conclusions: Throughout the three rounds, ChatGPT maintained the best performance, with the fewest missing data. Claude and Poe followed, showing high accuracy with relatively low inaccuracy rates. On the other hand, Perplexity and Gemini performed moderately, while Consensus had the lowest accuracy.

PMID:42355586 | PMC:PMC13300920 | DOI:10.3390/jcm15124419