Adherence of Free-Tier Large Language Models to the 2024 European Society of Cardiology (ESC) Guidelines for the Management of Elevated Blood Pressure and Hypertension: A Comparative Study

Scritto il 27/03/2026
da Aleksander Polus

Cureus. 2026 Feb 23;18(2):e104111. doi: 10.7759/cureus.104111. eCollection 2026 Feb.

ABSTRACT

Background Hypertension remains the leading modifiable risk factor for cardiovascular disease and premature death worldwide. In 2024, the European Society of Cardiology (ESC) released updated guidelines for the management of elevated blood pressure and hypertension. Concurrently, the integration of artificial intelligence into healthcare has accelerated, with large language models (LLMs) becoming accessible tools for information retrieval. Objective This study aims to evaluate and compare the accuracy and adherence of three popular free-tier LLMs (ChatGPT-5.2, Gemini 3 Flash, and Claude 4.5 Sonnet) in responding to questions based strictly on the 2024 ESC Guidelines. Methods We conducted a comparative cross-sectional study in January 2026 to evaluate the performance of three LLMs. The primary source of ground truth was the 2024 ESC Guidelines. A dataset of 40 specific questions was generated, covering key domains including diagnosis, treatment targets, lifestyle modifications, and comorbidities. Questions comprised both factual queries and clinical case reports. Responses were categorized by a qualified physician as correct, inaccurate, or incorrect based strictly on guidelines. Statistical analysis was performed using the Fisher-Freeman-Halton exact test to evaluate differences in performance. Results The overall accuracy across all models was high, with no statistically significant differences in performance observed (p>0.99). Claude 4.5 Sonnet achieved the highest numerical accuracy, providing correct responses to 33 out of 40 questions (82.5%). ChatGPT-5.2 and Gemini 3 Flash achieved identical correctness rates of 80.0% (32 out of 40 correct answers). A qualitative analysis revealed a distinct tendency toward overly aggressive management in complex clinical scenarios, suggesting a "safety bias" where models default to intensive intervention rather than nuanced guideline steps. Conclusions The evaluated free-tier LLMs demonstrated comparable and high proficiency in interpreting the 2024 ESC Guidelines. Despite this potential, the study identified a recurrent safety bias manifesting as a tendency toward over-medicalization. While these models serve as promising auxiliary tools for medical education, verification of AI-generated advice against official guideline documents remains essential.

PMID:41890463 | PMC:PMC13015764 | DOI:10.7759/cureus.104111