A bilingual evaluation of chatbot performance in bruxism-related information: accuracy and readability across two models

Scritto il 16/05/2026

da Merve Cennet Altuntaş

BMC Oral Health. 2026 May 16. doi: 10.1186/s12903-026-08581-3. Online ahead of print.

ABSTRACT

BACKGROUND: Artificial intelligence (AI) chatbots have increasing applications in healthcare; however, their accuracy and readability across different languages remain unclear. Therefore, this study aimed to compare the performance of ChatGPT-5 and DeepSeek-V3 on bruxism-related questions in English and Turkish.

METHODS: Responses generated by ChatGPT-5 and DeepSeek-V3 to 20 questions were evaluated in Turkish and English, yielding four chatbot-language conditions. Accuracy was independently scored by two prosthodontists using a Modified Global Quality Score, and inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Readability was defined as the ease of reading and was assessed using the Flesch Reading Ease (FRE) scale for English texts and the Ateşman readability formula for Turkish texts. Descriptive statistics were calculated, differences in accuracy across conditions were analyzed using the Friedman test, and the relationship between accuracy and readability was examined using Spearman's rank correlation coefficient.

RESULTS: Accuracy scores of responses generated by ChatGPT-5 and DeepSeek-V3 were generally high across both Turkish and English. No statistically significant differences in accuracy were observed among the four model-language combinations based on the Friedman test (χ² (3) = 3.204, p=0.361). Inter-rater reliability analysis demonstrated moderate agreement for single measures and good agreement for average measures between the two evaluators (ICC = 0.623 and 0.767, respectively). Comparison of Atesman readability scores and FRE-based English readability analyses revealed no statistically significant differences between the two chatbots, as confirmed by Mann-Whitney U tests (Turkish: p = 0.646; English: p = 0.745).

CONCLUSION: ChatGPT-5 and DeepSeek-V3 demonstrated comparable accuracy in bruxism-related responses across Turkish and English. However, high readability levels limit patient accessibility, and no association was found between accuracy and readability, indicating that simplification is necessary for effective patient education.

PMID:42143341 | DOI:10.1186/s12903-026-08581-3