fetching data ...

ABS0912 (2025)
EVALUATING ChatGPT’s PERFORMANCE IN DIAGNOSING LOW BACK PAIN: A COMPARISON WITH CLINICIANS AND IMPACT OF PROMPTED SPECIALTIES
Keywords: Pain, Interdisciplinary research, Artificial Intelligence, Telemedicine, Digital health, And measuring health
A. Nack1, X. Michelena2, P. Maymo Paituvi3, C. Calomarde-Gómez1, D. Lobo4, A. García-Alija5, R. Ugena-García1, M. Aparicio-Espinar1, P. Vidal-Montal3, D. Benavent3
1Germans Trias i Pujol University Hospital, Rheumatology, Badalona, Spain
2Servei Català de la Salut and Rheumatology Research Group, Vall d’Hebron Hospital Campus, Digitalization for the Sustainability of the Healthcare System (DS3), Barcelona, Spain
3Bellvitge University Hospital, University of Barcelona, Rheumatology, Hospitalet de Llobregat, Spain
4Doctor Josep Trueta University Hospital, Rheumatology, Girona, Spain
5La Santa Creu i Sant Pau University Hospital, Rheumatology, Barcelona, Spain

Background: Low back pain (LBP) is a multifactorial condition managed by various specialists besides rheumatologists. The use of artificial intelligence (AI) chatbots, such as ChatGPT, may support clinicians in identifying probable diagnoses. However, given the influence of semantic variations in query prompts, it remains hypothesized that ChatGPT’s diagnostic responses could vary depending on the specialty indicated in the prompt.


Objectives: We aimed to assess whether ChatGPT’s responses differ when simulating different medical specialties in the evaluation of LBP, and to compare the diagnostic accuracy of ChatGPT with that of clinicians.


Methods: A total of 10 clinical cases related to LBP were included from official public exams for rheumatologists in Spain, designed to assess expertise for permanent specialist positions. These included 5 cases of rheumatologic diseases and 5 representing other causes. The exercise was conducted in December 2024 using ChatGPT version 4.o. Ten clinicians with at least 5 years of experience in managing rheumatic and musculoskeletal diseases (RMDs) participated in the study. Each question was answered independently by clinicians, and at a later stage by each participant asking ChatGPT, simulating five specialties (Rheumatology, Neurology, Internal Medicine, Rehabilitation, and Orthopedics). The gold standard was the official answer listed as diagnosis for each exam question. Diagnostic performance was evaluated using precision (percentage of cases where the top diagnosis matched the gold standard) and sensitivity (percentage of cases where the gold standard was included in the top three probable diagnoses). The time taken to answer all 10 clinical cases was recorded for both clinicians alone and using ChatGPT, starting when the case was reviewed and stopping when three differential diagnoses and the most probable diagnosis were finalized. Statistical significance of differences was calculated using parametric tests such as t-test or ANOVA, or non-parametric tests such as Wilcoxon rank sum and Kruskal-Wallis tests, depending on the data distribution.


Results: In total, 528 free-text diagnoses were generated and standardized into 39 diagnostic categories. The percentage of the correct score for each participant and each prompted specialty is illustrated in Figure 1. Median precision ranged from 70% to 80% across the five specialties simulated by ChatGPT, and median sensitivity ranged from 80% to 90%. Statistical analysis revealed no significant differences in precision (p = 0.80) or sensitivity (p = 0.68) between the specialties simulated by ChatGPT, indicating consistent performance regardless of the prompted specialty. For clinicians, the median precision was 60%, and median sensitivity was 80%. When comparing ChatGPT to clinicians, ChatGPT had significantly higher diagnostic precision (median = 75% vs. 60%, p <0.001) and significantly higher sensitivity (median = 85% vs. 80%, p = 0.02). The mean time taken by participants to complete the task was 12.35±5.62 minutes, compared to 2.33±0.03 minutes for ChatGPT (p<0.01).


Conclusion: ChatGPT provides consistent diagnostic performance across simulated specialties, unaffected by the prompt’s semantic framing. Furthermore, ChatGPT may outperform clinicians in both diagnostic precision and sensitivity, highlighting its potential as a valuable complementary tool for generating fast, accurate and comprehensive differential diagnoses in cases of low back pain. Further research is needed to explore its application in clinical practice and its ability to enhance diagnostic workflows.


REFERENCES: NIL.


Acknowledgements: NIL.


Disclosure of Interests: Annika Nack abbvie, jansen, msd, amgen, Xabier Michelena: None declared, Pol Maymo Paituvi: None declared, Cristina Calomarde-Gómez: None declared, David Lobo: None declared, Asier García-Alija: None declared, Raquel Ugena-García: None declared, María Aparicio-Espinar: None declared, Paola Vidal-Montal: None declared, Diego Benavent Lilly, Galapagos, UCB, Janssen, MSD and Novartis, Works part-time in Savana Research.

© The Authors 2025. This abstract is an open access article published in Annals of Rheumatic Diseases under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ). Neither EULAR nor the publisher make any representation as to the accuracy of the content. The authors are solely responsible for the content in their abstract including accuracy of the facts, statements, results, conclusion, citing resources etc.


DOI: annrheumdis-2025-eular.B3476
Keywords: Pain, Interdisciplinary research, Artificial Intelligence, Telemedicine, Digital health, And measuring health
Citation: , volume 84, supplement 1, year 2025, page 1806
Session: Other topics (Publication Only)