EULAR Abstract Archive

Bookmarked

POS0765 (2026)

PERFORMANCE OF LARGE LANGUAGE MODELS IN DIAGNOSING RARE ENDEMIC AUTOINFLAMMATORY DISEASES

Keywords: Artificial Intelligence, Descriptive Studies

N. M. Drzeniek^1,2, F. Reis³, S. D. Boie³, F. Balzer³, A. Pankow¹, D. Simon^1,4,5, G. Krönke^1,4,5, A. Kleyer^1,4,5

¹Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Rheumatology and Clinical Immunology, Berlin, Germany
²BIH Center for Regenerative Therapies (BCRT), Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
³Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Berlin, Germany
⁴German Rheumatism Research Centre (DRFZ) Berlin, a Leibnitz Institute, Berlin, Germany
⁵Frauenhofer Insitute for Translational Medicine and Pharmacology, Allergology and Immunology, Berlin, Germany

Background: The diagnosis and classification of rare autoinflammatory diseases (AID) is challenging. Large language models (LLMs) have the potential to support earlier and more accurate diagnostic decision making and are increasingly explored for differential diagnosis [1]. However, it remains unclear how ethnicity-relate cues, often implicitly treated as homogeneous or disease-defining, influence LLM-generated diagnostic outputs, particularly when comparing endemic versus non-endemic rare diseases.

Objectives: To (i) investigate whether LLMs of different parameter size can accurately predict the diagnosis of AID, such as adult onset stills disease (AOSD), TNF-receptor-associated periodic fever syndrome (TRAPS),familial mediterranean fever (FMF) or Behçet’s disease (BD), based on first presentation anamnesis recorded in electronic health records and to (ii) assess whether the inclusion of ethnicity-related information improves or confounds LLM-based diagnostic performance in endemic conditions.

Methods: In this exploratory study, we constructed clinical case vignettes based on first-presentations of patients first diagnosed with FMF, BD, or other non-endemic autoinflammatory diseases at the EULAR Centre of Excellence, Charité-Universitätsmedizin Berlin. Cases of systemic lupus erythematosus (SLE) were included as a comparator representing a more common rheumatic disease. To evaluate the performance of models suitable for use within hospital infrastructure, six locally (Charité Cluster) deployed open-weights LLMs were prompted to genererate their top three differential diagnoses, both with and without disclosure of patient ethnicity. Diagnostic accuracy was quantified as Top-3 accuracy (“How often does the correct diagnosis appear in the top 3 suggestions?”) or mean reciprocal rank (MRR) and benchmarked against the final diagnosis confirmed by a rheumatologist

Results: 62 case vignettes of 31 patients were included. Across 1,116 AI-generated diagnoses (three for each vignette and LLM model), all models performed well in the diagnosis of SLE control cases with Top-3 accuracy of with Top-3 accuracy of ≥60% for all models (100% for MedGemma-4B, 60% for Mistral-7B, and 80% for MedGemma-27B and DeepSeek-R1-32B/-70B/671B. However, accuracy in rare diseases largely tracked model size, with 671B-parameter DeepSeek-V3 performing best. The specialized medical LLM MedGemma with 27B parameters broke this trend and performed better than several non-specialized models with larger parameter size (Figure 1). Removing ethnic information from cases of FMF or BD only modestly influenced accuracy. However, adding artificial ethnic information to vignettes of clinically similar but non-endemic autoinflammatory diagnoses, such as Adult-Onset Still’s Disease, markedly increased false-positive FMF/BD suggestions in the largest model, 671B-parameter DeepSeek-V3 (Figure 2).

Conclusions: Diagnostic performance varied substantially across models, with model capacity emerging as a key determinant of accuracy in the differentiation of rare diseases. Ethnic-related information may act as a double-edged sword, potentially enhancing recognition of endemic conditions while also increasing the risk of diagnostic misattribution. These findings underscore the need for bias-aware evaluation and careful handling of demographic information in LLM-assisted clinical diagnostics.

Figure 1.

Diagnostic accuracy (Top-3 accuracy) of six LLMs by patient cohorts (from left to right): n = 9 patients with FMF or BD; n = 2 AOSD, n = 1 TRAPS, n = 1 PFAPA; n = 9 AOSD; n = 2 FMF, n = 2 BD; control cohort n = 5 SLE. Values shown as mean. An ethnic background endemic for FMF or BD was either mentioned in the anamnesis (endemic background) or the ethnic background was not specified (non-endemic background).

Figure 2.

Impact of manipulating ethnic information on LLM accuracy in diagnosing rare endemic diseases (FMF and BD combined). a) Patient cohort structure and distribution of diagnoses (n = number of patients). b) Impact of removing given ethnic information (cohort A+B). c) Impact of adding artificial ethnic information (cohort C+D).

REFERENCES: [1] Kremer et al EULAR Rheumatology Open 2025

Acknowledgments: NIL.

Disclosure of Interests: Norman Michael Drzeniek: None declared, Florian Reis reports having an employment contract with Pfizer outside the submitted work., Sebastian-Daniel Boie reports having an employment contract with Pfizer outside the submitted work., Felix Balzer: None declared, Anne Pankow: None declared, David Simon received speaker honoraria from AbbVie, Amgen, Alfasigma, Bristol-Myers Squibb, Janssen-Cilag, Lilly, Novartis, UCB., has served on scientific advisory boards for AbbVie, Bristol-Myers Squibb, Gilead Sciences, Janssen-Cilag, Lilly, Novartis, UCB., Gerhard Krönke: None declared, Arnd Kleyer UCB,Abbvie, Lilly Novartis, Lilly, Celgene,Medac, Novartis.

DOI: annrheumdis-2026-eular.B.4000

Keywords: Artificial Intelligence, Descriptive Studies

Citation: , volume 85, supplement 1, year 2026, page s899

Session: Poster View III (Poster View)

version:	1.02