
Background: Delayed diagnosis of inflammatory rheumatic diseases remains a major challenge in clinical practice and is associated with worse long-term outcomes, increased disability, and higher healthcare costs. Early disease stages are often characterised by non-specific musculoskeletal symptoms, fluctuating inflammatory markers, and incomplete expression of disease-defining features. As a result, many patients initially access the healthcare system through primary care or non-rheumatology hospital services, where they undergo repeated consultations, laboratory testing, imaging studies, or evaluation of extra-articular manifestations before being referred to a rheumatologist.
These early healthcare encounters generate structured information within electronic health records (EHRs), creating digital clinical “footprints” that may contain meaningful early signals of inflammatory rheumatic disease. However, such data are rarely integrated or systematically analysed to support early diagnostic orientation or referral prioritisation. European research strategies increasingly emphasise the use of routinely collected real-world data and integrative modelling approaches to improve early diagnosis and patient stratification. In this context, machine learning techniques applied to EHR data offer an opportunity to identify latent diagnostic patterns and support earlier access to specialist care.
Objectives: To develop and evaluate a multi-label machine learning model capable of identifying inflammatory rheumatic diseases based on EHR-derived clinical footprints generated during non-rheumatology healthcare encounters, and to assess the clinical coherence and interpretability of associations between recorded variables and subsequent rheumatologic diagnoses.
Methods: We conducted a retrospective cohort study using de-identified EHR data from a community-based hospital and its associated primary care network, serving a stable and geographically defined catchment population of approximately 200,000 inhabitants. The dataset included 24,143 patients with healthcare encounters recorded between January 2020 and September 2024 outside rheumatology clinics.
Patients were identified based on the presence of structured EHR traces suggestive of inflammatory arthritis, including musculoskeletal symptoms (polyarthritis, arthralgia, inflammatory low back pain), requests for laboratory tests containing immunological markers or acute-phase reactants, axial or peripheral radiographic imaging, and/or documentation of extra-articular manifestations such as inflammatory bowel disease or uveitis.
These digital clinical footprints were retrospectively linked to rheumatologic diagnoses established later by a rheumatologist. Seven inflammatory rheumatic diseases defined by ICD-10 codes were modelled: psoriatic arthropathy (L40.5), seropositive rheumatoid arthritis (M05), other/seronegative rheumatoid arthritis (M06), systemic lupus erythematosus (M32), systemic sclerosis (M34), axial spondyloarthritis (M45), and myositis (M60.9).
A multi-label classification framework was applied to reflect diagnostic overlap and real-world clinical complexity. Tree-based ensemble models (XGBoost, Random Forest, Gradient Boosting) were trained using diagnosis-specific class weighting to address imbalance, probability calibration, and per-label threshold optimisation based on F1-score maximisation. Model performance was assessed using exact-match accuracy, Hamming loss, and micro- and macro-averaged F1-scores. To enhance clinical interpretability, Pearson correlation analyses were conducted between numerical clinical and laboratory variables and diagnostic labels.
Results: Among the 24,143 included patients, 7.5% were subsequently diagnosed with at least one of the target inflammatory rheumatic diseases by a rheumatologist. The most frequent diagnoses were other/seronegative rheumatoid arthritis (n=770), axial spondyloarthritis (n=506), and psoriatic arthropathy (n=497), while connective tissue diseases and myositis were less prevalent.
Across the evaluated models, XGBoost demonstrated the best overall performance, achieving an exact-match accuracy of 0.98, a Hamming loss of 0.003, a micro-F1 score of 0.86, and a macro-F1 score of 0.80. Diagnostic performance was consistently high for prevalent inflammatory conditions, with F1-scores of 0.90 for axial spondyloarthritis, 0.89 for seropositive rheumatoid arthritis, 0.87 for psoriatic arthropathy, and 0.83 for other rheumatoid arthritis. Performance for systemic lupus erythematosus was also robust (F1 0.82).
Lower and more variable performance was observed for rare diseases, particularly systemic sclerosis, reflecting limited case numbers rather than systematic model failure. Correlation analyses revealed clinically coherent patterns, with rheumatoid arthritis diagnoses strongly associated with anti-CCP positivity and rheumatoid factor abnormalities, psoriatic arthropathy associated with psoriasis-related variables, and axial spondyloarthritis associated with extra-articular manifestations such as uveitis and inflammatory bowel disease. Several negative correlations involved “not tested” indicators, reflecting test-ordering behaviour and missingness patterns rather than biological protection.
Conclusions: Digital clinical footprints generated during non-rheumatology healthcare encounters contain meaningful early signals of inflammatory rheumatic disease. Machine learning models applied to routinely collected EHR data can accurately and simultaneously identify multiple inflammatory rheumatic diseases while preserving clinical interpretability. This real-world, population-based approach supports earlier selection and prioritisation of patients for rheumatology assessment, with the potential to shorten diagnostic pathways, reduce unnecessary patient circulation across specialties, and align routine clinical practice with contemporary European strategies for early diagnosis and precision medicine.
Age distribution
Distribution of ICD-10 diagnoses
REFERENCES: NIL.
Acknowledgments: NIL.
Disclosure of Interests: None declared.