
Background: Autoimmunity precedes the clinical onset of rheumatoid arthritis (RA), often by several years. Clinical phenotype and disease progression within this pre-arthritis stage, is neither inevitable nor uniform, and individuals may follow distinct trajectories that are not fully understood. Artificial intelligence (AI), including machine learning (ML) techniques, offers a flexible and powerful tool for modelling the complex, non-linear interactions between biomarkers that define preclinical RA. Unlike traditional models, AI approaches do not require predefined assumptions and can uncover latent patterns within high-dimensional, multimodal datasets [1]. A growing body of literature has demonstrated the potential of AI in RA diagnosis, prognosis, and treatment prediction [2]. However, most AI models to date have been developed using datasets from individuals with established RA and not the at risk phase. Salehi et al. (2025)[3] applied supervised, survival ML models to estimate individual risk and time to onset of RA in CCP+ at risk individuals, achieving an Uno’s C index of 79.8%, suggesting promising predictive performance. However, these approaches optimise risk prediction without explicitly accounting for biological and clinical heterogeneity within the preclinical phase, potentially grouping individuals with different immunological or inflammatory profiles into similar risk categories. Unlike supervised survival models, that directly optimise prediction of RA onset, unsupervised ML can identify biologically and clinically distinct preclinical phenotypes and subsequently evaluate differences in progression risk and disease trajectories across these groups.
Objectives: To use unsupervised ML methods to investigate different phenotypes of anti-CCP positive individuals at risk of developing RA and how pre-arthritis disease trajectories differ across these subgroups.
Methods: Baseline data from individuals recruited into the Leeds CCP study, a prospective cohort of anti-CCP positive (CCP+) individuals with musculoskeletal (MSK) symptoms but no clinical synovitis, was analysed. An unsupervised k-means clustering algorithm was applied to 17 demographic, clinical, serological, and imaging biomarkers from 451 individuals to identify phenotypic clusters. Statistical validation was carried out using Wilcoxon rank-sum test to confirmed distinct biomarker profiles between clusters. To assess the separability of the clusters, a Support Vector Machine (SVM) classifier was trained to predict cluster membership of individual patients based on the same 17 biomarkers. Survival analysis was performed using a Kaplan-Meier curve and compared with the log-rank test. Cox proportional hazards analyses were performed to assess the influence of cluster assignment on RA progression rates.
Results: K-means clustering identified 2 distinct phenotypes (Table 1). Cluster 1 (N = 411) had less systemic inflammation (Blood CRP: 4.73 mg/L vs 6.35 mg/L, p=0.014), less subclinical synovitis (US synovitis present in 0.41 vs 1.67 joints, p=3.8×10 −9 ), fewer tender joints (2.37 vs 3.43 joints, p=0.023), lower HAQ scores (HAQ: 0.54 vs 0.78, p=0.033) and lower anti-CCP levels (27.0 vs 30.6, p=0.049) than cluster 2 (N= 44). The SVM classifier demonstrated excellent performance, accurately assigning patients to their respective clusters based on biomarker profiles, with an accuracy of 99%. In cluster 1, 30.2% of patients progressed to RA compared to 54.5% in cluster 2. This was also illustrated in the Kaplan–Meier survival curve (Figure 1), with more rapid progression in cluster 2. The median time to progression was substantially longer in cluster 1 than in cluster 2 (536 vs 205 weeks). Cluster 2 was associated with a higher risk of RA progression (HR = 2.11, p = 0.0008).
Conclusions: Two distinct preclinical phenotypes, with significantly different risks and trajectories of progressing to RA, can be identified using unsupervised ML methods alone. Cluster membership was associated with both inflammatory characteristics, antibody levels and time to disease onset, highlighting heterogeneity in disease trajectories. These findings suggest that Artificial Intelligence may offer a promising approach for identifying clinically and biologically distinct phenotypic clusters in individuals at risk of RA, which may improve management strategies and risk stratification.
REFERENCES: [1] Rajula, H. S. R., Verlato, G., Manchia, M., Antonucci, N. and Fanos, V. 2020. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment Medicina 56(9), 455.
[2] Momtazmanesh, S., Nowroozi, A. and Rezaei, N. 2022. Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review Rheumatology and Therapy 9(5), 1249.
[3] Salehi, F., Bayat, S., Schett, G., Kleyer, A., Altstidl, T. and Eskofier, B. M. 2025. ExSMART-PreRA: Explainable Survival and Risk Assessment Using Machine Learning for Time Estimation in Preclinical Rheumatoid Arthritis IEEE journal of biomedical and health informatics PP.
Acknowledgments: NIL.
Disclosure of Interests: None declared.