
Background: There is a shortage of rheumatologists in Europe, Asia, and the Americas, creating a risk that patients with severe ANA-related connective tissue diseases (ARCTD) may not receive timely rheumatology care. The antinuclear-antibody (ANA) screening test can help clinicians identify and refer patients with signs and symptoms of connective tissue diseases. However, this test lacks specificity, and 20% of healthy individuals test positive without developing rheumatic diseases. The test is often ordered in the setting of nonspecific clinical presentations, which may result in avoidable referrals and increased demand for rheumatology services, thereby prolonging wait times at rheumatology practices. To ensure that individuals with positive ANAs are appropriately triaged and receive rheumatology care promptly, we leveraged machine learning (ML) methods to identify and risk-stratify these individuals according to their likelihood of having ARCTD upon first referral to rheumatology.
Objectives: We developed ML models using electronic health records (EHR) data to identify ANA-positive (ANA +) individuals likely to have ARCTD. The model also identified different contributions of variables to gain insights into contributing clinical risk factors.
Methods: We collected clinical variables from the EHR of ANA + individuals referred for rheumatology evaluations from 2013-2024 at a tertiary care center in the United States of America. The ARCTD of interest included Systemic Lupus Erythematosus, Sjogren’s Disease, Immune-mediated Myositis, and Scleroderma. Clinical variables were collected at the time of the first rheumatology visit or ±3 months from the first visit, whichever came first. Patients’ ultimate diagnoses were also obtained from the EHR. Board-certified rheumatologists on our team reviewed prior literature and determined the most appropriate variables to be collected a priori. To determine the most robust variables to be included in the model, we only included the variables that are most correlated with the outcome of interest (development of ARCTD). We performed chi-square tests for categorical variables and Kruskal-Wallis tests for continuous variables; variables with p -value<0.05 were included for model development (Table 1), except for one variable (C-reactive protein values) that was included based on expert opinion. Candidate variables were used to develop predictive models, and the models were trained using two distinct ML algorithms, Random Forest and XGBoost. Data from 80% of individuals were randomly selected as the training set, while the remaining served as the test set. The classification models were assessed with the area under the receiver operating characteristic (AUC) curve, sensitivity, and specificity. To interpret the relative importance of variables in the model’s predictions, we use SHAP (SHapley Additive exPlanations) scores based on the best-performing model, which was the XGBoost algorithm.
Results: A total of 2,304 ANA+ patients were included in the analysis; among them, 1,357 (58.9%) had ARCTD, while 947 (41.1%) did not. The model was trained by randomly splitting the cohort into training set [80% (1,843 patients)] and tested with the remaining 20% (461 patients). Characteristics of patients in training and testing sets were comparable (Table 1). Over half of the patients were either female, middle-aged, or privately insured. More than half reported fatigue and weakness, followed by rashes in about 40% of patients at their first rheumatology visits. The ANA titers of the cohort ranged between 1:320 and 1:1280. Of the ML approaches, the XGBoost model had the highest AUC for identifying ANA + individuals with ARCTD; the AUC was 92.4%, specificity of 88%, and sensitivity of 84%. The Random Forest model had an AUC of 91%, specificity of 80%, and sensitivity of 83% (Figure 1a). The SHAP analysis based on the XGBoost model demonstrated that features contributing significantly to increased risk of ARCTD at the first point of rheumatology referral were older age, ANA titer of at least 1:320, presence of Raynaud’s, and rashes (Figure 1b).
Conclusions: We successfully developed an ML model that was highly specific, sensitive, and accurate in identifying ARCTD among ANA+ individuals upon first referral in a diverse population in the United States. Our model has the potential to help clinicians risk-stratify ANA+ patients referred for ARCTD, thereby enabling earlier intervention, timely access to treatment, closer monitoring, and more personalized care planning. Our next step encompasses incorporating unstructured variables from the clinical notes of our study cohort to further increase the ability of the ML model to identify ANA + individuals who would go on to develop ARCTD. In addition, we will validate the applicability of the ML model with an external dataset.
REFERENCES: NIL.
Acknowledgments: NIL.
Disclosure of Interests: Eugenia Chock: None declared, Yang Ren: None declared, Michelle Bernabeo: None declared, Mei Xue Dong: None declared, David T. Felson Merck, AposHealth, Hua Xu: None declared, Na Hong: None declared.