EULAR Abstract Archive

Bookmarked

POS0154 (2026)

EULAR AI-DRIVEN SCREENING: EVALUATING ASReview FOR SYSTEMATIC LITERATURE REVIEWS INFORMING EULAR RECOMMENDATIONS AND POINTS-TO-CONSIDER

Keywords: Quality of care, Artificial Intelligence, Systematic review

L. Bischof¹, J. J. Teijema², E. Westerbeek², G. E. Fragoulis^3,4, U. Kiltz^5,6, L. Könemann⁷, R. van de Schoot², S. Ramiro⁸, P. Bosch¹

¹Medical University of Graz, Department of Rheumatology and Immunology, Graz, Austria
²Faculty of Social and Behavioral Sciences, Utrecht University, Department of Methodology and Statistics, Utrecht, Netherlands
³Joint Academic Rheumatology Program, National and Kapodistrian University of Athens,, First Department of Propaedeutic and Internal Medicine,, Athens, Greece
⁴Institute of Infection, Immunity and Inflammation, University of Glasgow, Glasgow, United Kingdom
⁵Rheumazentrum Ruhrgebiet, Marien Hospital, Herne, Germany
⁶Ruhr-University Bochum, Department of Rheumatology, Bochum, Germany
⁷European Alliance of Associations for Rheumatology, EULAR IT Office, Zürich, Germany
⁸Leiden University Medical Center, Leiden, and Zuyderland Medical Center, Department of Rheumatology, Heerlen, Netherlands

Background: Screening titles and abstracts is a resource-intensive phase of systematic literature reviews (SLRs). As the volume of medical literature expands, it becomes more difficult to design search strategies that maintain a balance between inclusivity and feasibility. Active learning–based machine learning models can substantially reduce screening burden by iteratively prioritizing records for human review, allowing the reviewer to stop the screening once the relevant records have been identified [1]. Although increasingly applied in some recommendations-development contexts, their use in SLRs for medical recommendation development is not yet universal [2,3]. Available machine learning tools lack validated stopping rules specific to recommendation development within rheumatology to ensure sufficiently high sensitivity (“recall”) to include relevant studies.

Objectives: To develop stopping rules for active learning–assisted abstract screening, capable of achieving at least 95% recall, primarily for SLRs informing EULAR recommendations and points to consider.

Methods: Simulation studies were conducted using SLR datasets derived from completed and ongoing EULAR projects on recommendations or points to consider. Datasets were obtained from project steering committees (all projects potentially eligible, i.e. starting in 2004) and curated to ensure consistency and completeness. Eligible datasets were randomly split into simulation and validation subsets using a 90:10 split. The balance between simulation and validation subsets was assessed using the standardized mean difference (SMD) for the ratio of relevant. ASReview was chosen as the learning-assisted screening platform to use due to being open-source. Three active learning models implemented in the open-source software ASReview (h3, u3, u4) were evaluated using ASReview’s workflow automation tool “Makita” across all datasets. These selected models have been shown in previous studies to be widely applicable and a suitable starting point for simulation studies [4,5]. Model performance was compared using a loss metric based on normalized recall regret, a measure quantifying the difference between observed recall and an ideal benchmark. Time-to-event analyses and Kaplan–Meier curves were used to estimate the proportion of articles that needed to be screened to reach 95% recall in at least 95% of all available datasets. These cut-offs were set to obtain conservative stopping rules, minimizing the risk of missing relevant studies. Furthermore, thresholds for the longest observed sequence of consecutively irrelevant records before reaching 95% recall (“minimum safe streaks”) were calculated. Based on these, stopping rules using (i) the proportion of records screened, (ii) the length of irrelevant-record sequences, and (iii) combinations of both were derived. The resulting stopping criteria were subsequently tested in the independent validation dataset.

Results: Forty-five datasets (40 simulation, five validation) were analyzed [6]. A total of 37 datasets informed EULAR recommendations, six points to consider, one classification criteria, and one criteria for disease activity and treatment response. Most SLRs addressed disease management (n = 28), followed by imaging (n = 6), research (n = 3), diagnosis and classification (n = 3), and single reviews were on remote care, rheumatology education, work participation, disease monitoring criteria, and self-management. There was an excellent balance between simulation and validation subsets (SMD <0.02). Overall, the u4 model demonstrated the most favorable balance between performance and computational efficiency, while h3 showed similar results with higher computational demands. To achieve the target of 95% sensitivity using solitary metrics, model u4 required screening 34.8% of total references or reaching a minimum safe streak of 13.9%. Model h3 required screening 33.7% or an 18.8% safe streak. A combined stopping strategy required screening 26.0% of the dataset with a minimum safe streak of 9.8% for model u4, whereas model h3 required screening 28.0% of records, combined with a minimum safe streak of 2.9% (Table 1). Figure 1 shows the proportion of screened records and the minimum safe streak for 95% recall for each individual dataset. These empirical stopping rules were successfully confirmed on the independent validation set, achieving a sensitivity of 95.6% relative to the predefined 95% recall target while satisfying all rule criteria.

Conclusions: Active learning–assisted screening with empirically derived stopping rules can achieve a high inclusion rate of relevant records in EULAR SLRs, while reducing the screening effort to roughly one quarter to one third of total references to be screened by the reviewer. A further validation in larger number of datasets is desirable prior to implementation.

Table 1.

Minimal observed metric values corresponding to 95% recall in 95% of datasets.

Model	Proportion of records screened (%)	Minimum safe streak (% irrelevant)	Combined thresholds (%)
u4	34.8	13.9	26.0 screened + 9.8 streak
h3	33.7	18.8	28.0 screened + 2.9 streak

REFERENCES: [1] van de Schoot R et al. Nat Mach Intell. 2021;3:125–133.

[2] Harmsen W et al. Syst Rev. 2024;13:177.

[3] Federatie Medisch Specialisten. AI in richtlijnontwikkeling. 2026.

[4] Teijema JJ et al. Int J Data Sci Anal. 2025.

[5] de Bruin J et al. SSRN 5136987 (2025).

[6] Bischof L. OSF (2025); Teijema JJ. DataverseNL (2025); Teijema JJ. PhD thesis, Utrecht Univ.

Simulation materials: DataverseNL: Teijema, J. J. (2025). doi: 10.34894/YSXMVV.

Acknowledgments: NIL.

Disclosure of Interests: Lea Bischof Alfasigma, Jelle Jasper Teijema: None declared, Emily Westerbeek: None declared, George E. Fragoulis Eli Lilly, AbbVie, Pfizer, Boheringer-Ingelheim, Novartis, UCB, MSD, Johnson & Johnson, and AENORASIS, Eli Lilly, AbbVie, Pfizer, Boheringer-Ingelheim, Novartis, UCB, MSD, Johnson & Johnson, and AENORASIS, Uta Kiltz AbbVie, UCB, Novartis und der Rheumaakademie. U. Kiltz erhielt Vortrags‑/Beratungshonorare von AbbVie, Amgen, Biocad, Biogen, Chugai, Eli Lilly, Fresenius, Gilead, Grünenthal, GSK, Janssen, MSD, Novartis, Pfizer, Roche und UCB, AbbVie, UCB, Novartis und der Rheumaakademie. U. Kiltz erhielt Vortrags‑/Beratungshonorare von AbbVie, Amgen, Biocad, Biogen, Chugai, Eli Lilly, Fresenius, Gilead, Grünenthal, GSK, Janssen, MSD, Novartis, Pfizer, Roche und UCB, Lewin Könemann: None declared, Rens van de Schoot: None declared, Sofia Ramiro AbbVie, Eli Lilly, Galapagos/Alfasigma, Janssen, MSD, Novartis, Pfizer, UCB, AbbVie, Eli Lilly, Galapagos/Alfasigma, Janssen, MSD, Novartis, Pfizer, UCB, AbbVie, Eli Lilly, Galapagos/Alfasigma, Janssen, MSD, Novartis, Pfizer, UCB, Philipp Bosch Abbvie, Novartis, Ucb, Johnson and Johnson, Abbvie, Johnson and Johnson, Pfizer.

DOI: annrheumdis-2026-eular.B.4314

Keywords: Quality of care, Artificial Intelligence, Systematic review

Citation: , volume 85, supplement 1, year 2026, page s430

Session: Clinical Poster Tours: An RMD pick-n-mix (Poster Tours)

version:	1.02