Background: Rheumatology has experienced notable changes over the last two decades, fueled by the advent of new therapeutic strategies (e.g., treat-to-target), the introduction of novel drugs such as biologic agents and Janus kinase inhibitors, and the emergence of key clinical concepts like the window of opportunity, arthralgia suspicious for progression, or difficult-to-treat rheumatoid arthritis. Concurrently, other approaches—statistical learning methods, gene therapy, telemedicine, and precision medicine—have gained relevance. Considering the exponential increase in the volume of publications, leveraging advanced natural language processing (NLP) techniques, including state-of-the-art topic modeling methods, is essential to efficiently uncover dominant themes, trends, and research gaps in rheumatology.
Objectives: To apply BERTopic, a transformer-based topic modeling technique, to a large corpus of rheumatology-specific literature to identify central themes, characterize historical and emerging research trends, and highlight shifts in thematic foci over a 23-year period.
Methods: We analyzed 96,004 abstracts published between 2000 and December 31, 2023, from 34 specialized rheumatology journals indexed in PubMed and classified by the Journal Citation Reports. BERTopic, a novel topic modeling approach that considers semantic relationships among words and their context, was used to uncover topics. Thirty BERTopic models were trained, varying sentence embeddings, dimensionality reduction seeds, and minimum cluster sizes. The final model selection was guided by the number of topics generated, the proportion of outliers (i.e., abstracts not classified into any topic), and the topic coherence score (u_mass). Two rheumatologists manually labeled and interpreted the final set of topics. Temporal trends were quantified by computing yearly mean topic probabilities and applying linear regression to identify “hot” (positive slopes) and “cold” (negative slopes) topics.
Results: The selected BERTopic model yielded 47 topics with a u_mass score of -0.279 and identified 22,628 outliers (23.56%). Key disease areas, such as rheumatoid arthritis, systemic lupus erythematosus, and osteoarthritis, consistently appeared as the most frequently studied. Dynamic topic modeling successfully detected both expected (e.g., COVID-19) and emerging themes (e.g., spinal surgery, bone fractures), alongside established but now declining research areas (e.g., antiphospholipid syndrome, septic arthritis). Figure 1 illustrates the temporal evolution of selected topics, and Figure 2 presents hot and cold topics, underscoring ongoing changes in rheumatology’s research landscape.
Conclusion: Our study utilized advanced natural language processing techniques to analyse the rheumatology research landscape, and identify key themes and emerging trends. The results highlight the dynamic and varied nature of rheumatology research, illustrating how interest in certain topics have shifted over time. As the number of scientific publications increases, the use of natural language processing techniques will be necessary to efficiently analyze and synthesize information, helping to identify trends, gaps, and emerging areas of interest across various medical fields.
REFERENCES: NIL.
Evolution of the identified topics over time. B: Basic science. C: Clinical science
Bar chart of hot and cold topics. B: Basic science. C: Clinical science
Acknowledgements: NIL.
Disclosure of Interests: None declared.
© The Authors 2025. This abstract is an open access article published in Annals of Rheumatic Diseases under the CC BY-NC-ND license (