Background: Sarcoidosis is a heterogeneous granulomatous disease characterized by a wide range of clinical manifestations stemming from multiple organ involvement. While clustering techniques offer a robust method for uncovering these patterns, traditional approaches may fail to fully capture the complexity of multisystem diseases like sarcoidosis. Leveraging generative artificial intelligence (AI) offers a unique opportunity to improve data analysis and interpretation in complex systemic settings, providing novel insights into multifaceted disease patterns and guiding both hypothesis generation and clinical decision-making.
Objectives: This study aimed to identify distinct clusters of organ involvement in patients with sarcoidosis, assess their corresponding epidemiological characteristics, and highlight the benefits of AI-driven methodologies in handling complex multisystem data—underscoring the feasibility and advantages of advanced AI-based approaches for systemic phenotypes in this heterogeneous disease.
Methods: We conducted an AI-assisted analysis to identify organ-involvement clusters in a dataset of 2,187 anonymized sarcoidosis patients (Spanish National Registry SarcoGEAS, all fulfilling the 1999 ATS/ERS/WASOG criteria). Organ involvement was retrospectively determined in each patient at the time of diagnosis using the 2014 WASOG organ assessment instrument. Clustering was carried out via the k-means algorithm in Python’s scikit-learn library (version 1.0.2). The optimal number of clusters was determined using the elbow method, supported by silhouette scores to evaluate cluster quality. Statistical comparisons (ANOVA, Kruskal-Wallis, and Chi-square tests—using exact tests for low-frequency data) were applied to characterize cluster differences. Significance was set at p < 0.05, ensuring rigorous evaluation of epidemiological and clinical distinctions. The analysis was conducted in a secure computational environment using generative AI (via OpenAI’s GPT-4 model) using Python (version 3.9) with essential libraries including pandas (1.4.3) for data manipulation, numpy (1.21.5) for numerical computations, and matplotlib (3.5.1) and seaborn (0.11.2) for visualizations. Data processing and analysis workflows adhered to GDPR standards to ensure patient privacy. All patient data were anonymized prior to analysis, and no identifiable information was accessed at any point. Code modularity and reproducibility were prioritized, with all scripts managed in version control systems (e.g., Git) to enable transparency.
Results: The cohort comprised 2,187 patients, with a female predominance (61.4%), a mean age at diagnosis of 48.6 years (range: 5-95), and a majority identifying as White (88%). Cluster quality analysis identified 5 as the optimal number of clusters potential; an additional clinically significant cluster (hepatic-splenic) was manually identified and confirmed post hoc through statistical validation. Ultimately, we defined six distinct clusters of systemic involvement: the lymphadenopathic cluster (Cluster 1, characterized by 100% lymphadenopathy), the pulmonary cluster (Cluster 2, characterized by 100% lung involvement and co-occurring 100% lymphadenopathy), the cutaneous cluster (Cluster 3, 100% of cutaneous involvement), the ocular cluster (Cluster 4, 100% ocular involvement), the hepato-splenic cluster (Cluster 5, defined by 100% hepatic and splenic involvement), and the multisystemic cluster (Cluster 6, exhibiting generalized, but not predominant, organ involvement). Each cluster demonstrated statistically significant epidemiological differences (Figure 1). For age, the lymphadenopathic cluster had the highest mean (51.7 years), whereas the cutaneous cluster had the lowest (42.9 years) (p = 0.00056). For sex, the proportion of females ranged from 49.0% in the hepato-splenic cluster to 65.9% in the ocular cluster (p = 0.000017). For ethnicity, the proportion of White patients ranged from 81.4% in the ocular cluster to 94.6% in the lymphadenopathic cluster (p = 0.00135).
Conclusion: This generative AI-driven clustering study successfully identified six distinct patterns of systemic involvement in sarcoidosis, offering a deeper understanding of the disease’s heterogeneity. Each cluster exhibited specific epidemiological profiles: cutaneous cluster was associated with the youngest age at sarcoidosis diagnosis, lymphadenopathic cluster with the oldest age and the highest frequency of White patients, ocular cluster with the highest frequency of women and highest frequency of non-White patients, and the hepato-splenic cluster with the highest rate of men. The significant epidemiological disparities among clusters underscore the disease’s variability and offer a framework for refined patient stratification.
REFERENCES: NIL.
Acknowledgements: NIL.
Disclosure of Interests: None declared.
© The Authors 2025. This abstract is an open access article published in Annals of Rheumatic Diseases under the CC BY-NC-ND license (