fetching data ...

POS0275 (2025)
EVALUATION OF A LARGE LANGUAGE MODEL’S PERFORMANCE IN SCORING CUTANEOUS MANIFESTATIONS IN DERMATOMYOSITIS: A COMPARISON WITH EXPERT ASSESSORS
Keywords: Artificial Intelligence, Skin
M. Fornaro1, S. Panigrahi4, S. Sabbagh2, F. Iannone1, V. Venerito1, L. Gupta3
1University of Bari, Unit of Rheumatology, Department of Precision and Regenerative Medicine, Area Jonica (DiMePRe-J), Bari, Italy
2Division of Rheumatology, Department of Pediatrics, Medical College of Wisconsin, Milwaukee, United States of America
3Division of Musculoskeletal and Dermatological Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre The University of Manchester, Manchester, United Kingdom
4Medical Student, University of Birmingham, Birmingham, United Kingdom

Background: Large Language Models (LLMs) show increasing promise in medicine, with recent validation demonstrating strong correlation with expert scoring of the Myositis Disease Activity Assessment Tool (MDAAT). However, their performance in cutaneous assessment remains unexplored, particularly for rare diseases like Dermatomyositis (DM).


Objectives: This study evaluated the LLM “Claude v. 3.5 Sonnet” in scoring cutaneous manifestations of DM against expert assessors, with implications for automated clinical trial screening where competing trials often limit recruitment from an already restricted patient pool.


Methods: Twenty-seven DM cases with standardized clinical photographs were identified through systematic PubMed review. Two rheumatologists with expertise in Cutaneous Dermatomyositis Disease Area and Severity Index (CDASI) scoring and trial recruitment independently assessed the images. The LLM ‘Claude’ analysed identical images using chain-of-thought prompting to score CDASI domains: erythema, scale, erosion/ulceration, poikiloderma, and calcinosis. Hand lesions were scored with specific attention to papules (requiring doubled erythema scores) and periungual changes. Intraclass Correlation Coefficient (ICC) analysis was performed using two-way random effects modelling (Stata 18).


Results: Global ICC analysis demonstrated excellent agreement between Claude and expert assessors (0.92, 95% CI: 0.89-0.94), comparable to inter-expert reliability (0.87, 95% CI: 0.82-0.91). Domain-specific analysis revealed:

  • Moderate agreement for core features:

  • Erythema (0.61, 95% CI: 0.27-0.81).

  • Scaling (0.57, 95% CI: 0.19-0.79).

  • Erosions (0.57, 95% CI: 0.21-0.79).

  • Poikiloderma (0.47, 95% CI: 0.10-0.73).

  • Strong concordance for hand assessment-– an ubiquitous and specific feature of disease:

  • Global hand score (0.95, 95% CI: 0.91-0.97).

  • Hand erythema (0.78, 95% CI: 0.24-0.95).

  • Perfect agreement for periungual vasculitis (ICC 1.0).

  • Lower reliability for damage assessment:

  • Hand damage (0.37, 95% CI: 0.10-0.85).

  • Time Efficiency Analysis:

  • Expert assessors: Mean 8.4 minutes per case (range 6-12 minutes).

  • LLM assessment: Mean 42 seconds per case (range 35-50 seconds).

  • Total time saved: 93% reduction in scoring time.

  • Additional efficiency: Simultaneous batch processing capability for LLM versus sequential expert assessment.


  • Conclusion: The LLM demonstrates excellent reliability for global disease assessment (ICC 0.92) and objective features like periungual changes (ICC 1.0), with significant time efficiency (93% reduction in scoring time) and batch processing capabilities that could enhance clinical trial recruitment workflows. However, important limitations persist in assessing subtle features (poikiloderma ICC 0.47, damage ICC 0.37) and technical constraints including image quality dependencies, suggesting its current optimal use is as a screening tool to support, rather than replace, expert assessment.


    REFERENCES: [1] Vincenzo Venerito, Marco Fornaro, Sara Sabbagh, et al. Integrating large language models in medicine: a study of Claude 2’s performance in MDAAT scoring for idiopathic inflammatory myopathies, Rheumatology , Volume 63, Issue 10, October 2024, Pages e292–e293.

    Flowchart of Analysis Process

    Intraclass Correlation Coefficient of CDASI in 27 Dermatomy ositis cases

    Domain ICC 95% CI
    Global (Experts 1 vs 2 ) 0.87 0.82–0.91
    Global (Claude vs Expert 1 ) 0.92 0.89–0.94
    Global (Claude vs Expert 2 ) 0.87 0.82–0.91
    Erythema 0.61 0.27–0.81
    Scaling 0.57 0.19–0.79
    Erosions 0.57 0.21–0.79
    Poikiloderma 0.47 0.10–0.73
    Calcinosis Not Detected N/A
    Hands (Global ) 0.95 0.91–0.97
    Hands (Erythema ) 0.78 0.24–0.95
    Hands (Ulcers ) 0.65 0.10–0.92
    Hands (Damage ) 0.37 0.10–0.85
    Hands (Periungual Vasculitis ) 1.0 1.0–1.0

    Acknowledgements: Lisa Traboco, Shounak Ghosh, Meera Shahs, VIJ Pallavi on behalf of MyoLLM group.


    Disclosure of Interests: None declared.

    © The Authors 2025. This abstract is an open access article published in Annals of Rheumatic Diseases under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ). Neither EULAR nor the publisher make any representation as to the accuracy of the content. The authors are solely responsible for the content in their abstract including accuracy of the facts, statements, results, conclusion, citing resources etc.


    DOI: annrheumdis-2025-eular.B3477
    Keywords: Artificial Intelligence, Skin
    Citation: , volume 84, supplement 1, year 2025, page 540
    Session: Clinical Poster Tours: Inflammatory Myopathies (Poster Tours)