fetching data ...

SAT0589 (2019)
A VALIDATED TEXT-MINING ALGORITHM TO EXTRACT RA MEDICATION CONTAINED IN FORMAT-FREE FIELDS OF EMRS
Tjardo Maarseveen1, Thomas Huizinga1, Erik van den Akker2,3, Rachel Knevel1,4
1Leiden University Medical Center (LUMC), Rheumatology, Leiden, Netherlands
2Leiden University Medical Center (LUMC), Leiden Computational Biology Centre, Leiden, Netherlands
3Leiden University Medical Center (LUMC), Molecular Epidemiology, Leiden, Netherlands
4Brigham and Women’s Hospital, Rheumatology, Boston, United States of America

Background: Rapidly expanding collections of Electronic Medical Records (EMR) form a valuable resource for clinical research. Besides entries with a standardized format, EMRs often also contain free text fields intended for noting specifications of the treatment policy. While these free text fields contain essential information, their free nature makes them hard to parse, as they often contain typos or acronyms. As a result, data extraction from EMR is often performed manually, or is performed while excluding the format-free fields.


Objectives: To develop and validate a text-mining approach to extract medication prescribed for Rheumatoid Arthritis as contained in format-free fields of an EMR.


Methods: The EMR dataset consisted of 2,771 patients that visited the rheumatology outpatient clinic of the Leiden University Medical Centre between 2007 and 2018, resulting in a total of 45,012 EMR entries. We randomly selected 15% and 7,5% of the entries to create a training and test set, with 5,992 and 2,993 entries respectively, which were manually annotated by a trained rheumatologist. The training set was used to design the algorithm, whereas the test set was used as an independent validation of the algorithm’s performance of identifying each of the DMARDs and biologicals routinely prescribed for treating RA.

Using methods derived from Natural Language Processing, we developed an algorithm that consecutively performs three tasks: 1. Text pre-formatting 2. Acronym recognition and 3. Typo correction. Text pre-formatting consisted of several simple operations to deal with the most prevalent textual artifacts, including separation of special characters, numbers and punctuation sticking to words.

Ten independent clinicians compiled lists of acronyms for each of the drug routinely prescribed to treat RA. Lastly, for typo correction, we employed the Damerau Levenshtein [1] (DL) distance that determines the similarity between two words by counting the number of single character operations (remove, add, move or replace) required to transform one word into another. Using the training set, we computed for each drug DL distances between all words in the free fields of the EMRs and a particular drug name or its acronym. Using the annotations created in the training set we then determined the DL distance optimally distinguishing between a typo and two similar words with a different meaning.


Results: Fifteen medications for the treatment of RA were present in our sample (see Figure 1 ). In total, medication was present in 1,789 out of the 2,993 entries. The median DL cutoff for typos was 2 with a standard deviation of 0.96.

The overall accuracy of our drug-identification-algorithm was very good per medication in general (0.97) and the individual test characteristics were high: sensitivity=0.98, specificity=0.95, PPV=0.98 and NPV=0.95. Also on an individual drug-level the performance was high: accuracy>=0.99, sensitivity>=0.89 and specificity>=0.99, NPV>=0.99 and PPV>=0.90 for all medication except golimumab, which had a low prevalence in our dataset.


Conclusion: We developed and validated an algorithm enabling a highly accurate automated extraction of RA medication from format-free fields of Electronic Medical Records.


REFERENCES:

[1] Damerau, Fred J. (March1964), ”A technique for computer detection and correction of spelling errors”, Communications of the ACM, ACM, 7 (3): 171–176, doi:10.1145/363958.363994


Disclosure of Interests: Tjardo Maarseveen: None declared, Thomas Huizinga Consultant for: Merck, UCB, Bristol Myers Squibb, Biotest AG, Pfizer, GSK, Novartis, Roche, Sanofi-Aventis, Abbott, Crescendo Bioscience Inc., Nycomed, Boeringher, Takeda, Zydus, Epirus, Eli Lilly, Erik van den Akker: None declared, Rachel Knevel: None declared

DOI: 10.1136/annrheumdis-2019-eular.4835


Citation: Ann Rheum Dis, volume 78, supplement 2, year 2019, page A1387
Session: Epidemiology, risk factors for disease or disease progression (Scientific Abstracts)