ABSTRACT
Objective We developed a post-processing algorithm to convert raw natural language processing output from electronic health records into a usable format for analysis. This algorithm was specifically developed for creating datasets that can be used for medication-based studies.
Materials and Methods The algorithm was developed using output from two natural language processing systems, MedXN and medExtractR. We extracted medication information from deidentified clinical notes from Vanderbilt’s electronic health record system for two medications, tacrolimus and lamotrigine, which have widely different prescribing patterns. The algorithm consists of two parts. Part I parses the raw output and connects entities together and Part II removes redundancies and calculates dose intake and daily dose. We evaluated both parts of the algorithm by comparing to gold standards that were generated using approximately 300 records from 10 subjects for both medications and both NLP systems.
Results Both parts of the algorithm performed well. For MedXN, the F-measures for Part I were at or above 0.94 and for Part II they were at or above 0.98. For medExtractR the F-measures for Part I were at or above 0.98 and for Part II they were at or above 0.91.
Discussion Our post-processing algorithm is useful for drug-based studies because it converts NLP output to analyzable data. It performed well, although it cannot handle highly complicated cases, which usually occurred when a NLP incorrectly extracted dose information. Future work will focus on identifying the most likely correct dose when conflicting doses are extracted on the same day.