Learning regular expressions for clinical text classification

Research and applications

Learning regular expressions for clinical text classification Duy Duc An Bui,1,2 Qing Zeng-Treitler1,2 ▸ Additional material is published online only. To view please visit the journal online (http://dx.doi.org/10.1136/ amiajnl-2013-002411). 1

Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA 2 VA Salt Lake City Health Care System, Salt Lake City, Utah, USA Correspondence to Dr Qing Zeng-Treitler, Department of Biomedical Informatics, 26 S 2000 E, HSEB 5700, University of Utah, Salt Lake City, UT 841125750, USA; [email protected] Received 14 October 2013 Revised 27 December 2013 Accepted 27 January 2014 Published Online First 27 February 2014

ABSTRACT Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. Methods We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED +SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. Results The two RED classifiers achieved 80.9–83.0% in overall accuracy on the two datasets, which is 1.3–3% higher than SVM’s accuracy (p