Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain
Document Type
Conference Proceeding
Publication Date
12-1-2019
Journal
Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
DOI
10.1109/BigData47090.2019.9006492
Keywords
clinical text; information gain; k-means clustering; relative entropy; sublanguages
Abstract
© 2019 IEEE. Sublanguages are domain-centered subsets of general or colloquial language. Their identification drives several language analysis tasks, but it is difficult to discern separate sublanguages in large clinical corpora. We applied k-means clustering of semantic properties, and a novel implementation of relative entropy as an information gain indicator, to identify sublanguages within a large clinical corpus (~1.6 million documents), visualizing the results in a heat map. Patterns both within and across clusters reveal sublanguage trends. These findings are significant in sublanguage analysis, and have implications on both regional and international levels.
APA Citation
Workman, T., DIvita, G., & Zeng-Treitler, Q. (2019). Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain. Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, (). http://dx.doi.org/10.1109/BigData47090.2019.9006492