Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain
Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019
clinical text; information gain; k-means clustering; relative entropy; sublanguages
© 2019 IEEE. Sublanguages are domain-centered subsets of general or colloquial language. Their identification drives several language analysis tasks, but it is difficult to discern separate sublanguages in large clinical corpora. We applied k-means clustering of semantic properties, and a novel implementation of relative entropy as an information gain indicator, to identify sublanguages within a large clinical corpus (~1.6 million documents), visualizing the results in a heat map. Patterns both within and across clusters reveal sublanguage trends. These findings are significant in sublanguage analysis, and have implications on both regional and international levels.
Workman, T., DIvita, G., & Zeng-Treitler, Q. (2019). Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain. Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019, (). http://dx.doi.org/10.1109/BigData47090.2019.9006492