Discovering Sublanguages in a Large Clinical Corpus through Unsupervised Machine Learning and Information Gain

Document Type

Conference Proceeding

Publication Date

12-1-2019

Journal

Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019

DOI

10.1109/BigData47090.2019.9006492

Keywords

clinical text; information gain; k-means clustering; relative entropy; sublanguages

Abstract

© 2019 IEEE. Sublanguages are domain-centered subsets of general or colloquial language. Their identification drives several language analysis tasks, but it is difficult to discern separate sublanguages in large clinical corpora. We applied k-means clustering of semantic properties, and a novel implementation of relative entropy as an information gain indicator, to identify sublanguages within a large clinical corpus (~1.6 million documents), visualizing the results in a heat map. Patterns both within and across clusters reveal sublanguage trends. These findings are significant in sublanguage analysis, and have implications on both regional and international levels.

This document is currently not available here.

Share

COinS