School of Medicine and Health Sciences Poster Presentations

NCI Multi-omics Mislabeling Challenge: A Machine Learning Approach

Yeshwant Chillakuru, George Washington University
Arjun Panda, George Washington University
Sindhu Kubendran, George Washington University
Norman Lee, George Washington University

Document Type

Poster

Status

Medical Student

Abstract Category

Cancer/Oncology

Keywords

proteomics, cancer, mislabeling, data

Publication Date

Spring 2019

Abstract

Sample mislabeling is a pervasive problem in biomedical research, especially large-scale multi-omics studies, contributing to errors and leading to false conclusions. The Food and Drug Administration (FDA) and the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (NCI-CPTA) have launched a data science challenge to address this problem. We developed a novel machine learning based approach that combines traditional machine learning with learning from cancer genomics literature to identify mislabeled tumors in the NCI-CPTA Multi-omics Mislabeling Challenge.

The training data contained a sample of a tumor from 80 different patients, each containing features on gender, microsatellite instability (MSI) status, and proteomics data for up to 4119 proteins. Competition organizers systematically mislabeled 10% of the data, which lead to incorrect gender or MSI status, relative to proteomics data, for most mislabeled samples.

To create a model to identify mislabeled samples, we used proteomics data to predict both the correct gender and MSI and compare predictions to the the given data. This would identify mislabeled instances of sample swapping and, potentially, duplication and shifting as well. Gender mislabeling was predicted using genes unique to the y chromosome and associated with cancer. We turned these genes into dummy variables (present/not present) and evaluated each protein‚Äôs predictive value using kappa statistics. Two genes far out-performed the rest, DDX3Y and RPS4Y1, which together gave us the gender prediction in our test set. MSI was predicting by applying dimensionality reduction and a logistic regression classifier. First, we conducted an F-test with an adjusted p-threshold of 0.05 to identify 31 proteins that are dysregulated in unstable tumor genomes (high MSI) compared to stable tumor genomes (low MSI). In addition, 38 proteins in our dataset were identified in medical literature to be associated with MSI and were included in our dataset. We used these 62 proteins in a logistic regression model to predict MSI.

When comparing the prediction of gender and MSI by our model to the actual prediction, any mismatches were therefore invalid. Results on the unseen test set yielded a sensitivity of 0.83 and a specificity of 0.50.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Open Access

Comments

Presented at Research Days 2019.

Download

Included in

Oncology Commons

COinS

NCI Multi-omics Mislabeling Challenge: A Machine Learning Approach

School of Medicine and Health Sciences Poster Presentations

NCI Multi-omics Mislabeling Challenge: A Machine Learning Approach

Document Type

Status

Abstract Category

Keywords

Publication Date

Abstract

Creative Commons License

Open Access

Comments

Included in

Search

Browse

Author Corner

Links

School of Medicine and Health Sciences Poster Presentations

NCI Multi-omics Mislabeling Challenge: A Machine Learning Approach

Authors

Document Type

Status

Abstract Category

Keywords

Publication Date

Abstract

Creative Commons License

Open Access

Comments

Included in

Share

Search

Browse

Author Corner

Links