School of Medicine and Health Sciences Poster Presentations

Document Type

Poster

Status

Medical Student

Abstract Category

Cancer/Oncology

Keywords

proteomics, cancer, mislabeling, data

Publication Date

Spring 2019

Abstract

Sample mislabeling is a pervasive problem in biomedical research, especially large-scale multi-omics studies, contributing to errors and leading to false conclusions. The Food and Drug Administration (FDA) and the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (NCI-CPTA) have launched a data science challenge to address this problem. We developed a novel machine learning based approach that combines traditional machine learning with learning from cancer genomics literature to identify mislabeled tumors in the NCI-CPTA Multi-omics Mislabeling Challenge.

The training data contained a sample of a tumor from 80 different patients, each containing features on gender, microsatellite instability (MSI) status, and proteomics data for up to 4119 proteins. Competition organizers systematically mislabeled 10% of the data, which lead to incorrect gender or MSI status, relative to proteomics data, for most mislabeled samples.

To create a model to identify mislabeled samples, we used proteomics data to predict both the correct gender and MSI and compare predictions to the the given data. This would identify mislabeled instances of sample swapping and, potentially, duplication and shifting as well. Gender mislabeling was predicted using genes unique to the y chromosome and associated with cancer. We turned these genes into dummy variables (present/not present) and evaluated each protein’s predictive value using kappa statistics. Two genes far out-performed the rest, DDX3Y and RPS4Y1, which together gave us the gender prediction in our test set. MSI was predicting by applying dimensionality reduction and a logistic regression classifier. First, we conducted an F-test with an adjusted p-threshold of 0.05 to identify 31 proteins that are dysregulated in unstable tumor genomes (high MSI) compared to stable tumor genomes (low MSI). In addition, 38 proteins in our dataset were identified in medical literature to be associated with MSI and were included in our dataset. We used these 62 proteins in a logistic regression model to predict MSI.

When comparing the prediction of gender and MSI by our model to the actual prediction, any mismatches were therefore invalid. Results on the unseen test set yielded a sensitivity of 0.83 and a specificity of 0.50.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Open Access

1

Comments

Presented at Research Days 2019.

Included in

Oncology Commons

Share

COinS
 

NCI Multi-omics Mislabeling Challenge: A Machine Learning Approach

Sample mislabeling is a pervasive problem in biomedical research, especially large-scale multi-omics studies, contributing to errors and leading to false conclusions. The Food and Drug Administration (FDA) and the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (NCI-CPTA) have launched a data science challenge to address this problem. We developed a novel machine learning based approach that combines traditional machine learning with learning from cancer genomics literature to identify mislabeled tumors in the NCI-CPTA Multi-omics Mislabeling Challenge.

The training data contained a sample of a tumor from 80 different patients, each containing features on gender, microsatellite instability (MSI) status, and proteomics data for up to 4119 proteins. Competition organizers systematically mislabeled 10% of the data, which lead to incorrect gender or MSI status, relative to proteomics data, for most mislabeled samples.

To create a model to identify mislabeled samples, we used proteomics data to predict both the correct gender and MSI and compare predictions to the the given data. This would identify mislabeled instances of sample swapping and, potentially, duplication and shifting as well. Gender mislabeling was predicted using genes unique to the y chromosome and associated with cancer. We turned these genes into dummy variables (present/not present) and evaluated each protein’s predictive value using kappa statistics. Two genes far out-performed the rest, DDX3Y and RPS4Y1, which together gave us the gender prediction in our test set. MSI was predicting by applying dimensionality reduction and a logistic regression classifier. First, we conducted an F-test with an adjusted p-threshold of 0.05 to identify 31 proteins that are dysregulated in unstable tumor genomes (high MSI) compared to stable tumor genomes (low MSI). In addition, 38 proteins in our dataset were identified in medical literature to be associated with MSI and were included in our dataset. We used these 62 proteins in a logistic regression model to predict MSI.

When comparing the prediction of gender and MSI by our model to the actual prediction, any mismatches were therefore invalid. Results on the unseen test set yielded a sensitivity of 0.83 and a specificity of 0.50.

 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.