Feature Selection & Stability

Scientific Research Project | 2019
work-single-image

Summary

In this project I proposed the use of frequency of features (proteins) in good positions in ranks as an approach for feature selection in the context of Discovery Proteomics for Biomarkers research. These field usually have high-dimensional dataset with a much smaller number of samples.

I have used controlled samples in which proteins were added in different concentrations in different conditions: cancer, tumor removed and no-cancer. The biological experiments were developed by our collaborators from the Brazilian Bioscience National Laboratory. Proteins had their intensities quantified through Discovery Proteomics methods.

My method was capable of selecting the true markers and reveal information about instability of Discovery Proteomics and feature selection methods applied to this type of data. Visualization is of great importance for this category of work, once the experts are fundamental key and need to take decisions considering the high probability of bias and false discovery rate caused by the contrast between the number of samples and the number of variables in the dataset.

All results and research done is reported in my Doctoral dissertation.

Category
Machine Learning
Feature Selection
Statistics
My job
Design
Develop
Research
Report
Technology
R
Python
Visualization
Classifiers
Statistical tests
False discovery rate

Gallery

Machine Learning analysis overview

Step by step from from creating ranks to computing candidate signatures and variables' stability through double cross-validation.

Intensities of added markers

We can see that the proteins' intensities are not stable after quantification by Discovery Proteomics.

False Discovery Rate

Because we use p-values for filtering the variables, the False Discovery Rate analysis is fundamental.

Ratios between classes

One way to undestand the logic involved to and use as markers is the ratio between classes. The idea is to say statments like: 'low intensity of protein A is liked to class B'

Features correlation

This plot show us how variables correlate to each other.

Highest rank scores

In each of the 9 Double Cross Validation loops, 40 ranks were computed. The plot highlights the highest scores and reveal a set of good variables for classification tasks.

Frequency of variables

The plot shows the frequency of variables in the top-10 positions in the 40 ranks from each loop from the Double Cross Validation. It reveals not only the frequency but also the stability and rank the proteins considering this information.

Heat map of best variables

Top-10 variables were selected to understand their intensities among the samples from the three classes: healthy (control), cancer and tumor removed.

SO WHAT YOU THINK ?

Let me know if I can help you in any way.

Contact with me