Masters thesis topic introduction

Monday. November 25, 2019 - 1 min

Feature Selection for patient population discrimination

The thesis aims to develop a general machine-learning-based classification framework to investigate the heterogeneity between two ICU patient populations that have common data variables. The data variables are transformed into a feature set whose subsets are input to various binary classifiers in order to predict the population data source.

We have 2 patient datasets from separate population sources, one is MIMIC-III from Beth Israel Deaconess Medical Center, Boston, USA and the other from Uniklinik, Aachen, Germany. Also see this post

We first run a binary classification with all the available features, and take the accuracy as the baseline. Various feature selection techniques are then used to find a subset of features that give us the same or improved accuracy than the baseline. Among all the available features, many features may not be discriminatory.

The feature-set subsets with high classification accuracy are discriminatory and can be used to predict the data source of a randomly drawn patient from the two populations. However, we are equally interested in those subsets that generate accuracy close to chance because classifiers for them cannot distinguish the data source of a randomly-drawn patient. Such patient data can then be used as part of a computational data merging strategy.