Predicting Diabetes from Medical Records

Using data from an old Kaggle competition, my colleagues Likhita Devireddy, David Dunn, and I attempted to predict if a patient has type II by mining electronic health record data. This project broke down into a number of phases.

First, we had to take the fragmented health record data and turn it into a single features table to model against. We generated approximately 1000 features from the original health data using logical and mathematical transformations performed with R and Excel. We then turned to determining which features were useful. To do this we employed multiple methods of feature selection.

We then evaluated our different methods of feature selection using 25 different models. From the different combinations of models and feature sets we were able to hone down our options, eventually settling on an ensemble of model/feature set combinations(evaluated in terms of lift):

The full details (including references) can be found in the complete report.