Creating, serializing and deploying a machine learning model for healthcare: Part 1

please share

Machine Learning (ML) and Artificial Intelligence (AI) are the buzzwords lately and it is heartening to find local HSPs scrambling to get on the bandwagon. The emphasis is mostly on creating models which require technical as well as clinical expertise. The quintessential ‘blackbox’ model is a good healthcare analytics exercise, but deploying the model to be useful at the bedside belongs to the IT domain.

This article is about creating a simple model using discharge abstract database (DAD) as the database and Apache spark as the framework, serialize it into a format that can be used externally and building a simple website that deploys the model for users to make predictions. To make this interesting, we will create the website using Java 11 and Spring boot 2.1 that are yet to be released at the time of writing. Both will be released when we reach there. But, please note that this is about deploying a model/pipeline created with spark (which may be an overkill for most projects). Here are some good resources if have small data/simple model.

This post is actually a note to myself as I explore the process. As always the focus is on understanding the process and not on the utility of the model. Feel free to comment below and add your own notes/ideas.

TL;DR the code will be available on our GitHub repository as we progress.


First, let us start with a brief description of Apache Spark. Apache spark is an open-source big-data API with inbuilt cluster computing ability. Spark is highly accessible and offers simple APIs in Python, Java, Scala, and R. I have picked python as I can use the python interpreter at  CC right from pycharm IDE. Pyspark is the python library for interacting with spark which can be linked to sys.path at runtime using the findspark library. Most machine learning pipelines are available in pyspark. We will be building a simple logistic regression model. The necessary libraries can be imported as below.

I will be back again with more next week. In the meantime have a look at DAD and the data dictionary. As always the customary disclaimer below:

Read Part 2.

Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.

Bell Eapen

A dermatologist and an eHealth expert with expertise in Healthcare analytics, mHealth, health information exchange, benefits evaluation research, change management, and population informatics.[Resume]
please share

You may also like...

1 Response

  1. September 28, 2018

    […] is a series on serializing and deploying machine learning pipelines developed using pyspark. Part 1 is here. This is specifically for apache spark and is basically notes to […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.