Machine Learning

Creating, serializing and deploying a machine learning model for healthcare: Part 2

This is a series on serializing and deploying machine learning pipelines developed using pyspark. Part 1 is here. This is specifically for apache spark and is basically notes to myself.

We will be using the Mleap for serializing the model. I have added below a brief introduction about Mleap copied from their website. For more information, please visit the Mleap website.

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services.

This series is about serializing and deploying. If you are interested in model building, Susan’s article here is an excellent resource.

In part one we imported the dependencies. The next step is to initialize spark and import the data.

 _logger = logging.getLogger(__name__)

    # Configuration
    conf = SparkConf(). \
    # Spark Session replaces SparkContext
    spark = SparkSession.builder. \
        appName("BellSparkTest1"). \
               'ml.combust.mleap:mleap-spark-base_2.11:0.9.3,ml.combust.mleap:mleap-spark_2.11:0.9.3'). \
        config(conf=conf). \

    # Read csv
    df =, header=True, inferSchema=True)

In the above code, you have to set the spark home and path to DAD csv file. Obviously, you can name your app whatever you need. Mleap packages are loaded in the spark session.

To keep it simple, we are going to create a logistic regression model. The required variables are selected:

# Select the columns that we need
    df ='TLOS_CAT', 'ACT_LCAT', 'ALC_LCAT', \
                    'ICDCOUNT', 'CCICOUNT')

TLOS_CAT (Total length of stay) is the dependent variable (DV) and the rest are IVs. Please note that the choice of variables may not be ideal, but that is not our focus.

Now, recode TLOS_CAT to binary as we are going to build a logistic regression model.

# Change all NA to 0
    df =

    # Recode TLOS_CAT to binary
    df = df \
        .withColumn('TLOS_CAT_NEW', F.when(df.TLOS_CAT <= 5, 0).otherwise(1)) \


We will create and serialize the pipeline next week. I promised to deploy using Java 11 and spring boot 2.1. Java 11 was released on Sept 25 and I feel it can have a huge impact on java based EMRs like OSCAR and OpenMRS. More about that story soon on NuChange Blog!

Machine Learning

Creating, serializing and deploying a machine learning model for healthcare: Part 1

Machine Learning (ML) and Artificial Intelligence (AI) are the buzzwords lately and it is heartening to find local HSPs scrambling to get on the bandwagon. The emphasis is mostly on creating models which require technical as well as clinical expertise. The quintessential ‘blackbox’ model is a good healthcare analytics exercise, but deploying the model to be useful at the bedside belongs to the IT domain.

This article is about creating a simple model using discharge abstract database (DAD) as the database and Apache spark as the framework, serialize it into a format that can be used externally and building a simple website that deploys the model for users to make predictions. To make this interesting, we will create the website using Java 11 and Spring boot 2.1 that are yet to be released at the time of writing. Both will be released when we reach there. But, please note that this is about deploying a model/pipeline created with spark (which may be an overkill for most projects). Here are some good resources if have small data/simple model.

This post is actually a note to myself as I explore the process. As always the focus is on understanding the process and not on the utility of the model. Feel free to comment below and add your own notes/ideas.

TL;DR the code will be available on our GitHub repository as we progress.


First, let us start with a brief description of Apache Spark. Apache spark is an open-source big-data API with inbuilt cluster computing ability. Spark is highly accessible and offers simple APIs in Python, Java, Scala, and R. I have picked python as I can use the python interpreter at  CC right from pycharm IDE. Pyspark is the python library for interacting with spark which can be linked to sys.path at runtime using the findspark library. Most machine learning pipelines are available in pyspark. We will be building a simple logistic regression model. The necessary libraries can be imported as below.

import logging

import findspark
import pyspark.sql.functions as F
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils

I will be back again with more next week. In the meantime have a look at DAD and the data dictionary. As always the customary disclaimer below:

Read Part 2.

Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.