Machine Learning

Machine Learning in population health: Creating conditions that ensure good health.

Machine Learning (ML) in healthcare has an affinity for patient-centred care and individual-level predictions. Population health deals with health outcomes in a group of individuals and the outcome distribution in the group. Both individual health and population health are not divergent, but at the same time, both are not the same and may require different approaches. ML in public health applications receives far less attention.

The skills available to public health organizations to transition towards an integrated data analytics is limited. Hence the latest advances in ML and artificial intelligence (AI) have made very little impact on public health analytics and decision making. The biggest barrier is the lack of expertise in conceiving and implementing data warehouse systems for public health that can integrate health information systems currently in use. 

The data in public health organizations are generally scattered in disparate information systems within the region or even within the same organization. Efficient and effective health data warehousing requires a common data model for integrated data analytics. The OHDSI – OMOP Common Data Model allows for the systematic analysis of disparate observational databases and EMRs. However, the emphasis is on patient-level prediction. Research on how patient-centred data models to observation-centred population health data models are the need of the hour.

We are making a difficult yet important transition towards integrated health by providing new ways of delivering services in local communities by local health teams. The emphasis is clearly on digital health. We need efficient and effective digital tools and techniques. Motivated by the Ontario Health Teams’ digital strategy, I have been working on tools to support this transition.

Hephestus is a software tool for ETL (Extract-Transform-Load) for open-source EMR systems such as OSCAR EMR and national datasets such as Discharge Abstract Database (DAD). It is organized into modules to allow code reuse. Hephestus uses SqlAlchemy for database connection and auto-mapping tables to classes and bonobo for managing ETL. Hephaestus aims to support common machine learning workflows such as model building with Apache spark and model deployment using serverless architecture. I am also working on FHIR based standards for ML model deployments.

Hephaestus is a work in progress and any help will be highly appreciated. Hephaestus is an open-source project on GitHub. If you are looking for an open-source project to contribute to Hacktoberfest, consider Hephaestus! 

Machine Learning

Creating, serializing and deploying a machine learning model for healthcare: Part 2

This is a series on serializing and deploying machine learning pipelines developed using pyspark. Part 1 is here. This is specifically for apache spark and is basically notes to myself.

We will be using the Mleap for serializing the model. I have added below a brief introduction about Mleap copied from their website. For more information, please visit the Mleap website.

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services.

This series is about serializing and deploying. If you are interested in model building, Susan’s article here is an excellent resource.

In part one we imported the dependencies. The next step is to initialize spark and import the data.

 _logger = logging.getLogger(__name__)

    # Configuration
    conf = SparkConf(). \
    # Spark Session replaces SparkContext
    spark = SparkSession.builder. \
        appName("BellSparkTest1"). \
               'ml.combust.mleap:mleap-spark-base_2.11:0.9.3,ml.combust.mleap:mleap-spark_2.11:0.9.3'). \
        config(conf=conf). \

    # Read csv
    df =, header=True, inferSchema=True)

In the above code, you have to set the spark home and path to DAD csv file. Obviously, you can name your app whatever you need. Mleap packages are loaded in the spark session.

To keep it simple, we are going to create a logistic regression model. The required variables are selected:

# Select the columns that we need
    df ='TLOS_CAT', 'ACT_LCAT', 'ALC_LCAT', \
                    'ICDCOUNT', 'CCICOUNT')

TLOS_CAT (Total length of stay) is the dependent variable (DV) and the rest are IVs. Please note that the choice of variables may not be ideal, but that is not our focus.

Now, recode TLOS_CAT to binary as we are going to build a logistic regression model.

# Change all NA to 0
    df =

    # Recode TLOS_CAT to binary
    df = df \
        .withColumn('TLOS_CAT_NEW', F.when(df.TLOS_CAT <= 5, 0).otherwise(1)) \


We will create and serialize the pipeline next week. I promised to deploy using Java 11 and spring boot 2.1. Java 11 was released on Sept 25 and I feel it can have a huge impact on java based EMRs like OSCAR and OpenMRS. More about that story soon on NuChange Blog!