Spark Archives - Can-eHealth

Machine Learning (ML) and Artificial Intelligence (AI) are the buzzwords lately and it is heartening to find local HSPs scrambling to get on the bandwagon. The emphasis is mostly on creating models which require technical as well as clinical expertise. The quintessential ‘blackbox’ model is a good healthcare analytics exercise, but deploying the model to be useful at the bedside belongs to the IT domain.

This article is about creating a simple model using discharge abstract database (DAD) as the database and Apache spark as the framework, serialize it into a format that can be used externally and building a simple website that deploys the model for users to make predictions. To make this interesting, we will create the website using Java 11 and Spring boot 2.1 that are yet to be released at the time of writing. Both will be released when we reach there. But, please note that this is about deploying a model/pipeline created with spark (which may be an overkill for most projects). Here are some good resources if have small data/simple model.

https://github.com/mtobeiyf/keras-flask-deploy-webapp

https://towardsdatascience.com/deploying-keras-deep-learning-models-with-flask-5da4181436a2

https://blog.keras.io/building-a-simple-keras-deep-learning-rest-api.html

This post is actually a note to myself as I explore the process. As always the focus is on understanding the process and not on the utility of the model. Feel free to comment below and add your own notes/ideas.

TL;DR the code will be available on our GitHub repository as we progress.

E-Health

Hamilton, ON.

69Public Repos 0Public Gists 19Followers

First, let us start with a brief description of Apache Spark. Apache spark is an open-source big-data API with inbuilt cluster computing ability. Spark is highly accessible and offers simple APIs in Python, Java, Scala, and R. I have picked python as I can use the python interpreter at CC right from pycharm IDE. Pyspark is the python library for interacting with spark which can be linked to sys.path at runtime using the findspark library. Most machine learning pipelines are available in pyspark. We will be building a simple logistic regression model. The necessary libraries can be imported as below.

import logging

import findspark
import pyspark.sql.functions as F
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils

I will be back again with more next week. In the meantime have a look at DAD and the data dictionary. As always the customary disclaimer below:

Read Part 2.

Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.