This is a series on serializing and deploying machine learning pipelines developed using pyspark. Part 1 is here. This is specifically for apache spark and is basically notes to myself.
We will be using the Mleap for serializing the model. I have added below a brief introduction about Mleap copied from their website. For more information, please visit the Mleap website.
MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services.
This series is about serializing and deploying. If you are interested in model building, Susan’s article here is an excellent resource.
In part one we imported the dependencies. The next step is to initialize spark and import the data.
_logger = logging.getLogger(__name__) findspark.init(ConfigParams.__SPARK_HOME__) # Configuration conf = SparkConf(). \ setAppName('BellSpark') # Spark Session replaces SparkContext spark = SparkSession.builder. \ appName("BellSparkTest1"). \ config('spark.jars.packages', 'ml.combust.mleap:mleap-spark-base_2.11:0.9.3,ml.combust.mleap:mleap-spark_2.11:0.9.3'). \ config(conf=conf). \ getOrCreate() # Read csv df = spark.read.csv(ConfigParams.__DAD_PATH__, header=True, inferSchema=True)
In the above code, you have to set the spark home and path to DAD csv file. Obviously, you can name your app whatever you need. Mleap packages are loaded in the spark session.
To keep it simple, we are going to create a logistic regression model. The required variables are selected:
# Select the columns that we need df = df.select('TLOS_CAT', 'ACT_LCAT', 'ALC_LCAT', \ 'ICDCOUNT', 'CCICOUNT')
TLOS_CAT (Total length of stay) is the dependent variable (DV) and the rest are IVs. Please note that the choice of variables may not be ideal, but that is not our focus.
Now, recode TLOS_CAT to binary as we are going to build a logistic regression model.
# Change all NA to 0 df = df.na.fill(0) # Recode TLOS_CAT to binary df = df \ .withColumn('TLOS_CAT_NEW', F.when(df.TLOS_CAT <= 5, 0).otherwise(1)) \ .drop(df.TLOS_CAT) df.printSchema()
We will create and serialize the pipeline next week. I promised to deploy using Java 11 and spring boot 2.1. Java 11 was released on Sept 25 and I feel it can have a huge impact on java based EMRs like OSCAR and OpenMRS. More about that story soon on NuChange Blog!