Brian Schmidt

Logging for ML Model Deployments

2023-04-20T12:00:00-05:00

Logging for ML Model Deployments

In previous blog posts we introduced the decorator pattern for ML model deployments and then showed how to use the pattern to build extensions for an ML model deployment. For example, in this blog post we did data enrichment using a PostgreSQL database. The extensions were added without having to modify the machine learning model code at all, we were able to do it by using the decorator pattern. In this blog post we’ll add logging to a model deployment without having to modify the model code, using a decorator.

This blog post is written in a Jupyter notebook and we'll be switching between Python code and shell commands, the formatting will reflect this.

Introduction

As software systems become more and more complex, the people that build and operate these systems are finding that they are very hard to debug and inspect. To be able to solve this issue, a software system needs to be observable. An observable system is a system that allows an outside observer to infer the internal state of the system based purely on the data that it generates. The quality of "observability" helps the operators of a system to understand the inner workings of the system and to solve issues that may come up, even when the issues may be unprecedented.

Observability is a non-functional requirement (NFR) of a system. An NFR is a requirement that is placed on the operation of a system that has nothing to do with the specific functions of the system. Rather, it is a cross-cutting concern that needs to be addressed within the whole system design. Logging is a way that we can implement observability in a software system.

In the world of software systems, a "log" is a record of events that happen as software runs. A log is made up of individual records called log records that each represent a single event in the software system. Logs are useful for debugging the system, keeping a permanent record of its activities, and many other purposes. In general, log records are designed for debugging, alerting, and auditing the activities of the system.

Just like any other software component, machine learning models need to create a log of events that may be useful later on. For example, we may want to know how many predictions the model made, how many errors occurred, and any other interesting events that we may want to keep track of. In this blog post we'll create a decorator that creates a log for a machine learning model.

This post is not meant to be a full guide for doing logging in Python, but we'll include some background information to make it easier to understand. Logging in Python can get complicated and there are other places that cover it more thoroughly. Here is a good place to learn more about Python logging.

All of the code is available in this github repository.

Software Architecture

The logging decorator will operate within the model service, but it requires outside services to handle the logs that it produces. This makes the software architecture more complicated and requires that we add several more services to the mix.

The logging decorator is executing right after the prediction request is received from the client and a prediction is made by the model, it will send logs to be handled by other services. The other services are:

Log Forwarder: a service that runs on each cluster node that forwards logs from the local hard drive to the log aggregator service.
Log Storage: a service that can store logs and also query them.
Log User Interface: a service with a web interface that provides access to the logs stored in the log storage service.

The specific services that we'll use will be detailed later in the blog post.

Logging Best Practices

There are certain things that we can do when we create a log for our application that makes it more useful, especially in production settings. For example, attaching a "level" to each log record makes it easy to filter the log according to the severity of the events. For example, a log record is of level "INFO" when it communicates a simple action that the system has taken. A "WARNING" log event is an event that may indicate a problem in the system, but the system can continue to run. A good description of the common log levels is here.

Another good practice for logs is to include contextual information that can help to debug any problems that may arise in the execution of the code. For example, we can include the location in the codebase where the log record was generated. This information is very helpful during debugging and helps to quickly find the code that caused the event to happen. The information is often presented as the function name, code file name, and line number where the log record was generated. Another piece of useful contextual information is the hostname of the machine where the log was generated.

Logs should be easy to interpret for both humans and machines, this means that log records are often written in text strings. Humans can easily read text, but parsing a text string is complicated for machines. To allow both humans and machines to easily parse a log message, a good middle ground is to use JSON formatting. JSON-formatted logs are easy to parse, but also allow a human to quickly read and understand a log message.

Unique identifiers are useful to include in logs because they allow us to correlate many different log records together into a cohesive picture. For example, a correlation id is a unique ID that is generated to identify a specific transaction or query in a system. Adding unique identifiers to each log record can make it possible to debug complex problems that happen across system boundaries. A good description of correlation ids is here.

Logging in Python

The python standard library has a module that can simplify logging. The logging module is imported and used like this:

import logging

logger = logging.getLogger()

logger.warning("Warning message.")

Warning message.

To start logging, we instantiated a logger object using the logging.getLogger() function. Then we used the logger object to log a WARNING message.

The log records are being sent to the stderr output of the process by default. We'll change that by instantiating a StreamHandler and pointing it at the stdout stream:

import sys

stream_handler = logging.StreamHandler(sys.stdout)

logger.addHandler(stream_handler)
logger.warning("Warning message.")

Warning message.

We just replaced the original log handler that logged messages to stderror with another one that logs to stdout. A log handler is a software component that is able to send log messages to destinations outside of the running process.

We can also log messages at other levels, here is a WARNING and DEBUG message:

logger.warning("Warning message.")
logger.debug("Debug message.")

Warning message.

When the code above executed, only the WARNING message was printed because the logger only sends log messages to the output that are at the WARNING level or above by default. This filtering functionality is helpful when you are only interested in logs above a certain level. We can change that by configuring the logger:

logger.setLevel(logging.DEBUG)

logger.warning("Warning message.")
logger.debug("Debug message.")

Warning message.
Debug message.

Now we can see the debug message.

We can put in more information to the log record by adding a formatter to the log handler:

formatter = logging.Formatter('%(asctime)s:%(name)s:%(levelname)s: %(message)s')
stream_handler.setFormatter(formatter)

logger.warning("Warning message.")
logger.debug("Debug message.")

2023-04-23 21:28:47,875:root:WARNING: Warning message.
2023-04-23 21:28:47,876:root:DEBUG: Debug message.

A formatter is a software component that can format log messages according to a desired format. The log record now contains the date and time of the event, the name of the logger that generated the message, the level of the log, and the log message. These are all standard fields that are attached to log messages when they are created, more information about these fields can be found in the Python documentation here.

Each logger has a name attached to it when it is created, the name of the current logger is "root" because we created the logger without specifying a name. We can create a new logger with a name like this:

logger = logging.getLogger("test_logger")

logger.debug("Debug message.")

2023-04-23 21:28:47,881:test_logger:DEBUG: Debug message.

The log record has the name of the logger, which is not the root logger that we were using before.

Logging Environment Variables

To log extra information that is not available by default within each log record we have to extend the logging module by creating Filter classes. A Filter is simply a class that filters log records and can also modify them. This information will come from the environment variables of the process in which the logger is running.

To do this we'll create a Filter that is able to pick up information from the environment variables and add it to each log record.

import os
from typing import List
from logging import Filter


class EnvironmentInfoFilter(Filter):
    """Logging filter that adds information to log records from environment variables."""

    def __init__(self, env_variables: List[str]):
        super().__init__()
        self._env_variables = env_variables

    def filter(self, record):
        for env_variable in self._env_variables:
            record.__setattr__(env_variable.lower(), os.environ.get(env_variable, "N/A"))
        return True

To try it out we'll have to add an environment variable that will be logged:

os.environ["NODE_IP"] = "198.197.196.195"

Next, we'll instantiate the Filter class and add it to a logger instance to see how it works.

environment_info_filter = EnvironmentInfoFilter(env_variables=["NODE_IP"])

logger.addFilter(environment_info_filter)

formatter = logging.Formatter('%(asctime)s : %(name)s : %(levelname)s : %(node_ip)s : %(message)s')
stream_handler.setFormatter(formatter)

logger.warning("Warning message.")
logger.debug("Debug message.")

2023-04-23 21:28:47,910 : test_logger : WARNING : 198.197.196.195 : Warning message.
2023-04-23 21:28:47,911 : test_logger : DEBUG : 198.197.196.195 : Debug message.

The log record now contains the IP address that we set in the environment variables.

Logging in JSON

So far, the logs we've been generated have been in a slightly structured format that we came up with. It uses colons to separate out different sections of the log record. If we want to easily parse the logs to extract information from them, we should instead use JSON records. In this section we'll use the python-json-logger package to format the log records as JSON strings.

First, we'll install the package:

from IPython.display import clear_output

!pip install python-json-logger

clear_output()

We'll instantiate a JsonFormatter object that will convert the logs to JSON:

from pythonjsonlogger import jsonlogger


json_formatter = jsonlogger.JsonFormatter("%(asctime)s %(name)s %(levelname)s %(node_ip)s %(message)s")

We'll add the formatter to the stream handler that we created above like this:

stream_handler.setFormatter(json_formatter)

Now when we log, the output will be a JSON string:

logger.error("Error message.")

{"asctime": "2023-04-23 21:28:50,037", "name": "test_logger", "levelname": "ERROR", "node_ip": "198.197.196.195", "message": "Error message."}

We can add easily add more fields from the log record to make it more comprehensive:

json_formatter = jsonlogger.JsonFormatter("%(asctime)s %(node_ip)s %(process)s %(thread)s %(pathname)s %(lineno)s %(levelname)s %(message)s")

stream_handler.setFormatter(json_formatter)

logger.error("Error message.")

{"asctime": "2023-04-23 21:28:50,047", "node_ip": "198.197.196.195", "process": 793, "thread": 140704422703936, "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/2505421541.py", "lineno": 5, "levelname": "ERROR", "message": "Error message."}

Some of these fields were added by the Filter that we built above, other fields are default fields provided by the Python logging module.

The JSON formatter can also add extra fields and values to the log record by using the "extra" parameter:

extra = {
    "action": "predict",
    "model_qualified_name": "model_qualified_name",
    "model_version": "model_version",
    "status":"error",
    "error_info": "error_info"
}

logger.error("message", extra=extra)

{"asctime": "2023-04-23 21:28:50,057", "node_ip": "198.197.196.195", "process": 793, "thread": 140704422703936, "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/1433050719.py", "lineno": 9, "levelname": "ERROR", "message": "message", "action": "predict", "model_qualified_name": "model_qualified_name", "model_version": "model_version", "status": "error", "error_info": "error_info"}

The extra fields are:

action: the method called on the MLModel instance
model_qualified_name: the qualified name of the model
model_version: the version of the model
status: whether the action succeeded or not, can be "success" or "error"
error_info: extra error information, only present if an error occurred

This information would normally be included in the "message" field of the log record as unstructured text, but by breaking it out and putting it into individual fields in the JSON log record we'll be able to parse it later.

Putting It All Together

We've done a few things with the logger module, now we need to put it all together into one configuration that we can use to set up the logger the way we want it.

The logging.config.dictConfig() function can accept all of the options of the loggers, formatters, handlers, and filters and set them up with one function call.

import logging.config


logging_config = {
    "version": 1,
    "disable_existing_loggers": True,
    "loggers": {
        "root": {
            "level": "INFO",
            "handlers": ["stdout"],
            "propagate": False
        }
    },
    "filters": {
        "environment_info_filter": {
            "()": "__main__.EnvironmentInfoFilter",
            "env_variables": ["NODE_IP"]
        }
    },
    "formatters": {
        "json_formatter": {
            "class": "pythonjsonlogger.jsonlogger.JsonFormatter",
            "format": "%(asctime)s %(node_ip)s %(name)s %(pathname)s %(lineno)s %(levelname)s %(message)s"
        }
    },
    "handlers": {
        "stdout":{
            "level":"INFO",
            "class":"logging.StreamHandler",
            "stream": "ext://sys.stdout",
            "formatter": "json_formatter",
            "filters": ["environment_info_filter"]
        }
    }    
}

logging.config.dictConfig(logging_config)

logger = logging.getLogger()

logger.debug("Debug message.")
logger.info("Info message.")
logger.error("Error message.")

{"asctime": "2023-04-23 21:28:50,074", "node_ip": "198.197.196.195", "name": "root", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/4067465749.py", "lineno": 4, "levelname": "INFO", "message": "Info message."}
{"asctime": "2023-04-23 21:28:50,076", "node_ip": "198.197.196.195", "name": "root", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/4067465749.py", "lineno": 5, "levelname": "ERROR", "message": "Error message."}

The logger behaved in the same way as when we created it programmatically.

Installing a Model

We won't be training an ML model from scratch in this blog post because it would take a lot of space in the post. We'll be reusing a model that we built in a previous blog post. The model's code is hosted in this github repository. The model is used to predict credit risk.

The model itself can be installed as a normal Python package, using the pip command:

!pip install -e git+https://github.com/schmidtbri/health-checks-for-ml-model-deployments#egg=credit_risk_model

clear_output()

Making a prediction with the model is done through the CreditRiskModel class, which we'll import like this:

from credit_risk_model.prediction.model import CreditRiskModel

Now we'll instantiate the model class in order to make a prediction.

model = CreditRiskModel()

clear_output()

In order to make a prediction with the model instance, we'll need to instantiate the input:

from credit_risk_model.prediction.schemas import CreditRiskModelInput, EmploymentLength, HomeOwnership, \
    LoanPurpose, LoanPurpose, Term, VerificationStatus

model_input = CreditRiskModelInput(
    annual_income=273000, 
    collections_in_last_12_months=20, 
    delinquencies_in_last_2_years=39, 
    debt_to_income_ratio=42.64, 
    employment_length=EmploymentLength.less_than_1_year, 
    home_ownership=HomeOwnership.MORTGAGE, 
    number_of_delinquent_accounts=6, 
    interest_rate=28.99, 
    last_payment_amount=36475.59, 
    loan_amount=35000,  
    derogatory_public_record_count=86, 
    loan_purpose=LoanPurpose.debt_consolidation, 
    revolving_line_utilization_rate=892.3, 
    term=Term.thirty_six_months, 
    total_payments_to_date=57777.58, 
    verification_status=VerificationStatus.source_verified 
)

The model's input schema is called CreditRiskModelInput and it holds all of the features required by the model to make a prediction.

Now we can make a prediction with the model by calling the predict() method with an instance of the CreditRiskModelInput class.

prediction = model.predict(model_input)

prediction

CreditRiskModelOutput(credit_risk=<CreditRisk.safe: 'safe'>)

The model predicts that the client's risk is safe.

The output is also provided as an object, and because the model is a classification model, the output is an Enum. We can view the schema of the model output by requesting the JSON schema from the object:

model.output_schema.schema()

{'title': 'CreditRiskModelOutput',
 'description': 'Credit risk model output schema.',
 'type': 'object',
 'properties': {'credit_risk': {'description': 'Whether or not the loan is risky.',
   'allOf': [{'$ref': '#/definitions/CreditRisk'}]}},
 'required': ['credit_risk'],
 'definitions': {'CreditRisk': {'title': 'CreditRisk',
   'description': 'Indicates if loan is risky.',
   'enum': ['safe', 'risky'],
   'type': 'string'}}}

The two possible outputs of the model are "safe" and "risky".

Creating the Logging Decorator

As you saw above, the model did not produce any logs. To be able to emit some logs about the model's activity, we'll create a Decorator that will do logging around an MLModel instance.

In order to build a MLModel decorator class, we'll need to inherit from the MLModelDecorator class and add some functionality.

from typing import List, Optional
import logging
from ml_base.decorator import MLModelDecorator
from ml_base.ml_model import MLModelSchemaValidationException


class LoggingDecorator(MLModelDecorator):
    """Decorator to do logging around an MLModel instance."""

    def __init__(self, input_fields: Optional[List[str]] = None, 
                 output_fields: Optional[List[str]] = None) -> None:
        super().__init__(input_fields=input_fields, output_fields=output_fields)
        self.__dict__["_logger"] = None

    def predict(self, data):
        if self.__dict__["_logger"] is None:
            self.__dict__["_logger"] = logging.getLogger("{}_{}".format(
                self._model.qualified_name, "logger"))

        # extra fields to be added to the log record
        extra = {
            "action": "predict",
            "model_qualified_name": self._model.qualified_name,
            "model_version": self._model.version
        }

        # adding model input fields to the extra fields to be logged
        new_extra = dict(extra)
        if self._configuration["input_fields"] is not None:
            for input_field in self._configuration["input_fields"]:
                new_extra[input_field] = getattr(data, input_field)

        self.__dict__["_logger"].info("Prediction requested.", extra=new_extra)

        try:
            prediction = self._model.predict(data=data)
            extra["status"] = "success"

            # adding model output fields to the extra fields to be logged
            new_extra = dict(extra)
            if self._configuration["output_fields"] is not None:
                for output_field in self._configuration["output_fields"]:
                    new_extra[output_field] = getattr(prediction, output_field)            
            self.__dict__["_logger"].info("Prediction created.", extra=new_extra) 
            return prediction
        except Exception as e:
            extra["status"] = "error"
            extra["error_info"] = str(e)
            self.__dict__["_logger"].error("Prediction exception.", extra=extra)
            raise e

The LoggingDecorator class has most of its logic in the predict() method. This method simply instantiates a logger object and logs a message before a prediction is made, after it is made, and in the case when an exception is raised. Notice that the exception information is logged, but the exception is re-raised immediately after. We don't want to keep the exception from being handled by whatever code is using the model, we just need to emit a log of the event.

The decorator also adds a few fields to the log message:

action: the action that the model is performing, in this case "prediction"
model_qualified_name: the qualified name of the model performing the action
model_version: the version of the model performing the action
status: the result of the action, can be either "success" or "error"
error_info: an optional field that adds error information when an exception is raised

These fields are added on top of all the regular fields that the logging package provides. The extra information should allow us to easily filter logs later.

Decorating the Model

To test out the decorator we’ll first instantiate the model object that we want to use with the decorator.

model = CreditRiskModel()

clear_output()

Next, we’ll instantiate the decorator:

logging_decorator = LoggingDecorator()

We can add the model instance to the decorator after it’s been instantiated like this:

decorated_model = logging_decorator.set_model(model)

We can see the decorator and the model objects by printing the reference to the decorator:

decorated_model

LoggingDecorator(CreditRiskModel)

The decorator object is printing out its own type along with the type of the model that it is decorating.

Now we can try out the logging decorator by making a prediction:

prediction = decorated_model.predict(model_input)

prediction

{"asctime": "2023-04-23 21:28:57,431", "node_ip": "198.197.196.195", "name": "credit_risk_model_logger", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/3804123212.py", "lineno": 33, "levelname": "INFO", "message": "Prediction requested.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0"}
{"asctime": "2023-04-23 21:28:57,452", "node_ip": "198.197.196.195", "name": "credit_risk_model_logger", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/3804123212.py", "lineno": 44, "levelname": "INFO", "message": "Prediction created.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0", "status": "success"}





CreditRiskModelOutput(credit_risk=<CreditRisk.safe: 'safe'>)

Calling the predict() method on the decorated model now emits two log messages. The first message is a "Prediction requested." message and happens before the model's predict method is called. The second is a "Prediction created." message and happens after the prediction is returned by the model to the decorator. The decorator can also log exceptions made by the model.

The logging decorator is also able to grab fields from the model's input and output and log those alongside the other fields. Here is how to configure the logging decorator to do this:

logging_decorator = LoggingDecorator(input_fields=["collections_in_last_12_months", "debt_to_income_ratio"],
                                     output_fields=["credit_risk"])

decorated_model = logging_decorator.set_model(model)

prediction = decorated_model.predict(model_input)

prediction

{"asctime": "2023-04-23 21:28:57,461", "node_ip": "198.197.196.195", "name": "credit_risk_model_logger", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/3804123212.py", "lineno": 33, "levelname": "INFO", "message": "Prediction requested.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0", "collections_in_last_12_months": 20, "debt_to_income_ratio": 42.64}
{"asctime": "2023-04-23 21:28:57,480", "node_ip": "198.197.196.195", "name": "credit_risk_model_logger", "pathname": "/var/folders/vb/ym0r3p412kg598rdky_lb5_w0000gn/T/ipykernel_793/3804123212.py", "lineno": 44, "levelname": "INFO", "message": "Prediction created.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0", "status": "success", "credit_risk": "safe"}





CreditRiskModelOutput(credit_risk=<CreditRisk.safe: 'safe'>)

The "Prediction requested." log message now has two extra fields, the "collections_in_last_12_months" field and the "debt_to_income_ratio" field which were directly copied from the model input. The "Prediction created." log message also has the "credit_risk" field, which is the prediction returned by the model.

We now have a working logging decorator that can help us to do logging if the model does not do logging for itself.

Adding the Decorator to a Deployed Model

Now that we have a decorator that works locally, we can deploy it with a model inside of a service. The rest_model_service package is able to host ML models and create a RESTful API for each individual model. We don't need to write any code to do this because the service can decorate the models that it hosts with decorators that we provide. You can learn more about the package in this blog post. You can learn how the rest_model_service package can be configured to add decorators to a model in this blog post.

To install the service package, execute this command:

!pip install rest_model_service>=0.3.0

clear_output()

The configuration for our model and decorator looks like this:

service_title: Credit Risk Model Service
models:
  - class_path: credit_risk_model.prediction.model.CreditRiskModel
    create_endpoint: true
    decorators:
      - class_path: ml_model_logging.logging_decorator.LoggingDecorator
        configuration:
          input_fields: ["collections_in_last_12_months", "debt_to_income_ratio"]
          output_fields: ["credit_risk"]
logging:
    version: 1
    disable_existing_loggers: false
    formatters:
      json_formatter:
        class: pythonjsonlogger.jsonlogger.JsonFormatter
        format: "%(asctime)s %(node_ip)s %(name)s %(levelname)s %(message)s"
    filters:
      environment_info_filter:
        "()": ml_model_logging.filters.EnvironmentInfoFilter
        env_variables:
        - NODE_IP
    handlers:
      stdout:
        level: INFO
        class: logging.StreamHandler
        stream: ext://sys.stdout
        formatter: json_formatter
        filters:
        - environment_info_filter
    loggers:
      root:
        level: INFO
        handlers:
        - stdout
        propagate: true

The two main sections in the file are the "models" section and the "logging" section. The models section is simpler and lists the CreditRiskModel, along with the LoggingDecorator. The decorators configuration simply adds an instance of the LoggingDecorator to the CreditRiskModel when the service starts up.

The logging configuration is set up exactly like we set it up in the examples above except that it is in YAML format. The YAML is converted to a dictionary and passed directly into the logging.config.dictConfig() function.

To run the service locally, execute these commands:

export NODE_IP=123.123.123.123
export PYTHONPATH=./
export REST_CONFIG=./configuration/rest_configuration.yaml
uvicorn rest_model_service.main:app --reload

The NODE_IP environment variable is set so that the value can be added to the log messages through the filter we built above. The service should come up and can be accessed in a web browser at http://127.0.0.1:8000. When you access that URL you will be redirected to the documentation page that is generated by the FastAPI package:

The documentation allows you to make requests against the API in order to try it out. Here's a prediction request against the insurance charges model:

And the prediction result:

The prediction made by the model had to go through the logging decorator that we configured into the service, so we got these two log records from the process:

The local web service process emits the logs to stdout just as we configured it.

Deploying the Model Service

Now that we have a working service that is running locally, we can work on deploying it to Kubernetes.

Creating a Docker Image

Kubernetes needs to have a Docker image in order to deploy something, we'll build an image using this Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3.9-slim as base

WORKDIR /dependencies

# installing git because we need to install the model package from the github repository
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends git

# creating and activating a virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# installing dependencies
COPY ./service_requirements.txt ./service_requirements.txt
RUN pip install --no-cache -r service_requirements.txt

FROM base as runtime

ARG DATE_CREATED
ARG REVISION
ARG VERSION

LABEL org.opencontainers.image.title="Logging for ML Models"
LABEL org.opencontainers.image.description="Logging for machine learning models."
LABEL org.opencontainers.image.created=$DATE_CREATED
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/logging-for-ml-models"
LABEL org.opencontainers.image.version=$VERSION
LABEL org.opencontainers.image.revision=$REVISION
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR /service

# install packages
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends libgomp1 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

COPY --from=base /opt/venv ./venv

COPY ./ml_model_logging ./ml_model_logging
COPY ./LICENSE ./LICENSE

ENV PATH /service/venv/bin:$PATH
ENV PYTHONPATH="${PYTHONPATH}:/service"

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile includes a set of labels from the Open Containers annotations specification. Most of the labels are hardcoded in the Dockerfile, but there are three that we need to add from the outside: the date created, the version, and the revision. To do this we'll pull some information into environment variables:

DATE_CREATED=!date +"%Y-%m-%d %T"
REVISION=!git rev-parse HEAD

!echo "$DATE_CREATED"
!echo "$REVISION"

['2023-04-23 21:30:31']
['88a78deb3ed38e5bff5f0633fa4a4bf6202b868f']

Now we can use the values to build the image. We'll also provide the version as a build argument.

!docker build \
  --build-arg DATE_CREATED="$DATE_CREATED" \
  --build-arg VERSION="0.1.0" \
  --build-arg REVISION="$REVISION" \
  -t credit_risk_model_service:0.1.0 ..\

clear_output()

To find the image we just built, we'll search through the local docker images:

!docker images | grep credit_risk_model_service

credit_risk_model_service                       0.1.0     10985e3d96bd   9 seconds ago   922MB

Next, we'll start the image to see if everything is working as expected.

!docker run -d \
    -p 8000:8000 \
    -e REST_CONFIG=./configuration/rest_configuration.yaml \
    -e NODE_IP="123.123.123.123" \
    -v $(pwd)/../configuration:/service/configuration \
    --name credit_risk_model_service \
    credit_risk_model_service:0.1.0

265c3f15cae7c9b9788f0c1c96c66dcf28e7bba7b48f002671dc674cf1982f19

Notice that we added an environment variable called NODE_IP, this is just so we have a value to pull into the logs later, its not the real node IP address.

The service is up and running in the docker container. To view the logs coming out of the process, we'll use the docker logs command:

!docker logs credit_risk_model_service

{"asctime": "2023-04-24 01:31:03,901", "node_ip": "123.123.123.123", "name": "rest_model_service.helpers", "levelname": "INFO", "message": "Creating FastAPI app for: 'Credit Risk Model Service'."}
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

As we expected, the logs are coming out in JSON format, although there are some that are not. These logs are being emitted from logger objects that were initialized before the rest_model_service package got a chance to be initialized.

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/credit_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ \
      "annual_income": 273000, \
      "collections_in_last_12_months": 20, \
      "delinquencies_in_last_2_years": 39, \
      "debt_to_income_ratio": 42.64, \
      "employment_length": "< 1 year", \
      "home_ownership": "MORTGAGE", \
      "number_of_delinquent_accounts": 6, \
      "interest_rate": 28.99, \
      "last_payment_amount": 36475.59, \
      "loan_amount": 35000, \
      "derogatory_public_record_count": 86, \
      "loan_purpose": "debt_consolidation", \
      "revolving_line_utilization_rate": 892.3, \
      "term": " 36 months", \
      "total_payments_to_date": 57777.58, \
      "verification_status": "Source Verified" \
}'

{"credit_risk":"safe"}

We're done with the docker container so we'll stop it and stop it and remove it.

!docker kill credit_risk_model_service
!docker rm credit_risk_model_service

credit_risk_model_service
credit_risk_model_service

Creating a Kubernetes Cluster

To show the system in action, we’ll deploy the model service and the minio service to a Kubernetes cluster. A local cluster can be easily started by using minikube. Installation instructions can be found here.

To start the minikube cluster execute this command:

!minikube start --memory 4196

😄  minikube v1.30.1 on Darwin 13.3.1
✨  Using the docker driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.26.3 on Docker 23.0.2 ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
🔗  Configuring bridge CNI (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image docker.io/kubernetesui/dashboard:v2.7.0
    ▪ Using image docker.io/kubernetesui/metrics-scraper:v1.0.8
💡  Some dashboard features require the metrics-server addon. To enable all features please run:

    minikube addons enable metrics-server


🌟  Enabled addons: storage-provisioner, default-storageclass, dashboard
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Let's view all of the pods running in the minikube cluster to make sure we can connect to it using the kubectl command.

!kubectl get pods -A

NAMESPACE              NAME                                        READY   STATUS    RESTARTS       AGE
kube-system            coredns-787d4945fb-48bzx                    1/1     Running   2 (3d5h ago)   4d23h
kube-system            etcd-minikube                               1/1     Running   2 (3d5h ago)   4d23h
kube-system            kube-apiserver-minikube                     1/1     Running   2 (3d5h ago)   4d23h
kube-system            kube-controller-manager-minikube            1/1     Running   2 (3d5h ago)   4d23h
kube-system            kube-proxy-jj4pz                            1/1     Running   2 (3d5h ago)   4d23h
kube-system            kube-scheduler-minikube                     1/1     Running   2 (3d5h ago)   4d23h
kube-system            storage-provisioner                         1/1     Running   6 (33s ago)    4d23h
kubernetes-dashboard   dashboard-metrics-scraper-5c6664855-fgpqq   1/1     Running   2 (3d5h ago)   4d23h
kubernetes-dashboard   kubernetes-dashboard-55c4cbbc7c-ddx2q       1/1     Running   4 (32s ago)    4d23h

Looks like we can connect, we're ready to start deploying the model service to the cluster.

Creating a Namespace

Now that we have a cluster and are connected to it, we'll create a namespace to hold the resources for our model deployment. The resource definition is in the kubernetes/namespace.yaml file. To apply the manifest to the cluster, execute this command:

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   4d23h
kube-node-lease        Active   4d23h
kube-public            Active   4d23h
kube-system            Active   4d23h
kubernetes-dashboard   Active   4d23h
model-services         Active   1s

The new namespace should appear in the listing along with other namespaces created by default by the system.

Creating the Model Service

The model service is deployed by using Kubernetes resources. These are:

ConfigMap: a set of configuration options, in this case it is a simple YAML file that will be loaded into the running container as a volume mount. This resource allows us to change the configuration of the model service without having to modify the Docker image.
Deployment: a declarative way to manage a set of Pods, the model service pods are managed through the Deployment.
Service: a way to expose a set of Pods in a Deployment, the model service is made available to the outside world through the Service.

These resources are defined in the kubernetes/model_service.yaml file, the file is long so we won't list it here. The env section in the container's definition in the Deployment has a special section which is allowing us to access information about the pod and the node:

...
- name: REST_CONFIG
  value: ./configuration/kubernetes_rest_config.yaml
- name: POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name
- name: NODE_NAME
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
- name: APP_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['app']
...

The pod definition is using the downward API provided by Kubernetes to access the node name, the pod name, and the contents of the 'app' label. This information is made available as environment variables. We'll be adding this information to the log by adding the names of the environment variables to the logger configuration that we'll give to the model service. We built a logging context class above for the purpose of adding environment variables to log records.

We're almost ready to deploy the model service, but before starting it we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load credit_risk_model_service:0.1.0

We can view the images in the minikube cache with this command:

!minikube image ls | grep credit_risk_model_service

docker.io/library/credit_risk_model_service:0.1.0

The model service will need to access the YAML configuration file that we used for the local service above. This is file is in the /configuration folder and is called "kubernetes_rest_config.yaml", its customized for the kubernetes environment we're building.

To create a ConfigMap for the service, execute this command:

!kubectl create configmap -n model-services model-service-configuration \
    --from-file=../configuration/kubernetes_rest_config.yaml

configmap/model-service-configuration created

The service is deployed to the Kubernetes cluster with this command:

!kubectl apply -n model-services -f ../kubernetes/model_service.yaml

deployment.apps/credit-risk-model-deployment created
service/credit-risk-model-service created

The deployment and service for the model service were created together. Lets view the Deployment to see if it is available yet:

!kubectl get deployments -n model-services

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
credit-risk-model-deployment   1/1     1            1           33s

You can also view the pods that are running the service:

!kubectl get pods -n model-services -l app=credit-risk-model-service

NAME                                           READY   STATUS    RESTARTS   AGE
credit-risk-model-deployment-554575f4f-5rl5s   1/1     Running   0          35s

The Kubernetes Service details look like this:

!kubectl get services -n model-services

NAME                        TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
credit-risk-model-service   NodePort   10.104.38.144   <none>        80:32268/TCP   37s

We'll run a proxy process locally to be able to access the model service endpoint:

minikube service credit-risk-model-service --url -n model-services

The command outputs this URL:

http://127.0.0.1:50222

We can send a request to the model service through the local endpoint like this:

!curl -X 'POST' \
  'http://127.0.0.1:50222/api/models/credit_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ \
      "annual_income": 273000, \
      "collections_in_last_12_months": 20, \
      "delinquencies_in_last_2_years": 39, \
      "debt_to_income_ratio": 42.64, \
      "employment_length": "< 1 year", \
      "home_ownership": "MORTGAGE", \
      "number_of_delinquent_accounts": 6, \
      "interest_rate": 28.99, \
      "last_payment_amount": 36475.59, \
      "loan_amount": 35000, \
      "derogatory_public_record_count": 86, \
      "loan_purpose": "debt_consolidation", \
      "revolving_line_utilization_rate": 892.3, \
      "term": " 36 months", \
      "total_payments_to_date": 57777.58, \
      "verification_status": "Source Verified" \
}'

{"credit_risk":"safe"}

The model is deployed within Kubernetes!

Accessing the Logs

Kubernetes has a built-in system that receives the stdout and stderr outputs of the running containers and saves them to the hard drive of the node for a limited time. You can view the logs emitted by the containers by using this command:

!kubectl logs -n model-services credit-risk-model-deployment-554575f4f-5rl5s -c credit-risk-model | grep "\"action\": \"predict\""

{"asctime": "2023-04-24 01:36:40,696", "pod_name": "credit-risk-model-deployment-554575f4f-5rl5s", "node_name": "minikube", "app_name": "credit-risk-model-service", "name": "credit_risk_model_logger", "levelname": "INFO", "message": "Prediction requested.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0", "collections_in_last_12_months": 20, "debt_to_income_ratio": 42.64}
{"asctime": "2023-04-24 01:36:40,781", "pod_name": "credit-risk-model-deployment-554575f4f-5rl5s", "node_name": "minikube", "app_name": "credit-risk-model-service", "name": "credit_risk_model_logger", "levelname": "INFO", "message": "Prediction created.", "action": "predict", "model_qualified_name": "credit_risk_model", "model_version": "0.1.0", "status": "success", "credit_risk": "safe"}

The logs contain every field that we configured and they are in JSON format, as we expected. The log records also contain the pod_name, node_name, and app_name fields that we added through the downward API.

Although we can view the logs like this, this is not the ideal way to hold logs. We need to be able to search through the logs generated across the whole system. To do this we'll need to export the logs to an external logging system. We'll be working on that in another section of this blog post.

Creating the Logging System

The complexity of modern cloud environment makes it hard to manage logs in individual servers since we really don't know where our workloads are going to be scheduled ahead of time. Kubernetes workloads are highly distributed, meaning that an application can be replicated in many different nodes in a cluster. This makes it necessary to gather logs together in one place so that we can more easily view and analyze them.

A logging system is responsible for gathering log records from all of the instances of a running application and make them searchable from one centralized location. In this section, we'll add such a logging system to the cluster and use it to monitor the model service we've deployed.

We'll be installing the Elastic Cloud on Kubernetes operator in order to view our logs. The operator installs and manages ElasticSearch, Kibana, and Filebeat services.

To begin, lets install the custom resource definitions needed by the operator:

!kubectl create -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml

customresourcedefinition.apiextensions.k8s.io/agents.agent.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/apmservers.apm.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/beats.beat.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/elasticmapsservers.maps.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/elasticsearchautoscalers.autoscaling.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/elasticsearches.elasticsearch.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/enterprisesearches.enterprisesearch.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/kibanas.kibana.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/stackconfigpolicies.stackconfigpolicy.k8s.elastic.co created

We'll be using theses CRDs:

elasticsearch.k8s.elastic.co, to deploy ElasticSearch for storing and indexing logs
kibana.k8s.elastic.co, to deploy Kibana for viewing logs
beat.k8s.elastic.co, to deploy Filebeat on each node to forward logs to ElasticSearch

The CRDs are used by the ECK operator to manage resources in the cluster. To install the ECK operator itself, execute this command:

!kubectl apply -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml

namespace/elastic-system created
serviceaccount/elastic-operator created
secret/elastic-webhook-server-cert created
configmap/elastic-operator created
clusterrole.rbac.authorization.k8s.io/elastic-operator created
clusterrole.rbac.authorization.k8s.io/elastic-operator-view created
clusterrole.rbac.authorization.k8s.io/elastic-operator-edit created
clusterrolebinding.rbac.authorization.k8s.io/elastic-operator created
service/elastic-webhook-server created
statefulset.apps/elastic-operator created
validatingwebhookconfiguration.admissionregistration.k8s.io/elastic-webhook.k8s.elastic.co created

ElasticSearch

We'll be storing logs in ElasticSearch. ElasticSearch is a distributed full-text search engine with a RESTful API. The ElasticSearch service is ideal for our needs because our logs are made up of text strings.

Now we're ready to install the service by applying the "ElasticSearch" custom resource definition:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.7.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false

The CRD is stored in the kubernetes/elastic_search.yaml file. The CRD is applied with this command:

!kubectl apply -n elastic-system -f ../kubernetes/elastic_search.yaml

elasticsearch.elasticsearch.k8s.elastic.co/quickstart created

To get a list of ElasticSearch clusters currently defined in the cluster, execute this comand:

!kubectl get elasticsearch -n elastic-system

NAME         HEALTH   NODES   VERSION   PHASE   AGE
quickstart   green    1       8.7.0     Ready   116s

We can look at the pods running the ElasticSearch cluster:

!kubectl get pods -n elastic-system --selector='elasticsearch.k8s.elastic.co/cluster-name=quickstart'

NAME                      READY   STATUS    RESTARTS   AGE
quickstart-es-default-0   1/1     Running   0          116s

A Kubernetes service is created to make the ElasticSearch service available to other services in the cluster:

!kubectl get service quickstart-es-http -n elastic-system

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
quickstart-es-http   ClusterIP   10.106.185.54   <none>        9200/TCP   2m2s

A user named "elastic" is automatically in the ElasticSearch services with the password stored in a Kubernetes secret. Let's access the password:

!kubectl get secret quickstart-es-elastic-user -n elastic-system -o=jsonpath='{.data.elastic}' | base64 --decode; echo

DD097Fe67Qs320Uw6JHIy2Vb

Kibana

To view the logs we'll be using Kibana. Kibana is a web application that can provide access to and visualize logs stored in ElasticSearch.

The CRD for Kibana looks like this:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: quickstart
spec:
  version: 8.7.0
  count: 1
  elasticsearchRef:
    name: quickstart

We'll apply the CRD with this command:

!kubectl apply -n elastic-system -f ../kubernetes/kibana.yaml

kibana.kibana.k8s.elastic.co/quickstart created

Similar to Elasticsearch, you can retrieve details about Kibana instances:

!kubectl get kibana -n elastic-system

NAME         HEALTH   NODES   VERSION   AGE
quickstart   green    1       8.7.0     51s

We can also view the associated Pods:

!kubectl get pod -n elastic-system --selector='kibana.k8s.elastic.co/name=quickstart'

NAME                             READY   STATUS    RESTARTS   AGE
quickstart-kb-589dc4f75b-ncpd7   1/1     Running   0          53s

A ClusterIP Service is automatically created for Kibana:

!kubectl get service quickstart-kb-http -n elastic-system

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
quickstart-kb-http   ClusterIP   10.111.166.26   <none>        5601/TCP   57s

We'll use kubectl port-forward to access Kibana from a local web browser:

kubectl port-forward service/quickstart-kb-http 5601 -n elastic-system

Now we can access the Kibana service from this URL:

http://localhost:5601

Open the URL in your browser to view the Kibana UI. Login as the "elastic" user. The password is the one we retrieved above.

Filebeat

In order to centralize access to logs, we'll first need a way to get the logs off of the individual cluster nodes and forward them to the aggregator service. The service we'll use to do this is called Filebeat. Filebeat is a lightweight service that can forward logs stored in files to an outside service. We'll deploy Filebeat as a DaemonSet to ensure there’s a running instance on each node of the cluster.

The Filebeat CRD looks like this:

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: quickstart
spec:
  type: filebeat
  version: 8.7.0
  elasticsearchRef:
    name: quickstart
  kibanaRef:
    name: quickstart
  config:
    processors:
      - decode_json_fields:
          fields: ["message"]
          max_depth: 3
          target: parsed_message
          add_error_key: false
    filebeat.inputs:
    - type: container
      paths:
      - /var/log/containers/*.log
  daemonSet:
    podTemplate:
      spec:
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        securityContext:
          runAsUser: 0
        containers:
        - name: filebeat
          volumeMounts:
          - name: varlogcontainers
            mountPath: /var/log/containers
          - name: varlogpods
            mountPath: /var/log/pods
          - name: varlibdockercontainers
            mountPath: /var/lib/docker/containers
        volumes:
        - name: varlogcontainers
          hostPath:
            path: /var/log/containers
        - name: varlogpods
          hostPath:
            path: /var/log/pods
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

The container logs host folder (/var/log/containers) is mounted on the Filebeat container. The filebeat process also has a processor defined:

decode_json_fields, which decodes fields containing JSON strings and replaces the strings with valid JSON objects

Let's apply the CRD to create the Filebeat DaemonSet:

!kubectl apply -n elastic-system -f ../kubernetes/filebeat.yaml

beat.beat.k8s.elastic.co/quickstart created

Details about the Filebeat service can be viewed like this:

!kubectl get beat -n elastic-system

NAME         HEALTH   AVAILABLE   EXPECTED   TYPE       VERSION   AGE
quickstart   green    1           1          filebeat   8.7.0     35s

The pods running the service can be listed like this:

!kubectl get pods -n elastic-system --selector='beat.k8s.elastic.co/name=quickstart'

NAME                             READY   STATUS    RESTARTS   AGE
quickstart-beat-filebeat-znwsf   1/1     Running   0          38s

The Filebeat service is running on the single node in the cluster.

The logs are being forwarded to ElasticSearch and can be viewed in Kibana:

We have logs arriving from the model service and can view them in Kibana!

Deleting the Resources

To delete the Filebeat DaemonSet, execute this command:

!kubectl delete -n elastic-system -f ../kubernetes/filebeat.yaml

beat.beat.k8s.elastic.co "quickstart" deleted

To delete the Kibana service, execute this command:

!kubectl delete -n elastic-system -f ../kubernetes/kibana.yaml

kibana.kibana.k8s.elastic.co "quickstart" deleted

To delete the ElasticSearch service, execute this command:

!kubectl delete -n elastic-system -f ../kubernetes/elastic_search.yaml

elasticsearch.elasticsearch.k8s.elastic.co "quickstart" deleted

To remove all Elastic resources in all namespaces:

!kubectl get namespaces --no-headers -o custom-columns=:metadata.name | xargs -n1 kubectl delete elastic --all -n

No resources found
No resources found
No resources found
No resources found
No resources found
No resources found
No resources found

To uninstall the ECK operator:

!kubectl delete -f https://download.elastic.co/downloads/eck/2.7.0/operator.yaml

namespace "elastic-system" deleted
serviceaccount "elastic-operator" deleted
secret "elastic-webhook-server-cert" deleted
configmap "elastic-operator" deleted
clusterrole.rbac.authorization.k8s.io "elastic-operator" deleted
clusterrole.rbac.authorization.k8s.io "elastic-operator-view" deleted
clusterrole.rbac.authorization.k8s.io "elastic-operator-edit" deleted
clusterrolebinding.rbac.authorization.k8s.io "elastic-operator" deleted
service "elastic-webhook-server" deleted
statefulset.apps "elastic-operator" deleted
validatingwebhookconfiguration.admissionregistration.k8s.io "elastic-webhook.k8s.elastic.co" deleted
^C

To uninstall the Custom Resource Definitions for the ECK operator:

!kubectl delete -f https://download.elastic.co/downloads/eck/2.7.0/crds.yaml

customresourcedefinition.apiextensions.k8s.io "agents.agent.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "apmservers.apm.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "beats.beat.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "elasticmapsservers.maps.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "elasticsearchautoscalers.autoscaling.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "elasticsearches.elasticsearch.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "enterprisesearches.enterprisesearch.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "kibanas.kibana.k8s.elastic.co" deleted
customresourcedefinition.apiextensions.k8s.io "stackconfigpolicies.stackconfigpolicy.k8s.elastic.co" deleted

To delete the model service kubernetes resources, we'll execute this command:

!kubectl delete -n model-services -f ../kubernetes/model_service.yaml

deployment.apps "credit-risk-model-deployment" deleted
service "credit-risk-model-service" deleted

We'll also delete the ConfigMap:

!kubectl delete -n model-services configmap model-service-configuration

configmap "model-service-configuration" deleted

Then the model service namespace:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted

To shut down the minikube cluster:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  Powering off "minikube" via SSH ...
🛑  1 node stopped.

Closing

In this blog post we showed how to do logging with the Python logging package, and how to create a decorator that can help us to do logging around an MLModel instance. We also set up and used a logging system within a Kubernetes cluster and used it to aggregate logs and view them. Logging is usually the first thing that is implemented when we need to monitor how a system performs, and machine learning models are no exception to this. The logging decorator allowed us to do complex logging without having to modify the implementation of the model at all, thus simplifying a common aspect of software observability.

One of the benefits of using the decorator pattern is that we are able to build up complex behaviors around an object. The LoggingDecorator class is very configurable, since we are able to configure it to log input and output fields from the model. This approach makes the implementation very flexible, since we do not need to modify the decorator's code to add fields to the log. The EnvironmentInfoFilter class that we implemented to grab information from the environment for logs is also built this way. We were able to get information about the Kubernetes deployment from the logs without having to modify the code.

The LoggingDecorator class is designed to work with MLModel classes, and this is the only hard requirement of the code. This makes the decorator very portable, because we are able to deploy it inside of any other model deployment service we may choose to build in the future. For example, we can just as easily decorate an MLModel instance running inside of an gRPC service, since the decorator would work exactly the same way. This is due to interface-driven approach that we took when designing the MLModel interface.

We added logging to the ML model instance from the "outside" and we were not able to access information about the internals of the model. This is a limitation of the decorator approach to logging which only has access to the model inputs, model outputs, and exceptions raised by the model. This approach is best used to add logging functionality to an ML model implementation that we do not control, or in simple situations in which the limitations of the approach do not affect us. If any logging of internal model state is needed, we'll need to generate logs from within the MLModel class.

Signed Parameters for Secure ML Model Deployments

2023-03-17T22:00:00-05:00

Signed Parameters for Secure ML Model Deployments

This blog post was written in a Jupyter notebook, the code and commands found in it reflect this.

All of the code for this blog post is in this github repository.

Introduction

In the Python ecosystem, using pickle to serialize machine learning models is very common. Pickle is a built-in Python library module that makes it easy to convert in-memory objects into bytestreams that can be saved to a hard drive or sent over networks. Pickling an object is very quick and simple and is the easiest way to persist a complex Python object for later use. However, pickle is not a secure serialization standard. The documentation for the pickle module in the Python standard library explicitly mentions the insecure nature of the pickle format:

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

What can we do about this? Pickling is the easiest way to save model objectsand using pickle for model serialization is ubiquitous in Data Science. One thing that we can do is make sure that the pickle files that hold our models are not modified in the time between the training process and the prediction process. This way, we can be sure that the contents of the file are benign. This is especially important in models that are deployed in production services that are running in sensitive environments. If we allow the model service that is hosting the model to load a pickle file that has been compromised, we can allow arbritrary code execution on the server.

One way to prevent the pickle file from being modified is by "signing" it. Signing a file means processing the data and creating a "signature" that we can use later to make sure that the contents of the file have not been changed since it was signed. In order to still be able to use pickle in a production setting, we'll require that the model parameters be signed right after they are created, then we'll check the signature before we load the parameters within the model service. If the signature does not match, we'll know that the model parameters are not safe to load. However, signing model parameters does not encrypt them, so it is still possible for someone with access to the pickle files to view the model parameters.

In this blog post, we'll be downloading a dataset, exploring it, training a model, signing the model parameters, and deploying the model parameters and model to a Kubernetes cluster as a RESTful service. We'll also be loading the model parameters from a network storage service to show how to secure the model parameters while they are stored separately from the model deployment.

Getting Data

In order to train a model, we'll need a dataset. The dataset we've chosen is the Diabetes Health Indicators Dataset available from Kaggle. The dataset contains data about health and the incidence of diabetetes. We'll be using the dataset to train a model that predicts whether or not a person is likely to have diabetes.

To make it easy to download the data, we'll install the kaggle python package.

from IPython.display import clear_output

%pip install kaggle

clear_output()

Next, we'll execute these commands to download the data and unzip it into the data folder in the project:

!mkdir -p ../data

!kaggle datasets download -d alexteboul/diabetes-health-indicators-dataset -p ../data --unzip

clear_output()

The files downloaded look like this:

!ls -la ../data

total 101232
drwxr-xr-x   5 brian  staff       160 Mar 17 22:55 [34m.[m[m
drwxr-xr-x  25 brian  staff       800 Mar 17 22:55 [34m..[m[m
-rw-r--r--   1 brian  staff  22738151 Mar 17 22:55 diabetes_012_health_indicators_BRFSS2015.csv
-rw-r--r--   1 brian  staff   6347570 Mar 17 22:55 diabetes_binary_5050split_health_indicators_BRFSS2015.csv
-rw-r--r--   1 brian  staff  22738154 Mar 17 22:55 diabetes_binary_health_indicators_BRFSS2015.csv

We'll focus on the "diabetes_binary_5050split_health_indicators_BRFSS2015.csv" dataset. Let's load the dataset into a Pandas dataframe:

import pandas as pd

data = pd.read_csv(f'../data/diabetes_binary_5050split_health_indicators_BRFSS2015.csv')

data.shape

(70692, 22)

The unprocessed dataset has 70692 rows and 22 columns.

The dataframe columns are these:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth              70692 non-null  float64
 16  PhysHlth              70692 non-null  float64
 17  DiffWalk              70692 non-null  float64
 18  Sex                   70692 non-null  float64
 19  Age                   70692 non-null  float64
 20  Education             70692 non-null  float64
 21  Income                70692 non-null  float64
dtypes: float64(22)
memory usage: 11.9 MB

The columns names are not all easy to understand so we'll rename some of them:

data = data.rename(columns = {
    "Diabetes_binary": "Diabetes",
    "HighBP": "HighBloodPressure",
    "HighChol": "HighCholesterol",
    "CholCheck": "CholesterolChecked",
    "HeartDiseaseorAttack": "HeartDiseaseOrHeartAttack",
    "PhysActivity": "PhysicalActivity",
    "HvyAlcoholConsump": "HeavyAlchoholConsumption",
    "NoDocbcCost": "NoDoctorsVisitBecauseOfCost",
    "GenHlth": "GeneralHealth",
    "MentHlth": "MentalHealth",
    "PhysHlth": "PhysicalHealth",
    "DiffWalk": "DifficultyWalking"
})

Profiling the Data

In order to profile the data, we'll use the sweetviz package. Let's install the package:

%pip install sweetviz

clear_output()

To profile the data, all that is needed is two lines of code:

import sweetviz as sv

report = sv.analyze(data)

clear_output()

Once the report is created, we'll save it to disk as an HTML file.

report.show_html(filepath="../diabetes_risk_model/model_files/data_report.html")

Report ../diabetes_risk_model/model_files/data_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Right away the profile will tell us a few key details about the dataset:

The dataset has 1635 duplicate rows, it has 22 features, 18 of which are categorical and 4 of which are numerical. The profile has a description for each variable. Here's the description for the "Diabetes" variable, which we'll use as the target variable.

By using the sweetviz package we can avoid writing the most common data profiling code. From the report we can tell that there are a few things we'll need to deal with:

There are highly correlated variables.
Some variables have outliers.

Training a Model

To train a model we'll be using the pycaret package.

Let's install the package first:

%pip install --pre pycaret

clear_output()

We'll setup the experiment like this:

from pycaret.classification import setup

diabetes_experiment = setup(data=data, 
                            target="Diabetes", 
                            data_split_stratify=True,
                            fix_imbalance=False,
                            remove_outliers=True,
                            normalize=True,
                            feature_selection=True,
                            remove_multicollinearity=True,
                            session_id=42)

	Description	Value
0	Session id	42
1	Target	Diabetes
2	Target type	Binary
3	Original data shape	(70692, 22)
4	Transformed data shape	(68683, 5)
5	Transformed train set shape	(47421, 5)
6	Transformed test set shape	(21208, 5)
7	Numeric features	21
8	Preprocess	True
9	Imputation type	simple
10	Numeric imputation	mean
11	Categorical imputation	mode
12	Remove multicollinearity	True
13	Multicollinearity threshold	0.900000
14	Remove outliers	True
15	Outliers threshold	0.050000
16	Normalize	True
17	Normalize method	zscore
18	Feature selection	True
19	Feature selection method	classic
20	Feature selection estimator	lightgbm
21	Number of features selected	0.200000
22	Fold Generator	StratifiedKFold
23	Fold Number	10
24	CPU Jobs	-1
25	Use GPU	False
26	Log Experiment	False
27	Experiment Name	clf-default-name
28	USI	bd08

We're telling pycaret that the target column is target="Diabetes". We're also asking the pycaret package to take care of several problems in the dataset. The fix_imbalance parameter tells pycaret to not try to balance the target variable. The remove_outliers parameter tells the package to remove outliers using PCA linear dimensionality reduction. The feature_selection option tells the package to remove unnecessary features from the training set. The remove_multicollinearity option tells the package to drop a feature if it is highly linearly correlated with other features.

After analyzing the dataset, we can see that pycaret removed some samples and some columns from the dataset. The original dataset had 70,692 samples, the preprocessed dataset has 68,683 samples. Pycaret also removed features, we had 21 features starting out, after preprocessing only 5 features remained. Pycaret has also added data imputers in the prediction pipeline, we'll use these later to deal with missing values when making predictions.

Once pycaret has been setup, we're ready to train some models.

from pycaret.classification import compare_models

best_model = compare_models()

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
gbc	Gradient Boosting Classifier	0.7318	0.8069	0.7818	0.7108	0.7446	0.4636	0.4660	0.6810
lightgbm	Light Gradient Boosting Machine	0.7313	0.8056	0.7823	0.7100	0.7444	0.4627	0.4652	0.2000
ada	Ada Boost Classifier	0.7309	0.8053	0.7616	0.7176	0.7389	0.4618	0.4627	0.3710
ridge	Ridge Classifier	0.7281	0.0000	0.7444	0.7210	0.7325	0.4562	0.4565	0.1030
lda	Linear Discriminant Analysis	0.7281	0.8007	0.7444	0.7210	0.7325	0.4562	0.4565	0.1040
lr	Logistic Regression	0.7279	0.8015	0.7409	0.7222	0.7314	0.4559	0.4561	1.6740
svm	SVM - Linear Kernel	0.7265	0.0000	0.7552	0.7148	0.7338	0.4529	0.4545	0.1080
qda	Quadratic Discriminant Analysis	0.7265	0.7940	0.7610	0.7119	0.7356	0.4530	0.4541	0.1000
nb	Naive Bayes	0.7210	0.7939	0.7207	0.7212	0.7209	0.4420	0.4420	0.1090
rf	Random Forest Classifier	0.7001	0.7617	0.7204	0.6923	0.7061	0.4002	0.4006	1.1040
knn	K Neighbors Classifier	0.6942	0.7485	0.7166	0.6859	0.7008	0.3883	0.3888	0.2640
et	Extra Trees Classifier	0.6900	0.7467	0.6760	0.6955	0.6856	0.3800	0.3802	1.0550
dt	Decision Tree Classifier	0.6867	0.7368	0.6688	0.6937	0.6810	0.3735	0.3738	0.1070
dummy	Dummy Classifier	0.5000	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0890

The function displays a table of the model metrics, highlighting the models with the highest metrics in each category. The function also returns the best model found:

print(best_model)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='log_loss', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_samples_leaf=1,
                           min_samples_split=2, min_weight_fraction_leaf=0.0,
                           n_estimators=100, n_iter_no_change=None,
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In this case, pycaret returned the GradientBoostingClassifier as the best model. The model selected has the highest accuracy, AUC, recall, and F1 score, but does not have the highest precision. This first step is only to get an idea of the way the different types of models perform on the problem. We'll need to choose among the models for the one that meets our requirements.

There are other things to take into account when selecting a model. For example, certain models take a lot more memory and CPU time to make predictions. In certain situations, it would be better to select a model with lower accuracy but that is able to meet the requirements of the deployment environment.

It looks like the Gradient Boosting Classifier model has the highest F1 score, while also having a high accuracy. So we'll select it to keep working with. To train a gbc model, we'll call the pycaret create_model() function.

from pycaret.classification import create_model

model = create_model("gbc")

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.7264	0.8041	0.7770	0.7057	0.7396	0.4528	0.4551
1	0.7367	0.8050	0.7907	0.7137	0.7502	0.4734	0.4762
2	0.7298	0.8048	0.7684	0.7133	0.7398	0.4597	0.4611
3	0.7357	0.8053	0.7979	0.7096	0.7511	0.4714	0.4751
4	0.7318	0.8098	0.7765	0.7128	0.7433	0.4636	0.4655
5	0.7314	0.8075	0.7765	0.7123	0.7430	0.4628	0.4647
6	0.7268	0.7999	0.7817	0.7043	0.7410	0.4535	0.4563
7	0.7357	0.8104	0.7890	0.7129	0.7490	0.4713	0.4740
8	0.7310	0.8060	0.7789	0.7108	0.7433	0.4620	0.4641
9	0.7328	0.8164	0.7813	0.7122	0.7452	0.4656	0.4678
Mean	0.7318	0.8069	0.7818	0.7108	0.7446	0.4636	0.4660
Std	0.0034	0.0042	0.0081	0.0031	0.0040	0.0068	0.0070

Once the model has been created, we can do hyperparameter tuning with the tune_model() function.

from pycaret.classification import tune_model

tuned_model = tune_model(model, n_iter=10, optimize="F1")

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.7296	0.8041	0.7826	0.7077	0.7433	0.4593	0.4619
1	0.7377	0.8051	0.7952	0.7133	0.7520	0.4754	0.4786
2	0.7254	0.8024	0.7668	0.7081	0.7363	0.4508	0.4524
3	0.7341	0.8054	0.7971	0.7078	0.7498	0.4682	0.4720
4	0.7316	0.8088	0.7736	0.7136	0.7424	0.4632	0.4649
5	0.7357	0.8055	0.7793	0.7167	0.7467	0.4713	0.4731
6	0.7241	0.7996	0.7805	0.7014	0.7389	0.4483	0.4511
7	0.7411	0.8086	0.7882	0.7204	0.7528	0.4822	0.4844
8	0.7342	0.8055	0.7858	0.7123	0.7473	0.4685	0.4710
9	0.7330	0.8149	0.7789	0.7134	0.7447	0.4660	0.4680
Mean	0.7327	0.8060	0.7828	0.7115	0.7454	0.4653	0.4677
Std	0.0050	0.0039	0.0088	0.0051	0.0051	0.0099	0.0100

Fitting 10 folds for each of 10 candidates, totalling 100 fits

We asked pycaret to maximize the F1 score of the model. By tuning the hyperameters, we were able to raise the F1 score from 0.7446 to 0.7454.

Validating the Model

Pycaret is integrated with the yellowbrick package for creating visualizations. We can easily generate many standard plots to show the performance of our model.

The area under the curve plot can be generated like this:

from pycaret.classification import plot_model

plot_model(tuned_model, plot="auc", save=True)

clear_output()

The AUC plot is useful for understanding the tradeoffs between the true positive rate and the false positive rate of the model's predictions.

The confusion matrix can be plotted like this:

plot_model(tuned_model, plot="confusion_matrix", save=True)

clear_output()

The confusion matrix is useful for understanding which classes are being "confused" for each other by the model. The confusion matrix shows how many predictions were correctly and incorrectly made for each combination of classes.

The classification report can be plotted like this:

plot = plot_model(tuned_model, plot="class_report", save=True)

clear_output()

The classification report shows the precision, recall, F1, and support metrics of the model for each class.

The class prediction error can be plotted like this:

plot = plot_model(tuned_model, plot="error", save=True)

clear_output()

The class prediction error is similar to the classification report and confusion matrix, but highlights the per-class prediction error of the model.

The feature importance can be plotted like this:

plot = plot_model(tuned_model, plot="feature", save=True)

clear_output()

The feature importance plot is for understanding which features are most useful for making accurate predictions.

The learning curve can be plotted like this:

plot = plot_model(tuned_model, plot="learning", save=True)

clear_output()

The learning curve shows how the quality of the model varies when tested with the training set and the validation set, when using a varying number of training samples. This is useful for showing whether the model is underfit or overfit on the dataset.

Finalizing the Model

Once we have a tuned and validated model, we can use the entire dataset to train it again, in order to leverage the data samples that were held out for the testing and validation sets.

from pycaret.classification import finalize_model

finalized_model = finalize_model(tuned_model)

Now that we have a trained, validated, and finalized model, we'll save it disk for later use.

import pickle

with open("../diabetes_risk_model/model_files/model.pkl", "wb") as file:
    pickle.dump(finalized_model, file)

Signing the Model Parameters

Once we have the model parameters saved as a pickle file, we can sign the model parameters cryptographically. Signing the model parameters will enable us to ensure that the bytes that we are saving are exactly the same bytes that will be used to make predictions. The process involves creating a "signature" for the model parameters, and later verifying the signature.

To sign the model parameters we'll use the itsdangerous package. This package is useful for sending data through untrusted channels, where there is a chance that an attacker can modify the data.

Let's install the package:

%pip install itsdangerous

clear_output()

Signing messages requires that we come up with a secret key that is only known to us. We'll create a key and store it in a string variable:

import secrets
import string

secret_key = "".join(secrets.choice(string.ascii_uppercase + string.ascii_lowercase) for _ in range(64))
secret_key

'wjtRFppXQpxTChQnNcQJKGlLHKJBmAHMepfFbqvOoUrnuxIsKdiLCrrypYFQsqcw'

Next, we'll load the model parameters that we just saved into a bytes object so that we can sign them:

with open("../diabetes_risk_model/model_files/model.pkl", "rb") as file:
    model_bytes = file.read()

The signing process looks like this:

from itsdangerous import Signer

signer = Signer(secret_key)

signed_model_bytes = signer.sign(model_bytes)

The signed model bytes now have a signature appended to them, which means that the model can't be deserialized using pickle anymore. We have to unsign them to be able to do that. Here is how the unisigning process looks like:

unsigned_model_bytes = signer.unsign(signed_model_bytes)

The model bytes were verified using the secret key, and the signature was removed from the bytes object. Now we can unpickle the model object as we normally would:

import pickle

model = pickle.loads(model_bytes)

type(model)

pycaret.internal.pipeline.Pipeline

To show how the process would go if the model bytes were modified, let's add a single byte to the end of the signed bytes:

changed_signed_model_bytes = signed_model_bytes + bytes([1])

Now let's try to unsign the bytes object:

from itsdangerous import BadSignature

try:
    signer.unsign(changed_signed_model_bytes)
except BadSignature as e:
    print("BadSignature exception raised!")

BadSignature exception raised!

Because the bytes were modified, the unsign method raised an exception.

Let's save the signed model bytes to disk, alongside the original model pickle file we created above:

with open("../diabetes_risk_model/model_files/signed_model.pkl", "wb") as file:
    file.write(signed_model_bytes)

Packaging the Model Parameters

We now have signed model parameters. In order to deploy them we'll package them together with other results of the training process.

The model parameters are in the model_files folder:

!ls -la ../diabetes_risk_model/model_files

total 11936
drwxr-xr-x  6 brian  staff      192 Mar 17 23:31 [34m.[m[m
drwxr-xr-x  8 brian  staff      256 Feb 25 23:50 [34m..[m[m
-rw-r--r--@ 1 brian  staff     6148 Mar 15 22:40 .DS_Store
-rw-r--r--@ 1 brian  staff  1261313 Mar 17 22:57 data_report.html
-rw-r--r--  1 brian  staff  2419848 Mar 17 23:20 model.pkl
-rw-r--r--  1 brian  staff  2419876 Mar 17 23:31 signed_model.pkl

In the process of training this model, we created a few files containing the descriptive of the training set and other things. We'll save those files alongside the model parameters in a zip file.

import shutil

shutil.make_archive("../diabetes_risk_model/diabetes_risk_model-0.1.0-2023_03_17", 
                    "zip", 
                    "../diabetes_risk_model/model_files")

'/Users/brian/Code/securing-parameters-for-ml-models/diabetes_risk_model/diabetes_risk_model-0.1.0-2023_03_17.zip'

The command created a .zip file with all of the files in the model_files folder. The name of the zip file has the model name, model version, and today's date in it. This allows us to easily understand what the contents of the zip file are.

Now that we have the model files in a .zip file, we can delete the original files from the folder:

!rm ../diabetes_risk_model/model_files/data_report.html
!rm ../diabetes_risk_model/model_files/model.pkl
!rm ../diabetes_risk_model/model_files/signed_model.pkl

!mv ../diabetes_risk_model/diabetes_risk_model-0.1.0-2023_03_17.zip ../diabetes_risk_model/model_files/diabetes_risk_model-0.1.0-2023_03_17.zip

This packaging process ensures that all of the results of the model training process end up in one archive that we can use later. All of the data and model check results are packaged along with the serialized model so its easy to review the model training process.

Storing the Model Parameters

In order to store the model parameters, we'll be using a local S3 compatible service called minio. The minio project replicates the S3 API, and also provides a docker image.

To use the minio service, we'll first need a folder to store the files that it will host:

mkdir -p ../minio_data

To run an instance of minio locally, use this command:

!docker run -d \
    -p 9000:9000 \
    -p 9001:9001 \
    -e "MINIO_ACCESS_KEY=TEST" \
    -e "MINIO_SECRET_KEY=ASDFGHJKL" \
    --name minio \
    -v $(pwd)/../minio_data:/data \
    quay.io/minio/minio server /data --console-address ":9001"

d5283c718b1b1dc8d60eadbc03a2834647088474431c66bd032eab726670c1d7

The minio service instance running in the docker container is accessing the local filesystem to serve files. When minio is running in this way, it makes the folders it finds in the local filesystem available as buckets through its API, in exactly the same API as the AWS S3 service.

In order to easily interact with the minio service, we'll use the minio package.

%pip install minio

clear_output()

Let's create a client to connect to the minio service:

from minio import Minio

minio_client = Minio("127.0.0.1:9000",
                     access_key='TEST',
                     secret_key='ASDFGHJKL',
                     secure=False)

Let's make a bucket for the model files:

minio_client.make_bucket("model-files")

Now let's upload the packaged model parameters to the bucket so that we can make predictions with the model parameters later.

import io


with open("../diabetes_risk_model/model_files/diabetes_risk_model-0.1.0-2023_03_17.zip", "rb") as file:
    zip_bytes = file.read()

result = minio_client.put_object(
    bucket_name="model-files", 
    object_name="diabetes_risk_model-0.1.0-2023_03_17.zip", 
    data=io.BytesIO(zip_bytes), 
    length=len(zip_bytes)
)

The model parameters are now in place to be used for making predictions. The zip file shows up in the Minio UI:

The reason that we went through the process of uploading the model parameters in an external storage service is to show how they can be hosted in an external location. By signing the model parameters before we store them in the minio service, we can be sure that the parameters are not tampered with even if the minio service is compromised. Because we signed the model parameters, the attacker would also need to figure out the secret key to be able to modify the model parameters that the deployed model is using.

Making Predictions with the Model

We now have a working model that accepts Pandas dataframes as input and also returns predictions in dataframes. This is useful in the context of model training, but makes integrating the model with other software components a lot more complicated. To make the model easier to use, we'll need to create input and output schemas for the model and also create a wrapper class that provides a consistent interface for the model.

We'll create the model's input and output schemas with the pydantic package, which is used for data validation. By creating the schemas using this package we're able to fully document the inputs that the model accepts and the expected outputs of the model we're going to deploy.

%pip install pydantic

clear_output()

To begin, we'll define the allowed values for the categorical variables. The model uses three categorical variables, so we'll define three Enum classes that contain the values accepted for these variables. By using enumerated values, we can ensure that the model can only receive values in these inputs that it has previously seen in the training set.

from enum import Enum


class GeneralHealth(str, Enum):
    """How would you say that in general your health is?"""
    EXCELLENT = "EXCELLENT"
    VERY_GOOD = "VERY_GOOD"
    GOOD = "GOOD"
    FAIR = "FAIR"
    POOR = "POOR"

    @staticmethod
    def map(value) -> float:
        mapping = {
            "EXCELLENT": 1.0,
            "VERY_GOOD": 2.0,
            "GOOD": 3.0,
            "FAIR": 4.0,
            "POOR": 5.0
        }
        return mapping[value]


class Age(str, Enum):
    """How old are you?"""
    EIGHTEEN_TO_TWENTY_FOUR = "EIGHTEEN_TO_TWENTY_FOUR"
    TWENTY_FIVE_TO_TWENTY_NINE = "TWENTY_FIVE_TO_TWENTY_NINE"
    THIRTY_TO_THIRTY_FOUR = "THIRTY_TO_THIRTY_FOUR"
    THIRTY_FIVE_TO_THIRTY_NINE = "THIRTY_FIVE_TO_THIRTY_NINE"
    FORTY_TO_FORTY_FOUR = "FORTY_TO_FORTY_FOUR"
    FORTY_FIVE_TO_FORTY_NINE = "FORTY_FIVE_TO_FORTY_NINE"
    FIFTY_TO_FIFTY_FOUR = "FIFTY_TO_FIFTY_FOUR"
    FIFTY_FIVE_TO_FIFTY_NINE = "FIFTY_FIVE_TO_FIFTY_NINE"
    SIXTY_TO_SIXTY_FOUR = "SIXTY_TO_SIXTY_FOUR"
    SIXTY_FIVE_TO_SIXTY_NINE = "SIXTY_FIVE_TO_SIXTY_NINE"
    SEVENTY_TO_SEVENTY_FOUR = "SEVENTY_TO_SEVENTY_FOUR"
    SEVENTY_FIVE_TO_SEVENTY_NINE = "SEVENTY_FIVE_TO_SEVENTY_NINE"
    EIGHTY_OR_OLDER = "EIGHTY_OR_OLDER"

    @staticmethod
    def map(value) -> float:
        mapping = {
            "EIGHTEEN_TO_TWENTY_FOUR": 1.0,
            "TWENTY_FIVE_TO_TWENTY_NINE": 2.0,
            "THIRTY_TO_THIRTY_FOUR": 3.0,
            "THIRTY_FIVE_TO_THIRTY_NINE": 4.0,
            "FORTY_TO_FORTY_FOUR": 5.0,
            "FORTY_FIVE_TO_FORTY_NINE": 6.0,
            "FIFTY_TO_FIFTY_FOUR": 7.0,
            "FIFTY_FIVE_TO_FIFTY_NINE": 8.0,
            "SIXTY_TO_SIXTY_FOUR": 9.0,
            "SIXTY_FIVE_TO_SIXTY_NINE": 10.0,
            "SEVENTY_TO_SEVENTY_FOUR": 11.0,
            "SEVENTY_FIVE_TO_SEVENTY_NINE": 12.0,
            "EIGHTY_OR_OLDER": 13.0
        }
        return mapping[value]


class Income(str, Enum):
    """What is your income?"""
    LESS_THAN_10K = "LESS_THAN_10K"
    BETWEEN_10K_AND_15K = "BETWEEN_10K_AND_15K"
    BETWEEN_15K_AND_20K = "BETWEEN_15K_AND_20K"
    BETWEEN_20K_AND_25K = "BETWEEN_20K_AND_25K"
    BETWEEN_25K_AND_35K = "BETWEEN_25K_AND_35K"
    BETWEEN_35K_AND_50K = "BETWEEN_35K_AND_50K"
    BETWEEN_50K_AND_75K = "BETWEEN_50K_AND_75K"
    SEVENTY_FIVE_THOUSAND_OR_MORE = "SEVENTY_FIVE_THOUSAND_OR_MORE"

    @staticmethod
    def map(value) -> float:
        mapping = {
            "LESS_THAN_10K": 1.0,
            "BETWEEN_10K_AND_15K": 2.0,
            "BETWEEN_15K_AND_20K": 3.0,
            "BETWEEN_20K_AND_25K": 4.0,
            "BETWEEN_25K_AND_35K": 5.0,
            "BETWEEN_35K_AND_50K": 6.0,
            "BETWEEN_50K_AND_75K": 7.0,
            "SEVENTY_FIVE_THOUSAND_OR_MORE": 8.0
        }
        return mapping[value]

The enum classes contain the values that were originally found in the training dataset. These variables were actually encoded as numbers in the dataset, so we also added a map() method to each Enum class that returns the corresponding number for the enumerated value passed into it. We'll be using the map() method of each Enum class later.

If we didn't provide these enumerated values and the mapping, we'd be asking the user of the model to encode the values before sending them to the model. They would have to read and understand the dataset documentation to create their prediction request. By creating enumerations for the categorical values, we make it much easier to use the model.

Now that we have the categorical variables defined, we can define the input schema for the model:

from typing import Optional
from pydantic import BaseModel, Field


class DiabetesRiskModelInput(BaseModel):
    body_mass_index: Optional[int] = Field(ge=15, le=60, description="Body Mass Index.")
    general_health: Optional[GeneralHealth] = Field(description="How would you say that in general your health is?")
    age: Optional[Age] = Field(description="How old are you?")
    income: Optional[Income] = Field(description="What is your income?")

The schema is called "DiabetesRiskModelInput" and contains fields for each variable found in the dataset. We're using the Enum classes we defined above for the categorical fields, and we defined a field for the numerical variable. Each numerical field has a range of allowed values that matches the range of the numerical variable found in the dataset. Each field also has a description of the variable that helps the user of the model to correctly feed data to the model. We only have 4 input variables because the feature selection process removed 17 features from the training set of the model.

The process of creating an input data schema exposes information found in the dataset that the model was originally trained on to the user of the model. For example, the body_mass_index variable only allows values between 15 and 60, which is the range of the variable in the training set.

To show how the schema classes work, let's try to instantiate the DiabetesRiskModelInput class:

input_instance = DiabetesRiskModelInput(
    body_mass_index=20,
    general_health=GeneralHealth.VERY_GOOD,
    age=Age.THIRTY_TO_THIRTY_FOUR,
    income=Income.BETWEEN_20K_AND_25K
)

input_instance

DiabetesRiskModelInput(body_mass_index=20, general_health=<GeneralHealth.VERY_GOOD: 'VERY_GOOD'>, age=<Age.THIRTY_TO_THIRTY_FOUR: 'THIRTY_TO_THIRTY_FOUR'>, income=<Income.BETWEEN_20K_AND_25K: 'BETWEEN_20K_AND_25K'>)

The instance of the schema class contains all of the information needed to make a a prediction.

Now let's try to instantiate it with values that are not allowed by the schema:

from pydantic import ValidationError


try:
    input_instance = DiabetesRiskModelInput(
        body_mass_index=10,
        general_health=GeneralHealth.VERY_GOOD,
        age=Age.THIRTY_TO_THIRTY_FOUR,
        income=Income.BETWEEN_20K_AND_25K)
except ValidationError as e:
    print(e)
    print("ValidationError exception raised!")

1 validation error for DiabetesRiskModelInput
body_mass_index
  ensure this value is greater than or equal to 15 (type=value_error.number.not_ge; limit_value=15)
ValidationError exception raised!

The class was not instantiated succesfully because the value for body_mass_index is too low and the model cannot accept it. By using the pydantic package, we're able to describe what values the model is able to accept.

We can also ommit values because they are optional:

input_instance = DiabetesRiskModelInput(
    body_mass_index=20,
    age=Age.THIRTY_TO_THIRTY_FOUR,
    income=Income.BETWEEN_20K_AND_25K)

input_instance

DiabetesRiskModelInput(body_mass_index=20, general_health=None, age=<Age.THIRTY_TO_THIRTY_FOUR: 'THIRTY_TO_THIRTY_FOUR'>, income=<Income.BETWEEN_20K_AND_25K: 'BETWEEN_20K_AND_25K'>)

In this case, we did not provide a value for general_health, which is filled in with a value of "None". We can do this because the model has built-in imputers that can provide a default value when it is not provided by the user of the model.

Now that we have the model's input schema defined, we'll define the output schema:

class DiabetesRisk(str, Enum):
    """Risk of diabetes."""
    NO_DIABETES = "NO_DIABETES"
    DIABETES = "DIABETES"

    @staticmethod
    def map(value: float) -> str:
        mapping = {
            0.0: DiabetesRisk.NO_DIABETES,
            1.0: DiabetesRisk.DIABETES
        }
        return mapping[value]


class DiabetesRiskModelOutput(BaseModel):
    """Diabetes risk model output."""
    diabetes_risk: DiabetesRisk

The model is a classification model and the output schema simply enumerates the classes that the model can predict. We also added a map() method to the DiabetesRisk class in order to map the number that is output by the model to a value that can be returned to the user.

One of the benefits of using the pydantic package is that each schema class can create a JSON Schema description for itself:

json_schema = DiabetesRiskModelOutput.schema()

json_schema

{'title': 'DiabetesRiskModelOutput',
 'description': 'Diabetes risk model output.',
 'type': 'object',
 'properties': {'diabetes_risk': {'$ref': '#/definitions/DiabetesRisk'}},
 'required': ['diabetes_risk'],
 'definitions': {'DiabetesRisk': {'title': 'DiabetesRisk',
   'description': 'Risk of diabetes.',
   'enum': ['NO_DIABETES', 'DIABETES'],
   'type': 'string'}}}

JSON schemas are useful for documenting data structures. We'll use this JSON schema later in order to automatically build documentation for the deployed model.

Now that we have the input and output schemas defined, now we can tie it all together by creating a wrapper class for the model. To do this we'll use the ml_base package.

To install the ml_base package, execute this command:

%pip install ml_base

clear_output()

The ml_base package defines a simple base class for model prediction code that allows us to "wrap" the prediction code for a model in a class that follows the MLModel interface. This interface publishes this information about the model:

Qualified Name, a unique identifier for the model.
Display Name, a friendly name for the model used in user interfaces.
Description, a description for the model.
Version, semantic version of the model codebase.
Input Schema, an object that describes the model's input data.
Output Schema, an object that describes the model's output schema.

The MLModel interface dictates that the model class implements two methods:

__init__(), the initialization method which loads any model parameters needed to make predictions
predict(), prediction method that receives model inputs makes a prediction and returns model outputs

By using the MLModel base class we'll be able to do more interesting things later with the model. If you'd like to learn more about the ml_base package, here is some documentation about it.

We'll define the wrapper class like this:

import os
import pandas as pd
import pickle
from io import BytesIO
import zipfile
from itsdangerous import Signer
from minio import Minio
from ml_base import MLModel


class DiabetesRiskModel(MLModel):
    """Prediction logic for the Diabetes Risk Model."""

    @property
    def display_name(self) -> str:
        return "Diabetes Risk Model"

    @property
    def qualified_name(self) -> str:
        return "diabetes_risk_model"

    @property
    def description(self) -> str:
        return "Model to predict the diabetes risk of a patient."

    @property
    def version(self) -> str:
        return "0.1.0"

    @property
    def input_schema(self):
        return DiabetesRiskModelInput

    @property
    def output_schema(self):
        return DiabetesRiskModelOutput

    def __init__(self, model_parameters_version: str, 
                 model_files_bucket: str, 
                 minio_url: str, 
                 minio_access_key: str, 
                 minio_secret_key: str,
                 parameters_signing_key: str):
        # retrieving values from environment variables if the values provided have ${} around them
        if minio_access_key[0:2] == "${" and minio_access_key[-1] == "}":
            minio_access_key = os.environ[minio_access_key[2:-1]]

        if minio_secret_key[0:2] == "${" and minio_secret_key[-1] == "}":
            minio_secret_key = os.environ[minio_secret_key[2:-1]]

        if parameters_signing_key[0:2] == "${" and parameters_signing_key[-1] == "}":
            parameters_signing_key = os.environ[parameters_signing_key[2:-1]]

        minio_client = Minio(minio_url,
                             access_key=minio_access_key,
                             secret_key=minio_secret_key,
                             secure=False)
        try:
            # accessing the model file stored in minio
            response = minio_client.get_object(model_files_bucket, 
                                               f"{self.qualified_name}-{self.version}-{model_parameters_version}.zip")
            zip_bytes = BytesIO(response.data)

            response.close()
            response.release_conn()

            # unzipping the parameters
            with zipfile.ZipFile(zip_bytes) as zf:
                if "signed_model.pkl" not in zf.namelist():
                    raise ValueError("Could not find signed model file in zip file.")
                signed_model_bytes = zf.read("signed_model.pkl")
        except Exception as e:
            raise RuntimeError("Could not access model file.") from e

        # checking the signed parameters
        signer = Signer(parameters_signing_key)
        unsigned_model_bytes = signer.unsign(signed_model_bytes)

        # unpickling the model object
        self._model = pickle.loads(unsigned_model_bytes)

    def predict(self, data: DiabetesRiskModelInput) -> DiabetesRiskModelOutput:
        if type(data) is not DiabetesRiskModelInput:
            raise ValueError("Input must be of type 'DiabetesRiskModelInput'")

        X = pd.DataFrame([[
            None, None, None,
            data.body_mass_index,
            None, None, None, None, None, None, None, None, None,
            GeneralHealth.map(data.general_health),
            None, None, None, None,
            Age.map(data.age),
            None,
            Income.map(data.income),
        ]], columns=['HighBloodPressure', 'HighCholesterol', 'CholesterolChecked', 'BMI', 'Smoker', 'Stroke',
                     'HeartDiseaseOrHeartAttack', 'PhysicalActivity', 'Fruits', 'Veggies',
                     'HeavyAlchoholConsumption', 'AnyHealthcare', 'NoDoctorsVisitBecauseOfCost', 
                     'GeneralHealth', 'MentalHealth', 'PhysicalHealth', 'DifficultyWalking', 'Sex', 
                     'Age', 'Education', 'Income'])

        y_hat = float(self._model.predict(X)[0])

        return DiabetesRiskModelOutput(diabetes_risk=DiabetesRisk.map(y_hat))

The model class __init__() method loads the model parameters from the minio service, verifies the signature, and deserializes the pickle into a model object that we can use to make predictions. The predict() method uses the model object to make predictions. Notice that we mapped the enumerated values into the numbers that the model expects to see before making a prediction with the model, and also mapped the model's prediction back into an enumerated value that can be returned to the user.

Let's instantiate the model class we defined above:

model = DiabetesRiskModel(
    model_parameters_version="2023_03_17", 
    model_files_bucket="model-files", 
    minio_url="127.0.0.1:9000", 
    minio_access_key="TEST", 
    minio_secret_key="ASDFGHJKL",
    parameters_signing_key="wjtRFppXQpxTChQnNcQJKGlLHKJBmAHMepfFbqvOoUrnuxIsKdiLCrrypYFQsqcw")

When the model object is instantiated, the init method loaded the zip file that contains the model pickle file from the minio service, verified that the bytes have not been changed using the secret key, and deserialized the model. The model object is ready to use to make predictions.

We can make a prediction with the model by first building a DiabetesRiskModelInput object:

input_instance = DiabetesRiskModelInput(
    body_mass_index=20,
    general_health=GeneralHealth.VERY_GOOD,
    age=Age.THIRTY_TO_THIRTY_FOUR,
    income=Income.BETWEEN_20K_AND_25K
)

Then use the input object with the model's predict() method:

prediction = model.predict(input_instance)

prediction

DiabetesRiskModelOutput(diabetes_risk=<DiabetesRisk.NO_DIABETES: 'NO_DIABETES'>)

The model predicted that the patient does not have diabetes.

Creating a RESTful Service

Now that we have a working model that loads and verifies parameters from minio, we can deploy the model into a service. To do this, we won't need to write any extra code, we can leverage the rest_model_service package to provide the RESTful API for the service. You can learn more about the package in this blog post.

To install the package, execute this command:

%pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Diabetes Risk Model Service
models:
  - class_path: diabetes_risk_model.prediction.model.DiabetesRiskModel
    create_endpoint: true
    configuration:
      model_parameters_version: "2023_03_17"
      model_files_bucket: model-files
      minio_url: 127.0.0.1:9000
      minio_access_key: TEST
      minio_secret_key: ASDFGHJKL
      parameters_signing_key: wjtRFppXQpxTChQnNcQJKGlLHKJBmAHMepfFbqvOoUrnuxIsKdiLCrrypYFQsqcw

The "service_title" field is the name of the service as it will appear in the documentation. The "models" field is an array that contains the details of the models we would like to deploy in the service. The "class_path" points at the MLModel class that implements the model's prediction logic.

Using the configuration file, we're able to create an OpenAPI specification file for the model service by executing these commands:

export PYTHONPATH=./
generate_openapi --configuration_file=configuration/rest_config.yaml --output_file=service_contract.yaml

The generate_openapi command generated the service_contract.yaml file which contains the OpenAPI specification for the model service. The "/api/models/diabetes_risk_model/prediction" endpoint is the one we'll call to make predictions with the model. The model's input and output schemas were automatically extracted and added to the specification. The service_contract.yaml is available in the root of the Github repository.

To run the model service locally, execute these commands:

export REST_CONFIG=./configuration/rest_config.yaml
uvicorn rest_model_service.main:app --reload

The service comes up and can be accessed in a web browser at http://127.0.0.1:8000. When you access that URL you will be redirected to the documentation page that is generated by the FastAPI package:

The documentation allows you to make requests against the API in order to try it out. Here's a prediction request against the diabetes risk model:

And the prediction result:

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model. We're done with the model service, so we'll stop it with CTL+C.

Creating a Docker Image

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally using Docker.

Let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3.9-slim

ARG BUILD_DATE

LABEL org.opencontainers.image.title="Diabetes Risk Model Service"
LABEL org.opencontainers.image.description="Diabetes Risk Model Service."
LABEL org.opencontainers.image.created=$BUILD_DATE
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/securing-parameters-for-ml-models"
LABEL org.opencontainers.image.version="0.1.0"
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR /service

ARG USERNAME=service-user
ARG USER_UID=10000
ARG USER_GID=10000

# install packages
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends sudo && \
    apt-get install -y --no-install-recommends libgomp1 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# create a user
RUN groupadd --gid $USER_GID $USERNAME && \
    useradd --uid $USER_UID --gid $USER_GID -m $USERNAME && \
    echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME && \
    chmod 0440 /etc/sudoers.d/$USERNAME

# installing dependencies
COPY ./service_requirements.txt ./service_requirements.txt
RUN pip install --no-cache -r service_requirements.txt

# copying model code and license
COPY ./diabetes_risk_model ./diabetes_risk_model
COPY ./LICENSE ./LICENSE

USER $USERNAME

RUN sudo chown $USERNAME:$USERNAME -R /service && \
    sudo chmod -R +rw /service  && \
    sudo mkdir -p  /var/folders/vb && \
    sudo chown $USERNAME:$USERNAME -R /var/folders/vb && \
    sudo chmod -R +rw /var/folders/vb

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

This Dockerfile is used by this docker command to create a docker image:

!docker build -t diabetes_risk_model_service:0.1.0 ../

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep diabetes_risk_model_service

diabetes_risk_model_service       0.1.0     92d771f815ee   48 seconds ago   1.2GB

The diabetes_risk_model_service image is listed. To test the model service docker image with the minio docker container that is already running, we'll need to create a network for them first.

!docker network create local-test-network

7e66d4b4dd92e454d4a662c51678a3e05d61ca1389b566ec07afef7630cb1b93

Next, we'll connect the running minio container to the network.

!docker network connect local-test-network minio

Now we can start the model service docker image connected to the same network as the minio container.

!docker run -d \
    --name diabetes_risk_model_service \
    -p 8000:8000 \
    --net local-test-network \
    -v $(pwd)/../configuration:/service/configuration \
    -e REST_CONFIG=./configuration/docker_rest_config.yaml \
    diabetes_risk_model_service:0.1.0

a9b9f3b22af0c2b2e74f1c01e062c56c921b9f689c0284b308a3e93ed6990eba

Notice that we provided the configuration YAML file to the service running in the docker image by mounting the local configuration folder.

To make sure the server process started up correctly, we'll look at the logs:

!docker logs diabetes_risk_model_service

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The logs don't show any errors, looks like the model parameters were loaded and verified correctly from the minio service when the service started up.

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command:

!curl -X 'POST' \
  'http://0.0.0.0:8000/api/models/diabetes_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ \
    "body_mass_index": 20, \
    "general_health": "EXCELLENT", \
    "age": "EIGHTEEN_TO_TWENTY_FOUR", \
    "income": "LESS_THAN_10K" \
}'

{"diabetes_risk":"NO_DIABETES"}

The model predicted that the patient does not have diabetes.

We're done with the docker containers, so we'll shut them down along with the docker network.

!docker kill diabetes_risk_model_service
!docker rm diabetes_risk_model_service

!docker kill minio
!docker rm minio

!docker network rm local-test-network

diabetes_risk_model_service
diabetes_risk_model_service
minio
minio
local-test-network

Creating a Kubernetes Cluster

To start the minikube cluster execute this command:

!minikube start

😄  minikube v1.28.0 on Darwin 13.2.1
🎉  minikube 1.29.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.29.0
💡  To disable this notice, run: 'minikube config set WantUpdateNotification false'

✨  Using the docker driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.25.3 on Docker 20.10.20 ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image docker.io/kubernetesui/metrics-scraper:v1.0.8
    ▪ Using image docker.io/kubernetesui/dashboard:v2.7.0
💡  Some dashboard features require the metrics-server addon. To enable all features please run:

    minikube addons enable metrics-server


🌟  Enabled addons: storage-provisioner, default-storageclass, dashboard
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Let's view all of the pods running in the minikube cluster to make sure we can connect to it using the kubectl command.

!kubectl get pods -A

NAMESPACE              NAME                                        READY   STATUS    RESTARTS       AGE
kube-system            coredns-565d847f94-2v6l9                    1/1     Running   15 (82s ago)   72d
kube-system            etcd-minikube                               1/1     Running   15 (2d ago)    72d
kube-system            kube-apiserver-minikube                     1/1     Running   14 (2d ago)    72d
kube-system            kube-controller-manager-minikube            1/1     Running   15 (82s ago)   72d
kube-system            kube-proxy-ztbgd                            1/1     Running   14 (2d ago)    72d
kube-system            kube-scheduler-minikube                     1/1     Running   14 (2d ago)    72d
kube-system            storage-provisioner                         1/1     Running   26 (2d ago)    72d
kubernetes-dashboard   dashboard-metrics-scraper-b74747df5-x559p   1/1     Running   14 (2d ago)    72d
kubernetes-dashboard   kubernetes-dashboard-57bbdc5f89-9jvln       1/1     Running   18 (82s ago)   72d

Looks like we can connect, we're ready to start deploying the model service to the cluster.

Creating a Namespace

We'll first create a namespace to hold the resources for our model service. The resource definition is in the kubernetes/namespace.yaml file. To apply the manifest to the cluster, execute this command:

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created

The namespace was created. To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   72d
kube-node-lease        Active   72d
kube-public            Active   72d
kube-system            Active   72d
kubernetes-dashboard   Active   72d
model-services         Active   3s

The new namespace appears in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "minikube" modified.

Now the rest of the kubectl commands that we execute will automatically be applied in the "model-services" namespace.

Creating the Storage Service

To store the model parameters, we'll need to deploy minio to the cluster as a service. We can do this by using the helm tool and a helm chart provided by minio.

First let's add the minio helm repository:

!helm repo add minio https://charts.min.io/

"minio" has been added to your repositories

The minion helm repository is now available to be used.

Let's apply the minio helm chart:

!helm install minio --set rootUser=TEST,rootPassword=ASDFGHJKL \
  --set persistence.enabled=true \
  --set persistence.size=2Gi \
  --set resources.requests.cpu=1 \
  --set resources.limits.cpu=2 \
  --set resources.requests.memory=250Mi \
  --set resources.limits.memory=500Mi \
  --set mode=distributed,replicas=2 \
  minio/minio

NAME: minio
LAST DEPLOYED: Sat Mar 18 00:15:07 2023
NAMESPACE: model-services
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
MinIO can be accessed via port 9000 on the following DNS name from within your cluster:
minio.model-services.svc.cluster.local

To access MinIO from localhost, run the below commands:

  1. export POD_NAME=$(kubectl get pods --namespace model-services -l "release=minio" -o jsonpath="{.items[0].metadata.name}")

  2. kubectl port-forward $POD_NAME 9000 --namespace model-services

Read more about port forwarding here: http://kubernetes.io/docs/user-guide/kubectl/kubectl_port-forward/

You can now access MinIO server on http://localhost:9000. Follow the below steps to connect to MinIO server with mc client:

  1. Download the MinIO mc client - https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart

  2. export MC_HOST_minio-local=http://$(kubectl get secret --namespace model-services minio -o jsonpath="{.data.rootUser}" | base64 --decode):$(kubectl get secret --namespace model-services minio -o jsonpath="{.data.rootPassword}" | base64 --decode)@localhost:9000

  3. mc ls minio-local

The minio service was installed. We can view the pods running to see if it's running correctly:

!kubectl get pods

NAME      READY   STATUS    RESTARTS   AGE
minio-0   1/1     Running   0          82s
minio-1   1/1     Running   0          82s

The minio service is running in two pods. The minio service is accessible through a set of Kubernetes Services:

!kubectl get services

NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
minio           ClusterIP   10.108.159.154   <none>        9000/TCP   2m4s
minio-console   ClusterIP   10.110.151.171   <none>        9001/TCP   2m4s
minio-svc       ClusterIP   None             <none>        9000/TCP   2m4s

We'll upload the model parameters by accessing the minio-console service. To do that, we'll need to connect to the minio instance using using port forwarding. Port forwarding is a simple way to connect to a service running in the cluster from the local environment, it simply forwards all traffic from a local port to a remote port that is hosting the service.

To start port forwarding the minio-console service, execute this command:

minikube service minio-console --url -n model-services

This command has to run continuously for the port forwarding to work. The UI of the minio instance that is running in the cluster is now available locally:

In order to keep things short, I created the "model-files" bucket and uploaded model .zip file that were working with above.

We now have model parameters for the model service to access. Now ready to deploy the model service to the cluster.

Creating a Deployment and Service

The model service is deployed by using Kubernetes resources. These are:

Secret: a set of configuration string that are stored by Kubernetes that can be provided to Pods running within the cluster. The secrets will be the minio login details and the secret key used to verify the model parameters.
ConfigMap: a set of configuration options, in this case it is a simple YAML file that will be loaded into the running container as a volume mount. This resource allows us to change the configuration of the model service without having to modify the Docker image.
Deployment: a declarative way to manage a set of Pods, the model service pods are managed through the Deployment.
Service: a way to expose a set of Pods in a Deployment, the model service is made available to the outside world through the Service.

We're almost ready to deploy the model service, but before starting it we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load diabetes_risk_model_service:0.1.0

We can view the images in the minikube cache with this command:

!minikube image ls | grep diabetes_risk_model_service

docker.io/library/diabetes_risk_model_service:0.1.0

To create a ConfigMap for the service, execute this command:

!kubectl create configmap model-service-configuration \
    --from-file=../configuration/kubernetes_rest_config.yaml

configmap/model-service-configuration created

The model service also needs to access three secrets:

minio access key, used for accessing the minio service
minio secret key, used for accessing the minio service
parameters signing key used for verifying the model parameters

These secrets can't be added to the ConfigMap because they need to be encrypted to be secure. We'll store these secrets as Secrets in kubernetes with these commands:

!kubectl create secret generic diabetes-risk-model-service-secrets \
    --from-literal=minio-access-key=TEST \
    --from-literal=minio-secret-key=ASDFGHJKL \
    --from-literal=parameters-signing-key=wjtRFppXQpxTChQnNcQJKGlLHKJBmAHMepfFbqvOoUrnuxIsKdiLCrrypYFQsqcw

secret/diabetes-risk-model-service-secrets created

The model service Deployment and Service are created within the Kubernetes cluster with this command:

!kubectl apply -f ../kubernetes/model_service.yaml

deployment.apps/diabetes-risk-model-deployment created
service/diabetes-risk-model-service created

Lets view the Deployment to see if it is available yet:

!kubectl get deployments

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
diabetes-risk-model-deployment   1/1     1            1           65s

To get an idea of how the service went through the startup process, let's look a the service logs. Let's get the names of the pods that are running the service:

!kubectl get pods | grep diabetes-risk-model

diabetes-risk-model-deployment-ff7887475-5q2j5   1/1     Running   0          68s

Using the pod name, we can get the logs from Kubernetes:

!kubectl logs diabetes-risk-model-deployment-ff7887475-5q2j5 -c diabetes-risk-model

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:35258 - "GET /api/health/startup HTTP/1.1" 503 Service Unavailable
INFO:     172.17.0.1:35272 - "GET /api/health/startup HTTP/1.1" 503 Service Unavailable
INFO:     172.17.0.1:55252 - "GET /api/health/startup HTTP/1.1" 503 Service Unavailable
INFO:     172.17.0.1:55264 - "GET /api/health/startup HTTP/1.1" 200 OK
INFO:     172.17.0.1:55270 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:49028 - "GET /api/health HTTP/1.1" 200 OK

Looks like the process started up correctly.

The Kubernetes Service details look like this:

!kubectl get services | grep diabetes-risk-model-service

diabetes-risk-model-service   NodePort    10.99.180.41     <none>        80:31452/TCP   2m29s

We'll run another proxy process locally to be able to access the model service endpoint:

minikube service diabetes-risk-model-service --url -n model-services

The command outputs this URL:

http://127.0.0.1:55659

We can send a request to the model service through the local endpoint like this:

!curl -X 'POST' \
  'http://127.0.0.1:55659/api/models/diabetes_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ \
    "body_mass_index": 60, \
    "general_health": "EXCELLENT", \
    "age": "EIGHTEEN_TO_TWENTY_FOUR", \
    "income": "LESS_THAN_10K" \
}'

{"diabetes_risk":"NO_DIABETES"}

The model is deployed within Kubernetes!

Deleting the Resources

We're done working with the Kubernetes resources, so we will delete them and shut down the cluster.

To delete the model service Deployment and Service, execute this command:

!kubectl delete -f ../kubernetes/model_service.yaml

deployment.apps "diabetes-risk-model-deployment" deleted
service "diabetes-risk-model-service" deleted

We'll also delete the ConfigMap:

!kubectl delete configmap model-service-configuration

configmap "model-service-configuration" deleted

Next, we'll delete the secrets:

!kubectl delete secret diabetes-risk-model-service-secrets

secret "diabetes-risk-model-service-secrets" deleted

To delete the minio deployment execute this command:

!helm delete minio

release "minio" uninstalled

The minio service used Persistent Volume Claims to store data. Since these are not deleted with the minio helm chart, we'll delete it with a kubectl command:

!kubectl delete pvc -l app=minio

persistentvolumeclaim "export-minio-0" deleted
persistentvolumeclaim "export-minio-1" deleted

To delete the model-services namespace, execute this command:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted

To shut down the minikube cluster:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  Powering off "minikube" via SSH ...
🛑  1 node stopped.

Closing

In this blog post we trained, validated, signed, and verified a set of model parameters to ensure that they remain secure. This process is needed because of the inherent security problems that Python pickles bring with them. The signing and verificationprocess added a little bit of complexity, but it's worth it to ensure the security of the model deployment.

We also showed how to deploy the serialized model parameters to a storage service, and how to access them from the deployed model. We did this to show a common vulnerability of machine learning model deployments. Since a lot of model parameters are not deployed alongside the prediction code, they are deployed in a separate storage service from which they are loaded. This practice makes the deployment of model parameters faster, but adds another attack vector that needs to be secured. Since the model parameters are stored in a storage server, an attacker can access the storage service and modify the model parameters in order to do arbritrary code execution in the server where the model is deployed. By adding a signature verification process before the model parameters can be deserialized, we made the deployment of model parameters a little more secure.

One way to improve this process is to make it into a plug-in that can be easily added to model training and prediction code, making it simpler to add to a training pipeline and model deployment. Another way to improve it is by adding a key cycling mechanism to ensure that secret keys do not remain in production for a long time.

Health Checks for ML Model Deployments

2023-01-15T22:00:00-05:00

Health Checks for ML Model Deployments

In a previous blog post we showed how to create a RESTful model service for a machine learning model. In this blog post, we'll extend the model service API by adding health checks to it.

This blog post was written in a Jupyter notebook, some of the code and commands found in it reflect this.

All of the code for this blog post is in this github repository.

Introduction

Deploying machine learning models in RESTful services is a common way to make the model available for use within a software system. In general, RESTful services are the most common type of service deployed, since they are simple to build, have wide compatibility, and have lots of tooling available for them. In order to monitor the availability of the service, RESTful APIs often provide health check endpoints which make it easy for an outside system to verify that the service is up and running. A health check endpoint is a simple endpoint that can be called by a process manager to ascertain whether the application is running correctly. In this blog post we'll be working with Kubernetes so we'll focus on the health checks supported by Kubernetes.

There are several types of health check endpoints supported by Kubernetes: startup, readiness, and liveness health checks. Startup checks are used to check if an application has started. If the container has a startup check configured on it, Kubernetes will wait until the application has finished starting up before moving on with the process of making the application available to clients. Startup checks are useful for applications that take a while to startup. Startup checks are only called during application startup, once an application has finished starting up the startup check is not called again.

Readiness checks are used to check if a container is ready to start accepting requests. Once the application has finished starting up, Kubernetes uses the readiness check to make sure that the application is able to accept requests. Service readiness can change during the service lifecycle so the check is called continuously until the application is stopped.

Liveness checks are used to restart a pod if the application is not responding. They are are the simplest type of check to implement in the application because they should always succeed if the server process is running. Liveness checks are useful to detect if the application is in an unsafe state, if the liveness check fails the process manager needs to restart the application to get it out of the unsafe state. Liveness checks are also called continuously while the application is running.

In this blog post, we’ll be adding startup, readiness, and liveness checks to a RESTful model service that is hosting a machine learning model. We'll also build a model that requires health checks in order to be deployed correctly.

Getting Data

In order to train a model, we first need to have a dataset. We went into Kaggle and found a dataset that contained loan information. To make it easy to download the data, we installed the kaggle python package. Then we executed these commands to download the data and unzip it into the data folder in the project:

from IPython.display import clear_output

!mkdir -p ../data

!kaggle datasets download -d ranadeep/credit-risk-dataset -p ../data --unzip

clear_output()

Let's look at the data files:

!ls -la ../data

total 67648
drwxr-xr-x   6 brian  staff       192 Nov 21 14:45 [34m.[m[m
drwxr-xr-x  27 brian  staff       864 Jan 11 21:23 [34m..[m[m
-rw-r--r--@  1 brian  staff      6148 Oct 31 17:30 .DS_Store
-rw-r--r--   1 brian  staff     20995 Nov 21 13:49 LCDataDictionary.xlsx
-rw-r--r--   1 brian  staff  34603008 Nov 21 14:58 credit-risk-dataset.zip
drwxr-xr-x   3 brian  staff        96 Nov 16 09:33 [34mloan[m[m

Looks like the data is in a .csv file in the /loan folder and the data dictionary is in an excel spreadsheet file.

Let's load the data .csv file into a pandas dataframe.

import pandas as pd

data = pd.read_csv("../data/loan/loan.csv", low_memory=False)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887379 entries, 0 to 887378
Data columns (total 74 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           887379 non-null  int64  
 1   member_id                    887379 non-null  int64  
 2   loan_amnt                    887379 non-null  float64
 3   funded_amnt                  887379 non-null  float64
 4   funded_amnt_inv              887379 non-null  float64
 5   term                         887379 non-null  object 
 6   int_rate                     887379 non-null  float64
 7   installment                  887379 non-null  float64
 8   grade                        887379 non-null  object 
 9   sub_grade                    887379 non-null  object 
 10  emp_title                    835917 non-null  object 
 11  emp_length                   842554 non-null  object 
 12  home_ownership               887379 non-null  object 
 13  annual_inc                   887375 non-null  float64
 14  verification_status          887379 non-null  object 
 15  issue_d                      887379 non-null  object 
 16  loan_status                  887379 non-null  object 
 17  pymnt_plan                   887379 non-null  object 
 18  url                          887379 non-null  object 
 19  desc                         126028 non-null  object 
 20  purpose                      887379 non-null  object 
 21  title                        887227 non-null  object 
 22  zip_code                     887379 non-null  object 
 23  addr_state                   887379 non-null  object 
 24  dti                          887379 non-null  float64
 25  delinq_2yrs                  887350 non-null  float64
 26  earliest_cr_line             887350 non-null  object 
 27  inq_last_6mths               887350 non-null  float64
 28  mths_since_last_delinq       433067 non-null  float64
 29  mths_since_last_record       137053 non-null  float64
 30  open_acc                     887350 non-null  float64
 31  pub_rec                      887350 non-null  float64
 32  revol_bal                    887379 non-null  float64
 33  revol_util                   886877 non-null  float64
 34  total_acc                    887350 non-null  float64
 35  initial_list_status          887379 non-null  object 
 36  out_prncp                    887379 non-null  float64
 37  out_prncp_inv                887379 non-null  float64
 38  total_pymnt                  887379 non-null  float64
 39  total_pymnt_inv              887379 non-null  float64
 40  total_rec_prncp              887379 non-null  float64
 41  total_rec_int                887379 non-null  float64
 42  total_rec_late_fee           887379 non-null  float64
 43  recoveries                   887379 non-null  float64
 44  collection_recovery_fee      887379 non-null  float64
 45  last_pymnt_d                 869720 non-null  object 
 46  last_pymnt_amnt              887379 non-null  float64
 47  next_pymnt_d                 634408 non-null  object 
 48  last_credit_pull_d           887326 non-null  object 
 49  collections_12_mths_ex_med   887234 non-null  float64
 50  mths_since_last_major_derog  221703 non-null  float64
 51  policy_code                  887379 non-null  float64
 52  application_type             887379 non-null  object 
 53  annual_inc_joint             511 non-null     float64
 54  dti_joint                    509 non-null     float64
 55  verification_status_joint    511 non-null     object 
 56  acc_now_delinq               887350 non-null  float64
 57  tot_coll_amt                 817103 non-null  float64
 58  tot_cur_bal                  817103 non-null  float64
 59  open_acc_6m                  21372 non-null   float64
 60  open_il_6m                   21372 non-null   float64
 61  open_il_12m                  21372 non-null   float64
 62  open_il_24m                  21372 non-null   float64
 63  mths_since_rcnt_il           20810 non-null   float64
 64  total_bal_il                 21372 non-null   float64
 65  il_util                      18617 non-null   float64
 66  open_rv_12m                  21372 non-null   float64
 67  open_rv_24m                  21372 non-null   float64
 68  max_bal_bc                   21372 non-null   float64
 69  all_util                     21372 non-null   float64
 70  total_rev_hi_lim             817103 non-null  float64
 71  inq_fi                       21372 non-null   float64
 72  total_cu_tl                  21372 non-null   float64
 73  inq_last_12m                 21372 non-null   float64
dtypes: float64(49), int64(2), object(23)
memory usage: 501.0+ MB

We'll be predicting credit risk. Let's select the most promising columns in the dataset so that we wont have to deal with all of the columns.

data = data[[
    "annual_inc", 
    "collections_12_mths_ex_med", 
    "delinq_2yrs", 
    "dti", 
    "emp_length", 
    "home_ownership", 
    "acc_now_delinq", 
    "installment", 
    "int_rate", 
    "last_pymnt_amnt", 
    "loan_amnt", 
    "loan_status", 
    "pub_rec", 
    "purpose", 
    "revol_util", 
    "term", 
    "total_pymnt_inv", 
    "verification_status"
]]

To make the data easier to explore we'll rename the columns to make their names more user friendly.

data = data.rename(columns={
    "annual_inc": "AnnualIncome", 
    "collections_12_mths_ex_med": "CollectionsInLast12Months",
    "delinq_2yrs": "DelinquenciesInLast2Years", 
    "dti": "DebtToIncomeRatio", 
    "emp_length": "EmploymentLength", 
    "home_ownership": "HomeOwnership", 
    "acc_now_delinq": "NumberOfDelinquentAccounts", 
    "installment": "MonthlyInstallmentPayment",
    "int_rate": "InterestRate", 
    "last_pymnt_amnt": "LastPaymentAmount",  
    "loan_amnt": "LoanAmount",
    "loan_status": "LoanStatus", 
    "pub_rec": "DerogatoryPublicRecordCount", 
    "purpose": "LoanPurpose",
    "revol_util": "RevolvingLineUtilizationRate", 
    "term": "Term", 
    "total_pymnt_inv": "TotalPaymentsToDate", 
    "verification_status": "VerificationStatus",
})

We'll also build a simple data dictionary with the column descriptions that were downloaded with the dataset.

data_dictionary = {
    "AnnualIncome": "The self-reported annual income provided by the borrower during registration.", 
    "CollectionsInLast12Months": "Number of collections in 12 months excluding medical collections.",
    "DelinquenciesInLast2Years": "The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.", 
    "DebtToIncomeRatio": "A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.", 
    "EmploymentLength": "Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.", 
    "HomeOwnership": "The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.",
    "NumberOfDelinquentAccounts": "The number of accounts on which the borrower is now delinquent.", 
    "MonthlyInstallmentPayment": "The monthly payment owed by the borrower if the loan originates.", 
    "InterestRate": "Interest Rate on the loan.", 
    "LastPaymentAmount": "Last total payment amount received.",
    "LoanAmount": "The listed amount of the loan applied for by the borrower.", 
    "LoanStatus": "Current status of the loan.",
    "DerogatoryPublicRecordCount": "Number of derogatory public records.",
    "LoanPurpose": "A category provided by the borrower for the loan request.", 
    "RevolvingLineUtilizationRate": "Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.", 
    "Term": "The number of payments on the loan. Values are in months and can be either 36 or 60.",
    "TotalPaymentsToDate": "Payments received to date for portion of total amount funded by investors.", 
    "VerificationStatus": "Indicates if income was verified.", 
}

Building a Model

Now that we have the dataset, we'll start working on training a model. We'll be doing data exploration, data preparation, model training, and model validation.

Profiling the Data

Profiling the data can tell us a lot about the internal structure of the dataset. To profile the data, we'll use the pandas_profiling package.

To install the package, we'll execute this command:

!pip install pandas_profiling

clear_output()

The profile is built with this code:

from pandas_profiling import ProfileReport

profile = ProfileReport(data, 
                        title="Credit Risk Analysis Dataset Report",
                        pool_size=4,
                        progress_bar=False,
                        dataset={
                            "description": "Lending Club loan data, complete loan data for all loans issued through the 2007-2015."
                        },
                        variables={
                            "descriptions": data_dictionary
                        })

We passed the data dictionary we built to the profile, it will be saved in the report generated.

Once the report is created, we'll save it to disk as an HTML file.

profile.to_file("../credit_risk_model/model_files/data_exploration_report.html")

clear_output()

Right away the profile will tell us a few key details about the dataset:

The profile also contains a few alerts about the data:

The profile has a description for each variable. Here's the description for the "LoanStatus" variable, which we will use to build the target variable.

By using the pandas_profiling package we can avoid writing the most common data analysis code that we write for all datasets.

Preparing the Data

The column that we're interested in is the "LoanStatus" column which tells us the current status of the loan. The values in the column are:

list(data["LoanStatus"].unique())

['Fully Paid',
 'Charged Off',
 'Current',
 'Default',
 'Late (31-120 days)',
 'In Grace Period',
 'Late (16-30 days)',
 'Does not meet the credit policy. Status:Fully Paid',
 'Does not meet the credit policy. Status:Charged Off',
 'Issued']

We'll be using this column to predict how risky a loan is. To do this, we'll need to create a target column that maps the values above into values that represent the riskiness of the loan. To keep things simple we'll simply create two categories for loans:

"safe", for all loans that are current, fully paid off, in grace period, or the payment plan has not started yet
"risky", for all other loans

Now we'll map the values above to the categories we want:

data["LoanRisk"] = data["LoanStatus"].replace({
    "Fully Paid": "safe",
    "Charged Off": "risky",
    "Current": "safe",
    "Default": "risky",
    "Late (31-120 days)": "risky",
    "In Grace Period": "safe",
    "Late (16-30 days)": "risky",
    "Does not meet the credit policy. Status:Fully Paid": "safe",
    "Does not meet the credit policy. Status:Charged Off": "risky",
    "Issued": "safe"
})

data.shape

(887379, 19)

Now we can remove the "LoanStatus" column since we created a new target column.

data = data.drop("LoanStatus", axis=1)

data.shape

(887379, 18)

Now that have a defined target column, we can move on to fixing some things in the dataset. From the profile we can see that there are several problems with the data that we need to fix.

The profile tells us that there are rows with missing data. To simplify the data modeling, we'll drop these rows.

data = data.dropna()

data.shape

(841954, 18)

All of the rows with missing values are now gone.

The "AnnualIncome" column is highly skewed. This is due to some rows which have outlier values, for example the max value for this column is $9,500,000. We'll fix this by removing rows with outlier values in that column. We'll remove the rows with values in this column that are more than three standard deviations from the mean like this:

import numpy as np
from scipy import stats

data = data[(np.abs(stats.zscore(data["AnnualIncome"])) < 3)]

data.shape

(835167, 18)

Another column in the dataset that is highly skewed is "DebtToIncomeRatio". For example, the maximum value in this column is 9999 which is probably not correct since most of the values in the column have a range between 0 and 100.

We'll use the same code to remove the outlier values for DebtToIncomeRatio.

data = data[(np.abs(stats.zscore(data["DebtToIncomeRatio"])) < 3)]

data.shape

(835112, 18)

The column "NumberOfDelinquentAccounts" is highly skewed because of a single record that has a value of 14. We'll remove the outliers by simply filtering out the rows with values above 6.

data = data[data["NumberOfDelinquentAccounts"] <= 6]

data.shape

(835111, 18)

The "HomeOwnership" column has several values that stand in for missing data. These values make up a small portion of the dataset, so we'll just remove the rows instead of doing data imputation.

data = data[data["HomeOwnership"] != "OTHER"]
data = data[data["HomeOwnership"] != "NONE"]
data = data[data["HomeOwnership"] != "ANY"]

data.shape

(834889, 18)

Looks like we only lost a few hundred rows from the dataset.

The variable "CollectionsInLast12Months" is not highly skewed but it contains values that only appear once or very few times. There are very few samples that have a value above 5, these samples are likely not useful so we'll remove them.

data = data[data["CollectionsInLast12Months"] <= 5]

data.shape

(834884, 18)

The same is true for the "DelinquenciesInLast2Years" and "DerogatoryPublicRecordCount" variables. There are very few samples with a value above 10 for these variables. We'll remove those samples.

data = data[data["DelinquenciesInLast2Years"] <= 10]
data = data[data["DerogatoryPublicRecordCount"] <= 10]

data.shape

(834415, 18)

The variable "RevolvingLineUtilizationRate" is a percent whose values must be between 0 and 100. There really isn't a way to use more than 100% of your revolving line of credit. However, this variable has values above 100, we'll remove those samples because they're bad data.

data = data[data["RevolvingLineUtilizationRate"] <= 100.0]

data.shape

(831103, 18)

Validating the Data

Next, we'll use the deepchecks package to do ML specific checks on the data. These checks are for checking for data issues that might affect an ML model.

Let's install the package:

!pip install deepchecks

clear_output()

Before we can run these checks, we need to specify which variables are categorical and numerical, and which variable is the target variable. We'll create lists of variable names for this purpose.

categorical_variables = [
    "EmploymentLength", 
    "HomeOwnership",
    "LoanPurpose", 
    "VerificationStatus",
    "Term"
]

numerical_variables = [
    "AnnualIncome", 
    "CollectionsInLast12Months",
    "DelinquenciesInLast2Years", 
    "DebtToIncomeRatio", 
    "NumberOfDelinquentAccounts", 
    "MonthlyInstallmentPayment", 
    "InterestRate", 
    "LastPaymentAmount",
    "LoanAmount", 
    "DerogatoryPublicRecordCount",
    "RevolvingLineUtilizationRate", 
    "TotalPaymentsToDate"
]

target_variable = "LoanRisk"

all_variables = categorical_variables + numerical_variables + [target_variable]

assert len(all_variables) == len(data.columns), "A column is missing from the lists of variables."

The assert statement at the end of the code above is a simple check that makes sure that we don't forget to include all of the columns in the lists. If we forget to include a column in the list it will stop execution of the notebook with an error message.

In order to avoid some errors in the training process, we'll need to change the column type of some of the columns in the pandas dataframe to "category".

for column_name in categorical_variables:
    data[column_name] = data[column_name].astype("category")

Lets start the data validation checks.

from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import data_integrity

dataset = Dataset(data, 
                  cat_features=categorical_variables,  
                  label=target_variable)

clear_output()

The Dataset object contains a reference to the original Dataframe that we've been working with, and also contains the metadata about the columns in the dataframe that is needed to analyze the data.

We'll run the checks on the data like this:

data_integrity_suite = data_integrity()

suite_result = data_integrity_suite.run(dataset)

clear_output()

suite_result.save_as_html("../credit_risk_model/model_files/deepchecks_data_integrity_results.html")

'../credit_risk_model/model_files/deepchecks_data_integrity_results.html'

The checks done by this suite are geared towards datasets used for machine learning.

The results of the data integrity suite look like this:

The suite contains many checks that execute on the data set, the checks that passed are:

Feature Label Correlation, predictive power score is less than 0.8 for all features.
Single Value in Column, column does not contain only a single value
Special Characters, ratio of samples containing solely special character is less or equal to 0.1%
Mixed Nulls, number of different null types is less or equal to 1
Mixed Data Types, rare data types in column are either more than 10% or less than 1% of the data
Data Duplicates, duplicate data ratio is less or equal to 0%
String Length Out Of Bounds, ratio of string length outliers is less or equal to 0%
Conflicting Labels, ambiguous sample ratio is less or equal to 0%

The checks are fully explained in the deepchecks documentation.

For now, we're more interested in the checks that did not pass:

The two checks that didn't pass are:

Feature-Feature Correlation, not more than 0 pairs are correlated above 0.9
String Mismatch, no string variants

The deepchecks package found that the "LoanAmount" and "MonthlyInstallmentPayment" variables are highly correlated, which makes sense because an increase in loan amount will always cause an increase in payment amount. We can safely drop the MonthlyInstallmentPayment column from the dataset.

data = data.drop("MonthlyInstallmentPayment", axis=1)

numerical_variables.remove("MonthlyInstallmentPayment")

data.shape

(831103, 17)

The deepchecks package also found that the "EmploymentLength" variable contains string values that are similar to each other. For example, two levels found in the categorical variable are "1 year" and "< 1 year". This is a warning that we can ignore because the levels are correctly set.

We're now getting closer to a dataset that we can use to train a model . We'll be using deepchecks to do train/test dataset checks and model checks later on.

Training a Model

To train a model, we'll first create a training and testing set. We'll use 80% of the rows for training and 20% of the rows for testing.

import numpy as np

mask = np.random.rand(len(data)) < 0.80
training_data = data[mask]
testing_data = data[~mask]

print(training_data.shape)
print(testing_data.shape)

(664983, 17)
(166120, 17)

We will be running a test suite on the newly created training and test suites using deepchecks. The deepchecks package requires that we create two Dataset objects, one for the training set and one for the testing set.

train_dataset = Dataset(training_data, 
                        label=target_variable,
                        cat_features=categorical_variables)

test_dataset = Dataset(testing_data, 
                       label=target_variable,
                       cat_features=categorical_variables)

Now we can run the train-test validation suite.

from deepchecks.tabular.suites import train_test_validation

validation_suite = train_test_validation()

suite_result = validation_suite.run(train_dataset, test_dataset)

clear_output()

We'll save the results to files for this suite as well.

suite_result.save_as_html("../credit_risk_model/model_files/deepchecks_train_test_results.html")

'../credit_risk_model/model_files/deepchecks_train_test_results.html'

The results of the suite look like this:

All of the checks in this suite passed:

Datasets Size Comparison, Test-Train size ratio is greater than 0.01
Category Mismatch Train Test, ratio of samples with a new category is less or equal to 0%
Feature Label Correlation Change, Train-Test features' Predictive Power Score difference is less than 0.2
Feature Label Correlation Change, Train features' Predictive Power Score is less than 0.7
Train Test Feature Drift, categorical drift score < 0.2 and numerical drift score < 0.1
Train Test Label Drift, categorical drift score < 0.2 and numerical drift score < 0.1 for label drift Label's drift score Cramer's V is 0
New Label Train Test, number of new label values is less or equal to 0
String Mismatch Comparison No new variants allowed in test data
Train Test Samples Mix, percentage of test data samples that appear in train data is less or equal to 10%,
Multivariate Drift, drift value is less than 0.25

These checks are more fully explained in the documentation.

Now that we have verified the contents of the training and testing sets, we're finally ready to traing a model. To do this, we'll need to create separate dataframes for the predictor and target columns:

feature_columns = categorical_variables + numerical_variables

X_train = training_data[feature_columns]
y_train = training_data[target_variable]

X_test = testing_data[feature_columns]
y_test = testing_data[target_variable]

We'll be using the LightGBM package to train a GBM model and the FLAML package for doing automated machine learning. Let's install the packages:

!pip install lightgbm
!pip install flaml

clear_output()

Let's train a model using the default hyperparameters to have a baseline.

%%time
from lightgbm import LGBMClassifier

model = LGBMClassifier()

model = model.fit(X_train, y_train)

CPU times: user 13.1 s, sys: 657 ms, total: 13.8 s
Wall time: 3.46 s

Now let's calculate the classification metrics for this simple model:

%%time
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       risky       0.92      0.14      0.24     11456
        safe       0.94      1.00      0.97    154664

    accuracy                           0.94    166120
   macro avg       0.93      0.57      0.61    166120
weighted avg       0.94      0.94      0.92    166120

CPU times: user 8.96 s, sys: 191 ms, total: 9.15 s
Wall time: 6.52 s

The "safe" class has good metrics but the "risky" class does not. This is due to the fact that the classes are imbalanced.

We'll try to fix this issue by doing automated ML with the FLAML package. The automated hyperparameter search will hopefully find some parameters that can improve the metrics of the "risky" class.

The settings are:

time_budget: amount of time allowed for the auto ML algorithm to run
metric: the metric that should be maximized by the auto ML alogorithm
estimator_list: the types of estimators that can be used by FLAML, in this case we only want to try LightGBM
task: the type of task that the estimator should be solving
log_file_name: name of the log file output by the auto ML algorithm
seed: the random seed to be used by the auto ML algorithm

from flaml import AutoML

automl = AutoML()

settings = {
    "time_budget": 1200,
    "metric": "roc_auc",
    "estimator_list": ["lgbm"],
    "task": "classification",
    "log_file_name": "experiment.log",
    "seed": 42
}

automl.fit(X_train=X_train, y_train=y_train, **settings)

clear_output()

The hyperparameter search has found an optimal set of hyperparametes using the training set and cross validation. These are the hyperparameters found:

automl.best_config

{'n_estimators': 10707,
 'num_leaves': 7,
 'min_child_samples': 62,
 'learning_rate': 0.24185440044608203,
 'log_max_bin': 10,
 'colsample_bytree': 0.9914098492087268,
 'reg_alpha': 2.551067627605118,
 'reg_lambda': 0.0010846951681516895}

Let's train a model using the optimal hyperparameters:

model = LGBMClassifier(**automl.best_config)

model = model.fit(X_train, y_train)

Let's get the the classification metrics for the best model:

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       risky       0.89      0.46      0.61     11456
        safe       0.96      1.00      0.98    154664

    accuracy                           0.96    166120
   macro avg       0.93      0.73      0.79    166120
weighted avg       0.96      0.96      0.95    166120

Validating the Model

Deepchecks is also able to validate the model with the model_evaluation suite of checks.

from deepchecks.tabular.suites import model_evaluation

evaluation_suite = model_evaluation()

suite_result = evaluation_suite.run(train_dataset, test_dataset, model)

suite_result.save_as_html("../credit_risk_model/model_files/deepchecks_model_evaluation_results.html")

'../credit_risk_model/model_files/deepchecks_model_evaluation_results.html'

Packaging the Model Files

Let's serialize the best model to disk:

%%time
import joblib

joblib.dump(model, "../credit_risk_model/model_files/model.joblib")

CPU times: user 1.33 s, sys: 82.1 ms, total: 1.41 s
Wall time: 226 ms





['../credit_risk_model/model_files/model.joblib']

!ls -la ../credit_risk_model/model_files

total 82824
drwxr-xr-x  8 brian  staff      256 Jan 15 20:07 [34m.[m[m
drwxr-xr-x  8 brian  staff      256 Dec 13 11:15 [34m..[m[m
-rw-r--r--@ 1 brian  staff     6148 Jan 15 19:09 .DS_Store
-rw-r--r--  1 brian  staff  9426966 Jan 15 19:12 data_exploration_report.html
-rw-r--r--  1 brian  staff  7750291 Jan 15 19:15 deepchecks_data_integrity_results.html
-rw-r--r--@ 1 brian  staff  7964754 Jan 15 20:06 deepchecks_model_evaluation_results.html
-rw-r--r--  1 brian  staff  7793791 Jan 15 19:17 deepchecks_train_test_results.html
-rw-r--r--  1 brian  staff  9452707 Jan 15 20:07 model.joblib

The serialized model is 9.5 megabytes in size and took 226 milliseconds to write to disk. This is important to note because we will need to deserialize the model later in order to make predictions with it.

In the process of training this model, we created a few files. To be able to use these files later, we'll package them up and save them in a location that can be accessed by the prediction code later. We'll be using a .zip file for this purpose.

import shutil


shutil.make_archive("../credit_risk_model/model_files/1", "zip", "../credit_risk_model/model_files")

'/Users/brian/Code/health-checks-for-ml-model-deployments/credit_risk_model/model_files/1.zip'

The command created a .zip file with all of the files in the model_files folder. The name of the folder is "1.zip" this is just a simple name that denotes that it is the first model trained for the credit_risk_model package.

Now that we have the model files in a .zip file, we can delete the original files from the folder:

!rm ../credit_risk_model/model_files/data_exploration_report.html
!rm ../credit_risk_model/model_files/deepchecks_data_integrity_results.html
!rm ../credit_risk_model/model_files/deepchecks_train_test_results.html
!rm ../credit_risk_model/model_files/deepchecks_model_evaluation_results.html
!rm ../credit_risk_model/model_files/model.joblib

Making Predictions with the Model

We'll create the model's input and output schemas with the pydantic package, which is a package used for data validation. By creating the schemas using this package we're able to fully document the inputs that the model accepts and the expected outputs of the model we're going to deploy.

To begin, we'll define the allowed values for the categorical variables.

from pydantic import BaseModel, Field
from enum import Enum


class EmploymentLength(str, Enum): 
    """Employment length in years."""
    less_than_1_year = "< 1 year"
    one_year = "1 year"
    two_years = "2 years"
    three_years = "3 years"
    four_years = "4 years"
    five_years = "5 years"
    six_years = "6 years"
    seven_years = "7 years"
    eight_years = "8 years"
    nine_years = "9 years"
    ten_years_or_more = "10+ years"


class HomeOwnership(str, Enum):
    """The home ownership status provided by the borrower during registration."""
    MORTGAGE = "MORTGAGE"
    RENT = "RENT"
    OWN = "OWN"


class LoanPurpose(str, Enum):
    """A category provided by the borrower for the loan request."""
    debt_consolidation = "debt_consolidation"
    credit_card = "credit_card"
    home_improvement = "home_improvement"
    other = "other"
    major_purchase = "major_purchase"
    small_business = "small_business"
    car = "car"
    medical = "medical"
    moving = "moving"
    vacation = "vacation"
    wedding = "wedding"
    house = "house"
    renewable_energy = "renewable_energy"
    educational = "educational"


class Term(str, Enum):
    """The number of payments on the loan."""
    thirty_six_months = " 36 months"
    sixty_months = " 60 months"


class VerificationStatus(str, Enum):
    """Indicates if income was verified."""
    source_verified = "Source Verified"
    verified = "Verified"
    not_verified = "Not Verified"

The dataset contains 5 categorical variables, so we defined 5 Enum classes that contain the values accepted for these variables. Each enumeration has a key and value, with the value being the value as the model expects to see it. By using enumerated values, we can ensure that the model can only receive values in these inputs that it has previously seen in the training set.

Now that we have the categorical variables defined, we can define the input schema for the model:

class CreditRiskModelInput(BaseModel):
    """Inputs for predicting credit risk."""
    annual_income: int = Field(ge=1896, le=273000, description="The self-reported annual income provided by the borrower during registration.")
    collections_in_last_12_months: int = Field(ge=0, le=20, description="Number of collections in 12 months excluding medical collections.")
    delinquencies_in_last_2_years: int = Field(ge=0, le=39, description="The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years.")
    debt_to_income_ratio: float = Field(ge=0.0, le=42.64, description="A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.")
    employment_length: EmploymentLength = Field(description="Employment length in years.")
    home_ownership: HomeOwnership = Field(description="The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.")
    number_of_delinquent_accounts: int = Field(ge=0, le=6, description="The number of accounts on which the borrower is now delinquent.")
    interest_rate: float = Field(ge=5.32, le=28.99, description="Interest Rate on the loan.")
    last_payment_amount: float = Field(ge=0.0, le=36475.59, description="Last total payment amount received.")
    loan_amount: int = Field(ge=500.0, le=35000.0, description="The listed amount of the loan applied for by the borrower.")
    derogatory_public_record_count: int = Field(ge=0.0, le=86.0, description="Number of derogatory public records.")
    loan_purpose: LoanPurpose = Field(description="A category provided by the borrower for the loan request.")
    revolving_line_utilization_rate: float = Field(ge=0.0, le=892.3, description="Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.")
    term: Term = Field(description="The number of payments on the loan. Values are in months and can be either 36 or 60.")
    total_payments_to_date: float = Field(ge=0.0, le=57777.58, description="Payments received to date for portion of total amount funded by investors.")
    verification_status: VerificationStatus = Field(description="Indicates if income was verified.")

The schema is called "CreditRiskModelInput" and contains fields for each variable found in the dataset. We're using the Enum classes we defined above for the categorical fields, and we defined fields for all of the numerical variables. Each numerical field has a range of allowed values that matches the range of the numerical variable found in the dataset. Each field also has a description of the variable that helps the user of the model to correctly feed data to the model.

The process of creating an input data schema makes the model much more used friendly and exposes information found in the dataset that the model was originally trained on to the user of the model. Doing this also allows us to build documentation for the model service automatically.

Now that we have the model's input schema defined, we'll define the output schema:

class CreditRisk(str, Enum):
    """Indicates whether or not loan is risky."""
    safe = "safe"
    risky = "risky"


class CreditRiskModelOutput(BaseModel):
    credit_risk: CreditRisk = Field(description="Whether or not the loan is risky.")

The model is a classification model and the output schema simply enumerates the classes that the model can predict.

We now have the input and output schemas defined, now we can tie it all together by creating a wrapper class for the model. The ml_base package defines a simple base class for model prediction code that allows us to "wrap" the prediction code in a class that follows the MLModel interface. This interface publishes this information about the model:

Qualified Name, a unique identifier for the model
Display Name, a friendly name for the model used in user interfaces
Description, a description for the model
Version, semantic version of the model codebase
Input Schema, an object that describes the model\'s input data
Output Schema, an object that describes the model\'s output schema

The MLModel interface also dictates that the model class implements two methods:

__init__, the initialization method which loads any model artifacts needed to make predictions
predict, prediction method that receives model inputs makes a prediction and returns model outputs

By using the MLModel base class we'll be able to do more interesting things later with the model. If you'd like to learn more about the ml_base package, here is some documentation about it.

To install the ml_base package, execute this command:

!pip install ml_base

clear_output()

We'll define the wrapper class like this:

import sys
import os
import joblib
import pandas as pd
from ml_base import MLModel
import zipfile


__file__ = os.path.join(os.path.dirname(os.path.realpath(os.path.abspath(''))), "credit_risk_model", "prediction", "model.py")

class CreditRiskModel(MLModel):
    """Prediction logic for the Credit Risk Model."""

    @property
    def display_name(self) -> str:
        """Return display name of model."""
        return "Credit Risk Model"

    @property
    def qualified_name(self) -> str:
        """Return qualified name of model."""
        return "credit_risk_model"

    @property
    def description(self) -> str:
        """Return description of model."""
        return "Model to predict the credit risk of a loan."

    @property
    def version(self) -> str:
        """Return version of model."""
        return "0.1.0"

    @property
    def input_schema(self):
        """Return input schema of model."""
        return CreditRiskModelInput

    @property
    def output_schema(self):
        """Return output schema of model."""
        return CreditRiskModelOutput

    def __init__(self):
        """Class constructor that loads and deserializes the model parameters."""
        dir_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
        file_path = os.path.join(dir_path, "model_files", "1.zip")

        with zipfile.ZipFile(file_path) as zf:
            if "model.joblib" not in zf.namelist():
                raise ValueError("Could not find model file in zip file.")
            model_file = zf.open("model.joblib")
            self._model = joblib.load(model_file)

    def predict(self, data: CreditRiskModelInput) -> CreditRiskModelOutput:
        """Make a prediction with the model.

        Params:
            data: Data for making a prediction with the model.

        Returns:
            The result of the prediction.

        """
        if type(data) is not CreditRiskModelInput:
            raise ValueError("Input must be of type 'CreditRisk'")

        X = pd.DataFrame([[
            data.employment_length.value,
            data.home_ownership.value,
            data.loan_purpose.value,
            data.verification_status.value,
            data.term.value,
            data.annual_income,
            data.collections_in_last_12_months,
            data.delinquencies_in_last_2_years,
            data.debt_to_income_ratio,
            data.number_of_delinquent_accounts,
            data.interest_rate,
            data.last_payment_amount,
            data.loan_amount,
            data.derogatory_public_record_count,
            data.revolving_line_utilization_rate,
            data.total_payments_to_date,
        ]],
            columns=[
                "EmploymentLength", 
                "HomeOwnership", 
                "LoanPurpose",
                "VerificationStatus", 
                "Term", 
                "AnnualIncome",
                "CollectionsInLast12Months", 
                "DelinquenciesInLast2Years",
                "DebtToIncomeRatio",
                "NumberOfDelinquentAccounts", 
                "InterestRate",
                "LastPaymentAmount", 
                "LoanAmount", 
                "DerogatoryPublicRecordCount",
                "RevolvingLineUtilizationRate",
                "TotalPaymentsToDate"
            ])

        categorical_variables = ["EmploymentLength", 
                                 "HomeOwnership", 
                                 "LoanPurpose",
                                 "VerificationStatus", 
                                 "Term"]

        for column_name in categorical_variables:
            X[column_name] = X[column_name].astype("category")

        y_hat = self._model.predict(X)[0]

        return CreditRiskModelOutput(credit_risk=CreditRisk[y_hat])

We can make a prediction with the model by first building a CreditRiskModelInput object:

model_input = CreditRiskModelInput(
    annual_income=273000, 
    collections_in_last_12_months=20, 
    delinquencies_in_last_2_years=39, 
    debt_to_income_ratio=42.64, 
    employment_length=EmploymentLength.less_than_1_year, 
    home_ownership=HomeOwnership.MORTGAGE, 
    number_of_delinquent_accounts=6, 
    interest_rate=28.99, 
    last_payment_amount=36475.59, 
    loan_amount=35000,  
    derogatory_public_record_count=86, 
    loan_purpose=LoanPurpose.debt_consolidation, 
    revolving_line_utilization_rate=892.3, 
    term=Term.thirty_six_months, 
    total_payments_to_date=57777.58, 
    verification_status=VerificationStatus.source_verified 
)

Next, we'll instantiate the model class we defined above:

%%time
model = CreditRiskModel()

CPU times: user 372 ms, sys: 29.2 ms, total: 401 ms
Wall time: 113 ms

Notice that the model object took 113 milliseconds to be instantiated. This is because the model parameters take a lot of disk space and take a while to load from the hard drive. This is something that we'll need to deal with later.

We'll use the CreditRiskModelInput instance to make a prediction like this:

prediction = model.predict(model_input)

prediction

CreditRiskModelOutput(credit_risk=<CreditRisk.safe: 'safe'>)

The model predicted that the loan is "safe".

Creating a RESTful Service

Now that we have a model, we can deploy it in a service that allows clients to make predictions. To do this, we won't need to write any extra code, we can leverage the rest_model_service package to provide the RESTful API for the service. You can learn more about the package in this blog post.

To install the package, execute this command:

!pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Credit Risk Model Service
description: "Service hosting the Credit Risk Model."
version: "0.1.0"
models:
  - qualified_name: credit_risk_model
    class_path: credit_risk_model.prediction.model.CreditRiskModel
    create_endpoint: true

At the root of the YAML, the "service_title" field is the name of the service as it will appear in the documentation. The "description" and "version"fields will also be used to create the service documentation.

The models field is an array that contains the details of the models we would like to deploy in the service. The "qualified_name" field is the name we gave to the model. The "class_path" field points at the MLModel class that implements the model's prediction logic, in this case it is pointing to the class we built earlier in this blog post. The "create_endpoint" field tells the service to create an endpoint for the model.

Using the configuration file, we can create an OpenAPI specification file for the model service by executing these commands:

export PYTHONPATH=./
generate_openapi --configuration_file=./configuration/rest_configuration.yaml --output_file="service_contract.yaml"

The service_contract.yaml file is generated and contains the OpenAPI specification that was generated for the model service. The specification contains a description of the model's endpoint. The model's input and output schemas are automatically extracted and added to the specification. The OpenAPI specification file generated can be found at the root of the github repository in the file named service_contract.yaml

To run the service locally, execute these commands:

export REST_CONFIG=./configuration/rest_configuration.yaml
uvicorn rest_model_service.main:app

The service process starts up and can be accessed in a web browser at http://127.0.0.1:8000. The service renders the OpenAPI specification as a webpage that looks like this:

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model.

We can make a prediction using the model running in the service with this command:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/credit_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
      \"annual_income\": 273000, \
      \"collections_in_last_12_months\": 20, \
      \"delinquencies_in_last_2_years\": 39, \
      \"debt_to_income_ratio\": 42.64, \
      \"employment_length\": \"< 1 year\", \
      \"home_ownership\": \"MORTGAGE\", \
      \"number_of_delinquent_accounts\": 6, \
      \"interest_rate\": 28.99, \
      \"last_payment_amount\": 36475.59, \
      \"loan_amount\": 35000,  \
      \"derogatory_public_record_count\": 86, \
      \"loan_purpose\": \"debt_consolidation\", \
      \"revolving_line_utilization_rate\": 892.3, \
      \"term\": \" 36 months\", \
      \"total_payments_to_date\": 57777.58, \
      \"verification_status\": \"Source Verified\" \
}"

{"credit_risk":"safe"}

The model returned a prediction of "safe" for the loan.

Understanding Health Checks

The service is able exposes health information about itself through health endpoints. The health endpoints are:

/api/health: indicates whether the service process is running. This endpoint will return a 200 status once the service has started.
/api/health/ready: indicates whether the service is ready to respond to requests. This endpoint will return a 200 status only if all the models and decorators have finished being instantiated without errors.
/api/health/startup: indicates whether the service is started. This endpoint will return a 200 status only if all the models and decorators have finished being instantiated without errors.

These endpoints are important for our use case because our model takes a while to load and become ready to be served over the API. The service will not be ready to serve traffic for a while, so the readiness and startup checks will fail until the models are ready.

The service is running so we'll try out each endpoint with a request:

!curl -X 'GET' \
  'http://127.0.0.1:8000/api/health' \
  -H 'accept: application/json'

{"health_status":"HEALTHY"}

The health endpoint returned a status of "HEALTHY".

!curl -X 'GET' \
  'http://127.0.0.1:8000/api/health/ready' \
  -H 'accept: application/json'

{"readiness_status":"ACCEPTING_TRAFFIC"}

The readiness status endpoint returned a status of "ACCEPTING_TRAFFIC".

!curl -X 'GET' \
  'http://127.0.0.1:8000/api/health/startup' \
  -H 'accept: application/json'

{"startup_status":"STARTED"}

The startup status endpoint returned a status of "STARTED".

During normal operation, the health endpoints are not very interesting. However, in special situations they are very useful. For example, if a model takes a long time to start up, the startup check endpoint will not return a 200 status response until each model is initiated and ready to make predictions. The readiness endpoint will also not return a 200 status until the model is ready. We'll use the healtcheck endpoints to integrated with Kubernetes.

Creating a Docker Image

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally using Docker.

Let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3.9-slim

ARG BUILD_DATE

LABEL org.opencontainers.image.title="Health Checks for ML Models"
LABEL org.opencontainers.image.description="Health checks for ML models."
LABEL org.opencontainers.image.created=$BUILD_DATE
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/health-checks-for-ml-models"
LABEL org.opencontainers.image.version="0.1.0"
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR /service

ARG USERNAME=service-user
ARG USER_UID=10000
ARG USER_GID=10000

# install packages
RUN apt-get update \
    && apt-get install --assume-yes --no-install-recommends sudo \
    && apt-get install --assume-yes --no-install-recommends git \
    && apt-get install -y --no-install-recommends apt-utils \
    && apt-get install -y --no-install-recommends libgomp1 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# create a user
RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

# installing dependencies
COPY ./service_requirements.txt ./service_requirements.txt
RUN pip install -r service_requirements.txt

# copying code and license
COPY ./credit_risk_model ./credit_risk_model
COPY ./LICENSE ./LICENSE

USER $USERNAME

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile is used by this docker command to create a docker image:

!docker build -t credit_risk_model_service:0.1.0 ../

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep credit_risk_model_service

credit_risk_model_service         0.1.0     fc3c3c747e2b   4 seconds ago   614MB

The credit_risk_model_service image is listed. Next, we'll start the image to see if the service is working correctly.

!docker run -d \
    -p 8000:8000 \
    -e REST_CONFIG=./configuration/rest_configuration.yaml \
    -v $(pwd)/../configuration:/service/configuration \
    --name credit_risk_model_service \
    credit_risk_model_service:0.1.0

706ff17d8db159568989d8a74221b8bc3bbcb52074ca61da1ee4015297035dc6

To make sure the server process started up correctly, we'll look at the logs:

!docker logs credit_risk_model_service

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The logs look good and the service is up and running.

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/credit_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
      \"annual_income\": 273000, \
      \"collections_in_last_12_months\": 20, \
      \"delinquencies_in_last_2_years\": 39, \
      \"debt_to_income_ratio\": 42.64, \
      \"employment_length\": \"< 1 year\", \
      \"home_ownership\": \"MORTGAGE\", \
      \"number_of_delinquent_accounts\": 6, \
      \"interest_rate\": 28.99, \
      \"last_payment_amount\": 36475.59, \
      \"loan_amount\": 35000,  \
      \"derogatory_public_record_count\": 86, \
      \"loan_purpose\": \"debt_consolidation\", \
      \"revolving_line_utilization_rate\": 892.3, \
      \"term\": \" 36 months\", \
      \"total_payments_to_date\": 57777.58, \
      \"verification_status\": \"Source Verified\" \
}"

{"credit_risk":"safe"}

The model predicted that the loan is safe.

We're done with the docker container, so we'll shut down the model service container.

!docker kill credit_risk_model_service
!docker rm credit_risk_model_service

credit_risk_model_service
credit_risk_model_service

Creating a Kubernetes Cluster

To show the system in action, we’ll deploy the service to a Kubernetes cluster. A local cluster can be easily started by using minikube. Installation instructions can be found here.

To start the minikube cluster execute this command:

!minikube start

😄  minikube v1.28.0 on Darwin 13.0.1
✨  Using the docker driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🚜  Pulling base image ...
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.25.3 on Docker 20.10.20 ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
🔎  Verifying Kubernetes components...
    ▪ Using image docker.io/kubernetesui/dashboard:v2.7.0
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image docker.io/kubernetesui/metrics-scraper:v1.0.8
💡  Some dashboard features require the metrics-server addon. To enable all features please run:

    minikube addons enable metrics-server


🌟  Enabled addons: storage-provisioner, default-storageclass, dashboard
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Let's view all of the pods running in the minikube cluster to make sure we can connect.

!kubectl get pods -A

NAMESPACE              NAME                                        READY   STATUS    RESTARTS         AGE
kube-system            coredns-565d847f94-2v6l9                    0/1     Running   10 (3d20h ago)   11d
kube-system            etcd-minikube                               0/1     Running   10 (3d20h ago)   11d
kube-system            kube-apiserver-minikube                     0/1     Running   10 (3d20h ago)   11d
kube-system            kube-controller-manager-minikube            0/1     Running   10 (25s ago)     11d
kube-system            kube-proxy-ztbgd                            1/1     Running   10 (25s ago)     11d
kube-system            kube-scheduler-minikube                     0/1     Running   10 (25s ago)     11d
kube-system            storage-provisioner                         1/1     Running   18 (25s ago)     11d
kubernetes-dashboard   dashboard-metrics-scraper-b74747df5-x559p   1/1     Running   9 (25s ago)      11d
kubernetes-dashboard   kubernetes-dashboard-57bbdc5f89-9jvln       1/1     Running   14 (3d20h ago)   11d

The pods running the kubernetes dashboard and other cluster services appear in the kube-system and kubernetes-dashboard namespaces.

Creating a Kubernetes Namespace

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created
resourcequota/model-services-resource-quota created

The namespace was created, alongside with a ResourceQuota which limits the amount of resources that can be taken by objects within the namespace.

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   11d
kube-node-lease        Active   11d
kube-public            Active   11d
kube-system            Active   11d
kubernetes-dashboard   Active   11d
model-services         Active   0s

The new namespace appears in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "minikube" modified.

Creating a Kubernetes Deployment and Service

The model service is deployed by using Kubernetes resources. These are:

Model Service ConfigMap: a set of configuration options, in this case it is a simple YAML file that will be loaded into the running container as a volume mount. This resource allows us to change the configuration of the model service without having to modify the Docker image. The configuration file will overwrite the configuration files that were included with the Docker image.
Deployment: a declarative way to manage a set of pods, the model service pods are managed through the Deployment. This deployment includes the model service as well as the OPA service running as a sidecar container.
Service: a way to expose a set of pods in a Deployment, the model services is made available to the outside world through the Service.

The Deployment resource will be created with some special options that can leverage the health endpoints of the model service. These options look like this:

livenessProbe:
  httpGet:
    scheme: HTTP
    path: /api/health
    port: 8000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 5
  successThreshold: 1
readinessProbe:
  httpGet:
    scheme: HTTP
    path: /api/health/ready
    port: 8000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 5
  successThreshold: 1
startupProbe:
  httpGet:
    scheme: HTTP
    path: /api/health/startup
    port: 8000
  initialDelaySeconds: 0
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 5
  successThreshold: 1

This is not the complete YAML file, the full Deployment is defined in the ./kubernetes/model_service.yaml file.

The model service container has options defined for each type of health check. Each type of health check is configured in the same way. The options are:

initialDelaySeconds: This option tells Kubernetes how long to wait after container startup to start calling the health check.
periodSeconds: This option tells how often to call the health check endpoint.
timeoutSeconds: This option tell how long to wait for a response from the service before failing the health check.
failureThreshold: This option tells how many times the health check must fail before Kubernetes labels the container as unhealthy and restarts the pod.
successThreshold: This option tells how many times the health check must succeed before Kubernetes labels the container as healthy.

We decided to have Kubernetes check 5 times before labelling the container as unhealthy, with a period of 5 seconds. This means that the service has 25 seconds for the model to finish loading. This is a value we know would work with this model, based on our timing measurements above.

We're almost ready to start the model service, but before starting it we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load credit_risk_model_service:0.1.0

We can view the images in the minikube cache with this command:

!minikube image ls | grep credit_risk_model_service

docker.io/library/credit_risk_model_service:0.1.0

The model service resources are created within the Kubernetes cluster with this command:

!kubectl apply -f ../kubernetes/model_service.yaml

configmap/model-service-configuration created
deployment.apps/credit-risk-model-deployment created
service/credit-risk-model-service created

Lets view the Deployment to see if it is available yet:

!kubectl get deployments

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
credit-risk-model-deployment   0/2     2            0           2s

Looks like the replicas are not ready yet. Let's wait a bit and try again:

!kubectl get deployments

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
credit-risk-model-deployment   2/2     2            2           23s

After the model service finished loading the model, it switched to the "READY" and "STARTED" state, which made the replicas available to serve traffic.

To get an idea of how the service went through the startup process, let's look a the service logs. Let's get the names of the pods that are running the service:

!kubectl get pods

NAME                                            READY   STATUS    RESTARTS   AGE
credit-risk-model-deployment-55654498f4-2bw9k   1/1     Running   0          29s
credit-risk-model-deployment-55654498f4-rxznw   1/1     Running   0          29s

Using one of the pod names, we'll get the logs from Kubernetes:

!kubectl logs credit-risk-model-deployment-55654498f4-2bw9k -c credit-risk-model

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:57232 - "GET /api/health/startup HTTP/1.1" 503 Service Unavailable
INFO:     172.17.0.1:49828 - "GET /api/health/startup HTTP/1.1" 503 Service Unavailable
INFO:     172.17.0.1:49844 - "GET /api/health/startup HTTP/1.1" 200 OK
INFO:     172.17.0.1:49858 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:40210 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:40212 - "GET /api/health HTTP/1.1" 200 OK
INFO:     172.17.0.1:40224 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:40236 - "GET /api/health HTTP/1.1" 200 OK
INFO:     172.17.0.1:58908 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:58910 - "GET /api/health HTTP/1.1" 200 OK
INFO:     172.17.0.1:58926 - "GET /api/health/ready HTTP/1.1" 200 OK
INFO:     172.17.0.1:58942 - "GET /api/health HTTP/1.1" 200 OK

Looks like the process started up correctly and then the /api/health/startup endpoint was called three times, succeeding in the last request. Right after the startup check succeeded, the /api/health/ready endpoint was called, and it immediately succeeded. Right after that, the /api/health endpoint was called and it also succeeded. This startup process ensured that the model service was not accessed by clients before it finished loading the models into memory.

To access the model service, we created a Kubernetes service. The service details look like this:

!kubectl get services

NAME                        TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
credit-risk-model-service   NodePort   10.110.134.65   <none>        80:31004/TCP   68s

Minikube can expose the service on a local port, but we need to run a proxy process. The proxy is started like this:

minikube service credit-risk-model-service --url -n model-services

The command outputs this URL:

http://127.0.0.1:59091

The command must keep running to keep the tunnel open to the running model service in the minikube cluster.

We can send a request to the model service through the local endpoint like this:

!curl -X 'POST' \
  'http://127.0.0.1:59091/api/models/credit_risk_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
      \"annual_income\": 273000, \
      \"collections_in_last_12_months\": 20, \
      \"delinquencies_in_last_2_years\": 39, \
      \"debt_to_income_ratio\": 42.64, \
      \"employment_length\": \"< 1 year\", \
      \"home_ownership\": \"MORTGAGE\", \
      \"number_of_delinquent_accounts\": 6, \
      \"interest_rate\": 28.99, \
      \"last_payment_amount\": 36475.59, \
      \"loan_amount\": 35000,  \
      \"derogatory_public_record_count\": 86, \
      \"loan_purpose\": \"debt_consolidation\", \
      \"revolving_line_utilization_rate\": 892.3, \
      \"term\": \" 36 months\", \
      \"total_payments_to_date\": 57777.58, \
      \"verification_status\": \"Source Verified\" \
}"

{"credit_risk":"safe"}

The health check endpoints of the model service are also available to clients:

!curl -X 'GET' \
  'http://127.0.0.1:59091/api/health' \
  -H 'accept: application/json'

{"health_status":"HEALTHY"}

Deleting the Resources

We're done working with the Kubernetes resources, so we will delete them and shut down the cluster.

To delete the model service pods, execute this command:

!kubectl delete -f ../kubernetes/model_service.yaml

configmap "model-service-configuration" deleted
deployment.apps "credit-risk-model-deployment" deleted
service "credit-risk-model-service" deleted

To delete the model-services namespace, delete this command:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted
resourcequota "model-services-resource-quota" deleted

To shut down the Kubernetes cluster:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  Powering off "minikube" via SSH ...
🛑  1 node stopped.

Closing

In this blog post we showed how to deal with a common issue that arises when a large model is deployed. When the model parameters take a long time to load, the model service needs to make sure that no clients are depending on it to provide predictions until it is finished starting up. We accomplished this on the Kubernetes platform by adding health check endpoints to the model service and configuring Kubernetes to check on the endpoints. By doing this we are able to guarantee that the model service will only become available to clients once it has finished starting up.

In order to build the health checks, the model did not need to change at all. We were able to build the logic into the model service package, which means that the model prediction logic did not have to change to deal with this requirement. We were able to isolate a deployment concern from the model that we were trying to deploy. This also means that we can reuse the model to make predictions in other contexts and not have this extra logic being carried along with it.

Policies for ML Models

2022-09-21T22:00:00-05:00

Policies for ML Model Deployments

In a previous blog post we introduced the decorator pattern for ML model deployments and then showed how to use the pattern to build extensions for a deployed model. For example, in this blog post we added data enrichment to a deployed model. In this blog post we added prediction caching to a deployed model. These extensions were added without having to modify the machine learning model prediction code at all, we were able to do it using the decorator pattern. In this blog post we’ll add policies to a deployed model in the same way.

This blog post was written in a Jupyter notebook, some of the code and commands found in it reflects this.

Introduction

Machine learning models are being used to make ever more important decisions in the modern world. Because of the power of data modeling, ML models are able to learn the nuances of a domain and make accurate predictions even in situations where a human expert would not be able to. However, ML models are not omniscient and they should not run without oversight from their operators. To handle situations in which we don't want to have an ML model make predictions, we can create a policy that steps in before the prediction is returned to the user. A policy that is applied to an ML model is simply a rule that ensures that the model will never make predictions that are unsafe to use. For example, we can create a policy that make sure that a machine learning model that makes predictions about optimal airline ticket prices never makes predictions that cost the airline money. A good policy for an ML model is one that allows the model some leeway while also ensuring that the model’s predictions are safe to use. In this blog post, we'll write policies for ML models and deploy the policies alongside the model using the decorator pattern.

A policy is a system of guidelines that are used to make decisions. A software-defined policy is simply a policy that is written as code and can be executed. Most of the time, the policies followed by a software system are hard-coded into the system using whichever programming language the system is written in. This is often good enough for, but sometimes the policies are complex enough or change often enough to warrant writing them in a specialized language that is specifically designed for policies. By writing policies separately from the system that they will work in, we can decouple them from the system and make the system simpler to work in. Policies can also be written by domain experts and more easily integrated into the software system in this way.

In this blog post we'll write policies for a deployed machine learning model, and we'll use the Rego policy language. Policy decisions are made by querying policies written in Rego that are executed by the Open Policy Agent which is a service that can be integrated into software systems. Other services can offload policy management and execution to the OPA service, accessing it through an RESTful API. The OPA service is specifically built for low-latency evaluations of policies. Rego and OPA are already used to review Kubernetes manifests for best practices, to review infrastructure deployments by checking Terraform plans, and to check for authorization within the Envoy service mesh.

In this blog post we’ll also build a decorator that applies policies to the input and output of a model by using the OPA service. By using the decorator pattern that we’ve shown in previous blog posts, we’ll be able to show how to integrate policies separately from the model itself. We'll show how to deploy the ML model inside of a RESTful service along with the decorator, all by modifying a simple configuration file.

Software Architecture

The system we'll build will ultimately look like this:

Installing a Model

To make this blog post a little shorter we won't train a completely new model. Instead we'll install a model that we've built in a previous blog post. The code for the model is in this github repository.

The model is called the "Insurace Charges Model" and predicts the medical insurance charges based on features of a customer. To install the model, we can use the pip command and point it at the github repo of the model.

from IPython.display import clear_output

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

To make a prediction with the model, we'll import the model's class.

from insurance_charges_model.prediction.model import InsuranceChargesModel

Now we can instantiate the model using the class.

model = InsuranceChargesModel()

clear_output()

The model object contains everything needed to make a prediction. When the object was instantiated, it loaded the necessary model parameters.

The model object publishes some metadata about the model as attributes:

print(model.qualified_name)
print(model.display_name)
print(model.version)
print(model.description)

insurance_charges_model
Insurance Charges Model
0.1.0
Model to predict the insurance charges of a customer.

To make a prediction, we need to use the model's input schema class. The input schema class is a Pydantic class that defines a data structure that can be used by the model's predict() method to make a prediction.

The input schema can be accessed directly from the model object like this:

model.input_schema

insurance_charges_model.prediction.schemas.InsuranceChargesModelInput

We can view input schema of the model as a JSON schema document by calling the .schema() method on the Pydantic class.

model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]}},
 'definitions': {'SexEnum': {'title': 'SexEnum',
   'description': "Enumeration for the value of the 'sex' input of the model.",
   'enum': ['male', 'female'],
   'type': 'string'},
  'RegionEnum': {'title': 'RegionEnum',
   'description': "Enumeration for the value of the 'region' input of the model.",
   'enum': ['southwest', 'southeast', 'northwest', 'northeast'],
   'type': 'string'}}}

The model's input schema is called InsuranceChargesModelInput. The model expects five fields to be provided in order to make a prediction. All of the fields have type information as well as the allowed values. For example, the input schema states that the minimum allowed value for "bmi" is 15 and the maximum allowed value is 50.

To make a prediction, all we need to do is instantiate the input schema class and give it to the model object's predict() method:

from insurance_charges_model.prediction.schemas import SexEnum, RegionEnum

model_input = model.input_schema(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The prediction is another Pydantic class, this one is of type InsuranceChargesModelOutput. The output contains a single field called "charges", which is the predicted amount of charges in dollars. The model predicts that the charges will be $8640.78. Notice that we needed to import two Enum classes in order to fill in the categorical fields with allowed values.

The policies that we'll write need to interact with the model through these schemas, so it's important to review them.

Creating a Policy

Rego policies are assertions on data, in this blog post that data is the ML model's input and output data structures. Using the Insurance Charges Model we installed above, we'll create a policy for this situation:

"Smokers over the age of 60 should not have a prediction made."

This policy is completely made up, its an example of a situation in which we would not want to return a prediction from the model for reasons other than the model's capabilities. The prediction that the model makes would still be valid because the model is capable of prediting the insurance chages for a 62 year old smoker, but business requirements may prevent the prediction from being used. This is a good place to add a policy that will enforce this business requirement. The policy looks like this:

package insurance_charges_model

customer_is_a_smoker_over_60 if {
    input.model_input.smoker
    input.model_input.age > 60
}

The policy is defined in the "insurance_charges_model" package. The policy is using the model input fields "smoker" which is a boolean field, and "age" which is an integer. The value "customer_is_a_smoker_over_60" is set to "true" if the conditions in the body of the rule are true. This policy is very simple and it does not actually make a decision about what to do with the model's prediction, all it does is detect whether the customer is a smoker over the age of 60. To create a decision we'll add another rule:

allow := true if {
    not customer_is_a_smoker_over_60
} else := false {
    customer_is_a_smoker_over_60
}

We've added a rule called "allow" to the policy. Very simply, the value for "allow" is set to true if the customer is not a smoker over the age of of 60, otherwise it is set to false. We'll use this rule to actually make a decision as to what to do with the prediction. It would also be nice to have a description as to why the decision was made, so we'll add one last rule to the insurance_charges_model policy package:

messages contains msg if {
    customer_is_a_smoker_over_60
    msg:= "Prediction cannot be made if customer is a smoker over the age of 60."
}

The last rule creates a "messages" array with explanations for the rules. If the "customer_is_a_smoker_over_60" rule is true, the messages array will contain an explanation for that particular decision. The structure of this policy package is designed to be extendable, so extra clauses can be added to the "allow" rule and "messages" rule as needed.

The policy file is called "insurance_charges_model.rego" and it is saved in the "policies" folder of the repository.

Trying Out the Policy

To show how the policy works, we'll start up the Open Policy Agent service in a Docker container.

!docker run -d \
    -p 8181:8181 \
    --name opa \
    openpolicyagent/opa run --server

84f6c5264e3b1c06e5d20891932e4e682cfd45754fac52dfd0a76ee1574f1302

Once the container is up and running, we'll install the OPA python package to make the integration a little easier. By using the package we won't need to make individual REST call to the service ourselves, we'll let the package handle that.

!pip install OPA-python-client

clear_output()

To contact the OPA service running in the Docker image, we'll create a client object:

from opa_client.opa import OpaClient

client = OpaClient(host="localhost", port=8181, version="v1")

client.check_connection()

"Yes I'm here :)"

The check_connection() method on the client reached out to the OPA service and checked for connectivity.

We can create policies in the OPA service by loading the policies from a file and sending it to the service.

client.update_opa_policy_fromfile("../policies/insurance_charges_model.rego", 
                                  endpoint="insurance_charges_model")

True

The policy was created succesfully in the service, but just to make sure we can ask for a list of the policies:

client.get_policies_list()

['insurance_charges_model']

Looks like the insurance_charges_model package is loaded, now we can try it out with some data. We'll create some data using the model's input and output schemas:

policy_input_data = {
    "model_qualified_name": "insurance_charges_model",
    "model_version": "0.1.0",
    "model_input": {
        "age": 62,
        "sex": "female",
        "bmi": 24.0,
        "children": 2,
        "smoker": True,
        "region": "northwest"
    },
    "model_output": {
        "charges": 12345.0
    }
}

We'll be sending the model's qualified name and version, along with the model input and model output.

We can execute the policy against this data like this:

result = client.check_policy_rule(input_data=policy_input_data,
                                  package_path="insurance_charges_model")

result

{'result': {'allow': False,
  'customer_is_a_smoker_over_60': True,
  'messages': ['Prediction cannot be made if customer is a smoker over the age of 60.']}}

The "allow" rule evaluated to False, the reason being that the customer is a smoker over the age of 60. Let's try it again:

policy_input_data = {
    "model_qualified_name": "insurance_charges_model",
    "model_version": "0.1.0",
    "model_input": {
        "age": 45,
        "sex": "female",
        "bmi": 24.0,
        "children": 2,
        "smoker": True,
        "region": "northwest"
    },    
    "model_output": {
        "charges": 12345.0
    }
}

result = client.check_policy_rule(input_data=policy_input_data,
                                  package_path="insurance_charges_model")

result

{'result': {'allow': True, 'messages': []}}

This time, the "allow" rule evaluated to true, because the age of the customer is below 60, however they are still a smoker. The rule works as expected because we wanted to disallow a prediction if the customer is a smoker AND also over the age of 60.

In this section we showed how to execute the Rego policy using the Open Policy Agent.

Testing the Policy

Rego policies can be tested by creating other Rego policies that assert the the policy is outputting the correct decision by using fake data. A Rego test looks like this:

package insurance_charges_model
import future.keywords

test_customer_is_a_smoker_over_60 if {
    customer_is_a_smoker_over_60 with input as {
        "model_input": {
            "age": 62,
            "sex": "female",
            "bmi": 24.0,
            "children": 2,
            "smoker": true,
            "region": "northwest"
        },
        "model_output": {
            "charges": 12345.0
        }
    }
}

The unit test is named "test_customer_is_a_smoker_over_60" and it tests that the rule evaluates to "true" given the input. This unit test along with 9 others is found in the insurance_charges_model_test.rego file in the policies folder in the project repository.

We'll run the test with this Docker command:

!docker run -it --rm \
    -v "$(pwd)"/../policies:/policies \
    openpolicyagent/opa:0.43.0 test ./policies

PASS: 10/10

The rego test command found all 10 tests and executed them. The tests are loaded by sharing the folder containing the policies with the Docker container as a volume. The "opa test" command then automatically found the insurance_charges_model_tests.rego file and executed all of the tests found inside. The tests all passed.

One of the good things about building policies with code is the ability to test the policies to add quality control to the policy codebase.

Creating the Policy Decorator

In order to cleanly integrate a deployed ML model with the Open Policy agent, we'll create a decorator that handles the application of policies. The decorator will execute "around" the model's output_schema property and predict() method.

from typing import List, Union
from pydantic import BaseModel
from ml_base.decorator import MLModelDecorator

from opa_client.opa import OpaClient


class PredictionNotAvailable(BaseModel):
    """Schema returned when a prediction is not available because of a policy decision."""
    messages: List[str] 


class OPAPolicyDecorator(MLModelDecorator):
    """Decorator to do policy checks using the Open Policy Agent service.

    Args:
        host: Hostname of the OPA service.
        port: Port of the OPA service.
        policy_package: Name of the policy to apply to the model.

    """      

    def __init__(self, host: str, port: int, policy_package: str) -> None:
        super().__init__(host=host, port=port, policy_package=policy_package)
        self.__dict__["_client"] = OpaClient(host=host,
                                             port=port,
                                             version="v1")

    @property
    def output_schema(self) -> BaseModel:
        """Decorator method that modifies the model's output schema to accomodate the policy decision.

        Note:
            This method will create a Union of the model's output schema and the PredictionNotAvailable
            schema and return it.

        """
        class NewUnion(BaseModel):
            __root__: Union[self._model.output_schema, PredictionNotAvailable]
        NewUnion.__name__ = self._model.output_schema.__name__

        return NewUnion

    def predict(self, data):
        """Decorate the model's predict() method, calling the OPA service with the model's input and output.

        Note:
            If a prediction is allowed the OPAPolicyDecorator predict() method will return an
            instance of the model's output schema. If a prediction is not allowed because of a policy 
            violation, the decorator will return an instance of PredictionNotAvailable.

        """
        # make a prediction with the model
        prediction = self._model.predict(data=data)

        # build up data structure to send to the OPA service
        policy_check_data = {
            "model_qualified_name": self._model.qualified_name,
            "model_version": self._model.version,
            "model_input": data.dict(),
            "model_output": prediction.dict()
        }

        # call OPA service with model input and output 
        response = self.__dict__["_client"].check_policy_rule(input_data=policy_check_data,
                                                              package_path=self._configuration["policy_package"])

        # if "allow" is True, then return the prediction
        if response["result"]["allow"]:
            return prediction
        # otherwise, return an instance of PredictionNotAvailable
        else:
            return PredictionNotAvailable(messages=response["result"]["messages"])

    def __del__(self):
        try:
            if self.__dict__["_client"] is not None:
                del self.__dict__["_client"]
        except KeyError:
            pass

The OPAPolicyDecorator class implements the decorator. The __init__() method is used to configure the decorator when it is instantiated. It has parameters for the hostname and port of the OPA service, and the policy package that we want to apply to the model.

The decorator actually modifies the output schema of the model that it is decorating. The output schema becomes a Union of the model's output schema and a schema called PredictionNotAvailable. The decorator needs to add this Union because it needs to be able to inform the users of the model when the policy does not allow a prediction to be returned. The modification of the output schema happens transparently to the user of the model, all they need to do is be able to handle the model's output when the PredictionNotAvailable output is returned.

The predict() method is the where the action happens. Every time we make a prediction, the decorator will pass the prediction input to the model instance and receive the prediction output from the model. The decorator then sends the model's input and output to the OPA service along with the name of the policy package that we want to apply. If the "allow" result comes back as True, then the prediction is returned to the calling code, if "allow" result is False then the decorator returns a PredictionNotAvailable instance. The "messages" array is returned inside of the PredictionNotAvailable instance if the policy does not allow the prediction.

Decorating the Model

To test out the decorator we’ll first instantiate the model object that we want to use with the decorator.

model = InsuranceChargesModel()

Next, we’ll instantiate the decorator with the parameters.

decorator = OPAPolicyDecorator(
    host="localhost", 
    port=8181,
    policy_package="insurance_charges_model"
)

We can add the model instance to the decorator after it’s been instantiated like this:

decorated_model = decorator.set_model(model)

We can see the decorator and the model objects by printing the reference to the decorator:

decorated_model

OPAPolicyDecorator(InsuranceChargesModel)

The decorator object is printing out it's own type along with the type of the model that it is decorating.

The JSON Schema of the model output schema also reflects the Union that was created by the decorator:

decorated_model.output_schema.schema()

{'title': 'InsuranceChargesModelOutput',
 'anyOf': [{'$ref': '#/definitions/insurance_charges_model__prediction__schemas__InsuranceChargesModelOutput'},
  {'$ref': '#/definitions/PredictionNotAvailable'}],
 'definitions': {'insurance_charges_model__prediction__schemas__InsuranceChargesModelOutput': {'title': 'InsuranceChargesModelOutput',
   'description': "Schema for output of the model's predict method.",
   'type': 'object',
   'properties': {'charges': {'title': 'Charges',
     'description': 'Individual medical costs billed by health insurance to customer in US dollars.',
     'type': 'number'}}},
  'PredictionNotAvailable': {'title': 'PredictionNotAvailable',
   'description': 'Schema returned when a prediction is not available because of a policy decision.',
   'type': 'object',
   'properties': {'messages': {'title': 'Messages',
     'type': 'array',
     'items': {'type': 'string'}}},
   'required': ['messages']}}}

As we explained, the PredictionNotAvailable output is added by the OPAPolicyDecorator instance whenever the policy does not allow a prediction to be returned from the model. The Union is shown in the JSON Schema document using the "anyOf" field.

Trying out the Decorator

Now that we have some policies in the OPA service and a decorated model, we can try to make predictions with the decorated model.

To begin, we'll try a prediction that we know will succeed:

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

Since the customer is not a smoker or over the age of 60, we got a prediction back from the model. Next, we'll try another prediction:

model_input = InsuranceChargesModelInput(
    age=62, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=True,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

PredictionNotAvailable(messages=['Prediction cannot be made if customer is a smoker over the age of 60.'])

The policy decorator stepped in when the OPA service returned a result with "allow" set to false. The decorator threw away the model's prediction and returned an instance of PredictionNotAvailable with the messages array that the policy running in the OPA service created.

Deploying the Decorator and Model

Now that we have a model and a decorator, we can combine them together in a service that is able to make predictions and also does policy checks. To do this, we won't need to write any extra code, we can leverage the rest_model_service package to provide the RESTful API for the service. You can learn more about the package in this blog post.

To install the package, execute this command:

!pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true
    decorators:
      - class_path: policy_decorator.policy_decorator.OPAPolicyDecorator
        configuration:
          host: "localhost"
          port: 8181
          policy_package: insurance_charges_model

The service_title field is the name of the service as it will appear in the documentation. The models field is an array that contains the details of the models we would like to deploy in the service. The class_path field points at the MLModel class that implement's the model's prediction logic. The decorators field contains the details of the decorators that we want to attach to the model instance. In this case, we want to use the OPAPolicyDecorator decorator class with the configuration we've used for local testing.

Using the configuration file, we're able to create an OpenAPI specification file for the model service by executing these commands:

export PYTHONPATH=./
export REST_CONFIG=./configuration/local_rest_config.yaml
generate_openapi --output_file="service_contract.yaml"

The service_contract.yaml file is generated and contains the OpenAPI specification that was generated for the model service. The insurance_charges_model endpoint is the one we'll call to make predictions with the model. The model's input and output schemas were automatically extracted and added to the specification. If you inspect the contract, you'll find that the model's output schema was automatically modified by the decorator in the same way as it was done in the example above, the output schema is a Union of the model's original output schema and the PredictionNotAvailable type. The OpenAPI specification file generated can be found at the root of the repository in the file named service_contract.yaml

To run the service locally, execute these commands:

uvicorn rest_model_service.main:app --reload

The service process starts up and can be accessed in a web browser at http://127.0.0.1:8000. The service renders the OpenAPI specification as a webpage that looks like this:

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model. The decorator that we want to deploy can also be added to the model through configuration, including all of their parameters.

We won't be testing the service right now, so we can stop the service process by hitting CTRL+C.

Creating a Docker Image

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally using Docker.

Let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3.9-slim

ARG BUILD_DATE

LABEL org.opencontainers.image.title="Policies for ML Models"
LABEL org.opencontainers.image.description="Policies for machine learning models."
LABEL org.opencontainers.image.created=$BUILD_DATE
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/policies-for-ml-models"
LABEL org.opencontainers.image.version="0.1.0"
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR /service

ARG USERNAME=service-user
ARG USER_UID=10000
ARG USER_GID=10000

# install packages
RUN apt-get update \
    && apt-get install --assume-yes --no-install-recommends sudo \
    && apt-get install --assume-yes --no-install-recommends git \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# create a user
RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

# installing dependencies
COPY ./service_requirements.txt ./service_requirements.txt
RUN pip install -r service_requirements.txt

# copying code, configuration, and license
COPY ./configuration ./configuration
COPY ./policy_decorator ./policy_decorator
COPY ./LICENSE ./LICENSE

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

USER $USERNAME

The Dockerfile is used by this docker command to create a docker image:

!docker build -t insurance_charges_model_service:0.1.0 ../

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep insurance_charges_model_service

insurance_charges_model_service   0.1.0     4b2747668a67   18 seconds ago   1.37GB

The insurance_charges_model_service image is listed. Next, we'll start the image to see if everything is working as expected. However, we need to connect the docker containers to the same network first. Let's create a Docker network:

!docker network create local-network

8a7d2d05523d01dd0fc082adac84bda01a012d7e847dcd4ffcc35df1031e18ab

Next, we'll connect the running OPA Docker image to the network.

!docker network connect local-network opa

Now we can start the service docker image connected to the same network as the OPA container.

!docker run -d \
    -p 8000:8000 \
    --net local-network \
    -e REST_CONFIG=./configuration/docker_rest_config.yaml \
    --name insurance_charges_model_service \
    insurance_charges_model_service:0.1.0

02dd79117cfe53949b30dea9e1aa8834bf2509e2cc707f42972eec955c3364ae

Notice that we're using the "docker_rest_config.yaml" configuration file that has a different hostname for the OPA service instance. The opa container is not accesible from localhost inside of the network so we needed to have the hostname "opa" in the configuration.

To make sure the server process started up correctly, we'll look at the logs:

!docker logs insurance_charges_model_service

/usr/local/lib/python3.9/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.
  warnings.warn("Warning: optional dependency `torch` is not available. - skipping import of NN models.")
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command running inside of a container connected to the network:

!docker run -it --rm \
    --net local-network \
    curlimages/curl \
    curl -X 'POST' \
    'http://insurance_charges_model_service:8000/api/models/insurance_charges_model/prediction' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{ \
        \"age\": 42, \
        \"sex\": \"female\", \
        \"bmi\": 24.0, \
        \"children\": 2, \
        \"smoker\": false, \
        \"region\": \"northwest\" \
    }"

{"charges":8640.78}

The model predicted that the insurance charges will be $8640.78.

We'll try a prediction that will fail the policy check as well:

!docker run -it --rm \
    --net local-network \
    curlimages/curl \
    curl -X 'POST' \
    'http://insurance_charges_model_service:8000/api/models/insurance_charges_model/prediction' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{ \
        \"age\": 62, \
        \"sex\": \"female\", \
        \"bmi\": 24.0, \
        \"children\": 2, \
        \"smoker\": true, \
        \"region\": \"northwest\" \
    }"

{"messages":["Prediction cannot be made if customer is a smoker over the age of 60."]}

We're done with the local environment, so we'll shut down the OPA container, the model service container and the network we created for them.

!docker kill opa
!docker rm opa

!docker kill insurance_charges_model_service
!docker rm insurance_charges_model_service

!docker network rm local-network

opa
opa
insurance_charges_model_service
insurance_charges_model_service
local-network

Deploying the Model

To show the system in action, we’ll deploy the service and the Redis instance to a Kubernetes cluster. A local cluster can be easily started by using minikube. Installation instructions can be found here.

To start the minikube cluster execute this command:

!minikube start

😄  minikube v1.26.1 on Darwin 12.5.1
🎉  minikube 1.27.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.27.0
💡  To disable this notice, run: 'minikube config set WantUpdateNotification false'

✨  Using the virtualbox driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🔄  Restarting existing virtualbox VM for "minikube" ...
🐳  Preparing Kubernetes v1.24.3 on Docker 20.10.17 ...[K[K[K[K
    ▪ controller-manager.horizontal-pod-autoscaler-sync-period=5s
🔎  Verifying Kubernetes components...
    ▪ Using image k8s.gcr.io/metrics-server/metrics-server:v0.6.1
    ▪ Using image kubernetesui/dashboard:v2.6.0
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image kubernetesui/metrics-scraper:v1.0.8
🌟  Enabled addons: storage-provisioner
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

We'll use the Kubernetes Dashboard to view details about the model service. We can start it up in the minikube cluster with this command:

minikube dashboard --url

The command starts up a proxy that must keep running in order to forward the traffic to the dashboard UI in the minikube cluster.

Let's view all of the pods running in the minikube cluster to make sure we can connect.

!kubectl get pods -A

NAMESPACE              NAME                                         READY   STATUS    RESTARTS       AGE
kube-system            coredns-6d4b75cb6d-wrrwr                     1/1     Running   19 (23h ago)   43d
kube-system            etcd-minikube                                1/1     Running   19 (23h ago)   43d
kube-system            kube-apiserver-minikube                      1/1     Running   19 (23h ago)   43d
kube-system            kube-controller-manager-minikube             1/1     Running   5 (23h ago)    20d
kube-system            kube-proxy-5n4t9                             1/1     Running   18 (23h ago)   43d
kube-system            kube-scheduler-minikube                      1/1     Running   17 (23h ago)   43d
kube-system            metrics-server-8595bd7d4c-ptcsp              1/1     Running   15 (23h ago)   23d
kube-system            storage-provisioner                          1/1     Running   29             43d
kubernetes-dashboard   dashboard-metrics-scraper-78dbd9dbf5-xslpl   1/1     Running   11 (23h ago)   23d
kubernetes-dashboard   kubernetes-dashboard-5fd5574d9f-vbtnd        1/1     Running   14 (23h ago)   23d

The pods running the kubernetes dashboard and other cluster services appear in the kube-system and kubernetes-dashboard namespaces.

Creating a Kubernetes Namespace

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created
resourcequota/model-services-resource-quota created

The namespace was created, alongside with a ResourceQuota which limits the amount of resources that can be taken by objects within the namespace.

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   43d
kube-node-lease        Active   43d
kube-public            Active   43d
kube-system            Active   43d
kubernetes-dashboard   Active   23d
model-services         Active   3s

The new namespace appears in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "minikube" modified.

Creating a Kubernetes Deployment and Service

The model service is deployed by using Kubernetes resources. These are:

Model Service ConfigMap: a set of configuration options, in this case it is a simple YAML file that will be loaded into the running container as a volume mount. This resource allows us to change the configuration of the model service without having to modify the Docker image. The configuration file will overwrite the configuration files that were included with the Docker image.
Deployment: a declarative way to manage a set of pods, the model service pods are managed through the Deployment. This deployment includes the model service as well as the OPA service running as a sidecar container.
Service: a way to expose a set of pods in a Deployment, the model services is made available to the outside world through the Service.

The software architecture will look like this when it is running in the Kubernetes cluster:

This way of deploying the OPA service is called the "sidecar" pattern because the service Pods will contain the main model service and the OPA service running right beside it in the same cluster node.

The sidecar OPA container is added to the model service pod with this YAML:

  ...
  - name: opa
    image: openpolicyagent/opa:0.43.0
    ports:
      - name: http
        containerPort: 8181
    imagePullPolicy: Never
    resources:
      requests:
        cpu: "100m"
        memory: "250Mi"
      limits:
        cpu: "200m"
        memory: "250Mi"
    args:
      - "run"
      - "--ignore=.*"
      - "--server"
      - "/policies"
    volumeMounts:
      - readOnly: true
        mountPath: /policies
        name: policies
    livenessProbe:
      httpGet:
        scheme: HTTP
        port: 8181
      initialDelaySeconds: 5
      periodSeconds: 5
    readinessProbe:
      httpGet:
        path: /health?bundle=true
        scheme: HTTP
        port: 8181
      initialDelaySeconds: 5
      periodSeconds: 5
  ...

This is not the complete YAML file, the Deployment is defined in the ./kubernetes/model_service.yaml file.

You'll notice that the policy is not going to be loaded through the API. We'll be adding the policy as a volume mounted on the /policies folder within the OPA container. The contents of the volume are going to come from a ConfigMap that we'll create with this command:

!kubectl create configmap policies --from-file ../policies/insurance_charges_model.rego

configmap/policies created

The ConfigMap is managed separately from the OPA service running in the Pod. Let's view the ConfigMap to make sure it was created successfully.

!kubectl describe configmaps policies

Name:         policies
Namespace:    model-services
Labels:       <none>
Annotations:  <none>

Data
====
insurance_charges_model.rego:
----
package insurance_charges_model

import future.keywords.contains
import future.keywords.if

customer_is_a_smoker_over_60 if {
  input.model_input.smoker
    input.model_input.age > 60
}

allow := true if {
  not customer_is_a_smoker_over_60
} else := false {
  customer_is_a_smoker_over_60
}

messages contains msg if {
  customer_is_a_smoker_over_60
    msg:= "Prediction cannot be made if customer is a smoker over the age of 60."
}


BinaryData
====

Events:  <none>

The contents of the ConfigMap match the contents of the original insurance_charges_model.rego file.

We're almost ready to start the model service, but before starting it we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load insurance_charges_model_service:0.1.0

We can view the images in the minikube cache with this command:

!minikube image ls | grep insurance_charges_model_service

docker.io/library/insurance_charges_model_service:0.1.0

The model service resources are created within the Kubernetes cluster with this command:

!kubectl apply -f ../kubernetes/model_service.yaml

configmap/model-service-configuration created
deployment.apps/insurance-charges-model-deployment created
service/insurance-charges-model-service created

Let's get the names of the pods that are running the service:

!kubectl get pods

NAME                                                 READY   STATUS    RESTARTS   AGE
insurance-charges-model-deployment-66ff696fd-zbzdv   2/2     Running   0          29s

To make sure the service started up correctly, we'll check the logs of the model service:

!kubectl logs insurance-charges-model-deployment-66ff696fd-zbzdv -c insurance-charges-model

/usr/local/lib/python3.9/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.
  warnings.warn("Warning: optional dependency `torch` is not available. - skipping import of NN models.")
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Looks like the server process started correctly in the Docker container. The UserWarning is generated when we instantiate the model object, which means everything is running as expected.

We can also view the logs of the OPA service sidecar:

!kubectl logs insurance-charges-model-deployment-66ff696fd-zbzdv -c opa | head -n 5

{"addrs":[":8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server.","time":"2022-09-21T14:09:05Z"}
{"level":"warning","msg":"OPA running with uid or gid 0. Running OPA with root privileges is not recommended. Use the -rootless image to avoid running with root privileges. This will be made the default in later OPA releases.","time":"2022-09-21T14:09:05Z"}
{"client_addr":"172.17.0.1:48928","level":"info","msg":"Received request.","req_id":1,"req_method":"GET","req_path":"/","time":"2022-09-21T14:09:13Z"}
{"client_addr":"172.17.0.1:48928","level":"info","msg":"Sent response.","req_id":1,"req_method":"GET","req_path":"/","resp_bytes":1391,"resp_duration":2.031405,"resp_status":200,"time":"2022-09-21T14:09:13Z"}
{"client_addr":"172.17.0.1:48930","level":"info","msg":"Received request.","req_id":2,"req_method":"GET","req_path":"/health","time":"2022-09-21T14:09:13Z"}

The deployment and service for the model service were created together. You can see the new service with this command:

!kubectl get services

NAME                              TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
insurance-charges-model-service   NodePort   10.107.89.237   <none>        80:30468/TCP   59s

Minikube exposes the service on a local port, we can get a link to the endpoint with this command:

minikube service insurance-charges-model-service --url -n model-services

The command output this URL:

http://192.168.59.100:30468

The command must keep running to keep the tunnel open to the running model service in the minikube cluster.

To make a prediction, we'll hit the service with a request:

!curl -X 'POST' \
  'http://192.168.59.100:30468/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 62, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"messages":["Prediction cannot be made if customer is a smoker over the age of 60."]}

We have the model service up and running in the local minikube cluster!

Looks like the policy was evaluated and the PredictionNotAvailable schema was returned. Let's try it with a request that we know will return a prediction:

!curl -X 'POST' \
  'http://192.168.59.100:30468/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 42, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": false, \
        \"region\": \"southwest\" \
    }"

{"charges":9762.69}

The service is up and running with the OPA sidecar and the decorator is able to interact with the sidecar correctly to evaluate the policy we created.

Deleting the Resources

We're done working with the Kubernetes resources, so we will delete them and shut down the cluster.

To delete the policies ConfigMap, execute this command:

!kubectl delete configmap policies

configmap "policies" deleted

To delete the model service pods, execute this command:

!kubectl delete -f ../kubernetes/model_service.yaml

configmap "model-service-configuration" deleted
deployment.apps "insurance-charges-model-deployment" deleted
service "insurance-charges-model-service" deleted

To delete the model-services namespace, delete this command:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted
resourcequota "model-services-resource-quota" deleted

To shut down the Kubernetes cluster:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  1 node stopped.

Closing

In this blog post we showed how to deploy a machine learning model with a decorator that applied policies to the model's prediction. We built the policy using the Rego language and executed it with the Open Policy Agent. By adding the policy as a decorator, we’re able to decouple the model’s prediction logic from the policy logic, this makes both components more reusable and easier to test. In fact, the policy decorator can easily be reused in other ML deployments, as long as we write a policy that matches our model’s needs.

By writing the policy in an industry-standard language we’re enabling people that don’t have experience with ML or ML deployments to create complex policies that can be deployed alongside an ML model. The person that writes these policies is often a subject matter expert that understands the domain within which the model is working and the effect that the model’s operation will have on it. By using a policy-based approach to the problem of checking ML model predictions we’re able to simplify the deployment process as well, since a policy can be developed and deployed separately from the ML model deployment.

Adding the OPA sidecar to the deployment increased the complexity of the software because we have to worry about deploying an extra container in the Kubernetes pod to run the policy. This approach also increased the latency for each prediction because it requires inter-process communication to execute the policy for each prediction that the model service makes. For both of these reasons, using Rego and the Open Policy Agent may not be the ideal choice for all model deployments. In some situations, it may be better to just write the policy in Python and deploy it as a decorator alongside the model, this will make the policy decision add less time to the total prediction time.

Load Tests for ML Models

2022-09-01T07:00:00-05:00

Load Tests for ML Models

In a previous blog post we showed how to create a RESTful model service for a machine learning model that we want to deploy. A common requirement for RESTful services is to be able to be able to continue working while being used by many users at the same time. In this blog post we'll show how to create a load testing script for an ML model service.

This blog post was written in a Jupyter notebook, some of the code and commands found in it reflects this.

Introduction

Deploying machine learning models is always done in the context of a bigger software system into which the ML model is being integrated. ML models need to be integrated correctly into the software system, and the deployed ML model needs to meet the requirements of the system into which it is being deployed. The requirements that a system must meet are often categorized into two types: functional requirements and non-functional requirements. Functional requirements are the specific behavior that a sytem must have in order to do its assigned tasks. Non-functional requirements are the operational standards that the system must meet in order to do its assigned tasks. An example of a non-functional requirement is resilience, which is the quality of a system that is able to have errors in its operation and still provide an acceptable level of service. Non-functional requirements are often hard to measure objectively, but we can definitely tell when they are missing from a system. In this blog post we'll be dealing with load non-functional requirements.

Non-functional requirements can be stated by using Service Level Indicators (SLI). An SLI is a simply a metric that measures an aspect of the function of the system. For example, the latency of a system is the amount of time it takes for the system to fulfill one request from beginning to end. An SLI needs to be well-defined and understood by both the clients and operators of a system because it forms the basis for service level objectives. Some examples of SLIs are latency, throughput, availability, error rate, and durability.

Service level objectives (SLO) are requirements on the operation of a system as measured through the SLIs of the system. SLOs are defined and agreed-upon ways to tell when a system is operating outside of the required performance standard. For example, when measuring latency a valid SLO would be something like this: "the latency of the system must be 500 ms or less for 90% of requests". When measuring error rates an SLO would say "the number of errors must not exceed 10 for every 10,000 requests made to the system".

Service Level Agreements (SLA) are an agreement between a system and its clients about the "level" at which the system will provide its services. SLAs can contain many different types of clauses, the ones we are interested today are the non-functional aspects of the system as measured by SLIs and constrained by SLOs.

Load testing is the process by which we can verify that a deployed ML model that is deployed as a service is able to meet the SLA of the service while under load. Some of the SLIs that we will me measuring will be latency, throughput, and error rate.

All of the code for this blog post is available in this github repository.

Installing the Model

To install the model, we can use the pip command and point it at the github repo of the model.

from IPython.display import clear_output
from IPython.display import Markdown as md

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

To make a prediction with the model, we'll import the model's class.

from insurance_charges_model.prediction.model import InsuranceChargesModel

Now we can instantiate the model:

model = InsuranceChargesModel()

clear_output()

To make a prediction, we'll need to use the model's input schema class.

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput, \
    SexEnum, RegionEnum

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

The model's input schema is called InsuranceChargesModelInput and it encompasses all of the features required by the model to make a prediction.

Now we can make a prediction with the model by calling the predict() method with an instance of the InsuranceChargesModelInput class.

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The model predicts that the charges will be $8640.78.

We can view input schema of the model as a JSON schema document by calling the .schema() method on the class.

model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]}},
 'definitions': {'SexEnum': {'title': 'SexEnum',
   'description': "Enumeration for the value of the 'sex' input of the model.",
   'enum': ['male', 'female'],
   'type': 'string'},
  'RegionEnum': {'title': 'RegionEnum',
   'description': "Enumeration for the value of the 'region' input of the model.",
   'enum': ['southwest', 'southeast', 'northwest', 'northeast'],
   'type': 'string'}}}

We'll make use of the model's input schema to create the load testing script.

Profiling the Model

In order to get an idea of how much time it takes for our model to make a prediction, we'll profile it by making predictions with random data. To do this, we'll use the Faker package. We can install it with this command:

!pip install Faker

clear_output()

We'll create a function that can generate a random sample that meets the model's input schema:

from faker import Faker

faker = Faker()

def generate_record() -> InsuranceChargesModelInput:
    record = {
        "age": faker.random_int(min=18, max=65),
        "sex": faker.random_choices(elements=("male", "female"), length=1)[0],
        "bmi": faker.random_int(min=15000, max=50000)/1000.0,
        "children": faker.random_int(min=0, max=5),
        "smoker": faker.boolean(),
        "region": faker.random_choices(elements=("southwest", "southeast", "northwest", "northeast"), length=1)[0]
    }
    return InsuranceChargesModelInput(**record)

The function returns an instance of the InsuranceChargesModelInput class, which is the type required by the model's predict() method. We'll use this function to profile the predict() method of the model.

It's really hard to get a complete picture of the performance with one sample, so we'll perform a test with many random samples to see the difference. To start, we'll generate 1000 samples and save them:

samples = []

for _ in range(1000):
    samples.append(generate_record())

By using the timeit module from the standard library, we can measure how much time it takes to call the model's predict method with a random sample. We'll make 1000 predictions.

import timeit

total_seconds = timeit.timeit("[model.predict(sample) for sample in samples]", 
                              number=1, globals=globals())

seconds_per_sample = total_seconds / len(samples)
milliseconds_per_sample = seconds_per_sample * 1000.0

The model took 31.74 seconds to perform 1000 predictions, therefore it took 0.032 seconds to make a single prediction. The model takes about 31.74 milliseconds to make a prediction.

We now have enough information to establish an SLO for the model itself. An acceptable amount of time for the model to make a prediction is 100 ms (this is made up for the sake of the example). Based on the results from the test above, we're pretty sure that the model meets this standard. However, we want to write the requirement directly into the code of the notebook. To do this in a notebook cell, we can simply write an assert statement which checks for the condition:

assert milliseconds_per_sample < 100, "Model does not meet the latency SLO."

The assertion above did not fail, so the model meets the requirement. This is an example of a way to encode an SLO for the model so that it is checked programatically. We can add code like this to the training code of a model so that we always check the SLO right after a model is trained. If the requirement is not met, the assert statement will cause the notebook to stop executing immediately.

We've profiled the model and this provided us with some information about it's performance, however a real load test can only be performed on the model when it is deployed. The reason for this is that in the real world, the users of the model will be accessing the model concurrently, in the example we just did the model was making predictions serially and was not used by many users at the same time. The model was also running in the local memory of the computer, while in a real model deployment there would be a RESTful service working around it, and the model would be accessed through the network.

Creating the Model Service

Now that we have profiled the model, we can deploy the model inside of a RESTful service and do a load test on it. To do this, we'll use the rest_model_service package to quickly create a RESTful service. You can learn more about this package in this blog post.

!pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true

This YAML file is in the "configuration" folder of the project repository.

To run the service locally, execute these commands:

export PYTHONPATH=./
export REST_CONFIG=./configuration/local_rest_config.yaml
uvicorn rest_model_service.main:app --reload

The service should come up and can be accessed in a web browser at http://127.0.0.1:8000. When you access that URL using a web browser you will be redirected to the documentation page that is generated by the FastAPI package. The documentation looks like this:

As you can see the Insurance Charges Model got it's own endpoint.

We can try out the service with this command:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 42, \
        \"sex\": \"female\", \
        \"bmi\": 24.0, \
        \"children\": 2, \
        \"smoker\": false, \
        \"region\": \"northwest\" \
    }"

{"charges":8640.78}

By accessing the model's endpoint we were able to make a prediction. We got exactly the same prediction as when we installed the model in the example above.

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model.

Creating a Load Testing Script

To create a load testing script, we'll use the locust package. We'll install the package with this command:

!pip install locust

clear_output()

In order to run a load test with locust, we need to define what requests the locust package will make to the model service. To do this we need to define an HttpUser class.

from locust import HttpUser, constant_throughput, task
from faker import Faker


class ModelServiceUser(HttpUser):
    wait_time = constant_throughput(1)

    @task
    def post_prediction(self):
        faker = Faker()

        record = {
            "age": faker.random_int(min=18, max=65),
            "sex": faker.random_choices(elements=("male", "female"), length=1)[0],
            "bmi": faker.random_int(min=15000, max=50000) / 1000.0,
            "children": faker.random_int(min=0, max=5),
            "smoker": faker.boolean(),
            "region": faker.random_choices(
                elements=("southwest", "southeast", "northwest", "northeast"), length=1)[0]
        }
        self.client.post("/api/models/insurance_charges_model/prediction", json=record)

The class above makes a single request to the prediction endpoint in the model service, generating a random sample using the same code that we used to profile the model above. The load test consists of a single task that will be executed over and over, but we can easily add other tasks if we wanted to use the model in different ways. The wait_time attribute of the class is set to a constant throughout of 1, which means that each task will be executed at most 1 time per second by each simulated user in the load test. We can use this throuput and the number of concurrent users to create a realistic load test profile.

The code above is saved in the load_test.py file in the tests folder in the repository. We can launch a load test with this command:

locust -f tests/load_test.py

The load test process starts up a web app that can be accessed locally on http://127.0.0.1:8089.

To start a load test, the locust web app asks for the number of users to simulate, the spawn rate of users, and the base url of the service to send requests to. We set the number of users to 1, the spawn rate to 1 per second, and the url to the service instance that is currently running on the local host.

When we click on the "Start swarming" button, the load test starts and we can see this screen:

The load test is running and sending requests to the model service at the rate of one request per second from one user. The web UI also shows some charts in a separate tab in the UI, for example the total requests per second:

The response time is milliseconds:

And the number of users:

When we're ready to stop the load test, we can click on the "Stop" button in the upper right corner.

Determining whether the model service meets the SLO is as simple as inspecting the "Statistics" tab.

We can see that the maximum latency of the prediction request was 122 milliseconds, which does not meet our SLO of 100 ms. However, using the max is often a noisy measurement because it can be affected by many different environmental factors. It's better to use the 90th or 99th percentile. In this case the 99th percentile is 89 ms, which does meet our SLO.

This load test is not very realistic because it only has one concurrent user. In the next load tests, we'll add more concurrent users to make it more realistic.

Adding Shape to the Load Test

Right now the load test script is able to simulate one concurrent user making one request to the service per second. This is a good place to start, but we should test the service with more users. The load test is also designed to run indefinitely with the same number of users. We will add "shape" to the load test by raising the number of users over of time and then lowering the number of users back down. This will show us the performance of the service over many load conditions. We'll also stop the load test after the load test returns to the baseline, this will help us to automate the load test later.

To add a "shape" to the load test, we'll add a class that is a subclass of LoadTestShape to the load test file:

from locust import LoadTestShape


class StagesShape(LoadTestShape):
    """Simple load test shape class."""

    stages = [
        {"duration": 30, "users": 1, "spawn_rate": 1},
        {"duration": 60, "users": 2, "spawn_rate": 1},
        {"duration": 90, "users": 3, "spawn_rate": 1},
        {"duration": 120, "users": 4, "spawn_rate": 1},
        {"duration": 150, "users": 5, "spawn_rate": 1},
        {"duration": 180, "users": 4, "spawn_rate": 1},
        {"duration": 210, "users": 3, "spawn_rate": 1},
        {"duration": 240, "users": 2, "spawn_rate": 1},
        {"duration": 270, "users": 1, "spawn_rate": 1}
    ]

    def tick(self):
        run_time = self.get_run_time()

        for stage in self.stages:
            if run_time < stage["duration"]:
                tick_data = (stage["users"], stage["spawn_rate"])
                return tick_data
        # returning None to stop the load test
        return None

The tick() method is called once per second by the locust framework to determine the number of users needed and how fast to spawn the users. The tick() method looks up the desired number of users and spawn rate from the stages list. The tick() method simply iterates through the list until it finds the correct stage to use based on the number of elapsed seconds since the beginning of the load test. We defined 9 stages in the stages list, with each stage taking 30 seconds, the max number of concurrent users will be 5.

To run the load test, simply execute the same command as above:

locust -f tests/load_test.py

The load test will start when we press the "Start swarming" button, as before. However, this load test will vary the number of users according to the shape defined in the class. Since the number of users and spawn rate is determined by the shape class, we dont need to provide these to start the load test.

The load test runs for six minutes and the number of users chart looks like this:

The response time chart looks like this:

The response time of the service definitely suffered when the number of users went above 1, and the maximum response time of the service was 225 ms. It looks like a single instance of the model service cannot handle much more than 1 concurrent users making one request per second.

The requests per second chart looks like this:

The number of requests per second scaled with the number of users because we're making one request per second per user.

Adding Service Level Objectives

Right now, the load test script simply runs the load test and displays the results on a webpage. However, we can make it more useful by adding support for SLOs. For example, we can have the load test fail if the latency of any request is above a certain threshold, or if the average latency of all requests is above a certain threshold.

We'll add support for checking the following SLOs: - latency, we'll check that the latency at the 99th percentile is less than 100 ms - error rate, we'll check that there are no errors returned on any request - throughput, we'll check that the service can handle at least 5 requests per second

To do this we'll add a listener function that receives events from the locust package:

import logging

logger = logging.getLogger(__name__)


@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
    process_exit_code = 0

    max_requests_per_second = max(
        [requests_per_second for requests_per_second in environment.stats.total.num_reqs_per_sec.values()])

    if environment.stats.total.fail_ratio > 0.0:
        logger.error("Test failed because there was one or more errors.")
        process_exit_code = 1

    if environment.stats.total.get_response_time_percentile(0.99) > 100:
        logger.error("Test failed because the response time at the 99th percentile was above 100 ms. The 99th "
                     "percentile latency is '{}'.".format(environment.stats.total.get_response_time_percentile(0.99)))
        process_exit_code = 1

    if max_requests_per_second < 5:
        logger.error(
            "Test failed because the max requests per second never reached 5. The max requests per second "
            "is: '{}'.".format(max_requests_per_second))
        process_exit_code = 1

    environment.process_exit_code = process_exit_code

The on_test_quitting function is going to execute at the end of every load test. This function can access all of the statistics saved by the load test, we can check the different conditions by accessing the statistics. If any of the SLOs are not met, we set the process exit code to be 1, which signals a failure to the operating system.

To run the load test, execute the same command as above. When the load test finishes, the process will output the results to the command line. In this case the load test failed with this output:

Test failed because the response time at the 99th percentile was above 100 ms. The 99th percentile latency is '180.0'.

Running a Headless Load Test

The locust package can also run load test without the web UI. This is useful for doing automated load tests that run in a server, without anyone watching the UI. The command is:

locust -f tests/load_test.py --host=http://127.0.0.1:8000 --headless --loglevel ERROR --csv=./load_test_report/load_test --html ./load_test_report/load_test_report.html

Once the test finishes, we see the same error as above because the load test did not meet the SLO required. The error message is:

Test failed because the response time at the 99th percentile was above 100 ms. The 99th percentile latency is '180.0'.

All of the code for the load test script is found in the "test/load_test.py" file in the repository for this blog post. The results are stored in CSV files and an HTML file in the "load_test_report" folder.

Building a Docker Image

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally using Docker.

Let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3.9-slim

ARG BUILD_DATE

LABEL org.opencontainers.image.title="Load Tests for ML Models"
LABEL org.opencontainers.image.description="Load tests for machine learning models."
LABEL org.opencontainers.image.created=$BUILD_DATE
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/load-tests-for-ml-models"
LABEL org.opencontainers.image.version="0.1.0"
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR /service

ARG USERNAME=service-user
ARG USER_UID=1000
ARG USER_GID=1000

RUN apt-get update

# create a user
RUN groupadd --gid $USER_GID $USERNAME \
    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && apt-get install --assume-yes --no-install-recommends sudo \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME

# installing git because we need to install the model package from it's own github repository
RUN apt-get install --assume-yes --no-install-recommends git

RUN apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# installing dependencies first in order to speed up build by using cached layers
COPY ./service_requirements.txt ./service_requirements.txt
RUN pip install -r service_requirements.txt

COPY ./configuration ./configuration
COPY ./LICENSE ./LICENSE

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

USER $USERNAME

The Dockerfile is used by this docker command to create a docker image:

!docker build -t insurance_charges_model_service:latest ../

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep insurance_charges_model_service

insurance_charges_model_service   latest    446f5f06805f   37 seconds ago   1.25GB

Next, we'll start the image to see if everything is working as expected.

!docker run -d \
    -p 8000:8000 \
    -e REST_CONFIG=./configuration/local_rest_config.yaml \
    --name insurance_charges_model_service \
    insurance_charges_model_service:latest

44c4794160f941e44d1670b70c7fd5722c41bf0c2e470a0b0c8648c966b9923b

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 42, \
        \"sex\": \"female\", \
        \"bmi\": 24.0, \
        \"children\": 2, \
        \"smoker\": false, \
        \"region\": \"northwest\" \
    }"

{"charges":8640.78}

We'll use the model service Docker image to deploy the model service and automate the load test later.

Now that we're done with the local redis instance we'll stop and remove the docker container.

!docker kill insurance_charges_model_service
!docker rm insurance_charges_model_service

insurance_charges_model_service
insurance_charges_model_service

Deploying the Model Service

To start the minikube cluster execute this command:

!minikube start

😄  minikube v1.26.1 on Darwin 12.5
✨  Using the virtualbox driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🔄  Restarting existing virtualbox VM for "minikube" ...
🐳  Preparing Kubernetes v1.24.3 on Docker 20.10.17 ...[K[K[K[K
    ▪ controller-manager.horizontal-pod-autoscaler-sync-period=5s
🔎  Verifying Kubernetes components...
    ▪ Using image k8s.gcr.io/metrics-server/metrics-server:v0.6.1
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image kubernetesui/dashboard:v2.6.0
    ▪ Using image kubernetesui/metrics-scraper:v1.0.8
🌟  Enabled addons: dashboard
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

We'll need to use the Kubernetes Dashboard to view details about the model service. We can start it up in the minikube cluster with this command:

minikube dashboard --url

The command starts up a proxy that must keep running in order to forward the traffic to the dashboard UI in the minikube cluster.

The dashboard UI looks like this:

We'll also need to use the metrics server in Kubernetes. We can enable that in minikube with this command:

!minikube addons enable metrics-server

💡  metrics-server is an addon maintained by Kubernetes. For any concerns contact minikube on GitHub.
You can view the list of minikube maintainers at: https://github.com/kubernetes/minikube/blob/master/OWNERS
    ▪ Using image k8s.gcr.io/metrics-server/metrics-server:v0.6.1
🌟  The 'metrics-server' addon is enabled

Let's view all of the pods running in the minikube cluster to make sure we can connect.

!kubectl get pods -A

NAMESPACE              NAME                                         READY   STATUS    RESTARTS       AGE
kube-system            coredns-6d4b75cb6d-wrrwr                     1/1     Running   16 (22h ago)   23d
kube-system            etcd-minikube                                1/1     Running   16 (22h ago)   23d
kube-system            kube-apiserver-minikube                      1/1     Running   16 (22h ago)   23d
kube-system            kube-controller-manager-minikube             1/1     Running   2 (22h ago)    24h
kube-system            kube-proxy-5n4t9                             1/1     Running   15 (22h ago)   23d
kube-system            kube-scheduler-minikube                      1/1     Running   14 (22h ago)   23d
kube-system            metrics-server-8595bd7d4c-ptcsp              1/1     Running   12 (22h ago)   4d2h
kube-system            storage-provisioner                          1/1     Running   25 (24s ago)   23d
kubernetes-dashboard   dashboard-metrics-scraper-78dbd9dbf5-xslpl   1/1     Running   8 (22h ago)    4d2h
kubernetes-dashboard   kubernetes-dashboard-5fd5574d9f-vbtnd        1/1     Running   10 (22h ago)   4d2h

The pods running the kubernetes dashboard and metrics server appear in the kube-system and kubernetes-dashboard namespaces.

Creating a Kubernetes Namespace

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   23d
kube-node-lease        Active   23d
kube-public            Active   23d
kube-system            Active   23d
kubernetes-dashboard   Active   4d2h
model-services         Active   1s

The new namespace appears in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "minikube" modified.

Creating a Model Deployment and Service

The model service is deployed by using Kubernetes resources. These are:

ConfigMap: a set of configuration options, in this case it is a simple YAML file that will be loaded into the running container as a volume mount. This resource allows us to change the configuration of the model service without having to modify the Docker image.
Deployment: a declarative way to manage a set of pods, the model service pods are managed through the Deployment.
Service: a way to expose a set of pods in a Deployment, the model services is made available to the outside world through the Service, the service type is LoadBalancer which means that a load balancer will be created for the service.

These resources are defined in the ./kubernetes/model_service.yaml file in the project repository.

To start the model service, first we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load insurance_charges_model_service:latest

We can view the images in the minikube cache like this:

!minikube cache list

insurance_charges_model_service:latest

The model service resources are created within the Kubernetes cluster with this command:

!kubectl apply -f ../kubernetes/model_service.yaml

configmap/model-service-configuration created
deployment.apps/insurance-charges-model-deployment created
service/insurance-charges-model-service created

Let's get the names of the pods that are running the service:

!kubectl get pods

NAME                                                  READY   STATUS    RESTARTS   AGE
insurance-charges-model-deployment-5454fc7cfb-rhl2t   1/1     Running   0          4s

To make sure the service started up correctly, we'll check the logs of the single pod running the service:

!kubectl logs insurance-charges-model-deployment-5454fc7cfb-rhl2t

/usr/local/lib/python3.9/site-packages/tpot/builtins/__init__.py:36: UserWarning: Warning: optional dependency `torch` is not available. - skipping import of NN models.
  warnings.warn("Warning: optional dependency `torch` is not available. - skipping import of NN models.")
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Looks like the server process started correctly in the Docker container. The UserWarning is generated when we instantiate the model object, which means everything is running as expected.

The deployment and service for the model service were created together. You can see the new service with this command:

!kubectl get services

NAME                              TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
insurance-charges-model-service   NodePort   10.98.168.223   <none>        80:31687/TCP   48s

Minikube exposes the service on a local port, we can get a link to the endpoint with this command:

minikube service insurance-charges-model-service --url -n model-services

The command output this URL:

http://192.168.59.100:31687

The command must keep running to keep the tunnel open to the running model service in the minikube cluster.

To make a prediction, we'll hit the service with a request:

!curl -X 'POST' \
  'http://192.168.59.100:31687/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":25390.95}

We have the model service up and running in the local minikube cluster!

Running the Load Test

We can run the load test by using the IP address and port of the service running in minikube.

locust -f tests/load_test.py --host=http://192.168.59.100:31687 --headless --loglevel ERROR --csv=./load_test_report/load_test --html ./load_test_report/load_test_report.html

While the load test is running, we'll check the CPU usage of the single pod running the model service every 15 seconds:

%%bash

kubectl top pods

while sleep 15; do 
    kubectl top pods | grep insurance-charges-model-deployment 
done

NAME                                                  CPU(cores)   MEMORY(bytes)   
insurance-charges-model-deployment-5454fc7cfb-rhl2t   4m           104Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   27m          104Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   27m          104Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   27m          104Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   27m          104Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   132m         105Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   132m         105Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   132m         105Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   132m         105Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   198m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   198m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   198m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   198m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   200m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   200m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   200m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   200m         107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   94m          107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   94m          107Mi           
insurance-charges-model-deployment-5454fc7cfb-rhl2t   94m          107Mi           
Process is interrupted.

We can clearly see how the CPU usage is affected as the load goes from 1 user to 5 users. The CPU request for the deployment is 100 millicores, and the CPU usage goes as high as 200 millicores. The memory usage did not change very much based on the load.

The load test output this error message right before stopping:

 Test failed because the response time at the 99th percentile was above 100 ms. The 99th percentile latency is '3300.0'.

We can see that the single instance of the service running in Kubernetes is not enough to meet the requirements of the load test, and that the CPU usage is the limiting factor.

Adding Autoscaling to the Model Service

Kubernetes supports autoscaling, which is the ability to change the resources assigned to a service based on the current load on the service. We'll be doing horizontal scaling, which means that the number of replicas increases and decreases according to the load. Kubernetes supports this kind of autoscaling through the HorizontalAutoScaler resource.

The HorizontalAutoScaler resource is defined like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: insurance-charges-model-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: insurance-charges-model-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

This resource is defined in the /kubernetes/autoscaler.yaml file in the repository.

The HorizontalPodAutoscaler resource simply states that each pod of the deployment be kept at 50% CPU utilization. Since the pods of our service request 100 millicores, the autoscaler controller will step in whenever the CPU usage goes above 50 millicores and add a replica to the deployment.

We can deploy the HorizontalPodAutoscaler resource with this command:

!kubectl apply -f ../kubernetes/autoscaler.yaml

horizontalpodautoscaler.autoscaling/insurance-charges-model-autoscaler created

We can view the number of replicas in the Deployment in the Kubernetes Dashboard:

The deployment currently has 1 pod, with 1 requested pod.

We can also see the HorizontalPodAutoscaler:

The number of replicas is currently set to 1, the autoscaler will increase and decrease this number automatically.

Let's try running the load test with more concurrent users and see if we can trigger an autoscaling event.

locust -f tests/load_test.py --host=http://192.168.59.100:31687 --headless --loglevel ERROR --csv=./load_test_report/load_test --html ./load_test_report/load_test_report.html

While it's running, let's watch the deployment for the number of replicas:

%%bash

kubectl get deployment insurance-charges-model-deployment

while sleep 15; do 
    kubectl get deployment insurance-charges-model-deployment | grep insurance-charges-model-deployment
done

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
insurance-charges-model-deployment   1/1     1            1           14m
insurance-charges-model-deployment   1/1     1            1           14m
insurance-charges-model-deployment   1/1     1            1           15m
insurance-charges-model-deployment   2/2     2            2           15m
insurance-charges-model-deployment   2/2     2            2           15m
insurance-charges-model-deployment   2/2     2            2           15m
insurance-charges-model-deployment   2/2     2            2           16m
insurance-charges-model-deployment   4/4     4            4           16m
insurance-charges-model-deployment   4/4     4            4           16m
insurance-charges-model-deployment   4/4     4            4           16m
insurance-charges-model-deployment   4/4     4            4           17m
insurance-charges-model-deployment   6/6     6            6           17m
insurance-charges-model-deployment   6/6     6            6           17m
insurance-charges-model-deployment   6/6     6            6           17m
insurance-charges-model-deployment   6/6     6            6           18m
insurance-charges-model-deployment   6/6     6            6           18m
insurance-charges-model-deployment   6/6     6            6           18m
insurance-charges-model-deployment   6/6     6            6           18m
insurance-charges-model-deployment   6/6     6            6           19m
insurance-charges-model-deployment   6/6     6            6           19m
insurance-charges-model-deployment   6/6     6            6           19m
insurance-charges-model-deployment   6/6     6            6           19m
insurance-charges-model-deployment   6/6     6            6           20m
insurance-charges-model-deployment   6/6     6            6           20m
insurance-charges-model-deployment   6/6     6            6           20m
insurance-charges-model-deployment   6/6     6            6           20m
insurance-charges-model-deployment   6/6     6            6           21m
insurance-charges-model-deployment   6/6     6            6           21m
Process is interrupted.

The increasing caused the number of replicas to go up to 6.

Autoscaling can be triggered by using other metrics, such as memory usage. Autoscaling can ensure that a service can scale to meet the current needs of the clients of the system.

Deleting the Resources

Now that we're done with the service we need to destroy the resources.

To delete the service autoscaler, execute this command:

!kubectl delete -f ../kubernetes/autoscaler.yaml

horizontalpodautoscaler.autoscaling "insurance-charges-model-autoscaler" deleted

To delete the model service, we'll execute this command:

!kubectl delete -f ../kubernetes/model_service.yaml

configmap "model-service-configuration" deleted
deployment.apps "insurance-charges-model-deployment" deleted
service "insurance-charges-model-service" deleted

To delete the namespace:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted

Lastly, to stop the kubernetes cluster, execute these commands:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  1 node stopped.

Closing

In this blog post we showed how to create a load testing script for a machine learning model that is deployed within a RESTful service. The load testing script is able to generate random inputs for the model. We also showed how to add a shape to the load test in order to simplify load testing and how to add SLOs to the load testing script so that we can quickly tell if the model and model service are able to meet the requirements of the deployment. Lastly, we deployed the model service to a Kubernetes and showed how to implement autoscaling so that the model service can meet the SLO adaptively.

Caching for ML Model Deployments

2022-08-10T07:00:00-05:00

Caching for ML Model Deployments

In a previous blog post we introduced the decorator pattern for ML model deployments and then showed how to use the pattern to build extensions to a normal model deployment. For example, in this blog post we added data enrichment to a deployed model. This extension was added without having to modify the machine learning model code at all, we were able to do it by using the decorator pattern. In this blog post we’ll add caching functionality to a model in the same way.

This blog post was written in a Jupyter notebook, some of the code and commands found in it reflects this.

Introduction

In a software system, a cache is a data store that is used to temporarily store computation results or frequently-accessed data. When accessing the results of a computation from a cache, we are able to avoid paying the cost of recomputing the result. When accessing a frequently accessed piece of data we are able to avoid paying the cost of accessing the data from a slower data store. This type of caching is used when accessing data from a slower data store than the cache. When a cache hit occurs, the data being sought is found and returned to the caller. When a “miss” occurs, the data is not found and must be recomputed or accessed from the slower data store by the caller. A data cache is generally built using storage that has low latency, which means that it is more expensive to run.

Machine learning model deployments can benefit from caching because making predictions with a model can be a CPU-intensive process, especially for large and complex models. Predictions that take a long time to make can be cached and returned later when the same prediction is requested. This type of caching is also known as memoization. Another reason that a prediction can take a long time to create is if data enrichment is needed. Data enrichment is the process of adding fields to a model's input from a data store before a prediction is made, this process can add latency to the prediction and can benefit from caching.

In order to enable prediction caching possible from ML models, we need to make sure that the model produces deterministic predictions. Determinism is a property of algorithms that says that the algorithm will always return the same output for the same input. If the model for which we want to cache predictions returns a different prediction for the same inputs, then we wouldn’t be able to cache the predictions at all since we wouldn’t be able to guarantee that the model would return the same prediction that we had cached.

In this blog post, we’ll show how to create a simple decorator that is able to cache predictions for an ML model that is deployed to a production system. We'll also show how to deploy the decorator along with the model to a RESTful service.

All of the code is available in this github repository.

Software Architecture

For caching predictions, we’ll be using Redis. Redis is a data structure store that allows users to save and modify data structures in a remote service. This allows many clients to safely access the same data from a centralized service. Redis supports many different data structures, but we’ll be using the key-value store functionality to save our predictions.

Installing the Model

To install the model, we can use the pip command and point it at the github repo of the model.

from IPython.display import clear_output
from IPython.display import Markdown as md

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

To make a prediction with the model, we'll import the model's class.

from insurance_charges_model.prediction.model import InsuranceChargesModel

Now we can instantiate the model:

model = InsuranceChargesModel()

clear_output()

To make a prediction, we'll need to use the model's input schema class.

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput, \
    SexEnum, RegionEnum

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

The model's input schema is called InsuranceChargesModelInput and it encompasses all of the features required by the model to make a prediction.

Now we can make a prediction with the model by calling the predict() method with an instance of the InsuranceChargesModelInput class.

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The model predicts that the charges will be $8640.78.

We can view input schema of the model as a JSON schema document by calling the .schema() method on the class.

model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]}},
 'definitions': {'SexEnum': {'title': 'SexEnum',
   'description': "Enumeration for the value of the 'sex' input of the model.",
   'enum': ['male', 'female'],
   'type': 'string'},
  'RegionEnum': {'title': 'RegionEnum',
   'description': "Enumeration for the value of the 'region' input of the model.",
   'enum': ['southwest', 'southeast', 'northwest', 'northeast'],
   'type': 'string'}}}

Profiling the Model

!pip install Faker

clear_output()

We'll create a function that can generate a random sample that meets the model's input schema:

from faker import Faker

faker = Faker()

def generate_record() -> InsuranceChargesModelInput:
    record = {
        "age": faker.random_int(min=18, max=65),
        "sex": faker.random_choices(elements=("male", "female"), length=1)[0],
        "bmi": faker.random_int(min=15000, max=50000)/1000.0,
        "children": faker.random_int(min=0, max=5),
        "smoker": faker.boolean(),
        "region": faker.random_choices(elements=("southwest", "southeast", "northwest", "northeast"), length=1)[0]
    }
    return InsuranceChargesModelInput(**record)

It's really hard to see a performance difference with one sample, so we'll perform a test with many random samples to see the difference. To start, we'll generate 1000 samples and save them:

samples = []

for _ in range(1000):
    samples.append(generate_record())

By using the timeit module from the standard library, we can measure how much time it takes to call the model's predict method with a random sample. We'll make 1000 predictions.

import timeit

total_seconds = timeit.timeit("[model.predict(sample) for sample in samples]", 
                              number=1, globals=globals())

seconds_per_sample = total_seconds / len(samples)
milliseconds_per_sample = seconds_per_sample * 1000.0

The model took 32.997 seconds to perform 1000 predictions, therefore it took 0.033 seconds to make a single prediction. The model takes about 32.997 milliseconds to make a prediction.

Hashing Model Inputs

Before we can build a caching decorator, we'll need to understand a little bit about hashing and how to use it for caching. A hashing operation is an operation takes in data of arbritrary size as input and returns data of a fixed size. A "hash" value refers to the fixed-size data that is returned from a hashing operation. Hashing has many uses in computer science, in this application we'll us hashing to uniquely identify some inputs that are provided to the ML model that we are decorating.

Hashing is already built into the Python standard library through the hash() function, but it is only supported on certain types of objects. We can try it out using an instance of the model's input schema:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

-4360805119606244359

To try out hashing, we converted an instance of the model's input schema into a dictionary, and then converted the keys and values of the dictionary into frozensets. We then used the frozensets with the hash() function to create an integer value. The integer is the hashed value that we need to uniquely identify the inputs to the model.

To see how hashing works, we'll create a separate input instance for the model that has the exact same values and hash it:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

-4360805119606244359

The hashed values are exactly the same, as we expected. The hashes value should be different if any of the values in the model input change:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.2,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())

hash(frozen_dict)

-7065881474845529459

The "bmi" field changed from 24.0 to 24.2, so we got a completely different hashed value.

Hashing is a quick and easy way to identify inputs which will allow us to store the predictions of the model in the cache and retrieve them later.

Creating the Redis Cache Decorator

We'll be using Redis to hold the cached predictions of the model. To access the Redis instance, we'll use the redis python package, which we'll install with this command:

!pip install redis

clear_output()

Now we can implement the decorator class:

import os
from typing import List, Optional
from ml_base.decorator import MLModelDecorator
import redis
import json


class RedisCachingDecorator(MLModelDecorator):
    """Decorator for caching around an MLModel instance."""

    def __init__(self, host: str, port: int, database: int, prefix: Optional[str] = None, 
                 hashing_fields: Optional[List[str]] = None) -> None:

        super().__init__(host=host, port=port, database=database, prefix=prefix, 
                         hashing_fields=hashing_fields)

        self.__dict__["_redis_client"] = redis.Redis(host=host, port=port, db=database)

    def predict(self, data):
        if self._configuration["prefix"] is not None:
            prefix = "{}/{}/{}/".format(self._configuration["prefix"], 
                                        self._model.qualified_name, 
                                        self._model.version)
        else:
            prefix = "{}/{}/".format(self._model.qualified_name, 
                                     self._model.version)

        # select hashing fields from input
        if self._configuration["hashing_fields"] is not None:
            data_dict = {key: data.dict()[key] for key in self._configuration["hashing_fields"]}
        else:
            data_dict = data.dict()

        # creating a key for the prediction inputs provided
        frozen_data = frozenset(data_dict.keys()), frozenset(data_dict.values())
        key = prefix + str(hash(frozen_data))

        # check if the prediction is in the cache
        prediction = self.__dict__["_redis_client"].get(key)

        # if the prediction is present in the cache, then deserialize it and return the prediction
        if prediction is not None:
            prediction = json.loads(prediction)
            prediction = self._model.output_schema(**prediction)
            return prediction
        # if the prediction is not present in the cache, then make a prediction, save it to the cache, and return the prediction
        else:
            prediction = self._model.predict(data)
            serialized_prediction = json.dumps(prediction.dict())
            self.__dict__["_redis_client"].set(key, serialized_prediction)
            return prediction

The caching decorator works very simply, when it receives inputs for the model it:

creates a key for the model input using hashing
checks if the key is present in the cache
if the key is present:
- retrieves the prediction for that key
- deserializes the contents of the cache into the output type of the model
- returns the prediction to the caller
if the key is not present:
- makes a prediction with the model it is decorating
- serializes the prediction to a JSON string
- saves the prediction to the cache with the key created
- returns the prediction to the caller

The key created for each cache entry is made up of the model's qualified name, the model version and an optional prefix. The prefix is used to differentiate the predictions that are cached in a more flexible way. The caching decorator uses JSON as a serialization format to store information in the cache.

Using the Redis Cache Decorator

In order to try out the decorator, we'll need to run a local Redis instance. We can start one using Docker with this command:

!docker run -d -p 6379:6379 --name local-redis redis/redis-stack-server:latest

836c0d557926df641a2e657bcf0d935ec7b1e361b4de5dab6a9abad9371262ea

To test out the decorator we first need to instantiate the model object that we want to use with the decorator.

model = InsuranceChargesModel()

Next, we’ll instantiate the decorator with the connection parameters for the Redis docker container.

caching_decorator = RedisCachingDecorator(host="localhost", 
                                          port=6379,
                                          database=0,
                                          prefix="prefix")

We can add the model instance to the decorator after it’s been instantiated like this:

decorated_model = caching_decorator.set_model(model)

We can see the decorator and the model objects by printing the reference to the decorator:

decorated_model

RedisCachingDecorator(InsuranceChargesModel)

The decorator object is printing out it's own type along with the type of the model that it is decorating.

Now we’ll try to use the decorator and the model together by making a few predictions.

model_input = InsuranceChargesModelInput(
    age=46,
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=9612.64)

The first time we make a prediction with a given input, we'll get the prediction made by the model and the decorator will store the prediction in the cache.

We can view the key in the redis database to see how it is stored.

!docker exec local-redis redis-cli SCAN 0

0
prefix/insurance_charges_model/0.1.0/5926980192354242260

There is a single key in the redis database. We'll access they contents of the key like this:

!docker exec local-redis redis-cli GET prefix/insurance_charges_model/0.1.0/5926980192354242260

{"charges": 9612.64}

The prediction is stored in the key as a JSON string.

We'll try the same prediction again:

model_input = InsuranceChargesModelInput(
    age=46, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=9612.64)

This time the prediction was not made by the model, it was found in the Redis cache and returned by the decorator instead of being made again.

Next, we'll use the 1000 samples we generated above to make predictions with the decorated model:

decorated_total_seconds = timeit.timeit("[decorated_model.predict(sample) for sample in samples]", 
                                        number=1, globals=globals())

decorated_seconds_per_sample = decorated_total_seconds / len(samples)
decorated_milliseconds_per_sample = decorated_seconds_per_sample * 1000.0

The decorated model took 36.419 seconds to perform 1000 predictions the first time that it saw the prediction inputs, therefore it took 0.0364 seconds to make a single prediction. The decorated model takes about 36.419 milliseconds to make a prediction.

We'll run the same samples through again:

decorated_total_seconds = timeit.timeit("[decorated_model.predict(sample) for sample in samples]", 
                                        number=1, globals=globals())

decorated_seconds_per_sample = decorated_total_seconds / len(samples)
decorated_milliseconds_per_sample = decorated_seconds_per_sample * 1000.0

The decorated model took 0.88 seconds to perform 1000 predictions the second time that it saw the prediction inputs, therefore it took 0.0009 seconds to make a single prediction. The decorated model takes about 0.88 milliseconds to access a single prediction and return it.

It took less time because the cached predictions were returned more quickly because we requested the same predictions from the model.

We can get the amount of memory used by the cache by accessing the keys and summing up the number of bytes.

r = redis.StrictRedis(host='localhost', port=6379, db=0)

decorated_number_of_bytes = 0
decorated_total_entries = 0
for key in r.scan_iter("prefix*"):
    decorated_number_of_bytes += len(r.get(key))
    decorated_total_entries = decorated_total_entries + 1

decorated_average_number_of_bytes = decorated_number_of_bytes / decorated_total_entries

The keys in the cache take up a total of 20624 bytes. The average number of bytes per cache entry is 20.6.

We'll clear the redis database to make sure the contents don't intefere with the next things we want to try.

!docker exec local-redis redis-cli FLUSHDB

OK

Selecting Fields For Hashing

In certain situations, not all of the fields in the model's input should be used to create a hash. This may be because not all of the model's input fields are actually used for making a prediction. Some fields may be used for logging or debugging and do not actually affect the prediction created by the model. If changing the value of a field does not affect the value of the prediction created by the model, it should not be used to create the hashed key for the cache.

The caching decorator supports selecting specific fields from the input to create the cache key. The option is called "hashing_fields" and is provided to the decorator instance like this:

caching_decorator = RedisCachingDecorator(host="localhost", 
                                          port=6379,
                                          database=0,
                                          prefix="prefix",
                                          hashing_fields=["age", "sex", "bmi", "children", "smoker"])

decorated_model = caching_decorator.set_model(model)

The decorator now uses all of the input fields except for the "region" field to create the key.

To try out the functionality, we'll create a prediction with the decorated model. The prediction will get saved in the cache.

model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

We'll now make the same prediction, but this time the prediction will come from the cache because it was saved there previously.

model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.northwest)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

We'll make the prediction one more time, but this time we'll change the value of the "region" field.

model_input = InsuranceChargesModelInput(
    age=52, 
    sex=SexEnum.female,
    bmi=24.0,
    children=3,
    smoker=False,
    region=RegionEnum.southeast)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15219.19)

The predicted value should have changed because the region changed. It didn't change because we accessed the prediction from the cache instead of creating a new one. This happened because we ignored the value of the "region" field when creating the hashed key in the cache.

!docker exec local-redis redis-cli FLUSHDB

OK

Improving the Performance of the Decorator

When a prediction is stored in the cache, it is currently serialized using the JSON format. This format is simple and easy to understand, but it is not the most efficient format for serialization in terms of the size of the data and the time it takes to do the serialization.

To try to improve the efficiency of the caching decorator we'll add options for other serialization formats and also try to use compression. Another way to reduce the memory usage of the cache is to reduce the precision of the numbers given to the model. These approaches will be fully explained below.

We'll be using MessagePack to do serialization and Snappy for compression, so we need to install the packages:

!pip install msgpack
!pip install python-snappy

clear_output()

We'll recreate the RedisCachingDecorator class with the code needed to support the new features we want to work with.

import msgpack
import snappy


class RedisCachingDecorator(MLModelDecorator):
    """Decorator for caching around an MLModel instance."""

    def __init__(self, host: str, port: int, database: int, prefix: Optional[str] = None, 
                 hashing_fields: Optional[List[str]] = None, serder: str = "JSON", 
                 use_compression: bool = False, 
                 reduced_precision: bool = False,
                 number_of_places: Optional[int] = None
                ) -> None:

        if serder not in ["JSON", "MessagePack"]:
            raise ValueError("Serder option not supported.")

        if reduced_precision is True and number_of_places is None:
            raise ValueError("number_of_places must be provided when reduced_precision is True.")

        if number_of_places is None and reduced_precision is True:
            raise ValueError("reduced_precision must be True when number_of_places is provided.")

        super().__init__(host=host, port=port, database=database, prefix=prefix, 
                         hashing_fields=hashing_fields, serder=serder, 
                         use_compression=use_compression, 
                         reduced_precision=reduced_precision,
                         number_of_places=number_of_places)

        self.__dict__["_redis_client"] = redis.Redis(host=host, port=port, db=database)

    def predict(self, data):
        if self._configuration["prefix"] is not None:
            prefix = "{}/{}/{}/".format(self._configuration["prefix"], 
                                        self._model.qualified_name, 
                                        self._model.version)
        else:
            prefix = "{}/{}/".format(self._model.qualified_name,
                                     self._model.version)

        # reducing the precision of the numerical fields, if it is enabled
        if self._configuration["reduced_precision"] is True:
                for field_name, field_attributes in self._model.input_schema.schema()["properties"].items():
                    if "type" in field_attributes.keys() and field_attributes["type"] == "number":
                        field_value = getattr(data, field_name)
                        setattr(data, field_name, round(field_value, self._configuration["number_of_places"]))

        # select hashing fields from input
        if self._configuration["hashing_fields"] is not None:
            data_dict = {key: data.dict()[key] for key in self._configuration["hashing_fields"]}
        else:
            data_dict = data.dict()

        # creating a key for the prediction inputs provided
        frozen_data = frozenset(data_dict.keys()), frozenset(data_dict.values())
        key = prefix + str(hash(frozen_data))

        # check if the prediction is in the cache
        prediction = self.__dict__["_redis_client"].get(key)

        # if the prediction is present in the cache
        if prediction is not None:

            # optionally decompressing the bytes
            if self._configuration["use_compression"]:
                decompressed_prediction = snappy.decompress(prediction)
            else:
                decompressed_prediction = prediction

            # deserializing to bytes
            if self._configuration["serder"] == "JSON":
                deserialized_prediction = json.loads(decompressed_prediction.decode())
            elif self._configuration["serder"] == "MessagePack":
                deserialized_prediction = msgpack.loads(decompressed_prediction)
            else: 
                raise ValueError("Serder option not supported.")

            # creating the output instance
            prediction = self._model.output_schema(**deserialized_prediction)

            return prediction

        # if the prediction is not present in the cache
        else:
            # making a prediction with the model
            prediction = self._model.predict(data)

            # serializing to bytes
            if self._configuration["serder"] == "JSON":
                serialized_prediction = str.encode(json.dumps(prediction.dict()))
            elif self._configuration["serder"] == "MessagePack":
                serialized_prediction = msgpack.dumps(prediction.dict())
            else: 
                raise ValueError("Serder option not supported.")

            # optionally compressing the bytes
            if self._configuration["use_compression"]:
                serialized_prediction = snappy.compress(serialized_prediction)

            # saving the prediction to the cache
            self.__dict__["_redis_client"].set(key, serialized_prediction)

            return prediction

The new implementation above includes options to enable MessagePack for serialization/deserialization, snappy for compression, and the ability to reduce the precision of numerical fields in the model input. We'll try out each option individually.

MessagePack Serialization

MessagePack is a binary serialization format designed for small, efficient and flexible serialization.

To enable MessagePack, we'll instantiate the decorator setting the "serder" option to "MessagePack". We'll use a prefix to separate the cache entries that use MessagePack from the other cache entries.

msgpack_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                  port=6379,
                                                  database=0,
                                                  prefix="msgpack",
                                                  serder="MessagePack")

mspgpack_decorated_model = msgpack_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction will get serialized to MessagePack and saved to the cache.

model_input = InsuranceChargesModelInput(
    age=55, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = mspgpack_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15113.29)

The second time we make a prediction, the cache entry will be used instead.

model_input = InsuranceChargesModelInput(
    age=55, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = mspgpack_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15113.29)

The MessagePack format works, now we'll do some testing to see if it improves the serialization/deserialization performance.

As before, we'll make the predictions on the samples to fill in the cache with predictions. We'll be using the 1000 samples generated above to keep the comparison fair.

msgpack_total_seconds = timeit.timeit("[mspgpack_decorated_model.predict(sample) for sample in samples]", 
                                      number=1, globals=globals())

msgpack_seconds_per_sample = msgpack_total_seconds / len(samples)
msgpack_milliseconds_per_sample = msgpack_seconds_per_sample * 1000.0

The decorated model that uses MessagePack took 35.627 seconds to perform 1000 predictions the first time that it saw the prediction inputs. The decorated model takes about 35.627 milliseconds to make a single prediction.

Most of the time for this step is taken up by the model's prediction algorithm, this is the reason why its a similar amount of time as the JSON serder we used before.

Now we can try the same predictions again. This time, they'll be accessed from the cache and returned more quickly.

msgpack_total_seconds = timeit.timeit("[mspgpack_decorated_model.predict(sample) for sample in samples]", 
                                      number=1, globals=globals())

msgpack_seconds_per_sample = msgpack_total_seconds / len(samples)
msgpack_milliseconds_per_sample = msgpack_seconds_per_sample * 1000.0

The model that uses MessagePack took 0.955 seconds to perform 1000 predictions the second time that it saw the prediction inputs. The decorated model takes about 0.955 milliseconds to access a single prediction and return it.

The MessagePack serder performs at around the same speed as the JSON serder. The test we did with JSON above took about 0.88 ms for each sample, the MessagePack serder took 0.955 ms per sample.

We can see how much space the cache entries is taking up by querying each key and summing up the number of bytes:

msgpack_number_of_bytes = 0
msgpack_total_entries = 0
for key in r.scan_iter("msgpack*"):
    msgpack_number_of_bytes += len(r.get(key))
    msgpack_total_entries = msgpack_total_entries + 1

msgpack_average_number_of_bytes = msgpack_number_of_bytes / msgpack_total_entries

The keys in the original JSON cache took up a total of 20624 bytes. The keys in the MessagePack cache take up a total of 18018 bytes and the average number of bytes per MessagePack cache entry is 18.0.

By using MessagePack serialization we were able to use less memory in the cache.

!docker exec local-redis redis-cli FLUSHDB

OK

Snappy Compression

Snappy is a compression algorithm built by Google that targets high compression ratios and high compressions speed. We can try to reduce the memory used by the cache by compressing the cache entries with the Snappy algorithm. This approach was inspired by another blog post.

Enabling compression on the decorator is very simple, we'll just set the "use_compression" parameter to "True" when instantiating the caching decorator. In this example we'll use JSON serialization combined with compression.

compressing_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                      port=6379,
                                                      database=0,
                                                      prefix="json+compression",
                                                      serder="JSON",
                                                      use_compression=True)

compressing_decorated_model = compressing_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction will get serialized to JSON, then compressed, and saved to the cache.

model_input = InsuranceChargesModelInput(
    age=53, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = compressing_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15207.01)

The second time we make a prediction, the compressed cache entry will be used instead.

model_input = InsuranceChargesModelInput(
    age=53, 
    sex=SexEnum.female,
    bmi=25.0,
    children=4,
    smoker=False,
    region=RegionEnum.northwest)

prediction = compressing_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=15207.01)

The compression works, now we'll do some testing to see if it improves the serialization/deserialization performance.

compressed_total_seconds = timeit.timeit("[compressing_decorated_model.predict(sample) for sample in samples]", 
                                         number=1, globals=globals())

compressed_seconds_per_sample = compressed_total_seconds / len(samples)
compressed_milliseconds_per_sample = compressed_seconds_per_sample * 1000.0

The decorator that does compression took around 35.224 ms to make a prediction and add it to the cache the first time that it sees the prediction inputs.

Most of the time for this step is taken up by the model's prediction algorithm.

Now we can try the same predictions again.

compressed_total_seconds = timeit.timeit("[compressing_decorated_model.predict(sample) for sample in samples]", 
                                         number=1, globals=globals())

compressed_seconds_per_sample = compressed_total_seconds / len(samples)
compressed_milliseconds_per_sample = compressed_seconds_per_sample * 1000.0

The decorator that uses compressed JSON took 0.906 ms to make a prediction the second time that it saw the prediction inputs.

The serder that uses JSON serialization and compression performs around the same as the JSON serder. The test we did with uncompressed JSON above took about 0.88 ms for each sample.

We can see how much space the cache entries is taking up by querying each key and summing up the number of bytes:

compressed_number_of_bytes = 0
compressed_total_entries = 0
for key in r.scan_iter("json+compression*"):
    compressed_number_of_bytes += len(r.get(key))
    compressed_total_entries = compressed_total_entries + 1

compressed_average_number_of_bytes = compressed_number_of_bytes / compressed_total_entries

The keys in the original JSON cache took up a total of 20624 bytes. The keys in the MessagePack cache take up a total of 18018 bytes. The keys in the compressed JSON cache take up a total of 22627 bytes, and the average number of bytes per cache entry is 22.6.

The keys that were serialized with JSON and compressed were a few bytes bigger than the keys serialized and not compressed. It seems that compression is not saving memory in the cache, this is probably due to the small size of the entries and the fact that information was not repeated inside of the serialized data structures.

!docker exec local-redis redis-cli FLUSHDB

OK

Reducing the Precision of the Inputs

We can also try to limit the size of the cache by reducing the number of possible inputs to the hashing function. We'll demonstrate this with a few examples.

We'll start by hashing a single sample of the input of the model:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())
hash(frozen_dict)

-2801283067008197552

Next, we'll hash a very similar model input:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12346,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

model_input_dict = model_input.dict()
frozen_dict = frozenset(model_input_dict.keys()), frozenset(model_input_dict.values())
hash(frozen_dict)

5034586836711654789

The hash value produced is the second time is completely different even though the "bmi" field only changed by 0.00001. This means that these two predictions will have two different cache entries even though they are very lilely to produce exactly the same prediction. Just to make sure, we'll make the predictions using these inputs:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

Let's try the prediction and hash with a different value for the "bmi" field:

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12346,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The prediction came out to be the same for both values of "bmi". However, the hashed value of the input was completely different. These predictions would be saved separately from each other in the cache, event though they are exactly the same. We can cut down on the number of entries in the cache by reducing the precision of floating point numbers so that these predictions can be cached one time instead of many. By rounding down the number we'll be reducing the number of cache entries that will be placed in the cache but also affecting the accuracy of the model's predictions.

The caching decorator supports this feature, we'll just enable it by adding the "reduced_precision" and "number_of_places" options to the configuration:

low_precision_caching_decorator = RedisCachingDecorator(host="localhost", 
                                                        port=6379,
                                                        database=0,
                                                        prefix="low_precision",
                                                        reduced_precision=True,
                                                        number_of_places=0)

low_precision_decorated_model = low_precision_caching_decorator.set_model(model)

The first time we make a prediction, the model will be used and the prediction input will get the precision of the "bmi" field reduced to one decimal place, then the prediction will get serialized to JSON, and saved to the cache.

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.12345,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = low_precision_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The second time the prediction is requested, the precision of the "bmi" field is reduced again in the same way, making the prediction input the same as before even though the values for the "bmi" field are not exactly the same. This will create the same hashed value which will retrieve the prediction from the cache and return it to the user.

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.4321,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

prediction = low_precision_decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The predictions are the same even though the inputs were different. We can view the keys in the cache like this:

!docker exec local-redis redis-cli SCAN 0

0
low_precision/insurance_charges_model/0.1.0/-4360805119606244359

There's only one entry in the cache, which means that first prediction was used and no new entry was made for the second set of inputs.

Although this is not always an ideal way to save memory, there are some model deployments that can benefit from this approach. All that is needed is to analyze how much precision the model needs from its numerical inputs. It rarely makes sense to store predictions with an unlimited precision in their numerical inputs in the cache.

!docker exec local-redis redis-cli FLUSHDB

OK

Adding the Decorator to a Deployed Model

Now that we have a working decorator, we can use it inside of a service alongside the model. To do this, we'll use the rest_model_service package to quickly create a RESTful service. You can learn more about this package in this blog post.

!pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true
    decorators:
      - class_path: ml_model_caching.redis.RedisCachingDecorator
        configuration:
          host: "localhost"
          port: 6379
          database: 0

The service_title field is the name of the service as it will appear in the documentation. The models field is an array that contains the details of the models we would like to deploy in the service. The class_path points at the MLModel class that implement's the model's prediction logic, in this case we'll be using the same model as in the examples above. The decorators field contains the details of the decorators that we want to attach to the model instance. We want to use the RedisCachingDecorator decorator class with the configuration we've used for local testing.

To run the service locally, execute these commands:

export PYTHONPATH=./
export REST_CONFIG=./configuration/rest_configuration.yaml
uvicorn rest_model_service.main:app --reload

We can try out the service with this command:

!curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 50, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":46277.67}

We can check the Redis instance to make sure that the cache is being used:

!docker exec local-redis redis-cli SCAN 0

0
insurance_charges_model/0.1.0/-3948524794153351987

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model. The decorator that we want to test can also be added to the model through configuration, including all of its parameters.

!docker exec local-redis redis-cli FLUSHDB

OK

Deploying the Caching Decorator

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally using Docker. Once we have the service and Redis working locally, we'll deploy everything to a local Minikube instance.

Creating a Docker Image

Let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

FROM python:3.9-slim

ARG BUILD_DATE

LABEL org.opencontainers.image.title="Caching for ML Models"
LABEL org.opencontainers.image.description="Caching for machine learning models."
LABEL org.opencontainers.image.created=$BUILD_DATE
LABEL org.opencontainers.image.authors="6666331+schmidtbri@users.noreply.github.com"
LABEL org.opencontainers.image.source="https://github.com/schmidtbri/caching-for-ml-models"
LABEL org.opencontainers.image.version="0.1.0"
LABEL org.opencontainers.image.licenses="MIT License"
LABEL org.opencontainers.image.base.name="python:3.9-slim"

WORKDIR ./service

# installing git because we need to install the model package from the github repository
RUN apt-get update
RUN apt-get --assume-yes install git

COPY ./ml_model_caching ./ml_model_caching
COPY ./configuration ./configuration
COPY ./LICENSE ./LICENSE
COPY ./service_requirements.txt ./service_requirements.txt

RUN pip install -r service_requirements.txt

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile is used by this command to create a docker image:

!docker build -t insurance_charges_model_service:latest ../

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep insurance_charges_model_service

insurance_charges_model_service   latest    2c8c19151e65   32 hours ago   1.26GB

Next, we'll start the image to see if everything is working as expected. To do this we'll create a local docker network and connect the redis container and the model service container to it.

!docker network create local-network

1d8ad0b59ad831f1c6205cea3e799ee31f40109006b9a02d39db8207a7e3f339

We'll connect the running redis container that we were working with to the network.

!docker network connect local-network local-redis

Now we can start the service docker image connected to the same network as the redis container.

!docker run -d \
    -p 8000:8000 \
    --net local-network \
    -e REST_CONFIG=./configuration/local_rest_config.yaml \
    --name insurance_charges_model_service \
    insurance_charges_model_service:latest

83db77417dfa5cd33c3d7fabea8349df8b3932ef0cd2544a94b7d4958eed93bc

Notice that we're using a different configuration file that has a different hostname for the redis instance. The redis container is not accesible from localhost inside of the network so we needed to have the hostname "local-redis" in the configuration.

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command running inside of a container connected to the network:

!docker run -it --rm \
    --net local-network \
    curlimages/curl \
    curl -X 'POST' \
    'http://insurance_charges_model_service:8000/api/models/insurance_charges_model/prediction' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 50, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":46277.67}

The model predicted that the insurance charges would be $46277.67 and also saved the prediction to the Redis cache. We can view the cache entries in Redis with this command:

!docker exec local-redis redis-cli SCAN 0

0
insurance_charges_model/0.1.0/7732985413081947687

The key in the cache has this value:

!docker exec local-redis redis-cli GET insurance_charges_model/0.1.0/7732985413081947687

{"charges": 46277.67}

Since we didn't use MessagePack or Snappy compression the value is easily read as a plain JSON string.

Now that we're done with the local redis instance we'll stop and remove the docker container.

!docker kill local-redis
!docker rm local-redis

!docker kill insurance_charges_model_service
!docker rm insurance_charges_model_service

!docker network rm local-network

local-redis
local-redis
insurance_charges_model_service
insurance_charges_model_service
local-network

Deploying the Solution

Creating the Kubernetes Cluster

To start the minikube cluster execute this command:

!minikube start

😄  minikube v1.26.1 on Darwin 12.5
✨  Using the virtualbox driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🔄  Restarting existing virtualbox VM for "minikube" ...
🐳  Preparing Kubernetes v1.24.3 on Docker 20.10.17 ...[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K[K
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
    ▪ Using image kubernetesui/dashboard:v2.6.0
    ▪ Using image kubernetesui/metrics-scraper:v1.0.8
🌟  Enabled addons: default-storageclass, storage-provisioner, dashboard
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Let's view all of the pods running in the minikube cluster to make sure we can connect.

!kubectl get pods -A

NAMESPACE              NAME                                         READY   STATUS    RESTARTS      AGE
kube-system            coredns-6d4b75cb6d-wrrwr                     1/1     Running   7 (9h ago)    2d10h
kube-system            etcd-minikube                                1/1     Running   7 (9h ago)    2d10h
kube-system            kube-apiserver-minikube                      0/1     Running   7 (9h ago)    2d10h
kube-system            kube-controller-manager-minikube             0/1     Running   6 (9h ago)    2d10h
kube-system            kube-proxy-5n4t9                             1/1     Running   7 (9h ago)    2d10h
kube-system            kube-scheduler-minikube                      1/1     Running   6 (9h ago)    2d10h
kube-system            storage-provisioner                          1/1     Running   12 (9h ago)   2d10h
kubernetes-dashboard   dashboard-metrics-scraper-78dbd9dbf5-d4zv8   1/1     Running   4 (9h ago)    2d10h
kubernetes-dashboard   kubernetes-dashboard-5fd5574d9f-7mjlt        1/1     Running   5 (9h ago)    2d10h

Creating a Kubernetes Namespace

!kubectl create -f ../kubernetes/namespace.yaml

namespace/model-services created

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME                   STATUS   AGE
default                Active   2d10h
kube-node-lease        Active   2d10h
kube-public            Active   2d10h
kube-system            Active   2d10h
kubernetes-dashboard   Active   2d10h
model-services         Active   2s

The new namespace should appear in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "minikube" modified.

Creating the Redis Service

Before we can deploy the model service we need to create the Redis service that will hold the cached predictions. For this service we will create a StatefulSet that manages two instances of the Redis service. We will use both instances from the decorator running in the model service.

A StatefulSet is similar to a Deployment because it deploys Pods that are based on an identical specification. However, a StatefulSet will maintain an identity for each Pod and each one will be able to keep internal state. This is important because the Redis service is saving the cache for us, which is stateful.

Using Redis in this manner is an example of sharding. Sharding is the process of splitting up data that is too big to fit into a single computer into multiple computers. By using sharding we can make our data layer distributed, which can make it more easily to scale in the future.

A more detailed diagram of our software architecture looks like this:

The Redis service is defined in the kubernetes/redis_service.yaml file. We can create it with this command:

!kubectl create -f ../kubernetes/redis_service.yaml

service/redis-service created
statefulset.apps/redis-st created

We can view the pods associated with this service:

!kubectl get pods | grep redis

redis-st-0   1/1     Running             0          4s
redis-st-1   0/1     ContainerCreating   0          1s

We wanted to create two instances of Redis in the StatefulSet, because the pods are part of a Stateful set their names end with a number and we will be able to reach individual pod from the model service.

The .yaml file also created a Service for the StatefulSet pods which makes them accesible through DNS. We can view the service with this command:

!kubectl get services

NAME            TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
redis-service   ClusterIP   None         <none>        6379/TCP   7s

Creating a Model Deployment and Service

The model service now has a Redis instance to access, so we'll be creating the model service resources. These are:

Deployment: a declarative way to manage a set of pods, the model service pods are managed through the Deployment.
Service: a way to expose a set of pods in a Deployment, the model services is made available to the outside world through the Service, the service type is LoadBalancer which means that a load balancer will be created for the service.

The model service pod requires an extra container running inside of it to enable easy access to the Redis service. Because we sharded the Redis service into two instances, the caching decorator would need to be aware of both instances of Redis in order to access the right one for each cache entry. We can avoid this by adding an ambassador service to the model service pod. An ambassador takes care of interactions between the application and any outside services. In this case, the ambassador container will take care of routing the cache request to the right Redis instance. We'll use Twemproxy to act as the ambassador between the model service and the Redis instances.

The YAML for the ambassador container is defined in the Deployment resource of the model service and it looks like this:

...
- name: ambassador
    image: malexer/twemproxy
    env:
      - name: REDIS_SERVERS
        value: redis-st-0.redis-service.model-services.svc.cluster.local:6379:1,redis-st-1.redis-service.model-services.svc.cluster.local:6379:1
    ports:
      - containerPort: 6380
...

Notice that the ambassador is listening on localhost port 6380. We'll need to set this correctly in the caching decorator's configuration.

To start the model service, first we'll need to send the docker image from the local docker daemon to the minikube image cache:

!minikube image load insurance_charges_model_service:latest

We can view the images in the minikube cache like this:

!minikube cache list

insurance_charges_model_service:latest

The model service with the ambassador are created within the Kubernetes cluster with this command:

!kubectl apply -f ../kubernetes/model_service.yaml

deployment.apps/insurance-charges-model-deployment created
service/insurance-charges-model-service created

The deployment and service for the model service were created together. You can see the new service with this command:

!kubectl get services | grep insurance-charges-model-service

insurance-charges-model-service   NodePort    10.107.94.124   <none>        80:32440/TCP   3s

Minikube exposes the service on a local port, we can get a link to the endpoint with this command:

!minikube service insurance-charges-model-service --url -n model-services

http://192.168.59.100:32440

To make a prediction, we'll hit the service with a request:

!time curl -X 'POST' \
  'http://192.168.59.100:32440/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":25390.95}curl -X 'POST'  -H 'accept: application/json' -H  -d   0.01s user 0.01s system 8% cpu 0.158 total

The service and decorator are working! The prediction request took 0.158 seconds. We'll try the same prediction one more time to see if it takes less time.

!time curl -X 'POST' \
  'http://192.168.59.100:32440/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":25390.95}curl -X 'POST'  -H 'accept: application/json' -H  -d   0.01s user 0.01s system 55% cpu 0.022 total

The second time we made the prediction it took 0.022 seconds, which is faster than the first time we made the prediction. This tells us that the caching is working as expected.

We can review the contents of the Redis caches by executing the Redis CLI in the pods:

!kubectl exec --stdin --tty redis-st-1 -- redis-cli SCAN 0

1) "0"
2) 1) "insurance_charges_model/0.1.0/-4784352684431719157"

!kubectl exec --stdin --tty redis-st-1 -- redis-cli GET insurance_charges_model/0.1.0/-4784352684431719157

"{\"charges\": 25390.95}"

Notice that the cache entry was found in the second instance of Redis in the StatefulSet.

Adding a Prediction ID

The model has a single decorator working on it within the model service but we can add any number of decorators to add functionality. In a previous blog post we created a decorator that added a unique prediction id to every prediction returned by the model. We can add this decorator to the service by simply changing the configuration:

...
decorators:
  - class_path: data_enrichment.prediction_id.PredictionIDDecorator
  - class_path: ml_model_caching.redis.RedisCachingDecorator
    configuration:
      host: "localhost"
      port: 6380
      database: 0
      hashing_fields: 
        - age
        - sex
        - bmi
        - children
        - smoker
        - region
...

The PredictionIDDecorator decorator adds a unique identifier field to the prediction input data structure before the prediction request is passed to the caching decorator. We'll need to remove this field from the list of hashing fields because it should not be used to create the cached prediction, if we left the prediction_id field in the hashing fields then every single prediction request would be unique and we would not benefit from the cache.

This configuration is in the ./configuration/kubernetes_rest_config2.yaml file. We'll change the configuration file being used and recreate the Deployment again:

!kubectl apply -f ../kubernetes/model_service.yaml

deployment.apps/insurance-charges-model-deployment configured
service/insurance-charges-model-service unchanged

We'll try the service one more time:

!curl -X 'POST' \
  'http://192.168.59.100:32440/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
        \"age\": 65, \
        \"sex\": \"male\", \
        \"bmi\": 22, \
        \"children\": 5, \
        \"smoker\": true, \
        \"region\": \"southwest\" \
    }"

{"charges":25390.95,"prediction_id":"1aed2c71-9451-4cba-8d42-640d4b9695d8"}

The service returned a unique identifier field called "prediction_id" along with the prediction. This field was generated by the decorator we added through configuration. A full explanation of how the prediction ID decorator works can be found in the previous blog post.

This shows how easy and powerful it is to combine decorator with models in order to do more complex operations.

Deleting the Resources

Now that we're done with the service we need to destroy the resources. To delete the Redis deploymet, we'll delete the kubernetes resources:

!kubectl delete -f ../kubernetes/redis_service.yaml

service "redis-service" deleted
statefulset.apps "redis-st" deleted

To delete the model service, we'll execute this command:

!kubectl delete -f ../kubernetes/model_service.yaml

deployment.apps "insurance-charges-model-deployment" deleted
service "insurance-charges-model-service" deleted

To delete the namespace:

!kubectl delete -f ../kubernetes/namespace.yaml

namespace "model-services" deleted

Lastly, to stop the kubernetes cluster, execute these commands:

!minikube stop

✋  Stopping node "minikube"  ...
🛑  1 node stopped.

Closing

In this blog post, we showed how to build a decorator class that is able to cache predictions made by a machine learning model. Caching is a simple way to speed up predictions that we know can be reused and are requested often from a model.

The cache decorator classes can be applied to any model that uses the MLModel base class without having to modify the model class at all. The caching functionality is contained completely in the RedisCacheDecorator class. The same thing is true for the RESTful model service, the cache functionality did not need to be added to the service because we separated the concerns of the service and the cache decorator. We were able to add caching to the deployed model by modifying the configuration. By using decorators we’re able to create software components that can be reused in many different contexts. For example, if we chose to deploy the cache decorator in a gRPC service we should be able to do so as long as we instantiate and manage the decorator instance correctly.

Combining the caching decorator with other decorators that require I/O like data enrichment is very easy because of the way that decorators can be "stacked" together. We showed how to do this in this blog post by adding a decorator that adds unique identifier to each prediction.

Data Enrichment for ML Model Deployments

2022-05-01T07:00:00-05:00

Data Enrichment for ML Model Deployments

In the previous blog post we introduced the decorator pattern for ML model deployments and then showed how to use the pattern to build extensions for machine learning models. The extensions that we showed in the previous post were added without having to modify the machine learning model code at all, we were able to do it by creating a decorator class that wrapped the model. In this blog post we’ll use decorators to add data enrichment capabilities to an ML model.

Introduction

Machine learning models need data to make predictions. When deploying a model to a production setting, this data is not necessarily available from the client system that is requesting the prediction. When this happens, some other source is needed for the data that is required by the model but not provided by the client system. The process of accessing the data and joining it to the client's prediction request is called data enrichment. In all cases, the model itself should not need to be modified in order to do data enrichment, the process should be transparent to the model. In this blog post, we'll show a method for doing data enrichment that does not require the model itself to be modified.

Data enrichment is often done because the client system does not have access to the data that the model needs to make a prediction. In this case, the client must provide a field that the model can use to find the data that it needs to make a prediction, we'll call this the "index field". For example, in order to load customer details that need to be used to make a prediction, we need to get a customer id field that uniquely identifies the customer record. Once the data is loaded from a data source, the model can be called to make a prediction using the fields that it expects.

Other times, the client system is simply not the right place to manage the data that the model needs for predictions because of it's complexity. In this case, we would like to prevent the client system from having to manage data that really does not fall within it's responsabilities. In order to allow the client system to use the model without having to manage the extra data, we can add data enrichment capabilities to the model deployment.

Data enrichment simplifies the work of the client system because a client system can simply provide a way to find the correct data to the deployed ML model. The model deployment is then responsible for going and fetching the correct record, joining it to the data provided by the client system, and making a prediction. Data enrichment also prevents the client system from having to manage the data needed by the model, which keeps the two systems from becoming too coupled.

One more benefit of doing data enrichment is that the model can evolve by using new fields for predictions without affecting the client system at all. By having the model access the data that it needs to make a prediction, the model can access new data and the client system is not responsible for providing or managing the new fields. This allows the deployed model to evolve more easily.

In this blog post, we’ll show how to create a simple decorator that is able to access a database in order to do data enrichment for an ML model that is deployed to a production system. We'll also show how to deploy the decorator along with the model to a RESTful service, and how to create the necessary database to hold the data.

All of the code is available in this github repository.

Software Architecture

The decorator that we will be building requires an outside database in order to access data to do data enrichment. The software architecture will be a little more complicated because we’ll have to deploy a service for the model as well as a database for the data.

The client system accesses the model by reaching out to the model service which hosts both the model and the decorator that we will be building in this blog post. The decorator is the software component that does the data enrichment needed by the model. The decorator reaches out to the database to access data needed by the model, provides the data to the model to make a prediction, and then returns the prediction to the client system.

To store the data that we want to use for enrichment, we’ll use a PostgreSQL database.

Installing a Model

To install the model, we can use the pip command and point it at the github repo of the model.

from IPython.display import clear_output

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

To make a prediction with the model, we'll import the model's class.

from insurance_charges_model.prediction.model import InsuranceChargesModel

clear_output()

Now we can instantiate the model:

model = InsuranceChargesModel()

To make a prediction, we'll need to use the model's input schema class.

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput, \
    SexEnum, RegionEnum

model_input = InsuranceChargesModelInput(
    age=42, 
    sex=SexEnum.female,
    bmi=24.0,
    children=2,
    smoker=False,
    region=RegionEnum.northwest)

The model's input schema is called InsuranceChargesModelInput and it encompasses all of the features required by the model to make a prediction.

Now we can make a prediction with the model by calling the predict() method with an instance of the InsuranceChargesModelInput class.

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=8640.78)

The model predicts that the charges will be $8640.78.

When deploying the model we’ll pretend that the age, sex, bmi, children, smoker, and region fields are not available from the client system that is calling the model. Because of this, we’ll need to add it to the model input by loading the data from the database.

We can view input schema of the model as a JSON schema document by calling the .schema() method on the instance.

model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]}},
 'definitions': {'SexEnum': {'title': 'SexEnum',
   'description': "Enumeration for the value of the 'sex' input of the model.",
   'enum': ['male', 'female'],
   'type': 'string'},
  'RegionEnum': {'title': 'RegionEnum',
   'description': "Enumeration for the value of the 'region' input of the model.",
   'enum': ['southwest', 'southeast', 'northwest', 'northeast'],
   'type': 'string'}}}

Creating the Data Enrichment Decorator

A decorator needs to inherit from the MLModelDecorator base class, which requires a specific set of methods and properties be implemented. The decorator that can access PostgreSQL looks like this:

import os
from typing import List
from pydantic import BaseModel, create_model
import psycopg2
from ml_base.decorator import MLModelDecorator
from ml_base.ml_model import MLModelSchemaValidationException


class PostgreSQLEnrichmentDecorator(MLModelDecorator):
    """Decorator to do data enrichment using a PostgreSQL database."""

    def __init__(self, host: str, port: str, username: str, password: str, database: str,
                 table: str, index_field_name: str, index_field_type: str,
                 enrichment_fields: List[str]) -> None:
        # if password has ${}, then replace with environment variable
        if password[0:2] == "${" and password[-1] == "}":
            password = os.environ[password[2:-1]]
        super().__init__(host=host, port=port, username=username, password=password,
                         database=database, table=table, index_field_name=index_field_name,
                         index_field_type=index_field_type, enrichment_fields=enrichment_fields)
        self.__dict__["_connection"] = None

    @property
    def input_schema(self) -> BaseModel:
        # converting the index field type from a string to a class
        try:
            index_field_type = __builtins__[self._configuration["index_field_type"]]
        except TypeError as e:
            index_field_type = __builtins__.__dict__[self._configuration["index_field_type"]]

        input_schema = self._model.input_schema

        # adding index field to schema because it is required in order to retrieve
        # the right record in the database
        fields = {
            self._configuration["index_field_name"]: (index_field_type, ...)
        }
        for field_name, schema in input_schema.__fields__.items():
            # remove enrichment_fields from schema because they'll be added from the
            # database and don't need to be provided by the client
            if field_name not in self._configuration["enrichment_fields"]:
                if schema.required:
                    fields[field_name] = (schema.type_, ...)
                else:
                    fields[field_name] = (schema.type_, schema.default)

        new_input_schema = create_model(
            input_schema.__name__,
            **fields
        )
        return new_input_schema

    def predict(self, data):
        # create a connection to the database, if it doesn't exist already
        if self.__dict__["_connection"] is None:
            self.__dict__["_connection"] = psycopg2.connect(
                host=self._configuration["host"],
                port=self._configuration["port"],
                database=self._configuration["database"],
                user=self._configuration["username"],
                password=self._configuration["password"])
        cursor = self.__dict__["_connection"].cursor()

        # build a SELECT statement using the index_field and the enrichment_fields
        enrichment_fields = ", ".join(self._configuration["enrichment_fields"])
        sql_statement = "SELECT {} FROM {} WHERE {} = %s;".format(
            enrichment_fields,
            self._configuration["table"],
            self._configuration["index_field_name"])

        # executing the SELECT statement
        cursor.execute(sql_statement,
                       (getattr(data, self._configuration["index_field_name"]), ))
        records = cursor.fetchall()
        cursor.close()

        if len(records) == 0:
            raise ValueError("Could not find a record for data enrichment.")
        elif len(records) == 1:
            record = records[0]
        else:
            raise ValueError("Query returned more than one record.")

        # creating an instance of the model's input schema using the fields that
        # came back from the database and fields that are provided by calling code
        input_schema = self.input_schema
        enriched_data = {}
        for field_name in self._model.input_schema.__fields__.keys():
            if field_name == self._configuration["index_field_name"]:
                pass
            elif field_name in self._configuration["enrichment_fields"]:
                field_index = self._configuration["enrichment_fields"].index(field_name)
                enriched_data[field_name] = record[field_index]
            elif field_name in data.dict().keys():
                enriched_data[field_name] = getattr(data, field_name)
            else:
                raise ValueError("Could not find value for field '{}'.".format(field_name))

        # making a prediction with the model, using the enriched fields
        try:
            enriched_data = self._model.input_schema(**enriched_data)
        except ValueError as e:
            raise MLModelSchemaValidationException(str(e))
        prediction = self._model.predict(data=enriched_data)

        return prediction

    def __del__(self):
        try:
            if self.__dict__["_connection"] is not None:
                self.__dict__["_connection"].close()
        except KeyError:
            pass

The code is quite long, it is mainly made up of two methods: the input_schema method and the predict method. The input_schema method modifies the model's input schema according to the requirements of the data enrichment we want to do. The predict method is responsible for retrieving the data needed by the model and joining it to the data already provided by the client system.

The __init__() method accepts configuration that is used to customize the way that the decorator finds data in the database. The decorator accepts these parameters:

host: hostname for connecting to the database server
port: port for connecting to the database server
username: username for accessing the database
password: password for accessing the database
table: name of the table in the database where data used for enrichment is found
index_field_name: name of the field used for selecting a record
index_field_type: type of the index field
enrichment_fields: names of the fields that will be added to the data sent to the model to make a prediction

The configuration is saved by passing it up to the super class using the super().__init__() method. The configuration values can then be accessed inside of the decorator instance in the self._configuration attribute, which is a dictionary.

When the decorator is applied to a model, it modifies the input_schema of the model. It removes the enrichment_fields from the input schema because these fields are going to be added from the database. This means that the client does not need to provide values for them anymore. It also adds the index_field to the input schema because the decorator needs to use this field to access the correct record in the database table. The index_field is added as a required field in the model’s input_schema because the decorator always needs it.

When a prediction request is made to the decorator, it uses the value in the index_field to access the record in the database table. If the decorator finds the record in the table, it selects the enrichment fields and creates a new input object for the model and sends it to the model. If the record is not found, the decorator raises an exception. The index_field is actually not sent to the model at all, it is used purely to access the data needed by the model in the database. If more than one record is returned from the database, an exception is raised.

The SQL statement is built dynamically based on the fields required by the model and the index field selected through configuration. For example, if we wanted to do enrichment with all of the input fields of the InsuranceChargesModel, the SELECT statement would look like this:

SELECT age, sex, bmi, children, smoker, region
FROM clients
WHERE ssn = '123-45-6789'

In this case we would be accessing a client record by using their social security number as the index field.

Decorating the Model

To test out the decorator we’ll first instantiate the model object that we want to use with the decorator.

model = InsuranceChargesModel()

Next, we’ll instantiate the decorator with the parameters.

decorator = PostgreSQLEnrichmentDecorator(
    host="", 
    port="",
    username="", 
    password="", 
    database="", 
    table="",
    index_field_name="ssn", 
    index_field_type="str", 
    enrichment_fields=["age", "sex", "bmi", "children", "smoker", "region"])

We won't fill in the database details because we don't have a database to connect to yet. However, we can still see how the model's input and output schemas change because of the decorator. In this example, we'll use a client's social security number to uniquely identify records in the datbase table.

We can add the model instance to the decorator after it’s been instantiated like this:

decorated_model = decorator.set_model(model)

We can see the decorator and the model objects by printing the reference to the decorator:

decorated_model

PostgreSQLEnrichmentDecorator(InsuranceChargesModel)

The decorator object is printing out it's own type along with the type of the model that it is decorating.

Now we’ll try to use the decorator and the model together by doing a few things. First, we’ll look at the model input schema:

decorated_model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'type': 'object',
 'properties': {'ssn': {'title': 'Ssn', 'type': 'string'}},
 'required': ['ssn']}

As we can see, the input schema is not the same as what the model exposed, all of the model’s input fields are now removed because they are being provided by the decorator by accessing the database. The user of the model is not expected to provide a value for those fields. However, there is a new field in the schema, the “ssn” field. This field is used by the decorator to select the correct record in the database.

We can also use a few fields from the database and require the client to provide the rest. To do this we'll instantiate the decorator with a few, but not all, of the fields required by the model as enrichment fields.

decorator = PostgreSQLEnrichmentDecorator(
    host="", 
    port="",
    username="", 
    password="", 
    database="", 
    table="",
    index_field_name="ssn", 
    index_field_type="str", 
    enrichment_fields=["age", "sex", "smoker", "region"])

decorated_model = decorator.set_model(model)

decorated_model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'type': 'object',
 'properties': {'ssn': {'title': 'Ssn', 'type': 'string'},
  'bmi': {'title': 'Bmi', 'minimum': 15.0, 'maximum': 50.0, 'type': 'number'},
  'children': {'title': 'Children',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'}},
 'required': ['ssn']}

The model's input schema now requires the fields that are not listed as enrichment fields to be provided by the client. The "ssn" field is still added because the decorator needs it in order to retrieve the enrichment fields from the database.

Next, we’ll look at the decorated model’s output schema:

output_schema = decorated_model.output_schema.schema()

output_schema

{'title': 'InsuranceChargesModelOutput',
 'description': "Schema for output of the model's predict method.",
 'type': 'object',
 'properties': {'charges': {'title': 'Charges',
   'description': 'Individual medical costs billed by health insurance to customer in US dollars.',
   'type': 'number'}}}

The output schema has not changed at all, the decorator does not modify the prediction or the schema of the prediction returned by the model.

Creating a Database

Now that we have a model and a decorator that can add data to the input of the model, we need to create a database table to pull data from. To do this we’ll first start a PostgreSQL instance in a local docker image.

!docker run --name postgres \
    -p 5432:5432 \
    -e POSTGRES_USER=data_enrichment_user \
    -e POSTGRES_PASSWORD=data_enrichment_password \
    -e POSTGRES_DB=data_enrichment \
    -d postgres

695889c4c39617d44b158d7307d431180b1358e62ad07bdf26347a85f725468e

We can connect to the database by starting a client within the same container and executing a SQL statement.

!docker run -it --rm \
    --network="host" postgres \
    psql postgresql://data_enrichment_user:data_enrichment_password@127.0.0.1:5432/data_enrichment \
    -c "SELECT current_database();"

 current_database 
------------------
 data_enrichment
(1 row)

The current database within the server is called "data_enrichment" and it was created when the docker image started.

Next we'll execute a SQL statement that creates a table within the database.

!docker run -it --rm \
    --network="host" postgres \
    psql postgresql://data_enrichment_user:data_enrichment_password@127.0.0.1:5432/data_enrichment \
    -c "CREATE TABLE clients ( \
    ssn         varchar(11) PRIMARY KEY, \
    first_name  varchar(30) NOT NULL, \
    last_name   varchar(30) NOT NULL, \
    age         integer     NOT NULL, \
    sex         varchar(6)  NOT NULL, \
    bmi         integer     NOT NULL, \
    children    integer     NOT NULL, \
    smoker      boolean     NOT NULL, \
    region      varchar(10) NOT NULL \
);"

CREATE TABLE

The table has been created, we can see the table schema looks like this:

!docker run -it --rm \
    --network host postgres \
    psql postgresql://data_enrichment_user:data_enrichment_password@127.0.0.1:5432/data_enrichment \
    -c "\d clients"

                       Table "public.clients"
   Column   |         Type          | Collation | Nullable | Default 
------------+-----------------------+-----------+----------+---------
 ssn        | character varying(11) |           | not null | 
 first_name | character varying(30) |           | not null | 
 last_name  | character varying(30) |           | not null | 
 age        | integer               |           | not null | 
 sex        | character varying(6)  |           | not null | 
 bmi        | integer               |           | not null | 
 children   | integer               |           | not null | 
 smoker     | boolean               |           | not null | 
 region     | character varying(10) |           | not null | 
Indexes:
    "clients_pkey" PRIMARY KEY, btree (ssn)

The table has columns for all of the fields that the model requires to make a prediction plus two columns for the first and last name. It also has an index field called “ssn” because we’ll be referencing each record using a fake Social Security number. The ssn field is the unique identifier for each record and is a good way to correlate data from different systems.

Then we’ll run a some code that connects to the database and inserts fake data into the table. To do this we'll use the faker package, so we'll need to install it first.

!pip install Faker

clear_output()

To add data to the table, we'll just generate some data for each column in the database table and save it into a list.

from faker import Faker

fake = Faker()

records = list()
for _ in range(1000):
    sex = fake.random_choices(elements=("male", "female"), length=1)[0]
    record = {
        "ssn": fake.ssn(),
        "age": fake.random_int(min=18, max=80),
        "sex": sex,
        "bmi": fake.random_int(min=15, max=50),
        "children": fake.random_int(min=0, max=5),
        "smoker": fake.boolean(),
        "region": fake.random_choices(elements=("southwest", "southeast", "northwest", "northeast"), length=1)[0],
        "first_name": fake.first_name_male() if sex =="male" else fake.first_name_female(),
        "last_name": fake.last_name()
    }
    records.append(record)

Notice that each field is generating data that does not necessarily fit the schema of the model. For example, the maximum value allowed by the model for the "age" field is 65, but the fake data can go up to 80. We'll use records that do not match the model's schema to test the decorator later.

Let's take a look at the first record that matches the model schema:

valid_record = next(record for record in records if record["age"] <= 65)

valid_record

{'ssn': '646-87-1351',
 'age': 31,
 'sex': 'female',
 'bmi': 31,
 'children': 1,
 'smoker': False,
 'region': 'northeast',
 'first_name': 'Vickie',
 'last_name': 'Anderson'}

Now let's find a record that does not fit the model's schema so we can use it later:

invalid_record = next(record for record in records if record["age"] > 65)

invalid_record

{'ssn': '361-47-3850',
 'age': 72,
 'sex': 'male',
 'bmi': 34,
 'children': 4,
 'smoker': False,
 'region': 'northeast',
 'first_name': 'Michael',
 'last_name': 'Pena'}

We'll use the ssn numbers later to test out the decorator's error handling.

valid_ssn = valid_record["ssn"]
invalid_ssn = invalid_record["ssn"]

Next, we'll put the 1000 fake records generated into the database table that we created above.

connection = psycopg2.connect(
    host="localhost",
    port="5432",
    database="data_enrichment",
    user="data_enrichment_user",
    password="data_enrichment_password")

cursor = connection.cursor()

for record in records:
    cursor.execute("INSERT INTO clients (ssn, first_name, last_name, age, sex, bmi, children, smoker, region)"
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);",
                   (record["ssn"], record["first_name"], record["last_name"], record["age"], record["sex"], 
                    record["bmi"], record["children"], record["smoker"], record["region"]))
    connection.commit()

cursor.close()
connection.close()

The database now has a table that has records that we can use to try out the model using the decorator.

We'll access a some records to see the data:

!docker run -it --rm \
    --network host postgres \
    psql postgresql://data_enrichment_user:data_enrichment_password@127.0.0.1:5432/data_enrichment \
    -c "SELECT ssn, first_name, last_name FROM clients LIMIT 5;"

     ssn     | first_name | last_name 
-------------+------------+-----------
 646-87-1351 | Vickie     | Anderson
 194-94-3733 | Patricia   | Lee
 709-08-5148 | Seth       | James
 132-30-5594 | Edward     | Allen
 096-55-1187 | Mark       | Keith
(5 rows)

Trying out the Decorator

Now that we have some data in the database, we can try to make predictions with the decorated model.

decorator = PostgreSQLEnrichmentDecorator(
    host="localhost",
    port="5432",
    username="data_enrichment_user", 
    password="data_enrichment_password", 
    database="data_enrichment", 
    table="clients",
    index_field_name="ssn", 
    index_field_type="str", 
    enrichment_fields=["age", "sex", "bmi", "children", "smoker", "region"])

decorated_model = decorator.set_model(model)

model_input = decorated_model.input_schema(ssn=valid_ssn)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=6416.86)

We provided a value for the ssn field and the decorator was able to retrieve the values for the other fields for the model to use.

Next, we'll see what happens when we try to do data enrichment with a record that does not exist in the database.

model_input = decorated_model.input_schema(ssn="123-45-6789")

try:
    decorated_model.predict(model_input)
except ValueError as e:
    print(e)

Could not find a record for data enrichment.

The decorator raised a ValueError exception because it could not find the needed record.

We can also leave some fields for the client of the model to provide and pull all other fields from the database. We just need to instantiate the decorator a little differently.

decorator = PostgreSQLEnrichmentDecorator(
    host="localhost",
    port="5432", 
    username="data_enrichment_user", 
    password="data_enrichment_password", 
    database="data_enrichment", 
    table="clients",
    index_field_name="ssn", 
    index_field_type="str", 
    enrichment_fields=["age", "sex", "bmi", "region"])

decorated_model = decorator.set_model(model)

To see which fields are now required by the model, we'll take a look at the input schema of the decorated model.

input_schema = decorated_model.input_schema.schema()

input_schema

{'title': 'InsuranceChargesModelInput',
 'type': 'object',
 'properties': {'ssn': {'title': 'Ssn', 'type': 'string'},
  'children': {'title': 'Children',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker', 'type': 'boolean'}},
 'required': ['ssn']}

The decorator has removed the age, sex, bmi, and region fields from the input schema. It has left the smoker and children fields in place, and it has added the ssn field as we expected.

Now we can try the decorator with this new input schema:

model_input = decorated_model.input_schema(ssn=valid_ssn, children=2, smoker=False)

prediction = decorated_model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=6123.85)

The decorator was able to bring in the values for the missing fields from the database and join them to the fields provided by the client in order to make a prediction.

Lastly, we'll select a client record in the database that does not meet the schema requirements of the model:

model_input = decorated_model.input_schema(ssn=invalid_ssn, children=2, smoker=False)

try:
    prediction = decorated_model.predict(model_input)
except MLModelSchemaValidationException as e:
    print(e)

1 validation error for InsuranceChargesModelInput
age
  ensure this value is less than or equal to 65 (type=value_error.number.not_le; limit_value=65)

Because we put some records in the database that do not meet the input schema of the model a ValueError was raised inside of the decorator instance. The record had an age value that is above 65, which the model cannot predict with.

Adding a Decorator to a Deployed Model

Now that we have a model and a decorator, we can combine them together into a service that is able to make predictions and also do data enrichment. To do this, we won't need to write any extra code, we can leverage the rest_model_service package to provide the RESTful API for the service. You can learn more about the package in this blog post.

To install the package, execute this command:

!pip install rest_model_service

clear_output()

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true
    decorators:
      - class_path: data_enrichment.postgresql.PostgreSQLEnrichmentDecorator
        configuration:
          host: "localhost"
          port: "5432"
          username: "data_enrichment_user"
          password: "data_enrichment_password"
          database: "data_enrichment"
          table: "clients"
          index_field_name: "ssn"
          index_field_type: "str"
          enrichment_fields:
            - "age"
            - "sex"
            - "bmi"
            - "children"
            - "smoker"
            - "region"

The service_title field is the name of the service as it will appear in the documentation. The models field is an array that contains the details of the models we would like to deploy in the service. The class_path points at the MLModel class that implement's the model's prediction logic. The decorators field contains the details of the decorators that we want to attach to the model instance. In this case, we want to use the PostgreSQLEnrichmentDecorator decorator class with the configuration we've used for local testing.

Using the configuration file, we're able to create an OpenAPI specification file for the model service by executing these commands:

export PYTHONPATH=./
export REST_CONFIG=./configuration/rest_config.yaml
generate_openapi --output_file="service_contract.yaml"

The service_contract.yaml file will be generated and it will contain the specification that was generated for the model service. The insurance_charges_model endpoint is the one we'll call to make predictions with the model. The model's input and output schemas were automatically extracted and added to the specification. If you inspect the contract, you'll find that the enrichment fields are not part of the input schema because they are being removed by the enrichment decorator. The ssn field has been added to the contract because it is needed to do data enrichment.

To run the service locally, execute these commands:

uvicorn rest_model_service.main:app --reload

The service should come up and can be accessed in a web browser at http://127.0.0.1:8000. When you access that URL you will be redirected to the documentation page that is generated by the FastAPI package:

The documentation allows you to make requests against the API in order to try it out. Here's a prediction request against the insurance charges model:

And the prediction result:

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model. The decorator that we want to test can also be added to the model through configuration, including all of its parameters.

Deploying the Model

Now that we have a working model and model service, we'll need to deploy it somewhere. We'll start by deploying the service locally. Once we have the service and database working locally, we'll deploy everything to the cloud using DigitalOcean's managed kubernetes service.

Creating a Docker Image

Before moving forward, let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

FROM python:3.9-slim

MAINTAINER Brian Schmidt "6666331+schmidtbri@users.noreply.github.com"

WORKDIR ./service

RUN apt-get update
RUN apt-get --assume-yes install git

COPY ./data_enrichment ./data_enrichment
COPY ./configuration ./configuration
COPY ./LICENSE ./LICENSE
COPY ./requirements.txt ./requirements.txt

RUN pip install -r requirements.txt

CMD ["uvicorn", "rest_model_service.main:app", "--host", "0.0.0.0", "--port", "8000"]

The Dockerfile is used by this command to create a docker image:

!docker build -t insurance_charges_model_service:0.1.0 ..\

clear_output()

To make sure everything worked as expected, we'll look through the docker images in our system:

!docker image ls | grep insurance_charges_model_service

insurance_charges_model_service   0.1.0     f5b85418ebc7   2 days ago     1.53GB

!docker network create data-enrichment-network

bcfa5ed0334b609c6f553caac67375c0571438f4541d75d63be79638a6e300f7

Next, we'll connect the running postgres image to the network.

!docker network connect data-enrichment-network postgres

Now we can start the service docker image connected to the same network as the postgres container.

!docker run -d \
    -p 8000:8000 \
    --net data-enrichment-network \
    -e REST_CONFIG=./configuration/local_rest_config.yaml \
    --name insurance_charges_model_service \
    insurance_charges_model_service:0.1.0

6e1bc98063053f9260e078fb4bef3e36637bb84e73b04441791e2c75fd0ad833

Notice that we're using a different configuration file that has a different hostname for the postgres instance. The postgres image is not accesible from localhost inside of the network so we needed to have the hostname "postgres" in the configuration.

The service should be accessible on port 8000 of localhost, so we'll try to make a prediction using the curl command running inside of a container connected to the network:

!docker run -it --rm \
    --net data-enrichment-network \
    curlimages/curl \
    curl -X 'POST' \
    'http://insurance_charges_model_service:8000/api/models/insurance_charges_model/prediction' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{"ssn": "646-87-1351"}'

{"charges":6416.86}

The model predicted that the insurance charges will be $6416.86 for the person whose SSN is 646-87-1351.

We're done with the service and the database so we'll shut down the docker containers and the docker network.

!docker kill postgres
!docker rm postgres

!docker kill insurance_charges_model_service
!docker rm insurance_charges_model_service

!docker network rm data-enrichment-network

postgres
postgres
insurance_charges_model_service
insurance_charges_model_service
data-enrichment-network

Setting up Digital Ocean

In order to deploy the model service to a DigitalOcean kubernetes cluster, we'll need to connect to the DigitalOcean API.

In this section we'll be using the doctl command line utility which will help us to interact with the Digital Ocean Kubernetes service. We followed these instructions to install the doctl utility. Before we can do anything with the Digital Ocean API, we need to authenticate, so we created an API token by following these instructions. Once we have the token we can add it to the doctl utility by creating a new authentication context with this command:

doctl auth init --context model-services-context

The command will ask for the token that we generated on the website.

The command creates a new context called "model-services-context" that we'll use to interact with the Digital Ocean API. The command asks for the API token we generated and saves it into the configuration file of the tool. To make sure that the context was created correctly and is the current context, execute this command:

!doctl auth list

default
model-services-context (current)

The newly created context should be listed and have "(current)" by its name. If the context we created is not the current context, we can switch to it with this command:

!doctl auth switch --context model-services-context

Now using context [model-services-context] by default

Now that we have the credentials necessary, we can start creating the infrastructure for our deployment.

Creating the Kubernetes Cluster

To create the kubernetes cluster and supporting infrastructure, we'll use Terraform. Terraform is an Infrastructure as Code (IaC) tool that will allow us to declaratively create our infrastructure in configuration files, and then create, manage, and destroy it with simple commands. The command line Terraform tool can be installed by following these intructions.

We wont be doing a deep dive into Terraform for this blog post because it would make the post too long. The Terraform module that we'll install is in the source code attached to this post, in the "terraform" folder.

To begin, we'll switch into the terraform folder and add our API token to an environment variable.

%cd ../terraform

%env DIGITALOCEAN_TOKEN=dop_v1_c857bb7bb4bed089000125513c49f642f03401253ec09178c41f94df665312a

clear_output()

Next, we'll initialize the Terraform environment.

!terraform init

Initializing the backend...

Initializing provider plugins...
- Finding latest version of hashicorp/kubernetes...
- Finding digitalocean/digitalocean versions matching "~> 2.0"...
- Installing hashicorp/kubernetes v2.11.0...
- Installed hashicorp/kubernetes v2.11.0 (signed by HashiCorp)
- Installing digitalocean/digitalocean v2.19.0...
- Installed digitalocean/digitalocean v2.19.0 (signed by a HashiCorp partner, key ID F82037E524B9C0E8)

...

The terraform environment is now initialized and stored in the terraform folder. We can now create a plan for the deployment of the resources.

The plan command required an input variable called "project_name" which allows the resources to have a shared naming convention. We provided the value through the command line option.

!terraform plan -var="project_name=model-services"

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # digitalocean_container_registry.container_registry will be created
  + resource "digitalocean_container_registry" "container_registry" {
      + created_at             = (known after apply)
      + endpoint               = (known after apply)
      + id                     = (known after apply)
      + name                   = "model-services-registry"
      + region                 = (known after apply)
      + server_url             = (known after apply)
      + storage_usage_bytes    = (known after apply)
      + subscription_tier_slug = "basic"
    }

  # digitalocean_container_registry_docker_credentials.registry_credentials will be created
  + resource "digitalocean_container_registry_docker_credentials" "registry_credentials" {
      + credential_expiration_time = (known after apply)
      + docker_credentials         = (sensitive value)
      + expiry_seconds             = 1576800000
      + id                         = (known after apply)
      + registry_name              = "model-services-registry"
      + write                      = true
    }

...

Plan: 5 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + kubernetes_cluster_id = (known after apply)
  + registry_endpoint     = (known after apply)

───────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

The output of the plan command gives us a list of the resources that will be created. These resources are:

docker registry, used to deploy images to the cluster
docker registry credentials, used to allow access to the images from the cluster
VPC, a private network for the cluster nodes
kubernetes cluster, used to host the services
kubernetes secret, to hold the docker registry credentials so that the cluster can load images from the docker registry

We can create the resources with the apply command.

!terraform apply -var="project_name=model-services" -auto-approve

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # digitalocean_container_registry.container_registry will be created
  + resource "digitalocean_container_registry" "container_registry" {
      + created_at             = (known after apply)
      + endpoint               = (known after apply)
      + id                     = (known after apply)
      + name                   = "model-services-registry"
      + region                 = (known after apply)
      + server_url             = (known after apply)
      + storage_usage_bytes    = (known after apply)
      + subscription_tier_slug = "basic"
    }

...

Outputs:

kubernetes_cluster_id = "7eda057c-501f-414c-ad36-e4a75feac4e0"
registry_endpoint = "registry.digitalocean.com/model-services-registry"

The terraform stack returned the id of the cluster that was created. We'll need this id to connect to the cluster.

%cd ..

/Users/brian/Code/data-enrichment-for-ml-models

Pushing the Image

Now that we have a registry, we need to add credentials to our local docker daemon in order to be able to upload images, to do that we'll use this command:

!doctl registry login

Logging Docker in to registry.digitalocean.com

In order to upload the image, we need to tag it with the URL of the DO registry we created. The URL of the registry was an output of the terraform module we just created above. The docker tag command looks like this:

!docker tag insurance_charges_model_service:0.1.0 registry.digitalocean.com/model-services-registry/insurance_charges_model_service:0.1.0

Now we can push the image to the DigitalOcean docker registry.

!docker push registry.digitalocean.com/model-services-registry/insurance_charges_model_service:0.1.0

The push refers to repository [registry.digitalocean.com/model-services-registry/insurance_charges_model_service]

[1B4e8c730f: Preparing 
[1B262abd28: Preparing 
[1B103cfdc5: Preparing 
[1Be6d9a4d6: Preparing 
[1Ba89df31c: Preparing 
[1B3bc716a2: Preparing 
[1Bb9727396: Preparing 
[1B7bf074b6: Preparing 
[1B85df8c54: Preparing 
[1Bafbe089a: Preparing 
[1B90f11bed: Preparing 
[1B50a1245f: Preparing 
[13Be8c730f: Pushing  426.5MB/1.3GB
...

Accessing the Kubernetes Cluster

To access the cluster, doctl has another option that will set up the kubectl tool for us:

!doctl kubernetes cluster kubeconfig save 7eda057c-501f-414c-ad36-e4a75feac4e0

Notice: Adding cluster credentials to kubeconfig file found in "/Users/brian/.kube/config"
Notice: Setting current-context to do-nyc1-model-services-cluster

The unique identifier is for the cluster that was just created and is returned by the previous command. When the command finishes, the current kubectl context should be switched to the newly created cluster. To list the contexts in kubectl, execute this command:

!kubectl config get-contexts

CURRENT   NAME                             CLUSTER                          AUTHINFO                               NAMESPACE
*         do-nyc1-model-services-cluster   do-nyc1-model-services-cluster   do-nyc1-model-services-cluster-admin   
          minikube                         minikube                         minikube

A listing of the contexts currently in the kubectl configuration should appear, and there should be a star next to the new cluster's context. To make sure everything is working we can get a list of the nodes in the cluster with this command:

!kubectl get nodes

NAME                                       STATUS   ROLES    AGE   VERSION
model-services-cluster-worker-pool-crmkf   Ready    <none>   55m   v1.22.8
model-services-cluster-worker-pool-crmkx   Ready    <none>   55m   v1.22.8
model-services-cluster-worker-pool-crmky   Ready    <none>   55m   v1.22.8

Creating a Kubernetes Namespace

Now that we have a cluster and are connected to it, we'll create a namespace to hold the resources for our model deployment. The resource definition is in the kubernetes/namespace.yml file. To apply the manifest to the cluster, execute this command:

!kubectl create -f kubernetes/namespace.yml

namespace/model-services created

To take a look at the namespaces, execute this command:

!kubectl get namespace

NAME              STATUS   AGE
default           Active   164m
kube-node-lease   Active   164m
kube-public       Active   164m
kube-system       Active   164m
model-services    Active   2s

The new namespace should appear in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

!kubectl config set-context --current --namespace=model-services

Context "do-nyc1-model-services-cluster" modified.

Creating a Database

To create a PostgreSQL database instance in Kubernetes, we'll use the bitnami helm chart.

Helm charts are packaged applications that can be easily installed on a Kubernetes cluster. To install PostgreSQL we'll first add the bitnami helm repository:

!helm repo add bitnami https://charts.bitnami.com/bitnami

"bitnami" has been added to your repositories

Now we can apply the PostgreSQL chart to the current cluster and namespace with this command:

!helm install postgres bitnami/postgresql

NAME: postgres
LAST DEPLOYED: Sun May  1 23:36:50 2022
NAMESPACE: model-services
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: postgresql
CHART VERSION: 11.1.25
APP VERSION: 14.2.0

** Please be patient while the chart is being deployed **

PostgreSQL can be accessed via port 5432 on the following DNS names from within your cluster:

    postgres-postgresql.model-services.svc.cluster.local - Read/Write connection

To get the password for "postgres" run:

    export POSTGRES_PASSWORD=$(kubectl get secret --namespace model-services postgres-postgresql -o jsonpath="{.data.postgres-password}" | base64 --decode)

To connect to your database run the following command:

    kubectl run postgres-postgresql-client --rm --tty -i --restart='Never' --namespace model-services --image docker.io/bitnami/postgresql:14.2.0-debian-10-r77 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
      --command -- psql --host postgres-postgresql -U postgres -d postgres -p 5432

    > NOTE: If you access the container using bash, make sure that you execute "/opt/bitnami/scripts/entrypoint.sh /bin/bash" in order to avoid the error "psql: local user with ID 1001} does not exist"

To connect to your database from outside the cluster execute the following commands:

    kubectl port-forward --namespace model-services svc/postgres-postgresql 5432:5432 &
    PGPASSWORD="$POSTGRES_PASSWORD" psql --host 127.0.0.1 -U postgres -d postgres -p 5432

The output of the helm chart contains some info about the deployment that we'll need later. The DNS name of the new PostgreSQL service is used in the configuration of the decorator.

We can view the newly created database instance by looking for the pods that are hosting it:

!kubectl get pods

NAME                    READY   STATUS    RESTARTS   AGE
postgres-postgresql-0   1/1     Running   0          104s

To access the database, we'll need to get the password created by the helm chart:

!kubectl get secret postgres-postgresql -o jsonpath="{.data.postgres-password}" | base64 --decode

SaF0fhHrRj

We can test the database by executing a simple SELECT statement from another pod in the cluster:

!kubectl run postgres-postgresql-client --rm --tty -i \
    --restart='Never' \
    --image docker.io/bitnami/postgresql:14.2.0-debian-10-r77 \
    --command -- psql postgresql://postgres:SaF0fhHrRj@postgres-postgresql:5432/postgres \
                -c "SELECT current_database();"

 current_database 
------------------
 postgres
(1 row)

pod "postgres-postgresql-client" deleted

To create a table in the database, we'll execute a SQL command:

!kubectl run postgres-postgresql-client --rm --tty -i \
    --restart='Never' \
    --image docker.io/bitnami/postgresql:14.2.0-debian-10-r77 \
    --command -- psql postgresql://postgres:SaF0fhHrRj@postgres-postgresql:5432/postgres \
                -c "CREATE TABLE clients ( \
                    ssn         varchar(11) PRIMARY KEY, \
                    first_name  varchar(30) NOT NULL, \
                    last_name   varchar(30) NOT NULL, \
                    age         integer     NOT NULL, \
                    sex         varchar(6)  NOT NULL, \
                    bmi         integer     NOT NULL, \
                    children    integer     NOT NULL, \
                    smoker      boolean     NOT NULL, \
                    region      varchar(10) NOT NULL);"

CREATE TABLE
pod "postgres-postgresql-client" deleted

Next, we'll add some data to the table using the same code as we used for the local docker PostgreSQL instance. Before that, we'll need to connect to the instance using using port forwarding. Port forwarding is a simple way to connect to a pod running in the cluster from the local environment, it simply forwards all traffic from a local port to a remote port in the pod.

To start port forwarding, execute this command:

kubectl port-forward svc/postgres-postgresql 5432:5432

Now we can execute the python code that will add the data to the table:

connection = psycopg2.connect(
    host="localhost",
    port="5432",
    database="postgres",
    user="postgres",
    password="SaF0fhHrRj")

cursor = connection.cursor()

for record in records:
    cursor.execute("INSERT INTO clients (ssn, first_name, last_name, age, sex, bmi, children, smoker, region)"
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s);",
                   (record["ssn"], record["first_name"], record["last_name"], record["age"], record["sex"], 
                    record["bmi"], record["children"], record["smoker"], record["region"]))
    connection.commit()

cursor.close()
connection.close()

The remote database instance should now have the data needed to try out the decorator running in the service. We can view some of the data with this command:

!kubectl run postgres-postgresql-client --rm --tty -i \
    --restart='Never' \
    --image docker.io/bitnami/postgresql:14.2.0-debian-10-r77 \
    --command -- psql postgresql://postgres:SaF0fhHrRj@postgres-postgresql:5432/postgres \
                -c "SELECT ssn, first_name, last_name FROM clients LIMIT 5;"

     ssn     | first_name | last_name 
-------------+------------+-----------
 646-87-1351 | Vickie     | Anderson
 194-94-3733 | Patricia   | Lee
 709-08-5148 | Seth       | James
 132-30-5594 | Edward     | Allen
 096-55-1187 | Mark       | Keith
(5 rows)

pod "postgres-postgresql-client" deleted

Now that we're done putting data in the database, we can shut down the port forwarding process by pressing CTL-C or with this command:

pkill -f kubectl port-forward

Creating a Kubernetes Deployment and Service

The model service now has a database to access, so we'll be creating the model service resources. These are:

Deployment: a declarative way to manage a set of pods, the model service pods are managed through the Deployment.
Service: a way to expose a set of pods in a Deployment, the model services is made available to the outside world through the Service, the service type is LoadBalancer which means that a load balancer will be created for the service.

They are created within the Kubernetes cluster with this command:

!kubectl apply -f kubernetes/model_service.yml

deployment.apps/insurance-charges-model-deployment created
service/insurance-charges-model-service created

The deployment and service for the model service were created together. You can see the new service with this command:

!kubectl get services

NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
insurance-charges-model-service   LoadBalancer   10.245.246.238   <pending>     80:31223/TCP   32s
postgres-postgresql               ClusterIP      10.245.0.250     <none>        5432/TCP       15m
postgres-postgresql-hl            ClusterIP      None             <none>        5432/TCP       15m

The Service type is LoadBalancer, which means that the cloud provider is providing a load balancer and public IP address through which we can contact the service. To view details about the load balancer provided by Digital Ocean for this Service, we'll execute this command:

!kubectl describe service insurance-charges-model-service | grep "LoadBalancer Ingress"

LoadBalancer Ingress:     157.230.202.103

The load balancer can take a while longer than the service to come up, until the load balancer is running the command won't return anything. The IP address that the Digital Ocean load balancer sits behind will be listed in the output of the command.

Once the load balancer comes up, we can view the service through a web browser:

The same documentation is displayes as when we deployed the service locally.

To make a prediction, we'll hit the IP service with a request:

!curl -X 'POST' \
  'http://157.230.202.103/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ "ssn": "646-87-1351" }'

{"charges":6416.86}

The decorator is working and accessing data from the database!

The service is using the configuration file in ./configuration/kubernetes_rest_config.yaml right now, which is configuring the PostgreSQL decorator to accept the "ssn" field and use it to load all other features needed by the model from the database. This is not the only way that we can use the decorator, so we'll try out another configuration.

To load another configuration file, we'll just change the environment variable value in the Kubernetes Deployment resource for the model service:

env:
  - name: REST_CONFIG
    value: ./configuration/kubernetes_rest_config2.yaml
...

The new configuration file causes the decorator to accept more fields from the user of the service. After changing the Deployment, we'll recreate it in the cluster with this command:

!kubectl apply -f kubernetes/model_service.yml

deployment.apps/insurance-charges-model-deployment configured
service/insurance-charges-model-service unchanged

The service pods are restarted with the new configuration, the service remains unchanhed. We can try out a request with this command:

!curl -X 'POST' \
  'http://157.230.202.103/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
      \"ssn\": \"646-87-1351\", \
      \"age\": 65, \
      \"bmi\": 50, \
      \"smoker\": true \
    }"

{"charges":46627.88}

The service now required more fields because the decorator is no longer loading those features from the database.

We'll try out one more configuration to show how powerful decorators can be. In a previous blog post we created a decorator that added a unique prediction id to every prediction returned by the model. We can add this decorator to the service by simply changing the configuration:

decorators:
  - class_path: data_enrichment.prediction_id.PredictionIDDecorator
  - class_path: data_enrichment.postgresql.PostgreSQLEnrichmentDecorator
    configuration:
    host: "postgres-postgresql.model-services.svc.cluster.local"
    port: "5432"
...

This configuration is in the ./configuration/kubernetes_rest_config3.yaml file. We recreate the Deployment again, this time pointing at this configuration file:

!kubectl apply -f kubernetes/model_service.yml

deployment.apps/insurance-charges-model-deployment configured
service/insurance-charges-model-service unchanged

We'll try the service one more time:

!curl -X 'POST' \
  'http://157.230.202.103/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "{ \
      \"ssn\": \"646-87-1351\", \
      \"age\": 65, \
      \"bmi\": 50, \
      \"smoker\": true \
    }"

{"charges":46627.88,"prediction_id":"4db189c2-5200-44a6-b6af-0e341d0fb9bc"}

This shows how easy and powerful it is to combine decorator with models in order to do more complex operations.

Deleting the Resources

Now that we're done with the service we need to destroy the resources. To delete the database deploymet, we'll delete the helm deployment:

!helm delete postgres

release "postgres" uninstalled

Since the persistent volume claim is not deleted with the chart, we'll delete it with a kubectl command:

!kubectl delete pvc -l app.kubernetes.io/instance=postgres

persistentvolumeclaim "data-postgres-postgresql-0" deleted

To delete the model service, we'll execute this command:

!kubectl delete -f kubernetes/model_service.yml

deployment.apps "insurance-charges-model-deployment" deleted
service "insurance-charges-model-service" deleted

To delete the namespace:

!kubectl delete -f kubernetes/namespace.yml

namespace "model-services" deleted

Lastly, to destroy the kubernetes cluster, execute these commands:

%cd ./terraform

!terraform plan -var="project_name=model-services" -destroy

/Users/brian/Code/data-enrichment-for-ml-models/terraform
digitalocean_vpc.cluster_vpc: Refreshing state... [id=5a0e94f2-bb9d-4814-bf5f-ccc2c2e98b84]
digitalocean_container_registry.container_registry: Refreshing state... [id=model-services-registry]
digitalocean_container_registry_docker_credentials.registry_credentials: Refreshing state... [id=model-services-registry]
digitalocean_kubernetes_cluster.cluster: Refreshing state... [id=7eda057c-501f-414c-ad36-e4a75feac4e0]
kubernetes_secret.cluster_registry_crendentials: Refreshing state... [id=default/docker-cfg]

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # digitalocean_container_registry.container_registry will be destroyed
  - resource "digitalocean_container_registry" "container_registry" {
      - created_at             = "2022-05-02 00:48:55 +0000 UTC" -> null
      - endpoint               = "registry.digitalocean.com/model-services-registry" -> null
      - id                     = "model-services-registry" -> null
      - name                   = "model-services-registry" -> null
      - region                 = "sfo3" -> null
      - server_url             = "registry.digitalocean.com" -> null
      - storage_usage_bytes    = 694379520 -> null
      - subscription_tier_slug = "basic" -> null
    }

...

Plan: 0 to add, 0 to change, 5 to destroy.

Changes to Outputs:
  - kubernetes_cluster_id = "7eda057c-501f-414c-ad36-e4a75feac4e0" -> null
  - registry_endpoint     = "registry.digitalocean.com/model-services-registry" -> null

───────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

!terraform apply -var="project_name=model-services" -auto-approve -destroy

digitalocean_container_registry.container_registry: Refreshing state... [id=model-services-registry]
digitalocean_vpc.cluster_vpc: Refreshing state... [id=5a0e94f2-bb9d-4814-bf5f-ccc2c2e98b84]
digitalocean_container_registry_docker_credentials.registry_credentials: Refreshing state... [id=model-services-registry]
digitalocean_kubernetes_cluster.cluster: Refreshing state... [id=7eda057c-501f-414c-ad36-e4a75feac4e0]
kubernetes_secret.cluster_registry_crendentials: Refreshing state... [id=default/docker-cfg]

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # digitalocean_container_registry.container_registry will be destroyed
  - resource "digitalocean_container_registry" "container_registry" {
      created_at             = "2022-05-02 00:48:55 +0000 UTC" -> null
      - endpoint               = "registry.digitalocean.com/model-services-registry" -> null
      - id                     = "model-services-registry" -> null
      - name                   = "model-services-registry" -> null
      - region                 = "sfo3" -> null
      - server_url             = "registry.digitalocean.com" -> null
      - storage_usage_bytes    = 694379520 -> null
      - subscription_tier_slug = "basic" -> null
    }

...

Closing

In this blog post, we showed how to use decorators to perform data enrichment for machine learning models. Data enrichment is a common requirement across many different ML model deployments. We went through the entire design and coding process for the decorator, local testing using Docker, creating the infrastucture using Terraform, and then deploying the solution to Kubernetes.

One of the benefits of using a decorator for the ML model is that we keep the model prediction code and the data access code separate from each other. The model code did not have to change at all for us to be able to perform data enrichment for the model. The RESTful service package code also didnt have to be modified because it supports adding decorators to models through configuration rather than doing it through code. In the end it was possible to cleanly combine the model, decorator, and service components into one cohesive solution through the use of configuration only. The service is also able to host multiple decorators for each model which also allows for more complex use cases for decorators.

Another benefit is that we are able to reuse the decorator we built in this blog post to do data enrichment for any ML model deployment that needs to pull data from a PostgreSQL database. The same decorator class can easily be instantiated and added to any model instance that follows the MLModel interface. We can do this because the decorator is built for flexibility, being able to be configured to load any number of fields from a database table and join the values into the model's input.

Decorator Pattern for ML Models

2022-02-27T07:00:00-05:00

Decorator Pattern for ML Models

Introduction

The decorator pattern is a software engineering pattern that allows software to be more flexible, more reusable, and more cohesive. In this blog post, we’ll explore how decorators work, how to implement them, how to apply them to the MLModel base class, and how to deploy them in a REST service.

We’ll be building on top of the MLModel base class that we’ve built in a previous blog post. The MLModel base class is designed to be wrapped around the prediction functionality of a machine learning model. It has several properties that allow a model object to describe itself to the outside world, including its name, version, and input and output schemas. The MLModel base class also requires that any class that inherits from it to implement the __init__() method, and the predict() method. These two methods form the most simple functionality of a machine learning model, the __init__() method is where model parameters are loaded, and the predict() method is where predictions are made.

The Decorator Pattern

The decorator pattern is an object-oriented design pattern that is useful when behavior needs to be added to an object without changing the object’s class or subclassing the object’s class. A decorator is an object that “decorates” the API of the object that it is decorating while not modifying the API of the object. The decorator executes its own behavior before and after the behavior of the decorated object, in this way, the decorator instance acts as a “gateway” to the decorated object.

How to Build a Decorator

A decorator is a class that has the same API as the class that we want to decorate. In order to build a decorator, we’ll first create a Decorator base class by following these steps:

Subclass the class we want to decorate, creating a Decorator class with the same API.
In the Decorator class, add an instance attribute that can point to an instance of the class that we want to decorate.
When instantiating the Decorator class, receive an instance of the class we want to decorate and save it to the instance attribute.
In the Decorator class, implement the methods of the API of the class we want to decorate, calling the methods of the instance attribute and returning the results to the caller.

If we instantiate the Decorator base class, the decorator instance will just forward all method calls to the decorated object, which is not very useful. To actually build a Decorator, we’ll need to create a subclass of it like this:

Create a subclass of Decorator that overrides the methods that you want to modify, adding your own behavior.
Make sure that you call the corresponding methods in the instance attribute from the Decorator’s methods in order to allow the decorated object to still execute its own behavior.

Notice that a decorator instance can actually decorate another decorator instance, which allows us to “stack” decorators together to do more complex things.

Benefits of the Decorator Pattern

One of the great benefits of decorators is the flexibility that they bring to software development. Without the use of decorators, an object’s class must be modified or subclassed in order to modify its behavior. By using decorators, we can modify the behavior just by attaching the decorator to the object. The “decoration” of an object can be done at runtime and can be configuration-driven, which means that we can change a program’s behavior quickly and easily by modifying its configuration instead of its source code.

Another benefit of decorators in that the API of the object that is decorated does not change at all. Any other object that depends on the API of the decorated object can use it without modification and without being aware that it is decorated. The only problems that arise when applying decorators is if an object depends on another object’s specific behavior instead of its API, however this is an antipattern and should be avoided.

Yet another benefit of decorators is the ability to reuse them across different parts of an application. If we need to add the same behavior to many different objects which share the same API, we can create a decorator class that implements the behavior and attach it to the specific objects that we need to modify. If we had modified the behavior of the objects by changing their class, we would force all instances of the class to have the new behavior that we needed. If we subclassed the original class to add behavior, we would be adding another level of abstraction to the design which makes everything more complicated. By using decorator instances and only attaching them only to the objects that we actually need to modify, we simplify the application’s codebase.

By adding the decorator pattern to a codebase, we are able to make the whole codebase more cohesive. This is because we’re making individual classes that do one thing only. If we need to add some extra behavior to the class, we can attach a decorator that adds only that behavior instead of adding the behavior to the original class. The single responsibility principle tells us that a class should have only one reason to change, by using decorators we can make following this principle in our code a lot easier.

Decorators also encourage us to use a compositional approach to software development, which means that we create the desired behavior of the program by “composing” it from various smaller pieces of code. This is different from a hierarchical approach in which we define new behaviors by inheriting from and extending the behavior of base classes. Building software through composition is simpler in the long run because it incentivizes us to use simpler inheritance hierarchies that are easier to work with.

Decorators in the Python Language

The Python programming language already has a feature called decorators, which is syntactic sugar that allows a programmer to extend the functionality of a function or class. A decorator of this type is a function that takes a function or a class as a parameter and extends it with new behavior. Functions that are “decorated” have the name of the decorator function prepended with a “@” symbol:

@my_decorator
def my_function():
    ...

In this case the decorated function is called my_function and the decorator function is called my_decorator. It’s important to understand that in this blog post, we are not talking about this kind of decorator, although it is a similar concept. A great place to learn about Python decorators is here.

The decorator that is supported by the Python language allows you to decorate code, but does not allow for dynamic runtime behavior. That is to say, we can modify a function right after it is loaded, but not before it is executed. The type of decoration we will be building in this blog post will allow us to decorate MLModel objects at runtime, regardless of the actual code. This means that we’ll be able to add decorators are runtime from configuration, adding some flexibility to our software.

Base Class for Decorators

The decorator pattern requires that we define a base class for the decorators that we want to actually build.

First, we'll install the ml_base package

from IPython.display import clear_output

!pip install ml_base>=0.2.0

clear_output()

The MLModelDecorator base class looks like this:

from typing import Optional
from ml_base.ml_model import MLModel


class MLModelDecorator(MLModel):

    def __init__(self, model: Optional[MLModel] = None, **kwargs) -> None:
        if model is not None and not isinstance(model, MLModel):
            raise ValueError("Only objects of type MLModel can be wrapped with MLModelDecorator instances.")

        self.__dict__["_model"] = model
        self.__dict__["_configuration"] = kwargs


    @property
    def display_name(self) -> str:
        return self.__dict__["_model"].display_name

    @property
    def qualified_name(self) -> str:
        return self.__dict__["_model"].qualified_name

    @property
    def description(self) -> str:
        return self.__dict__["_model"].description

    @property
    def version(self) -> str:
        return self.__dict__["_model"].version

    @property
    def input_schema(self):
        return self.__dict__["_model"].input_schema

    @property
    def output_schema(self):
        return self.__dict__["_model"].output_schema

    def predict(self, data):
        return self.__dict__["_model"].predict(data=data)


clear_output()

The MLModelDecorator base class is actually defined in the ml_base package in version 0.2.0 and above. We can also import the class from the ml_base package like this:

from ml_base.decorator import MLModelDecorator

The base class for ML Model decorators is designed to hold a reference to an MLModel instance and add no behavior to it. Every method in the decorator just calls the corresponding method on the MLModel instance. This is done on purpose so that we can easily build simple decorators that only work on a single method while leaving all of the other methods and properties alone.

Installing a Model

To make this blog post a little shorter we won't build a new model to work with. Instead we'll install a model that we've built in the past.

To install the model, we can use the pip command and point it at the github repo of the model:

!pip install -e git+https://github.com/schmidtbri/regression-model#egg=insurance_charges_model

clear_output()

The model is used to estimate insurance charges and we built it in a previous blog post. The code for the model is in this github repository.

To make a prediction with the model, we'll import the model's class:

from insurance_charges_model.prediction.model import InsuranceChargesModel

Now we can instantiate the model:

model = InsuranceChargesModel()

To make a prediction, we'll need to use the model's input schema class.

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput, \
    SexEnum, RegionEnum

model_input = InsuranceChargesModelInput(age=21, 
                                         sex=SexEnum.male,
                                         bmi=20.0,
                                         children=0,
                                         smoker=False,
                                         region=RegionEnum.southwest)

Now we can make a prediction with the model by calling predict() with the input.

prediction = model.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=2231.7)

The model predicts that the charges will be $2231.70.

Decorating the Model

To show how the simplest possible decorator works, we'll instantiate the MLModel class and the MLModelDecorator class:

from ml_base import MLModelDecorator

decorator = MLModelDecorator(model)

decorator

MLModelDecorator(InsuranceChargesModel)

The decorator instance is wrapping the model. When we print the decorator object it shows us that it is wrapping the IrisModelMock instance.

All of the properties of the IrisModelMock instance can still be accessed:

print(decorator.display_name)
print(decorator.qualified_name)
print(decorator.description)
print(decorator.version)
print(decorator.input_schema)
print(decorator.output_schema)

Insurance Charges Model
insurance_charges_model
Model to predict the insurance charges of a customer.
0.1.0
<class 'insurance_charges_model.prediction.schemas.InsuranceChargesModelInput'>
<class 'insurance_charges_model.prediction.schemas.InsuranceChargesModelOutput'>

The MLModelDecorator base class actually makes no modifications to the results that it "passes through" from the model instance.

We can also make predictions with the predict() method:

prediction = decorator.predict(model_input)

prediction

InsuranceChargesModelOutput(charges=2231.7)

The MLModel decorator base class is not very useful by itself, we need to subclass it to add custom behaviors.

Creating a Simple Decorator

We'll override the default implementation of the MLModelDecorator base class in order to add some behavior.

This decorator executes around the predict() method:

class SimplePredictDecorator(MLModelDecorator):

    def predict(self, data):
        print("Executing before prediction.")
        prediction = self._model.predict(data=data)
        print("Executing after prediction.")
        return prediction

The decorator wraps around the predict() method and does nothing except print a message before and after executing the predict method of the model.

We can try it out by wrapping the model instance again:

decorator = SimplePredictDecorator(model)

Now, we'll call the predict method:

prediction = decorator.predict(model_input)

prediction

Executing before prediction.
Executing after prediction.





InsuranceChargesModelOutput(charges=2231.7)

The decorator instance executed before and after the model's predict() method and printed some messages.

Adding UUIDs to Predictions

Now we’ll build a decorator class that adds the ability to generate UUIDs for each prediction that a model makes. A UUID is a universally unique 128-bit identifier that can be generated for anything that we want to identify uniquely. In this case, we’d like to identify an individual prediction that an ML model makes.

To do this, we’ll have to do four things:

Modify the description of the model to add info about the prediction id.
Modify the input schema of the model add an optional field that accepts UUIDs.
Modify the output schema of the model to add a field for the UUID.
Modify the predict() method to generate a UUID and return it alongside the prediction.

Here is the code for the decorator:

from typing import Optional
from pydantic import create_model
from uuid import uuid4


class PredictionIDDecorator(MLModelDecorator):

    @property
    def description(self) -> str:
        decorator_description = " This model also has an optional input called 'prediction_id' that accepts an UUID string to uniquely identify the prediction returned. If the prediction id is not provided, a UUID is generated and returned in a field called 'prediction_id' in the model output."
        return self._model.description + decorator_description

    @property
    def input_schema(self):
        input_schema = self._model.input_schema
        new_input_schema = create_model(
            input_schema.__name__,
            prediction_id=(Optional[str], None),
            __base__=input_schema,
        )
        return new_input_schema

    @property
    def output_schema(self):
        output_schema = self._model.output_schema
        new_output_schema = create_model(
            output_schema.__name__,
            prediction_id=(str, ...),
            __base__=output_schema,
        )
        return new_output_schema

    def predict(self, data):
        if hasattr(data, "prediction_id") and data.prediction_id is not None:
            prediction_id = data.prediction_id
        else:
            prediction_id = str(uuid4())

        prediction = self._model.predict(data=data)
        wrapped_prediction = self.output_schema(prediction_id=prediction_id, **prediction.dict())
        return wrapped_prediction

We’ll try it out but instantiating the decorator with the IrisModel model instance:

uuid_decorated_model = PredictionIDDecorator(model)

uuid_decorated_model

PredictionIDDecorator(InsuranceChargesModel)

The description should be different:

uuid_decorated_model.description

"Model to predict the insurance charges of a customer. This model also has an optional input called 'prediction_id' that accepts an UUID string to uniquely identify the prediction returned. If the prediction id is not provided, a UUID is generated and returned in a field called 'prediction_id' in the model output."

Next, we’ll take a look at the output schema:

uuid_decorated_model.input_schema.schema()

{'title': 'InsuranceChargesModelInput',
 'description': "Schema for input of the model's predict method.",
 'type': 'object',
 'properties': {'age': {'title': 'Age',
   'description': 'Age of primary beneficiary in years.',
   'minimum': 18,
   'maximum': 65,
   'type': 'integer'},
  'sex': {'title': 'Sex',
   'description': 'Gender of beneficiary.',
   'allOf': [{'$ref': '#/definitions/SexEnum'}]},
  'bmi': {'title': 'Body Mass Index',
   'description': 'Body mass index of beneficiary.',
   'minimum': 15.0,
   'maximum': 50.0,
   'type': 'number'},
  'children': {'title': 'Children',
   'description': 'Number of children covered by health insurance.',
   'minimum': 0,
   'maximum': 5,
   'type': 'integer'},
  'smoker': {'title': 'Smoker',
   'description': 'Whether beneficiary is a smoker.',
   'type': 'boolean'},
  'region': {'title': 'Region',
   'description': 'Region where beneficiary lives.',
   'allOf': [{'$ref': '#/definitions/RegionEnum'}]},
  'prediction_id': {'title': 'Prediction Id', 'type': 'string'}},
 'definitions': {'SexEnum': {'title': 'SexEnum',
   'description': "Enumeration for the value of the 'sex' input of the model.",
   'enum': ['male', 'female'],
   'type': 'string'},
  'RegionEnum': {'title': 'RegionEnum',
   'description': "Enumeration for the value of the 'region' input of the model.",
   'enum': ['southwest', 'southeast', 'northwest', 'northeast'],
   'type': 'string'}}}

Even though the IrisModelMock didn't have a "prediction_id" in its input schema, the decorated model instance now has the field as an optional string field. This new field was added by the decorator instance.

We can see the prediction_id field schema by selecting it from the properties:

uuid_decorated_model.input_schema.schema()["properties"]["prediction_id"]

{'title': 'Prediction Id', 'type': 'string'}

The output schema of the model was also modified.

uuid_decorated_model.output_schema.schema()

{'title': 'InsuranceChargesModelOutput',
 'description': "Schema for output of the model's predict method.",
 'type': 'object',
 'properties': {'charges': {'title': 'Charges',
   'description': 'Individual medical costs billed by health insurance to customer in US dollars.',
   'type': 'number'},
  'prediction_id': {'title': 'Prediction Id', 'type': 'string'}},
 'required': ['prediction_id']}

In the output the "prediction_id" is a a required field, we did this because we want to always have a prediction_id associated with a prediction. To see how the decorator uses these new field, we'll make a prediction:

prediction = uuid_decorated_model.predict(
    uuid_decorated_model.input_schema(age=21, 
                                      sex=SexEnum.male,
                                      bmi=20.0,
                                      children=0,
                                      smoker=False,
                                      region=RegionEnum.southwest))

prediction

InsuranceChargesModelOutput(charges=2231.7, prediction_id='e84ab429-acec-4630-83d2-12809f222ae2')

The prediction now has a randomly generated UUID attached to it by the decorator in the "prediction_id" field.

We had to use the input schema returned by the decorator because the original InsuranceChargesModelInput schema class is no longer the model's input schema. The decorator creates a new class that becomes the model's new input schema.

If we want to provide a prediction_id with the model's input, the decorator will not generate a new prediction_id, instead it will return the prediction_id that was provided in the input.

prediction = uuid_decorated_model.predict(
    uuid_decorated_model.input_schema(age=21, 
                                      sex=SexEnum.male,
                                      bmi=20.0,
                                      children=0,
                                      smoker=False,
                                      region=RegionEnum.southwest,
                                      prediction_id="asdf-1234-asdf-1234"))

prediction

InsuranceChargesModelOutput(charges=2231.7, prediction_id='asdf-1234-asdf-1234')

The prediction_id returned by the model now has the same prediction_id that we provided to the model's input, the prediction_id was not generated.

This decorator will work with any model that works with the MLModel base class, as long as the UUID field can be attached to the root of the input and output schemas.

Adding Decorators to a Deployed Model

In order to deploy a model with a decorator we'll need to create a service that can add decorators to the model instance right after it is intantiated. This is supported by the rest_model_service package in version 0.2.0 and above. We built the rest_model service package in a previous blog post to easily deploy MLModel instances.

First, we'll install the rest_model_service package.

!pip install rest_model_service>=0.2.0

clear_output()

In order to deploy the IrisModelMock class, we'll create a configuration YAML file for the service:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true

Notice that the we're pointing to the IrisModelMock class in the __main__ module which is the module inside of the jupyter notebook where this blog post is being written.

We can run the REST model service with these commands:

export REST_CONFIG=configuration/rest_config.yaml
uvicorn rest_model_service.main:app --reload

We can access the documentation at the root of the model service:

The model is running inside of the "api/models/iris_model/prediction" endpoint. We can make a prediction with a curl command:

!(curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"age": 65, "sex": "male", "bmi": 50, "children": 5, "smoker": true, "region": "southwest"}')

{"charges":46277.67}

We were able to make a prediction with the undecorated model. Notice that we actually haven't loaded the decorator for the model yet. We'll stop the service with CTL C and try that next.

Adding a decorator to the IrisModelMock instance is done by adding the "decorators" key to the configuration:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true
    decorators:
      - class_path: ml_model_decorators.prediction_id_decorator.PredictionIDDecorator

We'll point the service to the new config file and restart it:

export REST_CONFIG=configuration/decorators_config.yaml
uvicorn rest_model_service.main:app --reload

With the service now restarted using the PredictionIDDecorator, we can view the documentation for this endpoint:

As you can see, the modified description of the model is now displayed instead of the old description and the example value has the prediction_id field. Now we can try to make a prediction again:

!(curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"age": 65, "sex": "male", "bmi": 50, "children": 5, "smoker": true, "region": "southwest"}')

{"charges":46277.67,"prediction_id":"5edbec33-ebec-4cdc-908b-e7d90d4bc2a2"}

We've made a prediction without providing a prediction_id, and we have the generated prediction_id in the response.

We can make another prediction request but with a provided prediction_id:

!(curl -X 'POST' \
  'http://127.0.0.1:8000/api/models/insurance_charges_model/prediction' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"age": 65, "sex": "male", "bmi": 50, "children": 5, "smoker": true, "region": "southwest", "prediction_id": "asdf-1234-asdf-1234"}')

{"charges":46277.67,"prediction_id":"asdf-1234-asdf-1234"}

As we expected, the model is now returning prediction ids along with the predictions themselves.

Since the service is able to load decorators along with models, we can modify the runtime behavior of any model we wish, as long as we wrap the code in an MLModelDecorator class.

Closing

In this blog post, we showed how decorators work and how to create decorators that work with the MLModel base class. We also showed how we can quickly deploy decorators on models inside of a RESTful model service through configuration. Decorators are an easy way to add functionality to a model without having to modify the code of the model class itself. In this blog post, we deployed an UUID generator on an ML model instance without having to modify the code of the model’s class or the code of the REST model service that hosts the model. The combination of decorators and machine learning models can help us to quickly and easily deploy common functionality to many different models.

Property-Based Testing for ML Models

2021-09-03T08:01:00-05:00

Property-Based Testing for ML Models

Introduction

Property-based testing is a form of software testing that allows developers to write more comprehensive tests for software components. Property-based tests work by asserting that certain properties of the software component under test hold over a wide range of inputs.

Property-based tests rely on the generation of inputs for a component and are a form of generative testing. When doing property-based testing it is useful to think in terms of invariants within the software component that we are testing. An invariant is a condition or assumption that we expect will never be violated by the component.

Generative software testing is a type of testing in which a developer does not have to come up with test cases manually. To accomplish this, an engine is used that can come up with any number of test cases, as long as we're able to state our requirements for the test cases clearly and concisely. When the engine generates a test case for us, we send it to the code that we are testing and see if any errors come up. Generative testing is a form of black box testing because we don't really know much about the internals of the component that is under test, we just know how to structure its input in the correct way. Once a test case is generated, we test by making sure that the component returns valid output or that it is not in an invalid state.

Machine learning models are just like any other software component, they require input and provide an output. In fact, ML models are some of the simplest software components that make up a software system because they usually only have one function: the "predict" function. The prediction usually only requires one object, and the prediction result is also a single object. Because of these factors, ML models are actually great candidates for property-based testing. In this blog post, we'll focus on testing the input and output schemas of the model and we'll make sure that the model is able to accept the inputs that it says that it can accept. In terms of invariants, we'll be testing that the model can handle any input that is within its stated input schema.

In this blog post, we'll do property-based testing of an ML model and a RESTful model service that we'll build around the same model. To do property-based testing we'll use the hypothesis package, and to do property-based testing on the model service we'll use the schemathesis package.

Package Structure

-   mobile_handset_price_model
    -   model_files (output files from model training)
    -   prediction (package for the prediction code)
        -   __init__.py
        -   model.py (prediction code)
        -   schemas.py (model input and output schemas)
        -   transformers.py (data transformers)
    -   training (package for the training code)
        -   data_exploration.ipynb (data exploration code)
        -   data_preparation.ipynb (data preparation code)
        -   model_training.ipynb (model training code)
        -   model_validation.ipynb (model validation code)
    -   tests (unit tests for model codel)
    -   Makefile
    -   requirements.txt (list of dependencies)
    -   rest_config.yaml (configuration for REST model service)
    -   service_contract.yaml (OpenAPI service contract)
    -   setup.py
    -   test_requirements.txt (test dependencies)

All of the code is available in a github repository.

Creating a Model

To be able to do property-based testing on an ML model, we'll need to have a model to work with. In this section we will get a dataset, explore it, preprocess it, train a model on it, and validate the resulting model.

Getting Data

In order to train a model, we first need to have a dataset. We went into Kaggle and found a dataset that contains mobile handset price information. To make it easy to download the dataset, we installed the kaggle python package and then we executed these commands to download the data and unzip it into the data folder in the project:

mkdir -p data
kaggle datasets download -d iabhishekofficial/mobile-price-classification/tasks -p ./data --unzip

To make it even easier to download the data, we added a Makefile target for the commands:

download-dataset:
    mkdir -p data
    kaggle datasets download -d iabhishekofficial/mobile-price-classification/tasks -p ./data --unzip

Now all we need to do to get the data is execute this command:

make download-data

Instead of having to remember how to get the data needed to do modeling, I always try to create a repeatable and documented process for creating the dataset. We also need to make sure to never store the dataset in source control, so we\'ll add this line to the .gitignore file:

data/

Training a Model

Data Exploration

In order to create a model, we'll first explore the data. Before we can do that, we need to first load the data and do some basic housekeeping.

data = pd.read_csv("../../data/train.csv")

The datatypes of the columns are:

data.dtypes
battery_power int64
blue          int64
clock_speed   float64
dual_sim      int64
fc            int64
four_g        int64
int_memory    int64
m_dep         float64
mobile_wt     int64
n_cores       int64
pc            int64
px_height     int64
px_width      int64
ram           int64
sc_h          int64
sc_w          int64
talk_time     int64
three_g       int64
touch_screen  int64
wifi          int64
price_range   int64
dtype: object

In order to more easily work with the data, we\'ll rename some of the columns so that they have clearer names:

columns = {
    "blue": "has_bluetooth",
    "dual_sim": "has_dual_sim",
    "fc": "front_camera_megapixels",
    "four_g": "has_four_g",
    "int_memory": "internal_memory",
    "m_dep": "depth",
    "mobile_wt": "weight",
    "n_cores": "number_of_cores",
    "pc": "primary_camera_megapixels",
    "px_height": "pixel_resolution_height",
    "px_width": "pixel_resolution_width",
    "sc_h": "screen_height",
    "sc_w": "screen_width",
    "three_g": "has_three_g",
    "touch_screen": "has_touch_screen",
    "wifi": "has_wifi"
}

data = data.rename(columns=columns)

We also need to get the unique values of the variable we intend to use as the target variable:

data["price_range"].unique()
array([1, 2, 3, 0])

The target variable holds categorical values.

To finish the data exploration we'll use the pandas_profiling package. This package is able to take a pandas dataframe and quickly create a full report about the dataset in the dataframe. Here are some simple statistics found by pandas_profiling:

The dataset has 21 variables in total, with 14 numeric variables and 7 categorical variables. There are 2000 samples, with no missing values or duplicate values. After examining the report, we can see that the categorical variables all hold only two values, for example the "has_bluetooth" variable:

From this we can see that these are really just boolean values, we'll use this later in order to simplify the input schema of the model.

The data exploration is in this notebook.

Preparing the Data

To prepare the data for modeling, we'll first create lists of categorical, numerical, and boolean variables:

categorical_cols = []

numerical_columns = [
    "battery_power",
    "clock_speed",
    "front_camera_megapixels",
    "internal_memory",
    "depth",
    "weight",
    "number_of_cores",
    "primary_camera_megapixels",
    "pixel_resolution_height",
    "pixel_resolution_width",
    "ram",
    "screen_height",
    "screen_width",
    "talk_time"
]

boolean_columns = [
    "has_bluetooth",
    "has_dual_sim",
    "has_four_g",
    "has_three_g",
    "has_touch_screen",
    "has_wifi",
]

Because all of the categorical variables are in fact boolean variables, we don\'t have any variables in the "categorical_cols" list. Next, we'll create a transformer that will work with the numerical variables:

numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

Next, we\'ll create a transformer that is able to convert the values in the boolean columns to boolean values:

boolean_transformer = BooleanTransformer(true_value=1, false_value=0)

Lastly, we'll combine both transformers using a ColumnTransformer:

column_transformer = ColumnTransformer(
    remainder="passthrough",
    transformers=[
        ("numerical", numerical_transformer, numerical_columns),
        ("boolean", boolean_transformer, boolean_columns)
    ]
)

Now we can save the transformer object so we can fit it to the data later:

joblib.dump(column_transformer, "column_transformer.joblib")

The data preparation code is in this notebook.

Training a Model

Now that we have the data transformations built, we can train a model. To do that, we'll first create lists of the predictor variables and the target column:

feature_columns = [
    "battery_power",
    "has_bluetooth",
    "clock_speed",
    "has_dual_sim",
    "front_camera_megapixels",
    "has_four_g",
    "internal_memory",
    "depth",
    "weight",
    "number_of_cores",
    "primary_camera_megapixels",
    "pixel_resolution_height",
    "pixel_resolution_width",
    "ram",
    "screen_height",
    "screen_width",
    "talk_time",
    "has_three_g",
    "has_touch_screen",
    "has_wifi"
]

target_column = "price_range"

Next, we'll split the dataset into training, validation, and test sets and then create dataframes for the predictor and target variables:

train, validate, test = np.split(data.sample(frac=1), [int(0.6*len(data)), int(0.8*len(data))])

X_train = train[feature_columns]
y_train = train[target_column]

X_validate = validate[feature_columns]
y_validate = validate[target_column]

We'll need the transformer we created earlier, so we'll load it from disk:

transformer = joblib.load("column_transformer.joblib")

Next, we'll create an XGBClassifier model:

model = XGBClassifier()

And combine it with the transformer to create a single pipeline:

pipeline = Pipeline(steps=[
    ("preprocessor", transformer),
    ("model", model)
])

Next, we'll fit the pipeline to the training set:

pipeline.fit(X_train, y_train)

Now we can try to make single prediction to make sure everything is working:

result = model.predict(X_validate.iloc[[0]])

print(result)
array([3])

However, this is not the real model we want, we'll do hyperparameter tuning using the hyperopt package. The hyperparameter space is defined like this:

space = {
"max_depth": hp.quniform("max_depth", 3, 18, 1),
"gamma": hp.uniform ("gamma", 1, 9),
"reg_alpha" : hp.quniform("reg_alpha", 40,180,1),
"reg_lambda" : hp.uniform("reg_lambda", 0, 1),
"colsample_bytree" : hp.uniform("colsample_bytree", 0.5, 1),
"min_child_weight" : hp.quniform("min_child_weight", 0, 10, 1),
"n_estimators": 180,
"seed": 0
}

And the objective function looks like this:

def objective(space):
    classifier = XGBClassifier(
        n_estimators=space["n_estimators"],
        max_depth=int(space["max_depth"]),
        gamma=space["gamma"],
        reg_alpha=int(space["reg_alpha"]),
        min_child_weight=int(space["min_child_weight"]),
        colsample_bytree=int(space["colsample_bytree"])
    )

    evaluation = [(X_train, y_train), (X_validate, y_validate)]
    classifier.fit(X_train, y_train, eval_set=evaluation, eval_metric="merror", early_stopping_rounds=10, verbose=False)

    predictions = classifier.predict(X_validate)
    accuracy = accuracy_score(y_validate, predictions)
    print("SCORE: ", accuracy)

    return {
    "loss": -accuracy,
    "status": STATUS_OK
    }

We'll run the hyperparameter search like this:

trials = Trials()

best_hyperparameters = fmin(fn = objective,
                            space = space,
                            algo = tpe.suggest,
                            max_evals = 100,
                            trials = trials)

The best hyperparameters found are these:

{
    'colsample_bytree': 0.7805313948569044, 
    'gamma': 2.8457210780834963, 
    'max_depth': 8.0, 
    'min_child_weight': 8.0,
    'reg_alpha': 86.0, 
    'reg_lambda': 0.23805965814363095
}

Now that we have found the best hyperparameters, we'll train the real model:

model = XGBClassifier(**best_hyperparameters)

pipeline = Pipeline(steps=[
    ("preprocessor", transformer),
    ("model", model)])

pipeline.fit(X_train, y_train)

Lastly, we can save the model object:

joblib.dump(pipeline, "model.joblib")

The model training code is in this notebook.

Validating the Model

To validate the model, we'll use the yellowbrick package. First, we\'ll load the fitted model object that was saved in a previous step:

model = joblib.load("model.joblib")

The yellowbrick package can create a classification report like this:

from yellowbrick.classifier import ClassificationReport

visualizer = ClassificationReport(model, classes=classes, support=True)
visualizer.score(X_test, y_test)
visualizer.show()

The resulting graph looks like this:

The classification report visualizer displays the precision, recall, F1, and support scores for the model for each class in the target variable.

A confusion matrix is created like this:

from yellowbrick.classifier import ConfusionMatrix

visualizer = ConfusionMatrix(model, classes=classes)
visualizer.score(X_test, y_test)
visualizer.show()

The ROC/AUC plot is created like this:

from yellowbrick.classifier import ROCAUC

visualizer = ROCAUC(model, classes=classes)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

The class prediction error plot is done like this:

from yellowbrick.classifier import ClassPredictionError

visualizer = ClassPredictionError(model, classes=classes)
visualizer.score(X_test, y_test)
visualizer.show()

Now that we have a fully trained and validated model and we understand the underlying data that we used to create the model, we can move forward with writing the code that we'll use to make predictions with the model.

The model validation code is in this notebook.

Creating the Model Schemas

In order to be able to use the model, we'll need to define what it's input and output schemas are. To do this, we'll use the pydantic package to define two classes. The model input class looks like this:

class MobileHandsetPriceModelInput(BaseModel):
    """Schema for input of the model's predict method."""
    battery_power: Optional[int] = Field(None, title="battery_power", ge=500, le=2000, description="Total energy a battery can store in one time measured in mAh.")
    has_bluetooth: int = Field(..., title="has_bluetooth", description="Whether the phone has bluetooth.")
    clock_speed: Optional[float] = Field(None, title="clock_speed", ge=0.5, le=3.0, description="Speed of microprocessor in gHz.")
    has_dual_sim: Optional[bool] = Field(None, title="has_dual_sim", description="Whether the phone has dual SIM slots.")
    front_camera_megapixels: Optional[int] = Field(None, title="front_camera_megapixels", ge=0, le=20, description="Front camera mega pixels.")
    has_four_g: bool = Field(..., title="has_four_g", description="Whether the phone has 4G.")
    internal_memory: Optional[int] = Field(None, title="internal_memory", ge=2, le=664, description="Internal memory in gigabytes.")
    depth: float = Field(None, title="depth", ge=0.1, le=1.0, description="Depth of mobile phone in cm.")
    weight: Optional[int] = Field(None, title="weight", ge=80, le=200, description="Weight of mobile phone.")
    number_of_cores: Optional[int] = Field(None, title="number_of_cores", ge=1, le=8, description="Number of cores of processor.")
    primary_camera_megapixels: Optional[int] = Field(None, title="primary_camera_megapixels", ge=0, le=20, description="Primary camera mega pixels.")
    pixel_resolution_height: Optional[int] = Field(None, title="pixel_resolution_height", ge=0, le=1960, description="Pixel resolution height.")
    pixel_resolution_width: Optional[int] = Field(None, title="pixel_resolution_width", ge=500, le=1998, description="Pixel resolution width.")
    ram: Optional[int] = Field(None, title="ram", ge=256, le=3998, description="Random access memory in megabytes.")
    screen_height: Optional[int] = Field(None, title="screen_height", ge=5, le=19, description="Screen height of mobile in cm.")
    screen_width: Optional[int] = Field(None, title="screen_width", ge=0, le=18, description="Screen width of mobile in cm.")
    talk_time: Optional[int] = Field(None, title="talk_time", ge=2, le=20, description="Longest time that a single battery charge will last when on phone call.")
    has_three_g: bool = Field(..., title="has_three_g", description="Whether the phone has 3G touchscreen or not.")
    has_touch_screen: bool = Field(..., title="has_touch_screen", description="Whether the phone has a touchscreen or not.")
    has_wifi: bool = Field(..., title="has_wifi", description="Whether the phone has wifi or not.")

The code above can be found here.

The input schema of the model defines what is acceptable input for the model and also provides a user-friendly interface to the code that is calling the model.

In order to make the model's input easier to understand we've replaced the binary categorical input variables with booleans which can have values of "True" or "False". For example, the model expected the has_bluetooth variable to contain either a "0" or a "1", Instead of forcing the user to understand the semantics of these values in order to provide input to the model we just convert "True" to "1" and "False" to "0" before we pass the input to the model.

Another example of user-friendliness is the addition of the "greater than" and "less than" limits to the numerical variables. These limits are enforced by pydantic when the class is instantiated and they clearly communicate which values are allowed by the model for the numerical variables. The bounds match the contents of the training set, for example the "battery_power" has a lower bound of 500 and an upper bound of 2000 which are the minimum and maximum values found in the training data for this variable.

The pydantic package allows us to add descriptions to each field that help the user to understand the fields that the model expects. The pydantic package also supports the generation of JSON schema documents from a schema class. The JSON schema of the input class looks like this:

{
  "title": "MobileHandsetPriceModelInput",
  "description": "Schema for input of the model's predict method.",
  "type": "object",
  "properties": {
    "battery_power": {
      "title": "battery_power",
      "description": "Total energy a battery can store in one time measured in mAh.",
      "minimum": 500,
      "maximum": 2000,
      "type": "integer"
    },
    "has_bluetooth": {
      "title": "has_bluetooth",
      "description": "Whether the phone has bluetooth.",
      "type": "boolean"
    },
    "clock_speed": {
      "title": "clock_speed",
      "description": "Speed of microprocessor in gHz.",
      "minimum": 0.5,
      "maximum": 3,
      "type": "number"
    },
    "has_dual_sim": {
      "title": "has_dual_sim",
      "description": "Whether the phone has dual SIM slots.",
      "type": "boolean"
    },
    "front_camera_megapixels": {
      "title": "front_camera_megapixels",
      "description": "Front camera mega pixels.",
      "minimum": 0,
      "maximum": 20,
      "type": "integer"
    }
...

The model also requires a schema for it's output. Before we can define it, we need to define the allowed values. To do that we'll use an Enum class:

class PriceEnum(str, Enum):
    zero = "zero"
    one = "one"
    two = "two"
    three = "three"

The code above can be found here.

The four allowed values match the output of the model. We defined this as an enumeration because this is a classification model, even though the outputs look like numbers.

Now we can define the output schema class:

class MobileHandsetPriceModelOutput(BaseModel):
    price_range: PriceEnum = Field(..., title="Price Range", description="Price range class.")

The code above can be found here.

The "price_range" variable uses the PriceEnum enumeration to define what the allowed values are.

Creating the Model Class

Now that we have the model's input and output schemas defined we can move on to creating a class that will wrap around the model and make predictions. This class makes using the model a lot easier because it abstracts out a lot of the low level details of the model.

To start, we\'ll define the class and add all of the required properties:

class MobileHandsetPriceModel(MLModel):
    @property
    def display_name(self) -> str:
        return "Mobile Handset Price Model"

    @property
    def qualified_name(self) -> str:
        return "mobile_handset_price_model"

    @property
    def description(self) -> str:
        return "Model to predict the price of a mobile phone."

    @property
    def version(self) -> str:
        return __version__

    @property
    def input_schema(self):
        return MobileHandsetPriceModelInput

    @property
    def output_schema(self):
        return MobileHandsetPriceModelOutput

The code above can be found here.

The properties of the class return metadata about the model. The input and output schema classes are returned from the input_schema and output_schema properties and can be used by the users of the model to introspect the schemas of the model.

The __init__ method of the class looks like this:

def __init__(self):
    dir_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
    with open(os.path.join(dir_path, "model_files", "1", "model.joblib"), 'rb') as file:
        self._svm_model = joblib.load(file)

The code above can be found here.

The __init__ method is used to initialize the model, after it completes the model object should be ready to make predictions.

The predict() method is the last method we need to define:

def predict(self, data: MobileHandsetPriceModelInput) -> MobileHandsetPriceModelOutput:
    X = pd.DataFrame([[data.battery_power, data.has_bluetooth,
                       data.clock_speed, data.has_dual_sim, data.front_camera_megapixels, data.has_four_g,
                       data.internal_memory, data.depth, data.weight, data.number_of_cores,
                       data.primary_camera_megapixels, data.pixel_resolution_height,
                       data.pixel_resolution_width, data.ram, data.screen_height,
                       data.screen_width, data.talk_time, data.has_three_g,
                       data.has_touch_screen, data.has_wifi]],
                     columns=["battery_power", "has_bluetooth", "clock_speed",
                              "has_dual_sim", "front_camera_megapixels", "has_four_g",
                              "internal_memory", "depth", "weight", "number_of_cores",
                              "primary_camera_megapixels", "pixel_resolution_height",
                              "pixel_resolution_width", "ram", "screen_height",
                              "screen_width", "talk_time", "has_three_g",
                              "has_touch_screen", "has_wifi"])
    # making the prediction and extracting the result from the array
    y_hat = output_class_map[str(self._svm_model.predict(X)[0])]

    return MobileHandsetPriceModelOutput(price_range=y_hat)

The code above can be found here.

This method accepts a pydantic object of the type that meets the model's input schema and returns a pydantic object that meets the model's output schema.

Adding the Property-Based Tests

The model class is now ready to do property-based testing. To test we'll use the hypothesis package, which we can install with this command:

pip install hypothesis

To launch a set of hypothesis tests, we'll write a simple test class:

class ModelPropertyBasedTests(TestCase):

    def setUp(self) -> None:
        self.counter = 0
        self.model = MobileHandsetPriceModel()

    def tearDown(self) -> None:
        print("Generated and tested {} examples.".format(self.counter))

The code above can be found here.

The test class defines a setUp() method which sets up a counter to 0 and instantiates the model object. The setUp method is executed before the execution of every test case, so by loading the model object here, we'll avoid the cost of instantiating during every execution of the test. The tearDown() method is executed after each test case, we'll use it to print out how many test cases we executed.

@settings(deadline=None, max_examples=1000)
@given(strategies.builds(MobileHandsetPriceModelInput))
def test_model_input(self, data):
    # act
    result = self.model.predict(data=data)

    # assert
    self.assertTrue(type(result) is MobileHandsetPriceModelOutput)
    self.assertTrue(type(result.price_range) is PriceEnum)
    self.counter += 1

The code above can be found here.

The test_model_input test case is decorated with two decorators that make it into a hypothesis test. The \@settings decorator tells the hypothesis package that there is no deadline for completion of the test case and that we would like to test with 1000 samples. The \@given decorator tells the hypothesis package that we would like to build samples for testing using the MobileHandsetPriceModelInput schema. The hypothesis package then generates 1000 samples from the schema class and calls the test_model_input method 1000 with the generated samples.

The test method itself is very simple, it makes a prediction with the sample generated by hypothesis and asserts that the result is of the right type. If any exceptions are raised in the execution of the predict method, the test will fail. The counter we initialized is incremented every time a test case is executed.

To execute the hypothesis tests, we'll use the pytest command:

py.test ./tests/property_based_tests.py --hypothesis-show-statistics

The output of the command tells us a bit about the test:

========================== Hypothesis Statistics=============================
tests/property_based_tests.py::ModelPropertyBasedTests::test_model_input:
- during reuse phase (0.00 seconds):
    - Typical runtimes: < 1ms, ~ 86% in data generation
    - 0 passing examples, 0 failing examples, 1 invalid examples

- during generate phase (0.36 seconds):
  - Typical runtimes: 6-293 ms, ~ 7% in data generation
  - 2 passing examples, 7 failing examples, 0 invalid examples
  - Found 1 failing example in this phase

- during shrink phase (0.10 seconds):
  - Typical runtimes: 0-7 ms, ~ 68% in data generation
  - 2 passing examples, 6 failing examples, 22 invalid examples
  - Tried 30 shrinks of which 8 were successful

- Stopped because nothing left to do
=========================== short test summary info===========================
FAILED
tests/property_based_tests.py::ModelPropertyBasedTests::test_model_input
- ValueError: Value: -1 cannot be mapped to a boolean value.
============================== 1 failed in 1.70s=============================

The test failed with the very first sample generated. The error raised is: "ValueError: Value: -1 cannot be mapped to a boolean value." in the mobile_handset_price_model/prediction/transformers.py file. This error is easy to debug because we actually introduced the problem in the first place!. The problem lies in the input schema of the model, the field called "has_bluetooth" is defined like this:

has_bluetooth: int = Field(..., title="has_bluetooth", description="Whether the phone has bluetooth.")

The problem is that the hypothesis package generated the value -1 for the "has_bluetooth" field because the type of the field is "int", which failed to be processed by the model. This error happened because we were matching the type of the field that is found in the dataset, instead of the type of the field as defined by the model's input schema. We can fix it easily by defining the field like this:

has_bluetooth: bool = Field(..., title="has_bluetooth", description="Whether the phone has bluetooth.")

Now we can try to run the tests again. The results came back like this:

========================= Hypothesis Statistics===========================
tests/property_based_tests.py::ModelPropertyBasedTests::test_model_input:
- during reuse phase (0.29 seconds):
  - Typical runtimes: ~ 287ms, ~ 0% in data generation
  - 0 passing examples, 1 failing examples, 0 invalid examples
  - Found 1 failing example in this phase

- during shrink phase (0.01 seconds):
  - Typical runtimes: ~ 6ms, ~ 8% in data generation
  - 0 passing examples, 1 failing examples, 0 invalid examples
  - Tried 1 shrinks of which 0 were successful

- Stopped because nothing left to do
======================= short test summary info==============================
FAILED
tests/property_based_tests.py::ModelPropertyBasedTests::test_model_input
- ValueError: Value: None cannot be mapped to a boolean value.
========================== 1 failed in1.56s==================================

The hypothesis test failed again with the very first sample generated. The error raised is: "ValueError: Value: None cannot be mapped to a boolean value." in the mobile_handset_price_model/prediction/transformers.py file. The problem again lies in the input schema of the model, in the field called "has_dual_sim" which is defined like this:

has_dual_sim: Optional[bool] = Field(None, title="has_dual_sim", description="Whether the phone has dual SIM slots.")

The problem is the fact that the model cannot impute a value for the boolean inputs in the same way that it can for the numerical inputs. This problem might arise if we forget which fields the model is able to impute, and mark fields that need to be provided as optional. We'll fix the issue by making the "has_dual_sim" input field a required field:

has_dual_sim: bool = Field(..., title="has_dual_sim", description="Whether the phone has dual SIM slots.")

We ran the tests one last time and got back this result:

============================== Hypothesis Statistics==========================
tests/property_based_tests.py::ModelPropertyBasedTests::test_model_input:

- during generate phase (0.52 seconds):
  - Typical runtimes: 6-8 ms, ~ 9% in data generation
  - 64 passing examples, 0 failing examples, 0 invalid examples

- Stopped because nothing left to do
=============================== 1 passed in 1.41s==========================

None of the samples generated by the hypothesis package were able to raise an exception in the model's prediction class.

Creating a RESTful Model Service

Creating a RESTful model service is very simple because we'll be leveraging the rest_model_service package. The package works through a configuration file that points at the model classes of the ML model that we would like to host in the service. If you\'d like to learn more about the rest_model_service package, here is a blog post about it.

To install the package, execute this command:

pip install rest_model_service

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Mobile Handset Price Model Service
models:
- qualified_name: mobile_handset_price_model
  class_path: mobile_handset_price_model.prediction.model.MobileHandsetPriceModel
  create_endpoint: true

The configuration file can be found here.

The configuration file sets up the service_title, which is the title that will be shown in the documentation of the service. The models array allows us to host any number of models within the service. The only model we'll host today is the mobile_handset_price_model, the class_path points at the location of the model class in the python environment. The create_endpoint setting is set to true which means that the service will create an endpoint for the model.

Now that we have the configuration set up, we can automatically generate an OpenAPI specification file for the service, with these commands:

export PYTHONPATH=./
generate_openapi --output_file=service_contract.yaml

The OpenAPI spec file can be found here. We can render the documentation using the Swagger Editor, which looks like this:

The service contract is set up, so now we can run the service locally, with these commands:

uvicorn rest_model_service.main:app --reload

The service is running locally, so now we can try out a request against the model's endpoint:

curl -X 'POST' 
'http://127.0.0.1:8000/api/models/mobile_handset_price_model/prediction'
-H 'accept: application/json' 
-H 'Content-Type: application/json' 
-d '{
"battery_power": 2000,
"has_bluetooth": true,
"clock_speed": 3,
"has_dual_sim": true,
"front_camera_megapixels": 20,
"has_four_g": true,
"internal_memory": 664,
"depth": 1,
"weight": 200,
"number_of_cores": 8,
"primary_camera_megapixels": 20,
"pixel_resolution_height": 1960,
"pixel_resolution_width": 1998,
"ram": 3998,
"screen_height": 19,
"screen_width": 18,
"talk_time": 20,
"has_three_g": true,
"has_touch_screen": true,
"has_wifi": true
}'

The service responds with this result:

{"price_range":"three"}

By using the rest_model_service package we've just set up a RESTful API service that is hosting our model. We can now move on to do property-based testing on the model through the service.

Adding Property-Based API Tests

The schemathesis package allows us to use the hypothesis package against REST API services, doing all of the things that the hypothesis package can do. The schemathesis uses the OpenAPI specification of the service to introspect the service contract and generate test cases.

There are two ways for schemathesis to execute the tests: by sending requests to the service as it runs in its own process or by sending request objects to the ASGI application object as it lives in the memory of a process. The second way is very fast because it does not require that we send requests over the network, so we'll execute the tests that way.

To begin, we'll import the ASGI application object from the rest_model_service package

from rest_model_service.main import app

Next, we'll ask the schemathesis to extract the schema from the application object:

schema = schemathesis.from_asgi("/openapi.json", app, data_generation_methods=[DataGenerationMethod.negative])

Next, we'll generate two strategies from the schema, one strategy per endpoint defined in the application:

model_metadata_strategy = schema["/api/models"]["GET"].as_strategy()

model_prediction_strategy = schema["/api/models/mobile_handset_price_model/prediction"]["POST"].as_strategy()

Now we're ready to start writing the test class:

class APITests(TestCase):

    def setUp(self) -> None:
        self.counter = 0

    def tearDown(self) -> None:
        print("Generated and tested {} examples.".format(self.counter))

The test class keeps track of the number of test cases executed through a counter that is created in the setUp method.

The test case for the metadata endpoint is very simple:

@given(case=model_metadata_strategy)
def test_model_metadata_endpoint(self, case):
    response = case.call_asgi()
    case.validate_response(response)
    self.counter += 1

The \@given decorator uses the model_metadata_strategy to generate test cases for the endpoint. This is a very simple endpoint that does not accept input and provides a static output that contains metadata about the model being hosted in the service.

The next test is much more interesting:

@given(case=model_prediction_strategy)
@settings(max_examples=1000)
def test_model_prediction_endpoint(self, case):
    response = case.call_asgi()
    case.validate_response(response)
    self.counter += 1

The model_prediction_strategy generates test cases for the model's prediction endpoint. The \@settings decorator asks schemathesis to generate 1000 test samples. The case.validate_response() method looks for unexpected responses from the service endpoint.

We executed the api tests with this command:

py.test ./tests/api_tests.py

The command provided this output:

========================== test session starts============================
platform darwin -- Python 3.8.10, pytest-6.2.4, py-1.10.0,
pluggy-0.13.1
rootdir: /Users/brian/Code/property-based-testing-for-ml-models
plugins: pylama-7.7.1, hypothesis-6.14.5, subtests-0.5.0,
schemathesis-3.9.7, anyio-3.3.0, html-3.1.1, metadata-1.11.0
collected 2 items
tests/api_tests.py .. [100%]
======================= 2 passed in 62.51s (0:01:02)===========================

The code for the property-based API tests is in this file.

By executing property-based tests against the model service, we're able to more thoroughly test the model deployment by executing the service code along with the model code in the tests. Although the service code is very simple and lightweight, it helps that we're including it because it makes the hypothesis tests into full integration tests that test the entire service along with the model.

Conclusion

Using property-based tests we were able to find two common errors that can come up when deploying machine learning models. A mismatch between the model's schema and the data that it is actually able to process can cause many issues that are hard to debug. By using this type of generative testing, we were able to find both errors that we introduced to the schema pretty easily.

In this blog post we also saw the benefits of using a package like pydantic for creating the input and output schemas for an ML model. By stating the schemas as code, we're able to clearly show what data is allowed as input and what data is returned by the model. The model's designer does not have to write documentation to explain the input and output data because it is already built into the input and output schema classes.If we didn't have the model's schemas as pydantic classes, the hypothesis and schemathesis packages would not even be able to generate test cases for the model and model service.

Training and Deploying an ML Model

2021-07-15T08:26:00-05:00

Introduction

This post is a collection of several different techniques that I wanted to learn. In this blog post I'll be using open source python packages to do automated data exploration, automated feature engineering, automated machine learning, and model validation. I'll also be using docker and kubernetes to deploy the model. I'll cover the entire codebase of the model, from the initial data exploration to the deployment of the model behind a RESTful API in Kubernetes.

Automated feature engineering is a technique that is used to automate the creation of features from a dataset without having to manually design them and write the code to create the features. Feature engineering is very important for being able to create ML models that work well on a dataset, but it takes a lot of time and effort. Automated feature engineering is able to generate many candidate features from a given dataset, from which we can then select the useful ones. In this blog post, I'll be using the feature_tools library, which helps to do feature preprocessing, feature selection, model selection, and hyperparameter search.

Automated machine learning is a process through which we can create machine learning models without having to explore many different model types and hyperparameters. AutoML can automate the process of choosing the best solution for a dataset, going from a raw dataset to a trained model. AutoML tools allow non-experts to be able to create ML models without having to understand everything that is happening under the hood. All that is needed is a properly processed data set and anyone can generate a model from the data. In this blog post, I'll be using the TPOT library, which helps to do feature preprocessing, feature selection, model selection, and hyperparameter search.

In this blog post, I'll also show how to create a RESTful service for the model that will allow us to deploy the model quickly and simply. We'll also show how to deploy the model service using docker and Kubernetes. This blog post contains a lot of different tools and techniques for building and deploying ML models and it is not meant to be a deep dive into any of the individual techniques, I just wanted to show how to take a model all the way from data exploration, to training and finally to deployment.

Package Structure

The package we'll develop in this blog post has this structure:

- insurance_charges_model
    - model_files (output files from model training)
    - prediction, package for the prediction code
        - __init__.py
        - model.py (prediction code)
        - schemas.py (model input and output schemas)
        - transformers.py (data transformers)
    - training (package for the training code)
        - data_exploration.ipynb (data exploration code)
        - data_preparation.ipynb (data preparation code)
        - model_training.ipynb (model training code)
        - model_validation.ipynb (model validation code)
    - __init__.py
- kubernetes (kubernetes manifests)
    - deployment.yml
    - namespace.yml
    - service.yml
- tests (unit tests for model codel)
- Dockerfile (instructions for generating a docker image)
- Makefile
- requirements.txt (list of dependencies)
- rest_config.yaml (configuration for REST model service)
- service_contract.yaml (OpenAPI service contract)
- setup.py
- test_requirements.txt (test dependencies)

All of the code is available in a github repository.

Getting the Data

In order to train a regression model, we first need to have a dataset. We went into Kaggle and found a dataset that contained insurance charges information. To make it easy to download the data, we installed the kaggle python package. Then we executed these commands to download the data and unzip it into the data folder in the project:

mkdir -p data
kaggle datasets download -d mirichoi0218/insurance -p ./data \--unzip

To make it even easier to download the data, we added a Makefile target for the commands:

download-dataset: ## download dataset from Kaggle
    mkdir -p data
    kaggle datasets download -d mirichoi0218/insurance -p ./data \--unzip

Now all we need to do is execute this command:

make download-data

Instead of having to remember how to get the data needed to do modeling, I always try to create a repeatable and documented process for creating the dataset. We also make sure to never store the dataset in source control, so we'll add this line to the .gitignore file:

data/

Training a Regression Model

Now that we have the dataset, we\'ll start working on training a regression model. We\'ll be doing data exploration, data preparation, feature engineering, automated model training and selection, and model validation.

Exploring the Data

Data exploration is a key step that can tell us a lot about the dataset that we have to model. Data exploration can be highly customized to the specific dataset, but there are also tools that allow us to calculate the most common things we want to learn about a dataset automatically. pandas_profiling is a package that accepts a pandas data frame and creates an HTML report with a profile of the dataset in the data frame. According to the pandas_profiling documentation it has these capabilities:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histograms
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Duplicate rows Lists the most occurring duplicate rows
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

These are the things that we would be looking into to learn more about the data set. To use the pandas_profiling package, we'll first load the dataset into a pandas dataframe:

import pandas as pd
from pandas_profiling import ProfileReport
data = pd.read_csv("../../data/insurance.csv")

Now we can query the dataframe to find out the column types:

data.dtypes

age int64
sex object
bmi float64
children int64
smoker object
region object
charges float64
dtype: object

To create the profile, we'll execute this code:

profile = ProfileReport(data, 
                        title='Insurance Dataset Profile Report',
                        pool_size=4,
                        html={'style': {'full_width': True}})
profile.to_notebook_iframe()

Once the report is created, we'll save it to disk:

profile.to_file("data_exploration_report.html")

Right away the profile will tell us a few key details about the dataset:

The profile also contains a few warnings about the data:

None of these warnings are really that surprising given what we know about insurance charges, since health insurance premiums go up with age, and being a smoker increases your insurance premiums.

The profile has a description for each variable, here's the description for the age variable:

As well as interactions between variables:

And finally the correlations between the variables:

By using the pandas_profiling package we can avoid writing the most common data analysis code that we write for all datasets. All of the code for data exploration is in the data_exploration.ipynb notebook.

Preparing the Data

In order to model the dataset, we'll first need to prepare and preprocess the data. To start, let's load the dataset into a dataframe again:

df = pd.read_csv("../../data/insurance.csv")

To do data preparation, we'll use the feature_tools package to create features from the data that is already in the dataset. To create features, we'll need to tell the feature_tools package about our data by identifying entities in the data:

entityset = ft.EntitySet(id="Transactions")
entityset = entityset.entity_from_dataframe(entity_id="Transactions",
    dataframe=df,
    make_index=True,
    index="index")

In the code above, we created an EntitySet with the id "Transactions" which is the entity that is in the dataframe. The feature_tools package identified the variables associated with the Transactions entity:

entityset["Transactions"].variables

[<Variable: index (dtype = index)>,
<Variable: age (dtype = numeric)>,
<Variable: sex (dtype = categorical)>,
<Variable: bmi (dtype = numeric)>,
<Variable: children (dtype = numeric)>,
<Variable: smoker (dtype = categorical)>,
<Variable: region (dtype = categorical)>,
<Variable: charges (dtype = numeric)>]

We can now generate some new features on the entity:

feature_dataframe, features = ft.dfs(entityset=entityset,
                                     target_entity="Transactions",
                                     trans_primitives=["add_numeric", "subtract_numeric",
                                                       "multiply_numeric", "divide_numeric",
                                                       "greater_than", "less_than"],
                                     ignore_variables={"Transactions": ["sex", "smoker", "region", 
                                                                        "charges"]})

The feature_tools package uses a set of primitive operations to generate new features from the data. In this case, we're using the "add_numeric" primitive to generate a new feature by adding up the values in all pairs of numeric variables. By combining numerical variables in this way, we'll generate three new columns:

age + bmi
age + children
bmi + children

The subtract_numeric, multiply_numeric, and divide_numeric primitives also create new columns in a similar way, by applying subtraction, multiplication, and division respectively. The greater_than and less_than primitives create new boolean columns by comparing the values in all pairs of numerical variables. The greater_than primitive generated these new features:

age > bmi
age > children
bmi > age
bmi > children
children > age
children > bmi

At the end of the feature generation, we have 30 new features in the dataset that were generated from the data already there. Before we can use these new features, we need to figure out how to integrate the transformer with scikit-learn pipelines, which is what we will be using to build up our model. To accomplish this we created a transformer which is instantiated like this:

dfs_transformer = DFSTransformer("Transactions",
                                 trans_primitives=["add_numeric", "subtract_numeric",
                                                   "multiply_numeric", "divide_numeric",
                                                   "greater_than", "less_than"],
                                 ignore_variables={"Transactions": ["sex", "smoker",
                                                                    "region"]})

Since the feature generation sometimes creates infinite values, we'll also need a transformer to convert these to nan values. This transformer is instantiated like this:

infinity_transformer = InfinityToNaNTransformer()

To handle the nan values generated by the InfinityToNaN transformer, we'll use a SimpleImputer from the scikit-learn library. It is instantiated like this:

simple_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

The SimpleImputer transformer has problems with imputing values that are not floats when using the \'mean\' strategy. To fix this, we\'ll create a transformer that will convert all integer columns into floating point columns:

int_to_float_transformer = IntToFloatTransformer()

Lastly, we\'ll put the DFSTransformer, IntToFloatTransformer, InfinityToNaNTransformer, and SimpleImputer transformers into a Pipeline so they\'ll all work together as a unit:

dfs_pipeline = Pipeline([
    ("dfs_transformer", dfs_transformer),
    ("int_to_float_transformer", int_to_float_transformer),
    ("infinity_transformer", infinity_transformer),
    ("simple_imputer", simple_imputer),
])

Next, we'll deal with the boolean features in the dataset. To do this, we created a transformer that converts string values into the corresponding true or false values. It's instantiated like this:

boolean_transformer = BooleanTransformer(true_value="yes", false_value="no")

This transformer will be used to convert the "smoker" variable into a boolean value. The values found in the dataset are "yes" and "no". The encoder is configured to convert "yes" to True, and "no" to False.

Next, we\'ll create an encoder that will encode the categorical features. The categorical features that we will encode will be \'sex\' and \'region\'. We'll use the OrdinalEncoder from the scikit-learn library:

ordinal_encoder = OrdinalEncoder()

Now we can create a ColumnTransformer that combines all of the pipelines and transformers we created above into one bigger pipeline:

column_transformer = ColumnTransformer(remainder="passthrough",
                                       transformers=[
                                           ("dfs_pipeline", dfs_pipeline, ["age", "sex", "bmi",
                                                                           "children", "smoker", "region"]),
                                           ("boolean_transformer", boolean_transformer, ["smoker"]),
                                           ("ordinal_encoder", ordinal_encoder, ["sex", "region"])
                                       ])

The ColumnTransformer applies the deep feature synthesis pipeline to all of the input variables, then it applies the boolean transformer to the "smoker" variable, and the ordinal encoder to the "sex" and "region" variables.

Now we do a small test to make sure that the transformations are happening as expected:

test_df = pd.DataFrame([[65, "male", 12.5, 0, "yes", "southwest"],
                        [75, "female", 78.770, 1, "no", "southeast"]],
                       columns=["age", "sex", "bmi", "children", "smoker", "region"])

column_transformer.fit(test_df)

result = column_transformer.transform(test_df)

if len(result[0]) != 33: # expecting 33 features to come out of the ColumnTransformer
    raise ValueError("Unexpected number of columns found in the dataframe.")

To test the pipeline, we created a dataframe with two rows, then we fitted the pipeline to it and transformed the dataframe. We expect to get 33 columns in the output dataframe because of the deep feature synthesis, so we test for that and raise an exception if it is not the case.

The columns transformer can now be saved so we can use it later in the model training process:

joblib.dump(column_transformer, "transformer.joblib")

In this section we used scikit-learn pipelines to compose a complex series of data transformations that will be executed when the model is trained and also when it is used for predictions. By using pipelines, we are able to make sure that the steps always happen in the same order and with the same parameters. If we didn't use pipelines, we would end up rewriting the transformations twice, once for model training and once for prediction. All of the code for data preparation is in the data_preparation.ipynb notebook.

Training a Model

The next step after preparing the data is to train a model. For this, we'll use the TPOT package, which is an automated machine learning tool that is able to search through many possible model types and hyperparameters and find the best pipeline for the dataset. The package uses genetic programming to search the space of possible ML pipelines.

To train the model, we'll first load the dataset:

df = pd.read_csv("../../data/insurance.csv")

Then, we'll create a training set and a test set by randomly selecting samples. The training testing split will be 80:20.

mask = np.random.rand(len(df)) < 0.8
training_set = df[mask]
testing_set = df[~mask]

Next, we'll save the data sets to the data folder because we'll need the two datasets when we do model validation. Since we're choosing to do this in another Jupyter notebook, we need to keep the data sets on the hard drive until then.

training_set.to_csv("../../data/training_set.csv")
testing_set.to_csv("../../data/testing_set.csv")

Now that we have a training set, we'll need to separate the feature columns from the target column:

feature_columns = ["age", "sex", "bmi", "children", "smoker", "region"]
target_column = "charges"
X_train = training_set[feature_columns]
y_train = training_set[target_column]
X_test = testing_set[feature_columns]
y_test = testing_set[target_column]

Next, we'll apply the preprocessing pipeline that we built in the data preprocessing code. First we'll load the transformer that we saved to disk:

transformer = joblib.load("transformer.joblib")

Now we can apply it to the features dataframe in order to calculate the features that we created using automated feature engineering:

features = transformer.fit_transform(X_train)

Now that we have a features dataframe that we can train a model with, we'll launch the training by instantiating a TPOTRegressor object and calling the fit method:

tpot_regressor = TPOTRegressor(generations=50,
                               population_size=50,
                               random_state=42,
                               cv=5,
                               n_jobs=8,
                               verbosity=2,
                               early_stop=10)

tpot_regressor = tpot_regressor.fit(features, y_train)

The TPOTRegressor uses genetic programming so we need to provide some parameters that will define the size of the population and the number of generations. The random_state parameter makes it easier to replicate the training run, the cv parameter is the number of cross validation splits that we want to use, the n_jobs parameters tells TPOT how many processes to launch to train the model.

Here is a sample of the output of the tpot_regressor as it trains:

Optimization Progress: 100%
2550/2550 [35:22<00:00, 1.15pipeline/s]
Generation 1 - Current best internal CV score: -19328040.90181576
Generation 2 - Current best internal CV score: -19328040.90181576
Generation 3 - Current best internal CV score: -19291161.694311526
Generation 4 - Current best internal CV score: -19216662.844604537
Generation 5 - Current best internal CV score: -19194856.36477192

...

Generation 48 - Current best internal CV score: -18848299.473418456
Generation 49 - Current best internal CV score: -18848299.473418456
Generation 50 - Current best internal CV score: -18848299.473418456

Best pipeline:
RandomForestRegressor(MaxAbsScaler(SGDRegressor(Normalizer(input_matrix,
norm=l2), alpha=0.01, eta0=1.0, fit_intercept=True, l1_ratio=0.0,
learning_rate=invscaling, loss=squared_loss, penalty=elasticnet,
power_t=0.1)), bootstrap=True, max_features=0.7500000000000001,
min_samples_leaf=16, min_samples_split=14, n_estimators=100)

It looks like the best pipeline found by TPOT includes a RandomForestRegressor combined with several preprocessing steps. Now that we have an optimal pipeline created by TPOT we will be adding our own preprocessors to it. To do this we\'ll need to have an unfitted pipeline object, we don\'t have that right now because the TPOTRegressor pipeline has been fitted.

To get an unfitted pipeline we\'ll ask TPOT for the fitted pipeline and clone it:

unfitted_tpot_regressor = clone(tpot_regressor.fitted_pipeline_)

Now that we have an unfitted Pipeline that is the same pipeline that was found by the TPOT package, we\'ll add our own preprocessors to the pipeline. This will ensure that the final pipeline will accept the features in the original dataset and will process the features correctly. We\'ll compose the unfitted TPOT pipeline and the transformer Pipeline into one Pipeline:

model = Pipeline([("transformer", transformer),
                  ("tpot_pipeline", unfitted_tpot_regressor)
                  ])

Now we can train the model on the original, unprocessed dataset:

model.fit(X_train, y_train)

The final fitted pipeline contains all of the transformations that we used to do deep feature synthesis and data preprocessing, and all of the transformations that were added by TPOT. This is the final pipeline:

Pipeline(steps=[('transformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('dfs_pipeline',
                                                  Pipeline(steps=[('dfs_transformer',
                                                                   DFSTransformer(ignore_variables={'Transactions': ['sex',
                                                                                                                     'smoker',
                                                                                                                     'region']},
                                                                                  target_entity='Transactions',
                                                                                  trans_primitives=['add_numeric',
                                                                                                    'subtract_numeric',
                                                                                                    'multiply_numeric',
                                                                                                    'divide_numeric',
                                                                                                    'greater_than',
                                                                                                    'less_...
                                                                                                    Pipeline(steps=[('normalizer', Normalizer()),
                                                                                                                    ('stackingestimator',
                                                                                                                     StackingEstimator(estimator=SGDRegressor(alpha=0.01,
                                                                                                                                                              eta0=1.0,
                                                                                                                                                              l1_ratio=0.0,
                                                                                                                                                              penalty='elasticnet',
                                                                                                                                                              power_t=0.1,
                                                                                                                                                              random_state=42))),
                                                                                                                    ('maxabsscaler', MaxAbsScaler()),
                                                                                                                    ('randomforestregressor',
                                                                                                                     RandomForestRegressor(max_features=0.7500000000000001,
                                                                                                                                           min_samples_leaf=16,
                                                                                                                                           min_samples_split=14,
                                                                                                                                           random_state=42))]))])

Finally, we'll test the model with a single sample:

test_df = pd.DataFrame([[65, "male", 12.5, 0, "yes", "southwest"]],
columns=["age", "sex", "bmi", "children", "smoker", "region"])
result = model.predict(test_df)

The result is:

array([19326.59077456])

In order to use the model later, we'll serialize it to disk:

joblib.dump(model, "model.joblib")

All of the code for training the model is in the model_training.ipynb notebook.

Validating the Model

In order to validate the model generated by the autoML process, we'll use the yellow_brick library.

First, we'll load the training and testing sets that we previously saved to disk:

training_set = pd.read_csv("../../data/training_set.csv")
testing_set = pd.read_csv("../../data/testing_set.csv")

Next, we'll separate the predictor variables from the target variable:

feature_columns = ["age", "sex", "bmi", "children", "smoker", "region"]
target_column = "charges"
X_train = training_set[feature_columns]
y_train = training_set[target_column]
X_test = testing_set[feature_columns]
y_test = testing_set[target_column]

We'll load the fitted model object that was saved in a previous step:

model = joblib.load("model.joblib")

We can now try to make predictions on the test set with the fitted pipeline:

predictions = model.predict(X_test)

The model's r\^2 and errors are calculated like this:

r2 = r2_score(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
mae = mean_absolute_error(y_test, predictions)

The results are:

r2 score: 0.827414647586443
mean squared error: 24830561.579995826
mean absolute error: 2713.6533067216383

Next, we'll create a yellow_brick visualizer for the model:

visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

The ResidualsPlot visualizer shows us the difference between the observed value and the predicted value of the target variable. This visualization is useful to see if there are value ranges for the target variable that have more or less error than other value ranges. The plot generated for our model looks like this:

Next, we'll generate the prediction error plot for the model using the PredictionError visualizer:

visualizer = PredictionError(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()

The prediction error plot shows the actual values of the target variable against the predicted values generated by the model. This allows us to see how much variance is in the predictions made by the model. The plot generated for our model looks like this:

All of the code for validating the model is in the model_validation.ipynb notebook.

Making Predictions with the Model

The insurance charges model is now ready to be used to make predictions, so now we need to make it available in an easy to use format. The ml_base package defines a simple base class for model prediction code that allows us to "wrap" the code in a class that follows the MLModel interface. This interface publishes this information about the model:

Qualified Name, a unique identifier for the model
Display Name, a friendly name for the model used in user interfaces
Description, a description for the model
Version, semantic version of the model codebase
Input Schema, an object that describes the model\'s input data
Output Schema, an object that describes the model\'s output schema

The MLModel interface also dictates that the model class implements two methods:

__init__, initialization method which loads any model artifacts needed to make predictions
predict, prediction method that receives model inputs makes a prediction and returns model outputs

By using the MLModel base class we'll be able to do more interesting things later with the model. If you'd like to learn more about the ml_base package, there is a blog post about it.

To install the ml_base package, execute this command:

pip install ml_base

Creating Input and Output Schemas

Before writing the model class, we'll need to define the input and output schemas of the model. To do this, we'll use the pydantic package.

The "sex" feature used by the model is a categorical feature that can be stated as an enumeration because it has a limited number of allowed values:

class SexEnum(str, Enum):
    male = "male"
    female = "female"

We'll use this class as a type in the input schema of the model.

We'll also need another enumeration for the region feature:

class RegionEnum(str, Enum):
    southwest = "southwest"
    southeast = "southeast"
    northwest = "northwest"
    northeast = "northeast"

Now we're ready to create the input schema for the model:

class InsuranceChargesModelInput(BaseModel):
    age: int = Field(None, title="Age", ge=18, le=65, description="Age of primary beneficiary in years.")
    sex: SexEnum = Field(None, title="Sex", description="Gender of beneficiary.")
    bmi: float = Field(None, title="Body Mass Index", ge=15.0, le=50.0, description="Body mass index of beneficiary.")
    children: int = Field(None, title="Children", ge=0, le=5, description="Number of children covered by health insurance.")
    smoker: bool = Field(None, title="Smoker", description="Whether beneficiary is a smoker.")
    region: RegionEnum = Field(None, title="Region", description="Region where beneficiary lives.")

We used the SexEnum and RegionEnum as types for the categorical variables, adding descriptions to the fields. We also added the age, bmi, children, and smoker fields. These fields are of type integer, float, integer, and boolean in turn.

We can use the class to create an object like this:

from insurance_charges_model.prediction.schemas import InsuranceChargesModelInput

input = InsuranceChargesModelInput(age=22, sex="male", bmi=20.0, children=0, region="southwest")

Now that we have the model input defined, we'll move on to the model output. This class is a lot simpler:

class InsuranceChargesModelOutput(BaseModel):
    charges: float = Field(None, title="Charges", description="Individual medical costs billed by health insurance to customer in US dollars.")

The model only has one output, the charges in US dollars that are predicted, which is a floating point field. The model schemas are in the schemas module in the prediction package.

Creating the Model Class

Since we now have the input and output schemas defined for the model, we'll be able to create the class that wraps around the model.

To start, we'll define the class and add all of the required properties:

class InsuranceChargesModel(MLModel):
    @property
    def display_name(self) -> str:
        return "Insurance Charges Model"

    @property
    def qualified_name(self) -> str:
        return "insurance_charges_model"

    @property
    def description(self) -> str:
        return "Model to predict the insurance charges of a customer."

    @property
    def version(self) -> str:
        return __version__

    @property
    def input_schema(self):
        return InsuranceChargesModelInput

    @property
    def output_schema(self):
        return InsuranceChargesModelOutput

The properties are required by the MLModel base class and they are used to easily access metadata about the model. The input and output schema classes are returned from the input_schema and output_schema properties.

The __init__ method of the class looks like this:

def __init__(self):
    dir_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))

    with open(os.path.join(dir_path, "model_files", "1", "model.joblib"), 'rb') as file:
        self._svm_model = joblib.load(file)

The init method is used to load the model parameters from disk and store the model object as an object attribute. The model object will be used to make predictions. Once the init method completes, the model object should be initialized and ready to make predictions.

The prediction method of the model class looks like this:

def predict(self, data: InsuranceChargesModelInput) -> InsuranceChargesModelOutput:
    X = pd.DataFrame([[data.age, data.sex.value, data.bmi, data.children, data.smoker, data.region.value]], 
                     columns=["age", "sex", "bmi", "children", "smoker", "region"])

    y_hat = round(float(self._svm_model.predict(X)[0]), 2)

    return InsuranceChargesModelOutput(charges=y_hat)

The predict method accepts an object of type InsuranceChargesModelInput and returns an object of type InsuranceChargesModelOutput. First, the method converts the incoming data into a pandas dataframe, then the dataframe is used to make a prediction, and the result is converted to a floating point number and rounded to two decimal places. Lastly, the output object is created using the prediction and returned to the caller.

The model class is defined in the model module in the prediction package.

Creating a RESTful Service

Now that we have a model class defined, we are finally able to build the RESTful service that will host the model when it is deployed. Luckily, we don't actually need to write any code for this because we'll be using the rest_model_service package. If you'd like to learn more about the rest_model_service package, there is a blog post about it.

To install the package, execute this command:

pip install rest_model_service

To create a service for our model, all that is needed is that we add a YAML configuration file to the project. The configuration file looks like this:

service_title: Insurance Charges Model Service
models:
  - qualified_name: insurance_charges_model
    class_path: insurance_charges_model.prediction.model.InsuranceChargesModel
    create_endpoint: true

The service title is the name we'll give the service in the documentation. The models array contains references to the models that we'd like to host within the service. Each model needs to have the qualified name of the model along with the class path to the model's MLModel class. The create_endpoint option is set to true to tell the service to create an endpoint for the model.

Using the configuration file, we're able to create an OpenAPI specification file for the model service by executing this command:

export PYTHONPATH=./
generate_openapi \--output_file=service_contract.yaml

To run the service locally, execute these commands:

uvicorn rest_model_service.main:app \--reload

The documentation allows you to make requests against the API in order to try it out. Here's a prediction request against the insurance charges model:

And the prediction result:

By using the MLModel base class provided by the ml_base package and the REST service framework provided by the rest_model_service package we're able to quickly stand up a service to host the model.

Deploying the Model

Now that we have a working model and model service, we'll need to deploy it somewhere. To do this, we'll use docker and kubernetes.

Creating a Docker Image

Before moving forward, let's create a docker image and run it locally. The docker image is generated using instructions in the Dockerfile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

MAINTAINER Brian Schmidt
"6666331+schmidtbri@users.noreply.github.com"

WORKDIR ./service

COPY ./insurance_charges_model ./insurance_charges_model
COPY ./rest_config.yaml ./rest_config.yaml
COPY ./service_requirements.txt ./service_requirements.txt

RUN pip install -r service_requirements.txt

ENV APP_MODULE=rest_model_service.main:app

The Dockerfile is used by this command to create the docker image:

docker build -t insurance_charges_model:0.1.0 .

To make sure everything worked as expected, we'll look through the docker images in our system:

docker image ls

The insurance_charges_model image should be listed. Next, we'll start the image to see if everything is working as expected:

docker run -d -p 80:80 insurance_charges_model:0.1.0

The service should be accessible on port 80 of localhost, so we'll try to make a prediction using the curl command:

curl -X 'POST' \
'http://localhost/api/models/insurance_charges_model/prediction' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"age": 65,
"sex": "male",
"bmi": 50,
"children": 5,
"smoker": true,
"region": "southwest"
}'

We got back this output, which tells us that the service is working as expected:

{"charges":46918.68}

If there are any problems, we should be able to debug them using the logs. To see the logs emitted by the running container, execute this command:

docker logs $(docker ps -lq)

To stop the docker container, execute this command:

docker kill $(docker ps -lq)

Setting up Digital Ocean

To show how to deploy the model service we created, we'll use Digital Ocean. In this section we'll be using the doctl command line utility which will help us to interact with the Digital Ocean Kubernetes service. We followed these instructions to install the doctl utility. Before we can do anything with the Digital Ocean API, we need to authenticate, so we created an API token by following these instructions. Once we have the token we can add it to the doctl utility by creating a new authentication context with this command:

doctl auth init \--context model-services-context

doctl auth list

If the context we created is not the current context, we can switch to it with this command:

doctl auth switch \--context model-services-context

To make sure that we are working in the right account, execute this command:

doctl account get

The account details should match the account that you used to login. Now that we are connecting to the right account in DO, we'll work on uploading the docker image that contains the model service so that we can use it in the Kubernetes cluster. First, we'll create a container registry with this command:

doctl registry create model-services-registry \--subscription-tier basic

We called the new registry "model-services-registry" and we used the basic tier, which costs \$5 a month.

Pushing the Image

Now that we have a registry, we need to add credentials to our local docker daemon in order to be able to upload images, to do that we'll use this command:

doctl registry login

In order to upload the image, we need to tag it with the URL of the DO registry we created. The docker tag command looks like this:

docker tag insurance_charges_model:0.1.0
registry.digitalocean.com/model-services-registry/insurance_charges_model:0.1.0

Now we can push the image to the DO registry:

docker push registry.digitalocean.com/model-services-registry/insurance_charges_model:0.1.0

Creating the Kubernetes Cluster

The doctl tool provides an option for creating a Kubernetes cluster, the command goes like this:

doctl kubernetes cluster create model-services-cluster

The cluster should come up after a while. The default cluster size is 3 nodes which should cost about \$30 to run for a month. We'll shut the cluster down later to save money.

Next, we need to add Kubernetes integration with Digital Ocean's docker registry, this allows the kubernetes cluster to pull images from the docker registry we created above. To do this execute this command:

doctl kubernetes cluster registry add model-services-cluster

To access the cluster, doctl has another option that will set up the kubectl tool for us:

doctl kubernetes cluster kubeconfig save 85866655-708d-47a9-8797-bcca56a10401

The unique identifier is for the cluster that was just created and is returned by the previous command. When the command finishes, the current context in kubectl should be switched to the newly created cluster. To list the contexts in kubectl, execute this command:

kubectl config get-contexts

A listing of the contexts currently in the kubectl configuration should appear, and there should be a star next to the new cluster's context. We can get a list of the nodes in the cluster with this command:

kubectl get nodes

Now that we have a cluster and are connected to it, we'll create a namespace to hold the resources for our model deployment. We'll create a namespace using this YAML manifest:

apiVersion: v1
  kind: Namespace
  metadata:
  name: model-services-namespace

The manifest can be found in this file. To apply the manifest to the cluster, execute this command:

kubectl create -f kubernetes/namespace.yml

To take a look at the namespaces, execute this command:

kubectl get namespace

The new namespace should appear in the listing along with other namespaces created by default by the system. To use the new namespace for the rest of the operations, execute this command:

kubectl config set-context --current --namespace=model-services-namespace

Creating a Kubernetes Deployment

We are now ready to actually create a deployment in the cluster. A deployment is a resource created within the Kubernetes cluster that provides declarative updates to individual pods and ReplicaSets. A pod represents a single instance of the web service that is hosting our model. We'll use a Deployment to launch two instances of the service in the cluster. The Deployment will manage the state of the Pods that hold the service instances and make sure that the desired state is always maintained in the cluster.

The Deployment is defined as YAML like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: insurance-charges-model-deployment
  labels:
    app: insurance-charges-model
spec:
  replicas: 1
  selector:
    matchLabels:
      app: insurance-charges-model
  template:
    metadata:
      labels:
        app: insurance-charges-model
    spec:
      containers:
        - name: insurance-charges-model
          image: registry.digitalocean.com/model-services-registry/insurance_charges_model:0.1.0
          ports:
          - containerPort: 80
            protocol: TCP
          imagePullPolicy: Always
          resources:
            requests:
              cpu: "250m"

The file containing the YAML is here. The deployment specifies that there should be two replicas of the docker image running in the cluster. The "app=insurance-charges-model" is applied to the two Pods and is used to select them later.

The Deployment is created within the Kubernetes cluster with this command:

kubectl apply -f kubernetes/deployment.yml

Once the command finishes we can see the new deployment with this command:

kubectl get deployments

We can view the pods that are being managed by the deployment with this command:

kubectl get pods

The output should look something like this:

NAME    READY   STATUS  RESTARTS    AGE
insurance-charges-model-deployment-7d58f6d569-zwjpw 1/1 Running 0 3m48s

Creating a Kubernetes Service

Now that we have a set of pods, we need to make them accessible to the outside world. The Service resource within Kubernetes is used to select a set of Pods and allow access to them through a single entry point. The Service allows us to decouple the Pods and Deployment resources that make up our REST service from the way that they are exposed to users.

The Service is defined as YAML like this:

apiVersion: v1
kind: Service
metadata:
  name: insurance-charges-model-service
spec:
  type: LoadBalancer
  selector:
    app: insurance-charges-model
  ports:
    - name: http
      protocol: TCP
      port: 80
      targetPort: 80

The YAML file is here. The Service is selecting the same Pods that are managed by the Deployment resource which we created above by using the same selector.

The Service is created within the Kubernetes cluster with this command:

kubectl apply -f kubernetes/service.yml

You can see the new service with this command:

kubectl get services

kubectl describe service insurance-charges-model-service | grep "LoadBalancer Ingress"

We can access the service documentation through the load balancer and the Pod that is running the REST service is returning the webpage.

We'll try the same curl command as before to see if the model is reachable:

curl -X 'POST' 'http://143.244.214.226/api/models/insurance_charges_model/prediction' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"age": 65,
"sex": "male",
"bmi": 50,
"children": 5,
"smoker": true,
"region": "southwest"
}'

A prediction was returned from the model:

{"charges":46277.67}

Deleting the Resources

Now that we're done with the service we need to destroy the resources. To destroy the load balancer, execute this command:

doctl compute load-balancer delete \--force $(kubectl get svc insurance-charges-model-service -o jsonpath="{.metadata.annotations.kubernetes\.digitalocean\.com/load-balancer-id}")

To destroy the kubernetes cluster, execute this command:

doctl k8s cluster delete 85866655-708d-47a9-8797-bcca56a10401

To destroy the docker registry, execute this command:

doctl registry delete model-services-registry

Closing

This blog post was created as a demonstration of how to build and deploy machine learning models quickly and easily. Although I didn\'t do any deep explanations of how the different tools work, I made sure to link to other resources from which you can learn more about them. The techniques and packages used are all open source and can be easily downloaded and used in other projects.

The dataset that we used happens to be useful for predicting insurance charges, but the code in this project can be used to train a model based on any regression data set because of the automated feature engineering and automated machine learning techniques that we used. We should be able to throw any dataset at the code and the automations that we built will enable us to quickly build a model and deploy a RESTful service with it.

Something that we can improve on in the future is to create a Helm chart that we can use to deploy an ML model service quickly and easily. Since the Kubernetes resources for the model service are likely to be very similar to other model services, we should be able to create a Helm chart that we can reuse to quickly spin up model services that follow the same pattern as this one.

Another thing that we can improve on is the automated generation of input and output schemas for the model. When we built the input and output schemas for the model, we had to manually extract the field information from the dataframes. By introspecting the dataframe metadata, we should be able to automatically generate the input and output schemas, which can be used to automatically generate the code in the schemas.py module. This is just one way in which we can further automate the deployment process of an ML model.

A RESTful ML Model Service

2021-04-29T07:32:00-05:00

Introduction

Sometimes you find yourself writing the same code over and over. When that starts happening you know it's time to take what you've learned and create a reusable piece of code that can be applied in the future. Because of the experience that we've gained in writing previous blog posts, I think that it is a good time to make a reusable service that can host any number of machine learning models.

In previous blog posts we've built many different types of services that can host ML models, in this blog post we'll aim at building a reusable service that can host an ML model behind a RESTful API. APIs are called RESTful when they follow the guidelines of the REST standard. REST stands for Representational State Transfer and is a subset of the HTTP protocol that is useful for building web applications. RESTful APIs are widely used in production systems and are an industry standard for integrating different systems.

The features that we want this reusable service to have are simple. We want to be able to install the service code as a package, that is to say through the pip python package manager. We want the API of the service to follow well-established standards, in this case we'll follow the REST standard for web APIs. We want to be able to configure the service to host any number of ML models. Lastly, we want to make the service be self-documenting, so that we don't have to create OpenAPI documentation for the service manually.

All of these things are possible and indeed easy to implement because we will be relying on a common interface for all the ML models that the service will host. This interface is the MLModel interface and it is defined in another package that we've already created. This interface and the package are fully described in a previous blog post. By requiring every model that we want to host in the service fulfill the requirements of the interface, we are able to write the service one time and reuse it.

The MLModel interface is very simple. It requires that a model class be created that contains two methods: an __init__ method that initializes the model object and a predict method that actually makes a prediction. This approach is very similar to the approach taken by Uber in their internal ML platform, they describe how they structure their ML model code here. SeldonCore is an open source project for deploying ML models which also takes a similar approach which is described here. In this blog post we will leverage the standardization that the MLModel interface makes possible to write a RESTful service that can host any model that follows the standard.

Package Structure

The service codebase will be structured into the following files:

- rest_model_service
  - __init__.py
  - configuration.py      # data models for configuration
  - generate_openapi.py   # script to generate an openapi spec
  - main.py               # entry point for service
  - routes.py             # controllers for routes
  - schemas.py            # service schemas
- tests
- requirements.txt
- setup.py
- test_requirements.txt

This structure can be seen in the github repository.

FastAPI

Now that we have a set of requirements and have described our approach, let's start building the REST service. For the web framework, we'll use the popular FastAPI framework. FastAPI is a modern framework for building web applications that uses python 3.6 and above. One of the great things about it is that it uses type hints by default, which helps to reduce the number of bugs in your code. By using the pydantic package for defining schemas, a FastAPI can generate an OpenAPI specification for your application without any extra effort. Because FastAPI supports asynchronous operations, it is also one of the fastest python web frameworks available. FastAPI is a great choice for our REST service because it follows a number of best practices by default which will raise the quality of our code. The ml_base package uses the pydantic package to define model input and output schemas which makes interfacing with FastAPI very easy.

We'll build up our understanding of how the service works by exploring the individual endpoints of the service. An endpoint is simply a point through which the service interacts with the outside world. The service has two types of endpoints: the metadata endpoint and all of the model endpoints. We'll talk about the metadata endpoint first.

Model Metadata Endpoint

The service needs to be able to expose information about the models that it is hosting to client systems. To do this, we'll add an endpoint that returns model metadata. The first thing we need to do is create the data model for the information that the endpoint will return:

class ModelMetadata(BaseModel):
  """Metadata of a model."""
  display_name: str = Field(description="The display name of the model.")
  qualified_name: str = Field(description="The qualified name of the model.")
  description: str = Field(description="The description of the model.")
  version: str = Field(description="The version of the model.")

The code above can be found here.

The ModelMetadata object represents one model that is being hosted by the service. We actually want to be able to host many models within the service, so we need to create a "collection" data model that can hold many ModelMetadata objects:

class ModelMetadataCollection(BaseModel):
  """Collection of model metadata."""

  models: List[ModelMetadata] = Field(description="A collection of model description.")

The code above can be found here.

Now that we have the data models, we can build the function that the client will interact with to get the model metadata:

async def get_models():
  try:
    model_manager = ModelManager()
    models_metadata_collection = model_manager.get_models()
    models_metadata_collection = ModelMetadataCollection(**{"models": models_metadata_collection}).dict()
    return JSONResponse(status_code=200, content=models_metadata_collection)
  except Exception as e:
    error = Error(type="ServiceError", message=str(e)).dict()
    return JSONResponse(status_code=500, content=error)

The code above can be found here.

The function does not accept any parameters because we don't need to select any specific model, we want to return metadata about all of the models. The first thing the function does is instantiate the ModelManager singleton. The ModelManager is a simple utility that we use to manage model instances, we described how it operates in a previous blog post. The ModelManager object should already contain instances of models, and by calling the get_models() method, we can get the metadata that we will return to the client.

The model_metadata_collection object is instantiated using the data model we created above, and returned as a JSONResponse to the client. If anything goes wrong, the function catches the exception object and returns a JSONResponse with the error details and a 500 status code.

Prediction Endpoint

To enable the service to host many instances of models, the code for the prediction endpoint needs to be a bit more complex than the metadata endpoint. We'll use a class instead of a function to create the controller for the endpoint:

class PredictionController(object):
  def __init__(self, model: MLModel) -> None:
    self._model = model

The code above can be found here.

The class is initialized with a reference to the instance of the model that it will be hosting. In this way, we can instantiate one controller object for each model that is living inside of the model service. To make predictions with the model, we'll add a method:

def __call__(self, data):
  try:
    prediction = self._model.predict(data).dict()
    return JSONResponse(status_code=200, content=prediction)
  except MLModelSchemaValidationException as e:
    error = Error(type="SchemaValidationError", message=str(e)).dict()
    return JSONResponse(status_code=400, content=error)
  except Exception as e:
    error = Error(type="ServiceError", message=str(e)).dict()
    return JSONResponse(status_code=500, content=error)

The code above can be found here.

The method is a dunder method named "__call__". This type of dunder method makes an object instantiated from the class behave like a function, which means that once we instantiate it, we'll be able to register it as an endpoint on the service.

The method is pretty simple, it takes the data object and sends it to the model to make a prediction. It then returns a JSONResponse that contains the prediction and a 200 status code. This response will be returned by the service if everything goes well. If the model raises an MLModelSchemaValidationException, then the method will return a JSONResponse with the 400 status code. For any other exceptions the method will return a 500 status code.

In the next section we'll see how this class is instantiated in order to allow the service to host any number of MLModel instances. We'll also see how we use the input and output models provided by each model object to create the documentation automatically.

Application Startup

At startup, the service does not know anything about which models it will be hosting, so it needs to load a configuration file to find out. In the main.py file, the configuration file is loaded from disk with this code:

if os.environ.get("REST_CONFIG") is not None:
  file_path = os.environ["REST_CONFIG"]
else:
  file_path = "rest_config.yaml"

if path.exists(file_path) and path.isfile(file_path):
  with open(file_path) as file:
    configuration = yaml.full_load(file)
    configuration = Configuration(**configuration)
    app = create_app(configuration.service_title, configuration.models)
else:
  raise ValueError("Could not find configuration file '{}'.".format(file_path))

The code above can be found here.

The default configuration file path is "rest_config.yaml" which is used if no other path is provided to the service. To provide an alternative path, we can set it in the "REST_CONFIG" environment variable. Once we have the yaml file loaded, we can call the create_app() function which creates the FastAPI application object.

def create_app(service_title: str, models: List[Model]) -> FastAPI:
  app: FastAPI = FastAPI(title=service_title, version=__version__)
  app.add_api_route("/",
    get_root,
    methods=["GET"])
  app.add_api_route("/api/models",
    get_models,
    methods=["GET"],
    response_model=ModelMetadataCollection,
    responses={
      500: {"model": Error}
    })

The code above can be found here.

The create_app() function first creates the app object with the service title that we loaded from the configuration file and the version. We then add two routes to the app: the root route and the model metadata route. The root route simply reroutes the request to the /docs route which hosts the auto-generated documentation. The model metadata route returns metadata for all of the models hosted by the service.

The next thing the function does is actually load the models:

model_manager = ModelManager()

for model in models:
  model_manager.load_model(model.class_path)

if model.create_endpoint:
  model = model_manager.get_model(model.qualified_name)
  controller = PredictionController(model=model)
  controller.__call__.__annotations__["data"] = model.input_schema
  app.add_api_route("/api/models/{}/prediction".format(model.qualified_name),
    controller,
    methods=["POST"],
    response_model=model.output_schema,
    description=model.description,
    responses={
      400: {"model": Error},
      500: {"model": Error}
    })
else:
  logger.info("Skipped creating an endpoint for model:{}".format(model.qualified_name))
return app

The code above can be found here.

The first thing we do is instantiate the ModelManager singleton. Next, we'll process each model in the configuration. For each model, we'll load it into the ModelManage and then create an endpoint for it. An endpoint is only created for a model if the configuration sets the "create_endpoint" option to true for that model.

Creating an endpoint for a model is a little tricky because we need to dynamically create an endpoint and add all of the options that FastAPI supports.

To create an endpoint for a model, we first need to get a reference to the model from the ModelManager singleton. We then instantiate the PredictionController class and pass the reference to the model to the __init__() method of the class. We now have a function that we can register with the FastAPI application as an endpoint controller. Before we can do that, we need to add an annotation to the function that will allow FastAPI to automatically create documentation for the endpoint. We'll annotate the controller function with the pydantic type that the model accepts as input. Now we are ready to register the function as a controller, when we do that we also provide the FastAPI app with the HTTP method, response pydantic model, description, and error response models. All of these options give the FastAPI app information about the endpoint which will be used later to auto-generate the documentation.

Creating a Package

This service will be most useful when it can be "added on" to a model project so that it can provide the deployment functionality for a machine learning model without becoming part of the codebase. If we take this approach, then the rest_model_service package is installed in the python environment and it will live as a dependency of the ml model package.

To enable all of this, the rest_model_service package is available as a package that can be installed from PyPi using the pip package manager. To install the package into your project you can execute this command:

pip install rest_model_service

Once the service package is installed, we can use it within an ML model project to create a RESTful service for the model.

Using the Service

In order to try out the service we'll need a model that follows the MLModel interface. There is a simple mocked model in the tests.mocks module that we'll use to try out the service:

class IrisModelInput(BaseModel):
  sepal_length: float = Field(gt=5.0, lt=8.0, description="Length of the sepal of the flower.")
  sepal_width: float = Field(gt=2.0, lt=6.0, description="Width of the sepal of the flower.")
  petal_length: float = Field(gt=1.0, lt=6.8, description="Length of the petal of the flower.")
  petal_width: float = Field(gt=0.0, lt=3.0, description="Width of the petal of the flower.")


class Species(str, Enum):
  iris_setosa = "Iris setosa"
  iris_versicolor = "Iris versicolor"
  iris_virginica = "Iris virginica"


class IrisModelOutput(BaseModel):
  species: Species = Field(description="Predicted species of the flower.")


class IrisModel(MLModel):
  display_name = "Iris Model"
  qualified_name = "iris_model"
  description = "Model for predicting the species of a flower based on its measurements."
  version = "1.0.0"
  input_schema = IrisModelInput
  output_schema = IrisModelOutput

  def __init__(self):
    pass

  def predict(self, data):
    return IrisModelOutput(species="Iris setosa")

The code above can be found here.

The mock model class works just like any other MLModel class, but it always returns a prediction of "Iris setosa". As you can see, the model references the IrisModelInput and IrisModelOutput pydantic models for its input and output.

Once we have a model, we'll need a configuration file to your project that will be used by the model service to find the models that you want to deploy. The configuration file should look like this:

service_title: REST Model Service
models:
  - qualified_name: iris_model
    class_path: tests.mocks.IrisModel
    create_endpoint: true

This file can be found in the examples folder here.

To start up the service locally, we need to point the service at the configuration file using an environment variable and then execute the uvicorn command:

export PYTHONPATH=./
export REST_CONFIG=examples/rest_config.yaml
uvicorn rest_model_service.main:app --reload

The service should start and we can view the documentation page on port 8000:

As you can see, the root endpoint and model metadata endpoint are part of the API. We also have an automatically generated endpoint for the iris_model mocked model that we added to the service through the configuration. The model's input and output data models are also added to documentation:

We can even try a prediction out:

Of course, the prediction will always be the same because it's a mocked model.

Generating the OpenAPI Contract

The FastAPI application actually generates the OpenAPI service specification at runtime and it is available for download from the documentation page. However, we'd like to generate the specification and save it to a file in source control. To do this we can use a script provided by the rest_model_service package called "generate_openapi". The script is installed with the package and is registered to be used within the environment where the package is installed. Here is how to use it:

export PYTHONPATH=./
export REST_CONFIG=examples/rest_config.yaml
generate_openapi --output_file=example.yaml

The script uses the same configuration that the service uses, but it doesn't run the webservice. It instead uses the FastAPI framework to generate the contract and saves it to the output file.

The generated contract will look like this:

info:
  title: REST Model Service
  version: <version_placeholder>
openapi: 3.0.2
paths:
  /:
    get:
      description: Root of API.
      operationId: get_root__get
      responses:
        '200':
          content:
            application/json:
              schema: {}
          description: Successful Response
      summary: Get Root
  /api/models:
    get:
      description: List of models available.
...

Closing

In this blog post we've shown how to create a web service that is easy to install, configure and deploy that is able to deploy any machine learning model that we throw at it. By using the MLModel base class, any model can be made to work with the service. When deploying machine learning models to production systems, it's a common practice to create a custom service that "wraps" around the model code and creates an interface that other systems can use to access the model. With the approach described in this blog post, the service is created automatically by using the interface definition provided by the model itself. Furthermore, the documentation is also created automatically by using the tooling provided by FastAPI. Lastly, we've made the service easy to add to any project by putting the package into the Pypi repository, from where it can be installed by using a simple "pip install" command.

The service currently does not allow any extra code that is not model code to be hosted by the service. When deploying a model into a production setting, we often have extra logic that we need to deploy alongside the model that is not technically part of the model. This is usually called the "business logic" of the solution. The service currently does not support the ability to add the business logic alongside the model logic. Granted, it is possible to throw the business logic into the model class and just deploy, but this combines the code together into one class and it makes it harder to test the code and reason about it correctly. To fix this shortcoming, we can add "plugin points" that allow us to add our own logic before and after the model executes where we can add the business logic.

One of the ways in which we could improve the service in the future is to allow more configuration of the models when they are instantiated by the service.It's not possible to customize the model when it is created by the service at startup time right now. In this future, it would be nice to allow the configuration of the service to hold parameters that would be passed to the model classes when they are instantiated.

Introducing the ml_base Package

2021-02-22T07:54:00-05:00

Introducing the ml_base Package

These examples run within an Jupyter notebook session. To clear out the results of cells that we don't want to see we'll use the clear_output() function provided by Jupyter:

from IPython.display import clear_output

To get started we'll install the ml_base package:

!pip install ml_base

clear_output()

Creating a Simple Model

To show how to work with the MLModel base class we'll create a simple model that we can make predictions with. We'll use the scikit-learn library, so we'll need to install it:

!pip install scikit-learn

clear_output()

Now we can write some code to train a model:

from sklearn import datasets
from sklearn import svm
import pickle

# loading the Iris dataset
iris = datasets.load_iris()

# instantiating an SVM model from scikit-learn
svm_model = svm.SVC(gamma=1.0, C=1.0)

# fitting the model
svm_model.fit(iris.data[:-1], iris.target[:-1])

# serializing the model and saving it
file = open("svc_model.pickle", 'wb')
pickle.dump(svm_model, file)
file.close()

Creating a Wrapper Class for Your Model

Now that we have a model object, we'll define a class that implements the prediction functionality for the code:

import os
from numpy import array


class IrisModel(object):
    def __init__(self):
        dir_path = os.path.abspath('')
        file = open(os.path.join(dir_path, "svc_model.pickle"), 'rb')
        self._svm_model = pickle.load(file)
        file.close()

    def predict(self, data: dict):
        X = array([data["sepal_length"], data["sepal_width"], data["petal_length"], data["petal_width"]]).reshape(1, -1)
        y_hat = int(self._svm_model.predict(X)[0])
        targets = ['setosa', 'versicolor', 'virginica']
        species = targets[y_hat]
        return {"species": species}

The class above wraps the pickled model object and makes the model easier to use by converting the inputs and outputs. To use the model, all we need to do is this:

model = IrisModel()

prediction = model.predict(data={
    "sepal_length":1.0,
    "sepal_width":1.1,
    "petal_length": 1.2,
    "petal_width": 1.3})

prediction

{'species': 'virginica'}

Creating an MLModel Class for Your Model

The model is already much easier to use because it provides the prediction from a class. The user of the model doesn't need to worry about loading the pickled model object, or converting the model's input into a numpy array. However, we are still not using the MLModel abstract base class, now we'll implement a part of the MLModel's interface to show how it works:

from ml_base import MLModel


class IrisModel(MLModel):
    @property
    def display_name(self):
        return "Iris Model"

    @property
    def qualified_name(self):
        return "iris_model"

    @property
    def description(self):
        return "A model to predict the species of a flower based on its measurements."

    @property
    def version(self):
        return "1.0.0"

    @property
    def input_schema(self):
        raise NotImplementedError()

    @property
    def output_schema(self):
        raise NotImplementedError()

    def __init__(self):
        dir_path = os.path.abspath('')
        file = open(os.path.join(dir_path, "svc_model.pickle"), 'rb')
        self._svm_model = pickle.load(file)
        file.close()

    def predict(self, data: dict):
        X = array([data["sepal_length"], data["sepal_width"], data["petal_length"], data["petal_width"]]).reshape(1, -1)
        y_hat = int(self._svm_model.predict(X)[0])
        targets = ['setosa', 'versicolor', 'virginica']
        species = targets[y_hat]
        return {"species": species}

The MLModel base class defines a set of properties that must be provided by any class that inherits from it. Because the IrisModel class now provides this metadata about the model, we can access it directly from the model object like this:

model = IrisModel()

print(model.qualified_name)

iris_model

The qualified name of the model uniquely identifies the instance of the model within the system. Right now the qualified name is hardcoded in the code of the model's class, but this can be made more dynamic in the future. The qualified name should also be a string that is easy to embed in a URL, so it shouldn't have spaces or special characters.

The model's display name is also available from the model object:

print(model.display_name)

Iris Model

The display name of a model should be a string that looks good in a user interface.

The model description is also available from the model object:

print(model.description)

A model to predict the species of a flower based on its measurements.

The model version is also available as a string from the model object:

print(model.version)

1.0.0

As you can see, we didn't implement the input_schema and output_schema properties above, we'll add those next.

Adding Schemas to Your Model

To add schema information to the model class, we'll use the pydantic package. The pydantic package allows us to state the schema requirements of the model's input and output programatically as Python classes:

from pydantic import BaseModel, Field
from pydantic import ValidationError
from enum import Enum


class ModelInput(BaseModel):
    sepal_length: float = Field(gt=5.0, lt=8.0, description="The length of the sepal of the flower.")
    sepal_width: float = Field(gt=2.0, lt=6.0, description="The width of the sepal of the flower.")
    petal_length: float = Field(gt=1.0, lt=6.8, description="The length of the petal of the flower.")
    petal_width: float = Field(gt=0.0, lt=3.0, description="The width of the petal of the flower.")


class Species(str, Enum):
    iris_setosa = "Iris setosa"
    iris_versicolor = "Iris versicolor"
    iris_virginica = "Iris virginica"


class ModelOutput(BaseModel):
    species: Species = Field(description="The predicted species of the flower.")

The ModelInput class inherits from the pydantic BaseModel class and it defines four required fields, all of them floating point numbers. The pydantic package allows for defining upper bounds and lower bounds for the values accepted by each field, and also a description for the field.

The ModelOutput is made up of a single fields, which is an enumerated string that contains the predicted species of the flower.

Now that we have the ModelInput and ModelOutput schemas defined as pydantic BaseModel classes, we'll add them to the IrisModel class by returning them from the input_schema and output_schema properties:

from ml_base.ml_model import MLModel, MLModelSchemaValidationException


class IrisModel(MLModel):
    @property
    def display_name(self):
        return "Iris Model"

    @property
    def qualified_name(self):
        return "iris_model"

    @property
    def description(self):
        return "A model to predict the species of a flower based on its measurements."

    @property
    def version(self):
        return "1.0.0"

    @property
    def input_schema(self):
        return ModelInput

    @property
    def output_schema(self):
        return ModelOutput

    def __init__(self):
        dir_path = os.path.abspath('')
        with open(os.path.join(dir_path, "svc_model.pickle"), 'rb') as f:
            self._svm_model = pickle.load(f)

    def predict(self, data: ModelInput):
        # creating a numpy array using the fields in the input object
        X = array([data.sepal_length, 
                   data.sepal_width, 
                   data.petal_length, 
                   data.petal_width]).reshape(1, -1)

        # making a prediction, at this point its a number
        y_hat = int(self._svm_model.predict(X)[0])

        # converting the prediction from a number to a string
        targets = ["Iris setosa", "Iris versicolor", "Iris virginica"]
        species = targets[y_hat]

        # returning the prediction inside an object
        return ModelOutput(species=species)

Notice that we are also using the pydantic models to validate the input before prediction and to create an object that will be returned from the model's predict() method.

If we use the model class now, we'll get this result:

model = IrisModel()

prediction = model.predict(ModelInput(
    sepal_length=6.0, 
    sepal_width=2.1, 
    petal_length=1.2, 
    petal_width=1.3))

prediction

ModelOutput(species=<Species.iris_virginica: 'Iris virginica'>)

By adding input and output schemas to the model, we can automate many other operations later. Also, we can query the model object itself for the schema. The pydantic package is able to create JSON schema from the fields in the input and output schema objects of the model:

model = IrisModel()

model.input_schema.schema()

{'title': 'ModelInput',
 'type': 'object',
 'properties': {'sepal_length': {'title': 'Sepal Length',
   'description': 'The length of the sepal of the flower.',
   'exclusiveMinimum': 5.0,
   'exclusiveMaximum': 8.0,
   'type': 'number'},
  'sepal_width': {'title': 'Sepal Width',
   'description': 'The width of the sepal of the flower.',
   'exclusiveMinimum': 2.0,
   'exclusiveMaximum': 6.0,
   'type': 'number'},
  'petal_length': {'title': 'Petal Length',
   'description': 'The length of the petal of the flower.',
   'exclusiveMinimum': 1.0,
   'exclusiveMaximum': 6.8,
   'type': 'number'},
  'petal_width': {'title': 'Petal Width',
   'description': 'The width of the petal of the flower.',
   'exclusiveMinimum': 0.0,
   'exclusiveMaximum': 3.0,
   'type': 'number'}},
 'required': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}

model.output_schema.schema()

{'title': 'ModelOutput',
 'type': 'object',
 'properties': {'species': {'$ref': '#/definitions/Species'}},
 'required': ['species'],
 'definitions': {'Species': {'title': 'Species',
   'description': 'An enumeration.',
   'enum': ['Iris setosa', 'Iris versicolor', 'Iris virginica'],
   'type': 'string'}}}

Although it is not required to use the pydantic package to create model schemas, it is recommended. The pydantic package is installed as a dependency of the ml_base package.

Using the ModelManager Class

The ModelManager class is provided to help manage model objects. It is a singleton class that is designed to enable model instances to be instantiated once during the lifecycle of a process and accessed many times:

from ml_base.utilities import ModelManager


model_manager = ModelManager()

Because it is a singleton object, a reference to the same object is returned no matter how many times we instantiate it:

print(id(model_manager))

another_model_manager = ModelManager()

print(id(another_model_manager))

4505980208
4505980208

You can add model instances to the ModelManager singleton by asking it to instantiate the model class:

model_manager.load_model("__main__.IrisModel")

The load_model() method is able to find the MLModel class that we defined above and instantiate it, after that it stores a reference to the instance internally.

The ModelManager is also able to save references to model instances that were instantiated in some other way by using the add_model() method:

another_iris_model = IrisModel()

try:
    model_manager.add_model(another_iris_model)
except ValueError as e:
    print(e)

A model with the same qualified name is already in the ModelManager singleton.

In this case, the ModelManager did not save the instance of the IrisModel because we already had an instance of the model. The models are uniquely identified by their qualified name properties.

The ModelManager instance can list the models that it contains with the get_models() method, the details of the instance of IrisModel that we just created are returned:

model_manager.get_models()

[{'display_name': 'Iris Model',
  'qualified_name': 'iris_model',
  'description': 'A model to predict the species of a flower based on its measurements.',
  'version': '1.0.0'}]

The ModelManager instance can return the metadata of any of the models. The metadata includes the input and output schemas as well:

model_manager.get_model_metadata("iris_model")

{'display_name': 'Iris Model',
 'qualified_name': 'iris_model',
 'description': 'A model to predict the species of a flower based on its measurements.',
 'version': '1.0.0',
 'input_schema': {'title': 'ModelInput',
  'type': 'object',
  'properties': {'sepal_length': {'title': 'Sepal Length',
    'description': 'The length of the sepal of the flower.',
    'exclusiveMinimum': 5.0,
    'exclusiveMaximum': 8.0,
    'type': 'number'},
   'sepal_width': {'title': 'Sepal Width',
    'description': 'The width of the sepal of the flower.',
    'exclusiveMinimum': 2.0,
    'exclusiveMaximum': 6.0,
    'type': 'number'},
   'petal_length': {'title': 'Petal Length',
    'description': 'The length of the petal of the flower.',
    'exclusiveMinimum': 1.0,
    'exclusiveMaximum': 6.8,
    'type': 'number'},
   'petal_width': {'title': 'Petal Width',
    'description': 'The width of the petal of the flower.',
    'exclusiveMinimum': 0.0,
    'exclusiveMaximum': 3.0,
    'type': 'number'}},
  'required': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']},
 'output_schema': {'title': 'ModelOutput',
  'type': 'object',
  'properties': {'species': {'$ref': '#/definitions/Species'}},
  'required': ['species'],
  'definitions': {'Species': {'title': 'Species',
    'description': 'An enumeration.',
    'enum': ['Iris setosa', 'Iris versicolor', 'Iris virginica'],
    'type': 'string'}}}}

The ModelManager can return a reference to the instance of any model that it is holding:

iris_model = model_manager.get_model("iris_model")

print(iris_model.display_name)

Iris Model

The instance is identified by the qualified name of the model.

Lastly, a model instance can be removed by calling the remove_model() method:

model_manager.remove_model("iris_model")

model_manager.get_models()

[]

To clear the ModelManager instance, you can call the clear_instance() method:

model_manager.clear_instance()

To create a new singleton you have to instantiate the ModelManager again:

model_manager = ModelManager()

10 Ways to Deploy a Machine Learning Model

2020-10-28T08:00:00-05:00

This blog post builds on the ideas started in three previous blog posts.

This blog post also references previous blog posts in which I deployed the same ML model in several different ways. I deployed the model as a batch job in this blog post, as a task queue in this blog post, inside an AWS Lambda in this blog post, as a Kafka streaming application in this blog post, a gRPC service in this blog post, as a MapReduce job in this blog post, as a Websocket service in this blog post, as a ZeroRPC service in this blog post, and as an Apache Beam job in this blog post.

Introduction

In previous blog posts we've seen how it is possible to deploy the same model in ten different ways. The model itself was developed one time and released as a package, which was then used in each deployment. These blog posts started as an exercise in finding new and interesting ways to deploy an ML model, so we decided to write this blog post about some of the things that we've learned along the way.

In order to be able to deploy the same model in 10 different ways, we needed to build the model so that it was not incompatible with all the different ways we wanted to deploy it. We also needed to make it easy to install and to make sure that the model published metadata about itself. All of these features of the model became very important once we needed to deploy it into a real software system.

In a previous blog post, we developed a model that we called the "iris_model". This model was designed for the purposes of the blog posts that we planned to write later on, so it followed several best practices that we will be describing in this blog post. To make sure that the model was compatible with every deployment option we wanted to pursue, we needed to build it to work as a software component, as a software library, and as a software package. In this blog post we'll describe how and why these approaches make it easier to deploy the model.

To be able to abstract away the details of an ML model from the code that is using it, we developed the MLModel base class in these blog posts. The base class is used to create a standard interface for the prediction code of an ML model, which makes it easier to deploy the model. This approach made it possible to write the model deployment code in such a way that it can support any model that implements the MLModel interface. This approach can be thought of as applying the strategy design pattern to machine learning models. In this blog post we'll describe how the strategy pattern is useful in ML model deployments.

When we started implementing all of the different deployments for the model, we started seeing patterns around the way that the model is accessible to its clients. These patterns coalesced into a few different classes of model deployments which help to talk about the strengths and weaknesses of each approach to deploying the model. In this blog post, we'll describe an ontology that can help developers to talk about and choose the best approach to deploying an ML model.

ML Models as Software Components

To create an ML model that is easy to deploy, we need to build it as a software component. A software component is simply a small part of a bigger software system that can be easily isolated from the rest of the system. That is to say, the component is not deeply tied to the rest of the system and it exposes an interface so that the rest of the system can access it. A software component is designed to fulfill a small part of the requirements of a larger software system, and to be easy to integrate with other software components in the system. Good software components are designed to be reused in many contexts and must follow good design patterns to achieve this goal.

One of the most important parts of a software component is the public API of the component. The API of the IrisModel class has proven to be very simple and adaptable to a wide variety of technologies. For example, when we deployed the IrisModel as a Websocket service, we didn't need to rewrite any of the model code to adapt it to the model component's API. The reason for this is that the IrisModel class inherits from the MLModel interface. This interface has a few requirements: your model must instantiate itself, it must receive prediction requests, and it must publish certain metadata about itself. By creating a standard interface around these requirements, the MLModel interface makes it possible to deploy a wide range of machine learning models in the same way.

When we designed the MLModel interface we made sure that it would not enforce any specific technology on the user. For example, there is no requirement that says that the models that implement the MLModel interface must use a specific serialization and deserialization standard. In all of the blog posts where we deployed the iris_model package we used JSON for serialization and deserialization, but this was an implementation detail that can easily be changed since the model code itself does not do any serialization or deserialization. Another important aspect of the design is the fact that the MLModel interface does not enforce any particular integration pattern on the code. For example, we were able to create a RESTful service and a batch job with the same model. In fact, the choice of deployment technology had no effect on the model codebase. This makes it possible to reuse the same model in many different contexts.

Certain technologies required advanced knowledge of the schema of the data that the model component would receive and send back. For example, the gRPC service required that we compile a protocol buffer from the input and output schemas of the model. In this case we were able to isolate the requirements of the deployment from the model itself by leveraging the schema metadata provided by the model. In other cases, the schema metadata was only useful for documentation purposes, since a user of the model would need to know about the model's input and about schemas to be able to use it. Because we return schema information from the API of the ML model software component, we were able to handle this situation smoothly.

ML Models as Libraries

To create an ML model that is easy to deploy, we must build it so that it works as a software library. A software library is a collection of reusable software components that can be used in many different contexts. A library is designed and built so that it is reusable.

By treating a machine learning model as a library we gain many different benefits, for example, models can easily be reused in many different services and applications without having to copy and paste the model code and parameters. There is no need to embed an ML model inside of a codebase in such a way that it cannot be reused somewhere else because the library can be installed into a project. When we used the iris_model library in our deployments, all we had to do was execute "from iris_model.iris_predict import IrisModel" and the model would be available to be used.

Another benefit that we gain when we treat ML models as libraries is that it is easy to version them. Since libraries are built and released many times, everyone understands how to version them and release them for use by other developers. The semantic versioning standard has been used widely in the software world and we used it to version the iris_model package. One of the main benefits of a strong versioning standard for ML models is that everyone understands that the ML model will be evolving in the future, and that they can access newer versions of the model by installing a newer version of the library.

By thinking about ML models as libraries we break the pattern of making custom models for very specific use cases. If we are going to spend the time and effort to build a complex ML model, why not make it easy to reuse in different contexts? This requires a bit of realignment in most cases, but it is certainly possible.

ML Models as Packages

To create an ML model that is easy to deploy, we must build it so that is a software package. A software package is a distributable file that contains the necessary files to install a software component or library in the programming environment. Software packages are usually managed using package managers. Software libraries are usually released as packages as well, to make them easy to install.

One of the most important factors that allowed us to deploy the IrisModel model in 10 different ways is the fact that the model code is isolated inside of a Python package. The first two blog posts were concerned with creating a model codebase that could be installed into any python environment. Once we could install the model as a python package with the pip install command, it was easy to reuse the same model in many different contexts.

An important part of this approach is the fact that we can install all of the dependencies of the model package automatically when the model package is installed. Often, a model that runs in one person's computer won't run in another person's computer because dependency management is not taken care of. In order to create a python package, the dependencies of the package must be listed in the setup.py file of the Python project, because of this the ML model is a lot easier to work with and can be easily installed by anybody. For example, the iris_model package lists the exact version of scikit-learn that it needs, which takes the guesswork out of installing and using it.

Lastly, by distributing the ML model as a package, we're able to download and install the model parameters along with the model code. Oftentimes, an ML model is just a file that contains serialized model parameters (often a pickle file). However, distributing a model this way ignores the fact that we might need to install some custom prediction code along with the model parameters. By using a package manager, we are able to ensure that the model parameters and the prediction codebase are installed correctly into the programming environment. In the case of the IrisModel package, the model parameters were installed by including the file in the package's manifest which ensures that the parameters are copied into the distributable file.

ML Models and the Strategy Pattern

The strategy pattern is a design pattern used in object oriented design. It is a behavioral design pattern that allows a software component to select an appropriate algorithm at runtime to execute a task. The strategy pattern is applied by defining an interface that every implementation of the strategy must inherit and implement. The MLModel class that the IrisModel class inherits from fulfills this purpose. The benefit that we gain from using the strategy pattern is that we can write code that doesn't care about the details of a machine learning model's prediction algorithm, because it can use any algorithm that meets the requirements of the interface.

In practice, this means that we were able to deploy an ML model simply by installing the package and writing a reference to the class that implements the MLModel interface into the configuration. The deployment code reads the configuration at runtime, loads the right model, and makes it available to the client. Some model deployments that we built were even able to handle multiple models. For example, the ZeroRPC service that we created in this blog post is able to dynamically create an endpoint for every model that is listed in the configuration.

By creating models as components and making them available as packages, we're able to make models reusable in many different situations. When we use the strategy pattern, we get a similar benefit, because the pattern makes it possible to reuse the model deployment code to deploy any model in the future. As long as the model we want to deploy implements the MLModel interface, we are able to reuse the deployment codebase to deploy it. In the future, it would be easy to build reusable codebases that can deploy models, the code would be configured with the model that needs to be deployed and there would be no need to create a custom service for each model that wanted to deploy.

An Ontology of ML Model Deployments

Now that we have deployed the same model in ten different ways, we can compare and contrast the ways the model was deployed. This section tries to build a complete picture of the effect that a deployment option can have on the way we can use the model.

Interactive and Non-Interactive Model Deployments

ML models can be deployed in an interactive manner and a non-interactive manner. A model is deployed "interactively" when a client of the model is able to request predictions from the model and get a prediction directly back without waiting an indeterminate amount of time to get the prediction. Interactive model deployments make the model directly available to the client through an API and make it possible for the client to send in any data allowed by the model's input schema to make a prediction. In "non-interactive" model deployments, the client is not able to send data to the model directly, which usually means that the client has to access predictions that were previously stored in a data store. The distinction between interactive and non-interactive model deployments can have a large impact on the design of the client systems that make use of the ML model. If a model is deployed non-interactively, the clients of the system don't have direct access to the model and they can't send any data they want to the model, the only predictions that are available from the model are the ones previously made and stored.

An example of an interactive deployment is the REST service that we built in this blog post. The service is designed to run continuously, which means that a client can contact the service anytime, request a prediction, and get a prediction back directly from the model. An example of a non-interactive deployment is the batch job that we built in this blog post, since a user of the model can only access the predictions that are saved by the batch job. At first sight, it would seem that the task queue deployment that we built in this blog post is non-interactive because the user has to wait to get a prediction. However, the task queue is actually interactive because the predictions are always made from the input provided by the client and the predictions become available to the client after the asynchronous task completes.

Single-Record and Batch Model Deployments

Single-record model deployments are designed to receive inputs from clients, make a single prediction, and return the results to the client. Batch model deployments are designed to receive many inputs from the client system, make predictions and return the results to the client as a batch of records. Batch systems often make better use of resources because they are able to vectorize their operations, this makes their operation more efficient. Single-record systems are usually more responsive to clients because they are able to quickly return a result.

System performance can be measured in two ways: throughput and latency. Throughput is defined as the number of records that can be processed by the system in a given period of time. Latency is the amount of time it takes the system to process a single request. A single-record model deployment is often optimizing for the total latency of a single request, and a batch model deployment is often optimizing for the total throughput of the system.

An example of a single-record model deployment is the gRPC service that we built in this blog post. The gRPC only allows one prediction to be made for each RPC call to the model, this is enforced in the protocol buffer interface definition of the service which does not allow arrays of prediction inputs to be received by the service. An example of a batch model deployment is the MapReduce job we built in this blog post. The MapReduce system is specifically designed to allow massive parallel batch jobs that run across multiple computers in a cluster. The system is most efficient when processing large datasets because of the amount of time it takes to start a processing run. The distinction between single-record and batch deployments can sometimes be hard to draw because we can support multiple predictions in the gRPC service API, as long as the client is willing to wait for all of the predictions to complete. As always, there are many tradeoffs that we can make between the two extremes.

Synchronous and Asynchronous Model Deployments

Synchronous ML model deployments are characterized by the client being blocked while the model is making a prediction. An asynchronous model deployment allows the client system to request a prediction from the model and not wait for the prediction to complete to continue processing. Typically, an asynchronous deployment allows the client to retrieve the model's prediction after it completes, but this is not required for the system to be considered asynchronous. The predictions made by a synchronous model deployment are returned to the client as soon as they are completed.

An example of a synchronous model deployment is the AWS Lambda deployment we built in this blog post. The Lambda receives prediction requests through an AWS API Gateway, makes a prediction and returns it while the client system waits for it. An example of an asynchronous model deployment is the task queue we built for this blog post. The task queue is specifically designed to receive predictions requests from clients and fulfill them while the client system works on other things. The task queue makes the prediction available to the client in a "result backend" which can be accessed by the client once the prediction is completed. Another asynchronous deployment is the Kafka stream processor we built in this blog post, although it is not designed to return the prediction results directly to the client like the task queue deployment.

Real-time and Non-real-time Model Deployments

Another area of optimization for ML model deployments is the ability to return a prediction very quickly. A real-time system needs to be optimized to have very low and very predictable latency so that we can ensure that interactions with the model can always happen quickly and end within a defined period of time.

An example of a real time model deployment is the Websocket service that we created in this blog post. The Websocket service is particularly useful for this type of deployment because websocket connections are designed to transfer data with very low overhead. Some examples of a non-real-time service is the Apache Beam ETL job we built in this blog post and the Hadoop MapReduce job we built in this blog post. These deployments are designed to make millions of predictions and are optimized for that purpose, which means that they are not useful in situations in which we need real-time predictions.

In the blog posts that we wrote, we didn't try to deploy a model on a consumer device like a phone or tablet. All of the approaches we took were designed to execute the model on a server and return the prediction to the client through the network. For a real-time system, being able to execute directly on the client device would be more efficient and faster since no network hop is required.

Deterministic and Non-deterministic Models

The last distinction we will make is between deterministic and nondeterministic model prediction code. Deterministic models will always return the same result when given the same input, non-deterministic models can return different results when given the same input. This distinction can have a large impact on the deployment of the model. If we don't distinguish between models that are deterministic and non-deterministic, doing things like storing predictions for later use and prediction caching can become much more complicated. Any model that is being deployed that is non-deterministic should publish that fact to its users so that they can be ready to deal with the side effects of non-determinism.

Conclusion

At the beginning of this series of blog posts we challenged ourselves to come up with a simple base class that would enable us to abstract out the details of a machine learning model. We started by creating a base class that could hide the details of the ML model behind an abstraction, then added features that we thought would be useful. From the beginning, the base class was designed to make it easy to deploy machine learning models. The base class was not designed for the training parts of a model codebase.

To be able to introspect details about the model, we also added the ability for the model to provide metadata about itself. The metadata aspect of the model was not really required for most model deployments, but it did become important for certain deployments. Model metadata like the version and the input and output schemas of the model becomes more important when we have to manage dozens or hundreds of deployed models.

To enable us to easily deploy any ML model, we also needed to make the model codebase easy to install, which we accomplished by making the ML model into a Python package that could be installed with the pip package manager. By making the model codebase easy to install we enabled anybody to reuse the model in whichever context they needed it without having to understand the code or manually install the dependencies of the model. Having the model inside of a package also allowed us to install the very same model in 10 different applications with no changes to the model code.

Overall, this series of blog posts is much less concerned with the details of training a machine learning model. It is mainly concerned with integrating the trained ML model with other software systems. To this end, we sought to use a wide variety of integration technologies to make sure that our approach worked in every situation. In every case, the model codebase remained the same and we did not have to adapt it to any of the integrations. This speaks to the flexibility of the approach, which allowed us to isolate the details of the ML model from the deployment and integration problems. Furthermore, we can reuse any of the deployment codebases to deploy any ML model code that implements the MLModel base class, which makes the deployment codebases reusable as well.

To sum up, the best strategy for building an ML model that can be used in many different contexts is to: code the model prediction code behind an interface, build and release the model as a package, and then to install it into the environment where it will be used. All deployment details should be kept out of the model package so that we are able choose the right approach to model deployment later on.

An Apache Beam ML Model Deployment

2020-07-31T19:00:00-05:00

This blog post builds on the ideas started in three previous blog posts.

In this blog post I'll show how to deploy the same ML model that we deployed as a batch job in this blog post, as a task queue in this blog post, inside an AWS Lambda in this blog post, as a Kafka streaming application in this blog post, a gRPC service in this blog post, as a MapReduce job in this blog post, as a Websocket service in this blog post, and as a ZeroRPC service in this blog post.

The code in this blog post can be found in this github repo.

Introduction

Data processing pipelines are useful for solving a wide range of problems. For example, an Extract, Transform, and Load (ETL) pipeline is a type of data processing pipeline that is used to extract data from one system and save it to another system. Inside of an ETL, the data may be transformed and aggregated into more useful formats. ETL jobs are useful for making the predictions made by a machine learning model available to users or to other systems. The ETL for such an ML model deployment looks like this: extract features used for prediction from a source system, send the features to the model for prediction, and save the predictions to a destination system. In this blog post we will show how to deploy a machine learning model inside of a data processing pipeline that runs on the Apache Beam framework.

Apache Beam is an open source framework for doing data processing. It is most useful for doing parallel data processing that can easily be split among many computers. The Beam framework is different from other data processing frameworks because it supports batch and stream processing using the same API, which allows developers to write the code one time and deploy it in two different contexts without change. An interesting feature of the Beam programming model is that once we have written the code, we can deploy into an array of different runners like Apache Spark, Apache Flink, Apache MapReduce, and others.

The Google Cloud Platform has a service that can run Beam pipelines. The Dataflow service allows users to run their workloads in the cloud without having to worry about managing servers and manages automated provisioning and management of processing resources for the user. In this blog post, we'll also be deploying the machine learning pipeline to the Dataflow service to demonstrate how it works in the cloud.

Building Beam Jobs

A Beam job is defined as a driver process that uses the Beam SDK to state the data processing steps that the Beam job does. The Beam SDK can be used from Python, Java, or Go processes. The driver process defines a data processing pipeline of components which are executed in the right order to load data, process it, and store the results. The driver program also accepts execution options that can be set to modify the behavior of the pipeline. In our example, we will be loading data from an LDJSON file, sending it to a model to make predictions, and storing the results in an LDJSON file.

The Beam programming model works by defining a PCollection, which is a collection of data records that need to be processed. A PCollection is a data structure that is created at the beginning of the execution of the pipeline, and is received and processed by each step in a Beam pipeline. Each step in the pipeline that modifies the contents of the PCollection is called a PTransform. For this blog post we will create a PTransform component that takes a PCollection, makes predictions with it, and returns a PCollection with the prediction results. We will combine this PTransform with other components to build a data processing pipeline.

Package Structure

The code used in this blog post is hosted in this Github repository. The codebase is structured like this:

-   data ( data for testing job)
-   model_beam_job (python package for apache beam package)
    -   __init__.py
    -   main.py (pipeline definition and launcher)
    -   ml_model_operator.py (prediction step)
-   tests ( unit tests )
-   Makefile
-   README.md
-   requirements.txt
-   setup.py
-   test_requirements.txt

Installing the Model

As in previous blog posts, we'll be deploying a model that is packaged separately from the deployment codebase. This approach allows us to deploy the same model in many different systems and contexts. To install the model package, we'll install the model into the virtual environment. The model package can be installed from a git repository with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now that we have the model installed in the environment, we can try it out by opening a python interpreter and entering this code:

>>> from iris_model.iris_predict import IrisModel
>>> model = IrisModel()
>>> model.predict({"sepal_length":1.1, "sepal_width": 1.2, "petal_width": 1.3, "petal_length": 1.4})
{'species': 'setosa'}

The IrisModel class implements the prediction logic of the iris_model package. This class is a subtype of the MLModel class, which ensures that a standard interface is followed. The MLModel interface allows us to deploy any model we want into the Beam job, as long as it implements the required interface. More details about this approach to deploying machine learning models can be found in the first three blog posts in this series.

MLModelPredictOperation Class

The first thing we'll do is create a PTransform class for the code that receives records from the Beam framework and makes predictions with the MLModel class. This is the class:

class MLModelPredictOperation(beam.DoFn):

The code above can be found here.

The class we'll be working with is called MLModelPredictOperation and it is a subtype of the DoFn class that is part of the Beam framework. The DoFn class defines a method which will be applied to each record in the PCollection. To initialize the object with the right model, we'll add an __init__ method:

def __init__(self, module_name, class_name):
    beam.DoFn.__init__(self)
    model_module = importlib.import_module(module_name)
    model_class = getattr(model_module, class_name)
    model_object = model_class()
    if issubclass(type(model_object), MLModel) is None:
        raise ValueError("The model object is not a subclass of MLModel.")
    self._model = model_object

The code above can be found here.

We'll start by calling the __init__ method of the DoFn super class, this initializes the super class. We then find and load the python module that contains the MLModel class that contains the prediction code, get a reference to the class, and instantiate the MLModel class into an object. Now that we have an instantiated model object, we check the type of the object to make sure that it is a subtype of MLModel. If it is a subtype, we store a reference to it.

Now that we have an initialized DoFn object with a model object inside of it, we need to actually do the prediction:

def process(self, data, **kwargs):
    yield self._model.predict(data=data)

The code above can be found here.

The prediction is very simple, we take the record and pass it directly to the model, and yield the result of the prediction. To make sure that this code will work inside of a Beam pipeline, we need to make sure that the pipeline feeds a PCollection of dictionaries to the DoFn object. When we create the pipeline, we'll make sure that this is the case.

Creating the Pipeline

Now that we have a class that can make a prediction with the model, we need to build a simple pipeline around it that can load data, send it to the model, and save the resulting predictions.

The creation of the Beam pipeline is done in the run function in the main.py module:

def run(argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', dest='input', help='Input file to process.')
    parser.add_argument('--output', dest='output', required=True, help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = True

The code above can be found here.

The pipeline options is an object that is given to the Beam job to modify the way that it runs. The parameters loaded from a command line parser are fed directly to the PipelineOptions object. Two parameters are loaded in the command line parser: the location of the input files, and the location where the output of the job will be stored.

When we are done loading the pipeline options, we can arrange the steps that make up the pipeline:

with beam.Pipeline(options=pipeline_options) as p:
    (p
    | 'read_input' >> ReadFromText(known_args.input, coder=JsonCoder())
    | 'apply_model' >> beam.ParDo(MLModelPredictOperation(module_name="iris_model.iris_predict", class_name="IrisModel"))
    | 'write_output' >> WriteToText(known_args.output, coder=JsonCoder())
    )

The code above can be found here.

The pipeline object is created by providing it with the PipelineOptions object that we created above. The pipeline is made up of three steps: a step that loads data from an LDJSON file and creates a PCollection from it, a step that makes predictions with that PCollection, and a step that saves the resulting predictions as an LDJSON file. The input and output steps use a class called JsonCoder, which takes care of serializing and deserializing the data in the LDJSON files.

Now that we have a configured pipeline, we can run it:

result = p.run()
result.wait_until_finish()

The code above can be found here.

The main.py module is responsible for arranging the steps of the pipeline, receiving parameters, and running the Beam job. This script will be used to run the job locally and in the cloud.

Testing the Job Locally

We can test the job locally by running with the python interpreter:

export PYTHONPATH=./
python -m model_beam_job.main --input data/input.json --output data/output.json

The job takes as input the "input.json" file in the data folder, and produces a file called "output.json" to the same folder.

Deploying to Google Cloud

The next thing we'll do is run the same job that we ran locally in the Google Cloud Dataflow service. The Dataflow service is an offering in the Google Cloud suite of services that can do scalable data processing for batch and streaming jobs. The Dataflow service runs Beam jobs exclusively and manages the job, handling resource management and performance optimization.

To run the model Beam job in the cloud, we'll need to create a project. In the Cloud Console, in the project selector page click on "Create Cloud Project", then create a project for your solution. The newly created project should be the currently selected project, then any resources that we create next will be held in the project. In order to use the GCP Dataflow service, we'll need to have billing enabled for the project. To make sure that billing is working, follow these steps.

To be able to create the Dataflow job, we'll need to have access to the Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs from your new project. To enable access to these APIs, follow this link, then select your new project and click the "Continue" button.

Next, we'll create a service account for our project. In the Cloud Console, go to the Create service account key page. From the Service account list, select "New service account". In the Service account name field, enter a name. From the Role list, select Project -> Owner and click on the "Create" button. A JSON file will be created and downloaded to your computer, copy this file to the root of the project directory. To use the file in the project, open a command shell and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the full path to the JSON file that you placed in the project root. The command will look like this:

export GOOGLE_APPLICATION_CREDENTIALS=/Users/.../apache-beam-ml-model-deployment/model-beam-job-a7c5c1d9c22c.json

To store the file we will be processing, we need to create a storage bucket in the Google Cloud Storage service. To do this, go to the bucket browser page, click on the "Create Bucket" button, and fill in the details to create a bucket. Now we can upload our test data to a bucket so that it can be processed by the job. To upload the test data click on the "Upload Files" button in the bucket details page and select the input.json file in the data directory of the project.

Next, we need to create a tar.gz file that contains the model package that will be run by the Beam job. This package is special because it cannot be installed from the public Pypi repository, so it must be uploaded along with the Beam job to the Dataflow job. To create the tar.gz file, we created a target in the project Makefile called "build-dependencies". When executed, the target downloads the code for the iris_model package, builds a tar.gz.distribution file, and leaves in the "dependencies" directory.

We're finally ready to send the job to be executed in the Dataflow service. To do this, execute this command:

python -m model_beam_job.main --region us-east1 \ 
  --input gs://model-beam-job/input.json \
  --output gs://model-beam-job/results/outputs \ 
  --runner DataflowRunner \
  --machine_type n1-standard-4 \ 
  --project model-beam-job-294711 \ 
  --temp_location gs://model-beam-job/tmp/ \ 
  --extra_package dependencies/iris_model-0.1.0.tar.gz \ 
  --setup_file ./setup.py

The job is sent by executing the same python scripts that we used to test the job locally, but we've added more command line options. The input and output options work the same as in the local execution of the job, but now they point to locations in the Google Cloud Storage bucket. The runner option tells the Beam framework that we want to use the Dataflow runner. The machine_type option tells the Dataflow service that we want to use that specific machine type when running the job. The project option points to the Google Cloud project we created above. The temp_location option tells the Dataflow service that we want to store temporary files in the same Google Cloud Storage bucket that we are using for the input and output. The extra_package option points to the iris_model distribution tar.gz file that we created above, this file will be sent to the Dataflow service along with the job code. Lastly, the setup_file option points at the setup.py file of the model_beam_job package itself, this allows the command to package up any code files that the job depends on.

Once we execute the command, the job will be started in the cloud. As the job runs it will output a link to a webpage that can be used to monitor the progress of the job. Once the job completes, the results will be in the Google Cloud Storage bucket that we created above.

Closing

By using the Beam framework, we are able to easily deploy a machine learning prediction job to the cloud. Because of the simple design of the Beam framework, a lot of the complexities of running a job on many computers are abstracted out. Furthermore, we are able to leverage all of the features of the Beam framework for advanced data processing.

One of the important features of this codebase is the fact that it can accept any machine learning model that implements the MLModel interface. By installing another model package and importing the class that inherits from the MLModel base class, we can easily deploy any number of models in the same Beam job without changing the code. However, we do need to change the pipeline definition to change or add models to it. Once again, the MLModel interface allowed us to abstract out the building a machine learning model from the complexity of deploying a machine learning model.

One thing that we can improve about the code is the fact that the job only accepts files encoded as LDJSON. We did this to make the code easy to understand, but we can easily add other options for the format of the input data making the pipeline more flexible and easier to use.

A ZeroRPC ML Model Deployment

2020-05-04T09:26:00-05:00

This blog post builds on the ideas started in three previous blog posts.

The code in this blog post can be found in this github repo.

Introduction

There are many different ways for two software processes to communicate with each other. When deploying a machine learning model, it's often simpler to isolate the model code inside of its own process. Any code that needs to use the model to make predictions then needs to communicate with the process that is running the model code to make predictions. This approach is easier than embedding the model code in the process that needs the predictions because it saves us the trouble of recreating the model's algorithm in the programming language of the process that needs the predictions. RPC calls are also used widely to connect code that is executing in different processes. In the last few years, the rise in popularity of microservice architectures has also caused the rise in popularity of RPC for integrating systems.

RPC stands for Remote Procedure Call. A remote procedure is just a function call that is executed in a different process from the process that initiated the call. The input parameters for the call come from the calling process and the result of the call is returned to the calling process. The function call looks as if it was executed locally. RPC therefore executes as a request-response protocol. The process that initiates the call is called the client and the process that executes the call is the server. RPC is useful when you want to call a function that is not implemented in the local process and you don't want to worry about the complexities of inter-process communication. RPC is similar to but a lot simpler than REST and HTTP-based inter-process communication.

An RPC call follows a series of steps to complete the call. First, the client code will call a piece of code called the "stub" in the client process. The stub behaves like a normal function but actually calls the remote procedure in the server. The stub then takes the parameters provided by the client code and serializes them so that they can be transported over the communication channel. The stub uses the communication channel to communicate with the remote process, sending the necessary information to execute the procedure. The server stub receives the information and deserializes the parameters, then executes the procedure. The series of steps are then executed in reverse order to return the results of the procedure to the client code.

In previous blog posts we showed how to do RPC with a RESTful service and a gRPC service. In this blog post we'll continue exploring the options available to us for interprocess communication with a ZeroRPC service that can host machine learning models.

ZeroRPC

ZeroRPC is a simple RPC framework that works in many different languages. ZeroRPC uses MessagePack for parameter serialization and deserialization, and it uses ZeroMQ for transporting data between processes. ZeroRPC supports advanced features such as streamed responses, heartbeats, and timeouts. The framework also supports introspection of the service and exceptions.

The ZeroRPC framework uses the ZeroMQ messaging framework to transport messages between processes. ZeroMQ is a high-performance low-level messaging framework that can be used in many different types of communication patterns. The ZeroRPC framework uses the ZeroMQ framework in a request-response pattern to do RPC calls. ZeroMQ also supports the publish-subscribe pattern along with other patterns. ZeroMQ is designed to support highly distributed and concurrent applications. ZeroMQ works in many different programming languages and in many operating systems.

The ZeroRPC framework uses the MessagePack format for serialization. This format is similar to JSON but is binary, which makes it more space efficient and allows for faster serialization and deserialization. The MessagePack format is similar to the Protocol Buffer format that is used by gRPC, but it allows us to serialize arbitrary data structures. This is different from Protocol Buffers which require a schema for the data to be serialized. MessagePack is also dynamically typed which makes developing code with it faster and simpler, but lacks the documentation and code generation features of Protocol Buffers.

Package Structure

The service codebase is structured like this:

- model_zerorpc_service ( python package for the zerorpc service )
    -  __init__.py
    -  config.py
    -  ml_model_zerorpc_endpoint.py
    -  ml_model_manager.py
    -  service.py
-  scripts (scripts for testing the service)
-  tests (unit tests)
-  Dockerfile (used to build a docker image of the service)
-  Makefle
-  README.md
-  requirements.txt
-  setup.py
-  test_requirements.txt

This structure can be seen in the github repository.

Installing the Model

Our aim for this blog post is to show how to build a ZeroRPC service that is able to host any ML model that works with the MLModel base class. To show how this can be done, we'll use the same model that we've deployed in previous blog posts. To install the model into the Python environment, execute this command:

pip install git+[https://github.com/schmidtbri/ml-model-abc-improvements](https://github.com/schmidtbri/ml-model-abc-improvements%5C)

This command installs the model code and parameters from the model's git repository. To understand how the model code works, check out this blog post. Once the model is installed, we can test it out by executing this Python code in an interactive session:

>>> from iris_model.iris_predict import IrisModel
>>> model = IrisModel()
>>> model.predict({"sepal_length":1.1, "sepal_width": 1.2, "petal_width": 1.3, "petal_length": 1.4})
{'species': 'setosa'}

The code above imports the class that implements the MLModel interface, instantiates it, and sends the model object a prediction request. The model successfully responds with a prediction for the flower species.

In order for the ZeroRPC service to find the model that we want to deploy, we'll create a configuration module that points to the model's package and module:

class Config(dict):
    models = [{
        "module_name": "iris_model.iris_predict",
        "class_name": "IrisModel"
    }]

The code above can be found here.

This configuration gives us the flexibility to add and remove models from the service dynamically. A service can host any number of models if they are installed in the environment and added to the configuration. The module_name and class_name fields in the configuration point to a class that implements the MLModel interface, which allows the service to make predictions using the model.

As in previous blog posts, we'll use a singleton object to manage the ML model objects that will be used to make predictions. The class that the singleton object is instantiated from is called ModelManager. The class is responsible for instantiating MLModel objects, managing the instances, returning information about the MLModel objects, and returning references to the objects when needed. The code for the ModelManager class can be found here. A complete explanation of the ModelManager class can be found in this blog post.

ZeroRPC Endpoint

In order to host a machine learning model, we have to handle incoming prediction requests, produce responses for them, and integrate with the ZeroRPC framework. The class described in this section will handle these aspects of the service.

First, we'll declare the class:

class MLModelZeroRPCCEndpoint(object):

The code above can be found here.

Next, we'll add the code that will initialize the object when the class is instantiated:

def __init__(self, model_qualified_name):
    model_manager = ModelManager()
    model_instance = model_manager.get_model(model_qualified_name)

    if model_instance is None:
        raise ValueError("'{}' not found in ModelManager instance.".format(model_qualified_name))

    self._model = model_manager.get_model(model_qualified_name)
    self.__doc__ = "Predict with the {}.".format(self._model.display_name)

The code above can be found here.

The __init__ method has one argument called "model_qualified_name" which tells the endpoint class which model it will be hosting. The __init__ method first gets a reference to the ModelManager singleton instance that is initialized when the service starts up. Then we get a reference to the specific model that is being hosted by this instance of the MLModelZeroRPCCEndpoint class. Next, we check if the model reference is equal to None which happens when the ModelManager can't find a model with the name requested, if there is no model by the name we raise an exception. If the model exists, we save a reference to it in the self variable which will make it easier to access in the future. Lastly, we modify the docstring property of the self variable which will cause the service to return it when doing introspection, we'll see how this works later.

Now that we have an instance of the endpoint, we need to handle incoming prediction requests:

def __call__(self, data):
    prediction = self._model.predict(data=data)
    return prediction

The code above can be found here.

The code in the method is very simple, it receives a parameter called "data" from the client, sends it to the model's predict method, and returns the prediction object that is returned by the model. Behind the scenes, the ZeroRPC framework is handling serialization and deserialization, the connection between the client and the server, and any exceptions raised by the server.

This class uses a special feature of Python which is the callable magic method. The __call__ method is a special method that turns any instance of the MLModelZeroRPCCEndpoint class into a callable, which allows instances of the class to be used as functions or methods. This will be useful later when we need to initialize a dynamic number of endpoints in the gRPC service.

ZeroRPC Service

Now that we have a model and a way to host the model within an endpoint, we can go ahead and write the code that will create the service. Before we can do that, we have to load the configuration:

configuration = getattr(config, os.environ["APP_SETTINGS"])

The code above can be found here.

The configuration is loaded dynamically by importing a class from the config.py module. The name of the class is received through an environment variable called APP_SETTINGS.

A ZeroRPC service is built as a class that provides methods that are exposed to the outside world as RPC calls. The class that will become the service is defined like this:

class ModelZeroRPCService(object):

The code above can be found here.

When the model service is started the __init__ method will be executed:

def __init__(self):
    self.model_manager = ModelManager()
    self.model_manager.load_models(configuration=configuration.models)

    for model in self.model_manager.get_models():
        endpoint = MLModelZeroRPCCEndpoint(model_qualified_name=model["qualified_name"])
        operation_name = "{}_predict".format(model["qualified_name"])
        setattr(self, operation_name, endpoint)

The code above can be found here.

The service starts by instantiating the ModelManager singleton, and loading the models from the configuration. Next the service instantiates one MLModelZeroRPCCEndpoint class for each model in the ModelManager and attaches it to the "self" parameter with a dynamically created operation name. The service method is mapped to the model's "predict" method by the endpoint object that wraps it. The reason for this is so that we are able to host any number of MLModel objects in the service, this code allows us to attach them to the service object dynamically. At the end of the initialization method, we have one service method for each model that is hosted by the service.

The service is now able to receive prediction requests for the models and return the predictions to the clients, but we can add some functionality by exposing metadata about the models being hosted, the get_models method does this:

def get_models(self):
    models = self.model_manager.get_models()
    return models

The code above can be found here.

The get_models procedure returns a list of models available for use, but does not return all of the metadata available for a model. To provide all of the metadata for a model, we'll add the get_model_metadata method:

def get_model_metadata(self, qualified_name):
    model_metadata = self.model_manager.get_model_metadata(qualified_name=qualified_name)

    if model_metadata is not None:
        return model_metadata
    else:
        raise ValueError("Metadata not found for this model.")

The code above can be found here.

Using the Service

To show now to use the service, we wrote a few scripts in the scripts folder of the project. To execute the scripts we first have to start up the service with these commands:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
python model_zerorpc_service/service.py

The ZeroRPC Python package has a utility that allows for communication with a ZeroRPC service from the command line. Now that we have a ZeroRPC service running, we can execute this command to get a list of procedures available on the service:

zerorpc tcp://127.0.0.1:4242
connecting to "tcp://127.0.0.1:4242"
[ModelZeroRPCService]
get_model_metadata  Return metadata about a model hosted by the service.
get_models          Return list of models hosted in this service.
iris_model_predict  Predict with the Iris Model.

The ZeroRPC tool will return a description of the methods available in the service. The iris_model_predict procedure's documentation string was generated when we instantiated the model's endpoint.

Next, we'll call a procedure on the service with Python code. Getting a list the models available by calling the "get_models" procedure is very simple:

client = zerorpc.Client()
client.connect("tcp://127.0.0.1:4242")
result = client.get_models()
print("Result: {}".format(result))

The code above can be found here.

Executing the code able should print out a list of models that are being hosted by the service:

Result: [{'display_name': 'Iris Model', 'qualified_name':
'iris_model', 'description': 'A machine learning model for
predicting the species of a flower based on its measurements.',
'major_version': 0, 'minor_version': 1}]

Making a prediction with the service is just as easy:

client = zerorpc.Client()
client.connect("tcp://127.0.0.1:4242")
result = client.iris_model_predict({"sepal_length": 1.1, "sepal_width": 1.2, "petal_length": 1.4, "petal_width": 1.5})
print("Result: {}".format(result))

The code above can be found here.

To see how exceptions are handled by the ZeroRPC service, we'll change the code of the client to purposefully cause an exception in the MLModel class:

client = zerorpc.Client()
client.connect("tcp://127.0.0.1:4242")
result = client.iris_model_predict({"sepal_length": 1.1, "sepal_width": 1.2, "petal_length": 1.4, "petal_width": "abc"})
print("Result: {}".format(result))

When we execute the client code, we get this exception being thrown:

python scripts/predict_with_model.py
Traceback (most recent call last):
File "scripts/predict_with_model.py", line 15, in <module>
...
File /Users/brian/Code/zerorpc-ml-model-deployment/venv/lib/python3.7/site-packages/iris_model/iris_predict.py",
line 51, in predict
raise MLModelSchemaValidationException("Failed to validate input data:
{}".format(str(e)))
ml_model_abc.MLModelSchemaValidationException: Failed to validate
input data: Key 'petal_width' error: 'abc' should be instance of 'float'

Closing

In this blog post we've shown how it is possible to deploy a machine learning model using the ZeroRPC framework. The service is able to host any number of models that implement the MLModel interface. The service codebase is simpler than a RESTful service, and is more lightweight than the JSON serialization format that is usually used by REST web services. RPC services are also simpler to understand than REST services, since they mimic a normal function call on the client side.

The ZeroRPC service has some benefits, but also has some drawbacks when compared to gRPC. The ZeroRPC framework does not have any way to provide schema information for the data structures that make up the request and responses of the service. In comparison, gRPC Protocol Buffers require the developer of the service to provide a full data contract for the service, and REST services have JSON Schema and the OpenAPI specification that can provide this information about the service. By building the get_model_metadata endpoint, we've been able to provide this information for each model hosted in the service, but not for the whole service.

The ZeroRPC framework provides more functionality for RPC calls by allowing the server to stream responses back to the client. This allows the server to send back prediction responses as they become available at the server and provides a simple interface. In the future, it would be interesting to leverage this feature of ZeroRPC to stream prediction responses to the client.

A Websocket ML Model Deployment

2020-04-04T09:25:00-05:00

This blog post builds on the ideas started in three previous blog posts.

In this blog post I'll show how to deploy the same ML model that l deployed as a batch job in this blog post, as a task queue in this blog post, inside an AWS Lambda in this blog post, as a Kafka streaming application in this blog post, and a gRPC service in this blog post, and as a MapReduce job in this blog post.

The code in this blog post can be found in this github repo.

Introduction

In the world of web applications, the ability to create responsive and interactive experiences is limited when we do normal request-response requests against a REST API. In the request-response programming paradigm, requests are always initiated by the client system and fulfilled by the server and continuously sending and receiving data is not supported. To fix this problem, the Websocket standard was created. Websockets allow a client and service to exchange data in a bidirectional, full-duplex connection which stays open for a long period of time. This approach offers much higher efficiency in the communication between the server and client. Just like a normal HTTP connection, Websockets work in ports 80 and 443 and support proxies and load balancers. Websockets also allow the server to send data to the client without having first received a request from the client which helps us to build more interactive applications.

Just like other web technologies, Websockets are useful for creating applications that run in a web browser. Websockets are useful for deploying machine learning models when the predictions made by the model need to be available to a user interface running in a web browser. One benefit of the Websocket protocol is that we are not limited to making a prediction when the client requests it, since the server is able to send a prediction from the model to the client at any time without waiting for the client to make a prediction request. In this blog post we will show how to build a Websocket service that works with machine learning models.

Package Structure

To begin, we set up the project structure for the websocket service:

-   model_websocket_service ( python package for websocket service )
    -   static (Javascript files)
    -   templates (HTML templates for UI)
    -   __init__.py
    -   config.py (configuration for the service)
    -   endpoints.py (Websocket handler)
    -   ml_model_manager.py (class for managing models)
    -   schemas.py (schemas for the API data objects)
    -   views.py (web views for the UI)
-   scripts (test script)
-   tests (unit tests)
-   Dockerfile
-   Makefle
-   README.md
-   requirements.txt
-   setup.py
-   test_requirements.txt

This structure can be seen here in the github repository.

Websockets

Websockets are fundamentally different from normal HTTP connections. They are full-duplex, which means that the client and server can exchange data in both directions. Websocket connections are also long-lived, which means that the connection stays open even when no messages are being exchanged. Lastly, websocket connections are event-based, which means that messages from the server are handled by the client in an "event handler" function that is registered to an event type. The same happens in the server code, which handles events from the client by registering handlers. There are four default events that are built into the Websocket protocol: open, message, error, and close. Apart from these event types, we are free to add our own event types and exchange messages through them.

Installing the Model

To begin working on a Websocket service that can host any ML model, we'll need a model to work with. For this, we'll use the same model that we've used in the previous blog posts, the iris_model package. The package can be installed directly from the git repository where it is hosted with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

This command should install the model code and parameters, along with all of its dependencies. To make sure everything is working correctly, we can make a prediction with the model in an interactive Python session:

>>> from iris_model.iris_predict import IrisModel
>>> model = IrisModel()
>>> model.predict({"sepal_length":1.1, "sepal_width": 1.2, "petal_width": 1.3, "petal_length": 1.4}) 
{'species': 'setosa'}

Now that we have a working model in the Python environment, we'll need to point the service to it. To do this, we'll add the IrisModel class to the configuration in the config.py file:

class Config(dict):
    models = [{
        "module_name": "iris_model.iris_predict",
        "class_name": "IrisModel"
    }]

The code above can be found here.

This configuration gives us flexibility when adding and removing models from the service. The service is able to host any number of models, as long as they are installed in the environment and added to the configuration. The module_name and class_name fields in the configuration point to a class that implements the MLModel interface, which allows the service to make predictions with the model.

As in previous blog posts, we\'ll use a singleton object to manage the ML model objects that will be used to make predictions. The class that the singleton object is instantiated from is called ModelManager. The class is responsible for instantiating MLModel objects, managing the instances, returning information about the MLModel objects, and returning references to the objects when needed. The code for the ModelManager class can be found here. A complete explanation of the ModelManager class can be found in this blog post.

Defining the Service

The websocket service is built around the Flask framework, which can be extended to support Websockets with the flask_socketio extension. The Flask application is initialized in the __init__.py file of the package like this:

app = Flask(__name__)

The code above can be found here.

Now that we have an application object, we can load the configuration into it:

if os.environ.get("APP_SETTINGS") is not None:
    app.config.from_object("model_websocket_service.config.{}".format(os.environ['APP_SETTINGS']))

The code above can be found here.

The configuration is loaded according to the value in the APP_SETTINGS environment variable. This allows us to change the setting based on the environment we are running in. Now that we have the app configured we can initialize the Flask extensions we'll be using:

bootstrap = Bootstrap(app)
socketio = SocketIO(app)

The code above can be found here.

The Bootstrap extensions will be used to build a user interface and the SocketIO extension will be used to handle the Websocket connections and events. With the extensions loaded, we can now import the code that handles the Websocket events, REST requests, and renders the views of the UI:

import model_websocket_service.endpoints
import model_websocket_service.views

The code above can be found here.

Lastly, we will instantiate the ModelManager singleton at application startup. This function is executed by the Flask framework before the application starts serving requests. The models that will be loaded are retrieved from the configuration object that we loaded above.

@app.before_first_request
def instantiate_model_manager():
    model_manager = ModelManager()
    model_manager.load_models(configuration=app.config["MODELS"])

The code above can be found here.

With this code, we set up the basic Flask application that will handle the Websocket events.

Websocket Event Handler

With the application set up, we can now work on the code that handles the Websocket events. This code is in the endpoints.py module. To begin, we'll import the Flask app object and the socketio extension object from the package:

from model_websocket_service import app, socketio

The code above can be found here.

A websocket handler is just a function that is decorated with the \@socketio.on() decorator. The decorator registers the function as a Websocket event handler with the Flask framework, which will call the function whenever an event of the type described in the decorator is received by the application. We'll use the decorator here to handle events of type "prediction_request", which will handle the prediction requests that clients send to the server.

@socketio.on('prediction_request')
def message(message):
    try:
        data = prediction_request_schema.load(message)
    except ValidationError as e:
        response_data = dict(type="DESERIALIZATION_ERROR", message=str(e))
        response = error_response_schema.load(response_data)
        emit('prediction_error', response)
        return

The code above can be found here.

The first thing we do when receiving a message from the client is to try to deserialize it with the PredictionRequest schema. This schema contains the inputs to the model predict() method and also the model's qualified name. If the deserialization fails, we'll respond to the client by emitting a prediction error message back to the client using the ErrorResponse schema. The emit() function is provided by the socketio extension and is used to send events to the client from the server.

Now that we have a deserialized prediction request from a client, we'll try to get a reference to the model from the model manager. The service will emit an ErrorResponse object back to the client system if it fails to find the model that is requested by the client.

model_manager = ModelManager()
model_object = model_manager.get_model(qualified_name=data["model_qualified_name"])

if model_object is None:
    response_data = dict(model_qualified_name=data["model_qualified_name"], type="ERROR", message="Model not found.")
    response = error_response_schema.load(response_data)
    emit('prediction_error', response)

The code above can be found here.

If the model is found, then this code will be executed:

else:
    try:
        prediction = model_object.predict(data["input_data"])
        response_data = dict(model_qualified_name=model_object.qualified_name, prediction=prediction)
        response = prediction_response_schema.load(response_data)
        emit('prediction_response', response)
    except MLModelSchemaValidationException as e:
        response_data = dict(model_qualified_name=model_object.qualified_name, type="SCHEMA_ERROR", message=str(e))
        response = error_response_schema.load(response_data)
        emit('prediction_error', response)
    except Exception as e:
        response_data = dict(model_qualified_name=model_object.qualified_name, type="ERROR", message="Could not make a prediction.")
        response = error_response_schema.load(response_data)
        emit('prediction_error', response)

The code above can be found here.

If the prediction is made successfully by the model, a PredictionResponse object is serialized and emitted back to the client through the 'prediction_response' event type. If the model raises an MLModelSchemaValidationException error, the error is serialized and sent back by emitting an ErrorResponse object back to the client. If any other type of exception is raised, a ErrorResponse object is created and sent back to the client.

The Websocket handler that we built in this section is the only one that we need to add to the service in order to expose any machine learning models to clients of the Websocket service. The handler is able to forward prediction requests to any model that is loaded in the ModelManager singleton. The handler is also able to handle any exceptions raised by the model and return the error back to the client.

REST Endpoints

In order to make the Websocket service easy to use, we will be adding two REST endpoints that expose data about the models that are being hosted by the service. Even though the models can be reached directly by connecting to the Websocket endpoint and sending prediction request events, knowing what models are available and data to send into each model is helpful for users of the service.

The first REST endpoint queries the ModelManager for information about all of the models in it and returns the information as a JSON data structure to the client.

@app.route("/api/models", methods=['GET'])
def get_models():
    model_manager = ModelManager()
    models = model_manager.get_models()
    response_data = model_collection_schema.dumps(dict(models=models))
    return response_data, 200

The code above can be found here.

The second REST endpoint is used to return metadata about a specific model hosted by the service. The metadata returned includes the input and output schemas that the model uses for it's prediction function.

@app.route("/api/models/<qualified_name>/metadata", methods=['GET'])
def get_metadata(qualified_name):
    model_manager = ModelManager()
    metadata = model_manager.get_model_metadata(qualified_name=qualified_name)

    if metadata is not None:
        response_data = model_metadata_schema.dumps(metadata)
        return Response(response_data, status=200, mimetype='application/json')
    else:
        response = dict(type="ERROR", message="Model not found.")
        response_data = error_response_schema.dumps(response)
        return Response(response_data, status=400, mimetype='application/json')

The code above can be found here.

Using the Service

In order to test the Websocket server we wrote a short python script that connects through a websocket, sends a prediction request, and receives and displays a prediction response. The script can be found in the scripts folder.

The script's main function connects to localhost on port 80 and sends a single message to the prediction_request channel:

sio = socketio.Client()

def main():
    sio.connect('http://0.0.0.0:80')
    data = {'model_qualified_name': 'iris_model', 'input_data': {'sepal_length': 1.1, 'sepal_width': 1.1, 'petal_length': 1.1, 'petal_width': 1.1}}
    sio.emit('prediction_request', data)

The code above can be found here.

To receive a prediction response from the server, we register a function that will be called on every message in the "prediction_response" channel:

@sio.on('prediction_response')
def on_message(data):
    print('Prediction response: {}'.format(str(data)))

The code above can be found here.

To use the script, we first start the server with these commands:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
gunicorn --worker-class eventlet -w 1 -b 0.0.0.0:80 model_websocket_service:app

Then we can run the script with this command:

python scripts/test_prediction.py

The script will send the prediction request and then print the response from the server to the screen:

Prediction response: {'prediction': {'species': 'setosa'}, 'model_qualified_name': 'iris_model'}

Building a User Interface

In order to show how to use the Websocket service in a real-world client application we built a simple website around the Websocket and REST endpoints that were described above. The user interface leverages the models and metadata REST endpoints to display information about the models being hosted by the service, and it uses the Websocket endpoint to make predictions with the models.

This user interface is similar to the one we built for this blog post, where we showed how to deploy models behind a Flask REST service. We are reusing a lot of the same code here.

Flask Views

The Flask framework supports rendering HTML web pages through the Jinja templating engine. We created an HTML template that displays the model available through the service. The view code uses the ModelManager object to get a list of the model being hosted, then renders the list to an HTML document that is returned to the client's web browser:

In order to show a model's metadata, we built a view that queries the model object directly and renders an HTML view with the information:

Both of these views are rendered in the service and do not use the REST endpoints to retrieve the information about the models.

Dynamic Web Form

The last webpage we'll build for the application is special because it renders a dynamically -generated form that is created from the model's input schema. The webpage uses the model's metadata REST endpoint to get the input schema of the model and uses the brutusin forms package to render the form in the browser.

The form accepts input from the user and sends it to the server as a Websocket event of type 'prediction_request'. The webpage also has a Websocket event listener that is able to render all of the 'prediction_response' and 'prediction_error' Websocket events that the server emits back to the client. The code for this webpage can be found here.

Closing

The Websocket protocol is a simple way to build more interactive web pages that has wide support in modern browsers. By deploying ML models in a Websocket service, we're able to integrate predictions from the models into web applications quickly and easily. As in previous blog posts, the service is built so that it is able to host any ML model that implements the MLModel interface. Deploying a new ML model is as simple as installing the python package, and adding the model to the configuration of the service. Combining the Websocket protocol with machine learning models is quick and easy if the code is written in the right way.

A MapReduce ML Model Deployment

2020-02-23T09:25:00-05:00

This blog post builds on the ideas started in three previous blog posts.

The code in this blog post can be found in this github repo.

Introduction

Because of the growing need to process large amounts of data across many computers, the Hadoop project was started in 2006. Hadoop is a set of software components that help to solve large scale data processing problems using clusters of computers. Hadoop supports mass data storage through the HDFS component and large scale data processing through the MapReduce component. Hadoop clusters have become a central part of the infrastructure of many companies because of their usefulness.

In this blog post, we'll focus on the MapReduce component of Hadoop since we will be deploying a machine learning model, which is a compute-intensive process. MapReduce is a programming framework for data processing which is useful for processing large amounts of distributed data. MapReduce is able to handle errors and failures in the computation. MapReduce is also inherently parallel in nature but abstracts out that fact, making the code look like single-process code.

Hadoop and MapReduce are used to process large data sets almost exclusively. Even though machine learning models are trained over large data sets, we'll focus on using MapReduce to execute predictions. Hadoop and MapReduce should be considered when a prediction batch job needs to be executed on millions or billions of records. This blog post is similar to a previous blog post that deployed an ML model as a batch job, but that post was focused on small scale batch jobs that could run quickly on single machines.

Because the results of a batch prediction job are stored and accessed later by clients, the user can't interact with the model directly. This means that the client that is using the predictions produced by the model is not able to ask for predictions directly from the ML model software component, and must access the data set produced by the batch job to get predictions from the model.

Package Structure

To begin, I set up the project structure for the job package:

-   data (data files used for testing the job)
-   model_map_reduce_job (python package for the map reduce job)
    -   __init__.py
    -   config.py
    -   ml_model_map_reduce_job.py
    -   ml_model_manager.py
-   tests ( unit tests )
-   Makefle
-   mrjob.conf (configuration file for MapReduce framework)
-   README.md
-   requirements.txt
-   setup.py
-   test_requirements.txt

This structure can be seen here in the github repository.

Building MapReduce Jobs

A MapReduce job is made up of two basic steps: the map step and the reduce step. Both steps are implemented as simple functions that receive data, process it and return the results. The map step is responsible for implementing filtering and sorting and the reduce step is responsible for calculating aggregate results. The MapReduce system is responsible for starting, managing, and stopping the code in the map and reduce functions, for serializing and deserializing the data, and for managing the redundancy and fault tolerance of the execution of the map and reduce functions.

The MapReduce implementation provided by Hadoop is able to do data processing with map and reduce functions implemented in many different programming languages by using the streaming interface. In this blog post, we'll use this interface to run a model prediction job using Python. This simplifies the deployment of the model greatly, since we don't need to rewrite the model's prediction code in order to deploy it to a Hadoop cluster. We'll be using the mrjob python package to write the MapReduce job.

Installing the Model

In order to write a MapReduce job that is able to handle any machine learning model, we'll start by installing a model into the environment. For this we can use the same model we've used before, the iris_model package. This package can be installed from a git repository with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now that we have the model installed in the environment, we can try it out by opening a python interpreter and entering this code:

from iris_model.iris_predict import IrisModel
>>> model = IrisModel()
>>> model.predict({"sepal_length":1.1, "sepal_width": 1.2, "petal_width": 1.3, "petal_length": 1.4})
{'species': 'setosa'}

To load the model inside of the MapReduce job, we'll point at the IrisModel class in a configuration file. The configuration file looks like this:

class Config(dict):
    models = [{
        "module_name": "iris_model.iris_predict",
        "class_name": "IrisModel"
    }]

The code above can be found here.

This configuration will be used by the job to dynamically load the model packages. The module_name and class_name fields allow the job to import the class that contains the implementation of the model's prediction algorithm. The models list can contain pointers to many models, so there are no limitations to how many models can be hosted by the MapReduce job.

Managing Models

As in previous blog posts, we'll use a singleton object to manage the ML model objects that will be used to make predictions. The class that the singleton object is instantiated from is called "ModelManager". The class is responsible for instantiating MLModel objects, managing the instances, returning information about the MLModel objects, and returning references to the objects when needed. The code for the ModelManager class can be found here. For a full explanation of the code in the class, read this blog post.

MLModelMapReduceJob Class

We now have the model package installed and the ModelManager class to manage it, so we can start to write the MapReduce job itself. The MapReduce job is defined as a subclass of the MRJob base class which defines map() and reduce() methods that implement the functionality of the job. To start, we'll load the right configuration by accessing the APP_SETTINGS environment variable:

configuration = __import__("model_mapreduce_job"). 
    __getattribute__("config").
    __getattribute__(os.environ["APP_SETTINGS"])

The code above can be found here.

With the configuration loaded, we'll instantiate the ModelManager singleton which will hold the references to the model objects that we want to host in this MapReduce job:

model_manager = ModelManager()
model_manager.load_models(Config.models)

The code above can be found here.

By putting this initialization at the top of the module, we can be sure that the models are initialized one time only, when the module is loaded by the python interpreter.

Now we can write the class that makes up the MapReduce job:

class MLModelMapReduceJob(MRJob):

    INPUT_PROTOCOL = JSONValueProtocol
    OUTPUT_PROTOCOL = JSONProtocol

    DIRS = ['../model_mapreduce_job']

The code above can be found here.

The INPUT_PROTOCOL and OUTPUT_PROTOCOL class properties define the input and output protocols of the MapReduce steps. A protocol is a piece of code that reads and writes data to the filesystem, it is useful to abstract out the map and reduce steps from the format in which the data is stored. The DIRS class property tells the MrJob package that the code in this module depends on code inside of the "model_map_reduce" directory, this causes MrJob to copy the code whenever it creates a deployment package for this job. These options help to simplify the code and deployment of the job.

The job class needs to be initialized, so we'll add a __init__() method:

def __init__(self, *args, **kwargs):
    super(MLModelMapReduceJob, self).__init__(*args, **kwargs)
    self._model = model_manager.get_model(self.options.model_qualified_name)

    if self._model is None:
        raise ValueError("'{}' not found in the ModelManager instance.".format(self.options.model_qualified_name))

The code above can be found here.

The __init__ method first calls the MrJob base class' __init__ method so that it can do framework-level initialization. Next, we ask the ModelManager singleton for an instance of the model that we want to host in the MapReduce job. The qualified name of the model is accessed from the self.options.model_qualified_name variable, which is set by a command line option. Lastly, we check that a model object was actually returned by the ModelManager and raise an exception if it wasn't.

Next, the MapReduce job must be able to run on any model that is inside of the ModelManager instance. To support this, we will add a command line option to the job that accepts the qualified name of the model we want to run:

def configure_args(self):
    super(MLModelMapReduceJob, self).configure_args()

    self.add_passthru_arg('--model_qualified_name', \
        type=str, help='Qualified name of the model.')

The code above can be found here.

This function allows us to extend the command line options already supported by the MrJob framework. The command line argument passes through the framework and is stored in the self.options object, which we used in the code in the __init__ method to select the model we want to use for the job.

Now that we have an initialized job class, we can write the code that actually does the work of the MapReduce job. The mapper function looks like this:

def mapper(self, _, data):
    prediction = None

    try:
        prediction = self._model.predict(data=data)
    except Exception as e:
        prediction = None

    yield data, prediction

The code above can be found here.

This function is very simple, it receives a dictionary in the "data" argument, makes a prediction with the model, and returns a tuple of the prediction input and output. The data argument is a dictionary because we used the "JSONValueProtocol" as the INPUT_PROTOCOL for this job. This protocol deserializes a JSON string into a native Python object. By using this protocol, we saved ourselves the trouble of having to deserialize the input to JSON in the mapper step. If the model fails to make a prediction, then None is returned as the prediction. The OUTPUT_PROTOCOL option is set to "JSONProtocol", which serializes the key-value pair to two JSON strings separated by a tab character.

The output of the mapper step is always a key-value pair in which the key must be unique across the inputs of the step. If any input is repeated, the mapper step will make a prediction on it, but the MapReduce framework will only return one result for the key to the next step. This behavior sets up a limitation on our model: it must always produce the same prediction given the same input, which is to say that the model must make predictions deterministically. If the model is not deterministic, the MapReduce framework will choose the first prediction made for the input record. This may not matter in some situations but may break any steps that use the results of this step if this behavior is not handled correctly.

This MapReduce job does not need a reduce step since we only need to make predictions and return the results. However, this job can be combined with other MapReduce jobs that do use reduce steps to make more a complex data processing pipeline.

Testing the Job

Now that we have the code for the MapReduce job, we will test it locally against a small data file. Because of the input and output protocol options, the model is able to accept JSON files as input and it will produce JSON files as output. Here is an example of the JSON that we will feed to the job:

{ "sepal_length": 5.0, "sepal_width": 3.2, "petal_length": 1.2, "petal_width": 0.2}
{ "sepal_length": 5.5, "sepal_width": 3.5, "petal_length": 1.3, "petal_width": 0.2}
...

The data file can be found here.

To execute the job locally, these commands need to be run:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
python model_mapreduce_job/ml_model_map_reduce_job.py \
  --model_qualified_name iris_model ./data/input.ldjson > data/output.ldjson

After the job runs, the output of the map step will be in the /data folder. The input json string and resulting prediction will be on one line of the file separated by a tab character. One input line had JSON with a schema that the model could not accept, so the output should contain a null prediction for that input. The --model_qualified_name command line argument tells the job to use the iris_model model from the ModelManager when running the job.

Deploying to AWS

The mrjob package supports running jobs in the AWS Elastic Map Reduce (EMR) service. To run the model job, we'll need an account in AWS. To interact with AWS, we'll need to install the boto3 and awscli python packages:

pip install boto3 awscli

Next we'll configure the API access keys. A set of access keys can be generated and configured by following these instructions. The configuration will look like this:

aws configure
AWS Access Key ID [*******************]: xxxxxxxxxxxxxxxxxx
AWS Secret Access Key [******************]:xxxxxxxxxxxxxxxxxxx
Default region name [us-east-2]: us-east-1
Default output format [None]:

In order to run the model job in AWS EMR, we'll first need to configure a default role for the job to assume. A simple way to do this is already supported in the AWS CLI tool. The command looks like this:

aws emr create-default-roles

In order to set up the execution environment in the nodes before we run the model prediction code we'll need to execute a few commands. The mrjob package supports this through a configuration file called mrjob.conf. The config file is written in YAML and looks like this:

runners:
  emr:
    bootstrap:
    - sudo yum update -y
    - sudo yum install git -y
    - sudo pip-3.6 install -r ./requirements.txt#
    setup:
    - export PYTHONPATH=$PYTHONPATH:model_mapreduce_job/#
    - export APP_SETTINGS=ProdConfig

The file can be found here.

The file is able to hold configuration for several types of runners, for now we'll only configure the EMR runner. The bootstrap section holds commands that will be executed one time, when the cluster node is first created. In this section we're updating the yum package manager, installing the git client, and installing all of the python dependencies we need to run the model package from the requirements.txt file in the project.

The setup section holds commands that will be executed whenever the MapReduce job starts up. In this section, we are setting up the PYTHONPATH environment variable that the python interpreter will need in order to find the code files that make up the job. We are also setting the APP_SETTINGS environment variable that tells the job which environment it is running in, for now we're running the job with the ProdConfiguration settings.

Now that we have the credentials and configuration set up, we can run the job in AWS. The command looks like this:

python model_mapreduce_job/ml_model_map_reduce_job.py \
  --conf-path=./mrjob.conf -r emr --iam-service-role EMR_DefaultRole \ 
  --model_qualified_name iris_model ./data/input.ldjson

The mrjob package will create an S3 bucket for the job, upload the code and data to the S3 bucket, create an EMR cluster for the job, and run the job. The results of the job will be stored into the same S3 bucket.

Closing

By using the MapReduce framework, we are able to make a large number of predictions on a cluster of computers. Because of the simple design of the MapReduce framework, a lot of the complexities of running a job on many computers are abstracted out. This deployment option for machine learning models enables us to deploy model prediction jobs against truly massive data sets.

By building the prediction job so that it uses the MLModel interface, the deployment of a model as a MapReduce job is greatly simplified. The MapReduce job that we built in this blog post is able to host any machine learning model that uses the MLModel interface which makes the code highly reusable. Once again, the MLModel interface allowed us to abstract out the complexities of building a machine learning model from the complexities of deploying a machine learning model.

One of the drawbacks of the implementation is the fact that it only accepts LDJSON encoded files as input to the job. This is for the sake of simplicity, since having the field names along with the data makes the code easier to understand. An improvement to the code would be to enable other protocols so that we can use other file types with the job. Furthermore, it would be easy to make the choice of input and output protocols a command line option that can be chosen at execution time.

A gRPC Service ML Model Deployment

2020-01-20T09:27:00-05:00

This blog post builds on the ideas started in three previous blog posts.

In this blog post I'll show how to deploy the same ML model that l deployed as a batch job in this blog post, as a task queue in this blog post, inside an AWS Lambda in this blog post, and a Kafka streaming application in this blog post.

The code in this blog post can be found in this github repo.

Introduction

With the rise of service oriented architectures and microservice architectures, the gRPC system has become a popular choice for building services. gRPC is a fairly new system for doing inter-service communication through Remote Procedure Calls (RPC) that started in Google in 2015. A remote procedure call is an abstraction that allows a developer to make a call to a function that runs in a separate process, but that looks like it executes locally. gRPC is a standard for defining the data exchanged in an RPC call and the API of the function through protocol buffers. gRPC also supports many other features, such as simple and streaming RPC invocations, authentication, and load balancing.

Protocol buffers are defined through an interface definition language, and the code that actually does the serialization/deserialization is then generated from the definition. Once a protocol buffer definition file is created, the protocol buffer definition can be compiled into many different programming languages through a compiler. This allows gRPC to be a cross-language standard for a common exchange format between services.

gRPC services are coded in much the same way as a regular web service but have several differences that will affect the service we'll build in this blog post. First, protocol buffers are statically typed, which makes the serialized data packages smaller but allows for less flexibility in the code of the service. Second, protocol buffers must be compiled to source code, which makes it harder to evolve services that use them. Lastly, a protocol buffer is a binary data structure that is optimized for size and processing speed, whereas a JSON data structure is a string-based data structure optimized for simplicity and readability. In performance comparisons, protocol buffers have been found to be many times faster than JSON.

In previous blog posts, we've used JSON exclusively, to keep things simple. JSON allowed the services and applications to deserialize the data structure and send it directly to the model without having to worry about the contents of the data structure. This is not possible with gRPC since the service requires explicit knowledge of the schema of the models incoming and outgoing data.

Package Structure

-   model_grpc_service (python package for service)
    -   __init__.py
    -   config.py configuration for the application)
    -   ml_model_grpc_endpoint.py (MLModel gRPC endpoint class)
    -   model_manager.py (model manager singleton class)
    -   service.py (service code)
-   scripts
    -   client.py (single prediction test)
    -   generate_proto.py
-   tests (unit tests)
-   Dockerfile
-   Makefle
-   model_service.proto (protocol buffer definition of gRPC service)
-   model_service_pb2.py (python protocol buffer code)
-   model_service_pb2_grpc.py (python gRPC service bindings)
-   model_service_template.proto (protocol buffer template file)
-   README.md
-   requirements.txt
-   setup.py
-   test_requirements.txt

This structure can be seen in the github repository.

Installing the Model

In order to create a gRPC service for ML models we'll first install a model package into the environment. We'll use the iris_model package, which has been used in several previous blog posts. The model package itself was created in this blog post. The model package can be installed from its git repository with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now that we have the model package in the environment, we can add it to the config.py module:

class Config(dict):
models = [{
    "module_name": "iris_model.iris_predict",
    "class_name": "IrisModel"
}]

The code above can be found here.

This configuration class is used by the service in all environments. The module_name and class_name fields allow the application to find the MLModel class that implements the prediction functionality of the iris_model package. The list can hold information for many models, so there's no limitation to how many models can be hosted by the service.

The reason that we need to install the model package before we can write any other code is because the model's input and output schemas are needed to be able to define the gRPC service's API.

Generating a Protocol Buffer Definition

Since we can't code the gRPC service until we have a .proto file with the definition of the API of the service, our first task is to generate .proto file from the models that will be hosted by the service. In order to automatically generate the file from the iris_model's input and output schemas we'll use the Jinja2 templating tool. Jinja2 is a templating tool that allows documents to be generated by combining a template file and a data structure, it allows a developer to isolate the unchanging parts of a document in the template, and keeps the parts that change in the data structure. First we'll create a template, and after that we'll add the schema information to it to generate a .proto file for the service.

The Template File

First we'll create the template file from which we'll generate the .proto file:

syntax = "proto3";

package model_grpc_service;

This code can be found here.

At the top of the template, we declare that we'll use the proto3 format, and the name of the package is "model_grpc_service". Next, we'll declare some data structures:

message empty {}

message model {
    string qualified_name = 1;
    string display_name = 2;
    string description = 3;
    sint32 major_version = 4;
    sint32 minor_version = 5;
    string input_type = 6;
    string output_type = 7;
    string predict_operation = 8;
}

message model_collection {
    repeated model models = 1;
}

This code can be found here.

These data structures will be used by an operation that will be declared further down in the template. The data structures hold information about the models that are hosted by the service, including the names of the input and output types and the name of the prediction operation for the model. The model_collection type holds a list of model objects.

Next, we'll generate an input type for the models hosted by the service:

{% for model in models %}
message {{ model.qualified_name }}_input { 
    {% for field in model.input_schema %}
        {{ field.type }} {{ field.name }} = {{ field.index }};
    {% endfor %}
}
{% endfor %}

This code can be found here.

This template code uses the qualified name of a model and the schema of the input of the model to generate a protocol buffer type that matches the model's input. The name of the input type for a model always follows this pattern: "<model_qualified_name>_input". Each field in the input schema of the model is translated to the equivalent field type in a protocol buffer and is given the same name. Lastly, an index is generated and assigned to the field.

Next, we'll do the same for the output schema of the model:

{% for model in models %}
message {{ model.qualified_name }}_output { 
    {% for field in model.output_schema %}
        {{ field.type }} {{ field.name }} = {{ field.index }};
    {% endfor %}
}
{% endfor %}

This code can be found here.

Now we can start to define the service's API:

service ModelgRPCService {
    rpc get_models (empty) returns (model_collection) {}
    {% for model in models %}
        rpc {{ model.qualified_name }}_predict ({{ model.qualified_name }}_input) returns ({{ model.qualified_name }}_output) {}
    {% endfor %}
}

The code above can be found here.

This code defines the operations that the service implements. The first operation is called "get_models" and it uses the first set of protobuf data structures that we defined above. This operation is simple since it does not change with the models that are being hosted by the gRPC service. It accepts the "empty" type since it does not require any inputs, and it returns the "model_collection" type.

Next, we will generate a set of prediction operations, one for each model hosted by the service. The name of the predict operation always follows this pattern: "<model_qualified_name>_predict". The model's input and output types are added to the operation by name.

Using the Template File

This template file is now ready to be used, so we'll create a python script that will take it and add information about the models that we actually want to host in the service. The script to do this is in the generate_proto.py script.

This code will make use of the ModelManager class that has been used in several previous blog posts. The ModelManager class is responsible for loading models from configuration, maintaining references to the model objects, and returning information about the models. In this section we'll use the get_models() and get_model_metadata() operations to access the information needed to generate the protocol buffer definition.

The script starts by instantiating the ModelManager and loading the models from the configuration:

model_manager = ModelManager()

model_manager.load_models(Config.models)

This code can be found here.

Then the script loads the Jinja2 template file:

template_loader = jinja2.FileSystemLoader(searchpath="./")
template_env = jinja2.Environment(loader=template_loader)
template = template_env.get_template("model_service_template.proto")

This code can be found here.

Now that the template is loaded, we can generate the data structure that will be passed to the template:

models = []
for model in model_manager.get_models():
    model_details = model_manager.get_model_metadata(qualified_name=model["qualified_name"])
    models.append({
            "qualified_name": model_details["qualified_name"],
            "input_schema": [{
                "index": str(index + 1),
                "name": field_name,
                "type": type_mappings[model_details["input_schema"]["properties"][field_name]["type"]]
        } for index, field_name in enumerate(model_details["input_schema"]["properties"])],
        "output_schema": [
        {
            "index": str(index + 1),
            "name": field_name,
            "type": type_mappings[model_details["output_schema"] ["properties"][field_name]["type"]]
        } for index, field_name in enumerate(model_details["output_schema"]["properties"])]
    })

This code can be found here.

The code builds a dictionary for each model that contains the qualified name, input schema, and output schema of each model in the ModelManager. The python data types are converted to the equivalent protocol buffer types as it goes along. The resulting dictionary is the data structure that is used by the Jinja2 template defined above to generate a protocol buffer definition.

Lastly, we'll render the template with the information we just extracted from the models and then save the generated file to disk:

output_text = template.render(models=models)
with open(output_file, "w") as f:
    f.write(output_text)

This code can be found here.

Now that we have the template and the script that uses the template completed, we can try to generate a protocol buffer definition for the service. The command to do this goes like this:

export PYTHONPATH=./
python scripts generate_proto.py --output_file=model_service.proto

The file generated by the command above is called "model_service.proto" and it can be found here. The protocol buffer definition contains the types needed for the get_models operation as well as the operation itself. It also contains the types and operations needed to interact with the iris_model, which were automatically extracted from the information provided by the model.

By using a template and script approach to generating a protocol buffer definition we are able to host any number of models inside of the gRPC service. This is possible because every model that will be hosted is required to expose its input and output schema through the MLModel interface.

Defining the Service

Now that we have a protocol buffer definition for the gRPC service we can actually start writing the code to implement the service itself. To do this, we first need to compile the protocol buffer into its python implementation. This is done with this command:

export PYTHONPATH=./
python -m grpc_tools.protoc --proto_path=. --python_out=. --grpc_python_out=. model_service.proto

This command generates two files: the model_service_pb2.py file and the model_service_pb2_grpc.py file. The model_service_pb2.py file contains the python data structures that will serialize and deserialize from native python types to the protocol buffer binary format. The model_service_pb2_grpc.py file contains the bindings that will allow us to write a service that implements the operations defined in the protocol buffer definition and also to write client code that can call the implementations.

We'll start by creating a python file that contains the main service codebase. We'll also implement the get_models operation in this file since it is not a dynamic endpoint which depends on the presence of a model to execute.

The gRPC service is defined as a class that inherits from a "Servicer" class that was generated by the protoc compiler:

class ModelgRPCServiceServicer(model_service_pb2_grpc.ModelgRPCServiceServicer):

This code can be found here.

Within the class, each operation is defined as a method with the same name as the operation in the .proto file. The get_models operation is defined like this:

def get_models(self, request, context):
    model_data = self.model_manager.get_models()
    models = []
    for m in model_data:
        response_model = model(qualified_name=m["qualified_name"],
            display_name=m["display_name"],
            description=m["description"],
            major_version=m["major_version"],
            minor_version=m["minor_version"],
            input_type="{}_input".format(m["qualified_name"]),
            output_type="{}_output".format(m["qualified_name"]),
            predict_operation="{}_predict".format(m["qualified_name"]))
        models.append(response_model)
    response_models = model_collection()
    response_models.models.extend(models)
    return response_models

This code can be found here.

The operation does not receive any data in the request and returns a model_collection data structure in the response. The model_collection data structure was defined in the .proto file and compiled into a python class by the protoc compiler. In order to fill the model_collection, we iterate through the data returned by the ModelManager creating a list of model objects as we go along. We then create the model_collection from the list and return it to the client.

MLModelgRPCEndpoint Class

In order for the service to host any model that uses the MLModel base class, we'll need to create a class that translates the protocol buffer data structures into the native python data structures used by the models. This class will be instantiated for every model that is hosted by the service.

class MLModelgRPCEndpoint(object):

The code above can be found here.

When the service is initiated, we'll create one instance of this class for every model. The __init__ method is looks like this:

def __init__(self, model_qualified_name):
    model_manager = ModelManager()
    self._model = model_manager.get_model(model_qualified_name)
    if self._model is None:
        raise ValueError("'{}' not found in ModelManager instance.".format(model_qualified_name))

    logger.info("Initializing endpoint for model: {}".format(self._model.qualified_name))

The code above can be found here.

The __init__ method has one argument called "model_qualified_name" which tells the endpoint class which model it will be hosting. The __init__ method gets a reference to the ModelManager object that is managed by the service, then it gets a reference to the model object from the ModelManager object using the model_qualified_name argument. Lastly, before finishing we check that the model instance is actually available in the ModelManager.

Now that we have an instance of the endpoint for the MLModel object, we need to write a method that will make the predict method available as a gRPC endpoint. We'll do this by defining the __call__ method on the endpoint class. When a __call__ method is attached to a class, it turns all instances of the class into callables, which allows instances of the class to be used like functions. This will be useful later when we need to initialize a dynamic number of endpoints in the gRPC service.

def __call__(self, request, context):
    data = MessageToDict(request, preserving_proto_field_name=True)

    prediction = self._model.predict(data=data)

    output_protobuf_name = "{}_output".format(self._model.qualified_name)
    output_protobuf = MLModelgRPCEndpoint._get_protobuf(output_protobuf_name)

    response = output_protobuf(**prediction)

    return response

The code above can be found here.

The method uses the MessageToDict function from the protobuf package to turn a protocol buffer data structure into a Python dictionary. The dictionary is then passed into the model's predict method and a prediction is returned.

Now that we have a prediction, we have to find the right protocol buffer data structure to return the prediction result to the client. To do this, a special method called "_get_protobuf" is used which goes into the model_service_pb2.py module where the python protocol buffer definitions are stored, and dynamically import the correct class for the output of the model. For example, the iris_model's output protocol buffer definition is called "iris_model_output". This lookup is possible because the output protocol buffer of a model is always named according to the same pattern. In the last step, we hand over the model's prediction to the protocol buffer class which initializes itself with the prediction data and return the resulting object.

Creating gRPC Endpoints Dynamically

Now that we have a class that can handle any model object, we need to connect it to the service. To do this, we'll create an __init__ method in the service class that will execute when the service starts up:

def __init__(self):
    self.model_manager = ModelManager()
    self.model_manager.load_models(configuration=Config.models)

    for model in self.model_manager.get_models():
        endpoint = MLModelgRPCEndpoint(model_qualified_name=model["qualified_name"])

        operation_name = "{}_predict".format(model["qualified_name"])
        setattr(self, operation_name, endpoint)

The code above can be found here.

The __init__ method first instantiates the ModelManager class and loads the models listed in the configuration. Once the models are in memory, we create an endpoint object for each one in a loop. For each model, we create an MLModelgRPCEndpoint object which is given the model's qualified name. Then we generate the model's operation name which matches the operation name for the model's predict operation listed in the .proto file. For example, the iris_model's predict operation is named "iris_model_predict". Lastly, we use the operation name and dynamically set an attribute on the service class that attaches the newly created endpoint to the class. This last step allows the service to find the right endpoint for the operation when a call for a prediction from a certain model is received. The fact that each endpoint object is callable allows the service to call the endpoint object as if it was a method of the class even though the endpoint is actually another class.

Using the Service

We now have a complete service that we can test out. To do this we'll execute these commands:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
python model_grpc_service/service.py

In order to test out the service, I created a simple script that sends a single gRPC request to the service. The script is found here. To send a request to the get_models operation, the code looks like this:

with grpc.insecure_channel("localhost:50051") as channel:
    stub = ModelgRPCServiceStub(channel)
    response = stub.get_models(empty())
    print(response)

This code can be found here.

To send a test request to the iris_model_predict operation of the service, execute this command:

export PYTHONPATH=./
python scripts/client.py --iris_model_predict

The script will contact the service running locally, make a prediction with some sample data and print out the prediction result.

Closing

In this blog post we've shown how to deploy an ML model inside a gRPC service. As gRPC becomes more popular, the option of deploying ML models as gRPC services is becoming more attractive. As in previous blog posts, we've built the service so that it can support any number of ML models, as long as they implement the ML Model interface. This is one more type of deployment that we implemented without having to modify the iris_model package. The ability to deploy an ML model in different ways without having to rewrite any part of the model code is very valuable and ensures good software engineering practices.

By using gRPC to deploy an MLModel, we're able to take advantage of all of the features of gRPC. These benefits include lightweight and fast serialization of messages and built in support for streaming. The ability to document a service API using protocol buffers also simplifies the documentation and roll out of a new service. Lastly, the ability to compile service and client codebases from the protocol buffer definitions allows us to avoid many common errors.

In previous blog posts, deploying a new model was as simple as installing the model package into the environment and adding it to the configuration of the application. The schema of the model's inputs and outputs did not affect the application code at all. In the code of this blog post, we have to do more work because of the nature of protocol buffers, since the generated code in the project is specific to a set of models. Because of this, adding a new model to the gRPC service requires us to generate a new .proto file from the model's input and output schemas, generate python code from the .proto file, and finally add the model to the configuration of the service. The extra steps make it more complex to deploy the service.

In the future, the service could be improved by handling more complex schemas, since currently the schema mapping between native python types and protocol buffers only supports simple data structures. Another way to improve the service is to add support for streaming endpoints for each model. Lastly, protocol buffers have a mechanism for evolving message schemas, the code could be improved by safely evolving the shema of the service through this mechanism when the model schema changes.

A Streaming ML Model Deployment

2019-12-29T09:26:00-05:00

This blog post builds on the ideas started in three previous blog posts.

In this blog post I'll show how to deploy the same ML model that l deployed as a batch job in this blog post, as a task queue in this blog post, and inside an AWS Lambda in this blog post.

The code in this blog post can be found in this github repo.

Introduction

In general, when a client communicates with a software service two patterns are available: synchronous and asynchronous communication. When doing synchronous communication, a message is sent to the service which blocks the sender until the operation is done and the result is returned to the client. With an asynchronous message, the service receives the message and does not block the sender of the message while it does the processing. We've already seen an asynchronous deployment for a machine learning model in a previous blog post. In this blog post, we'll show a similar type of deployment that is useful in different situations. We'll be focusing on deploying an ML model as part of a stream processing system.

Stream processing is a data processing paradigm that treats a dataset as an unending stream of ordered records. A stream processor works by receiving a record from a data stream, processing it, and putting it in another data stream. This approach is different from batch processing, in which a process sees a data set as a batch of records that are processed together in one processing run. Stream processing is inherently asynchronous, since a producer of records does not have to coordinate with the process that consumes the records.

In order for a stream processor to receive messages from producers, a message broker is often used. In this case, the message broker acts as middleware that enables producers and consumers to communicate without being explicitly aware of each other. The message broker allows the system to be more decoupled than in other types of software architectures.

In a previous blog post, we used Redis as a message broker to deploy a model inside a task queue. One thing that is different about the current blog post and that one is the lack of a result backend, since we are not going to store the results of a prediction into a result store for later retrieval. The ML model stream processor we'll build will pick up data used for prediction from the message broker and put the resulting predictions back into the message broker. Instead of Redis, we'll be using Kafka as the message broker.

Software Architecture

The model stream processor application we will build will communicate with other software components through topics on a message broker. A topic is a channel of communication that exists in a message broker. A software service can "produce" messages to a topic and also "consume" messages from a topic. Each model will need three topics for its own use: an input topic from which it will receive data used to make predictions, an output topic to which it will write the prediction results, and an error topic to which it will write any input messages that caused an error to occur. The error topic is essentially an invalid message channel for the model.

Kafka for Stream Processing

To show how to deploy an ML model as a stream processor, we'll be using Kafka as the message broker service. Over the last few years, Kafka has become an important tool for doing stream processing because of its high performance and rich tool ecosystem.

To connect to Kafka from python, we'll use the aiokafka python library. This library can be used to produce and consume messages on kafka as well as other operations. The aiokafka library uses the asyncio library to improve the performance of the application. Asyncio is a new library in python that helps to write concurrent code that performs IO-bound operations in a more performant manner. The async/await syntax will appear in the code of this blog post, I won't go out of my way to explain it since there are many better places to learn about this programming paradigm.

Package Structure

-   model_stream_processor
    -   __init__.py
    -   app.py (application code)
    -   config.py (configuration for the application)
    -   ml_model_stream_processor.py (MLModel stream processor class)
    -   model_manager.py (model manager singleton class)
-   scripts
    -   create_topics.py (script for automating topic creation)
    -   receive_messages.py (script for receiving messages from a topic)
    -   send_messages.py (script for sending messages to a topic)
-   tests (unit test suite)
-   Makefile
-   README.md
-   docker-compose.yml
-   requirements.txt
-   setup.py
-   test_requirements.txt

This structure can be seen in the github repository.

MLModelStreamProcessor Class

To be able to have an MLModel that sends and receives data from Kafka topics, we'll write a class that wraps around an MLModel instance. The class will take care of finding and connecting to Kafka brokers, serializing and deserializing the messages from Kafka, and detecting errors.

We'll start by creating the class:

class MLModelStreamProcessor(object):
    """Processor class for MLModel stream processors."""

The code above can be found here.

The __init__() method of the class contains a lot of the functionality of the class:

def __init__(self, model_qualified_name, loop, bootstrap_servers):
    model_manager = ModelManager()
    self._model = model_manager.get_model(model_qualified_name)

    if self._model is None:
        raise ValueError("'{}' not found in ModelManager instance.".format(model_qualified_name))

    base_topic_name = "model_stream_processor.{}.{}.{}".format(model_qualified_name,
        self._model.major_version,
        self._model.minor_version)

    self.consumer_topic = "{}.inputs".format(base_topic_name)
    self.producer_topic = "{}.outputs".format(base_topic_name)
    self.error_producer_topic = "{}.errors".format(base_topic_name)

    self._consumer = AIOKafkaConsumer(self.consumer_topic, loop=loop,
        bootstrap_servers=bootstrap_servers, group_id=__name__)
    self._producer = AIOKafkaProducer(loop=loop,
        bootstrap_servers=bootstrap_servers)

The code above can be found here.

When the processor class is first instantiated, the first thing it does is to get an instance of the ModelManager class and then to get an instance of the model it will manage from it. The model is identified by the qualified_name, which should be unique for the model we're trying to deploy. The __init__ method also accepts an asyncio loop that is created once for the whole application, and also the name of the kafka bootstrap server to use. Before we try to finish initializing the stream processor, we check that the model instance actually exists within the ModelManager singleton, if the model can't be found we'll raise an exception.

After that, we generate the kafka topic names for the three topics that each model needs. The topic names are generated from scratch and cannot be parameterized. The base_topic_name is the same for all three topics and contains the name of the stream processing application, the qualified name of the model, and the model's major and minor versions. Then we can generate the three unique names of the topics we'll need for the model from the base_topic_name. The consumer topic will contain input data for the model, the producer topic will contain the output of the model for successful predictions, and the error producer topic will contain all of the input messages that caused errors in the model.

Once this is done, we are finally able to create the consumer and producer object that we'll use to write and read from Kafka. These objects are created once and reused throughout the lifecycle of the stream processor. The producer and consumer classes are provided by the aiokafka package.

Even though we have an initialized stream processor object once we finish executing the __init__() method of the class, we still need to start the producer and consumer object within the stream processor object. The start method is used at application startup to connect the stream processor to the Kafka topics that it will use:

async def start(self):
    await self._consumer.start()
    await self._producer.start()

The code above can be found here.

Once the stream processor class is initialized and started, we need to process messages:

async def process(self):
    async for message in self._consumer:
    try:
        data = json.loads(message.value)
        prediction = self._model.predict(data=data)
        serialized_prediction = json.dumps(prediction).encode()
        await self._producer.send_and_wait(self.producer_topic, serialized_prediction)
    except Exception as e:
        await self._producer.send_and_wait(self.error_producer_topic, message.value)

The code above can be found here.

The process() method uses an async for loop to continuously process messages from the input Kafka topic. The message is then deserialized using JSON, and the resulting data structure is sent to the model's predict() method. The prediction result is then serialized to a JSON string and encoded to a byte array. Lastly, the prediction is written to the output Kafka topic. If any exceptions are raised during this process, the input message that caused the error is written to the error Kafka topic so that we can try to reprocess it later (or try some other error handling method).

Just like the Kafka producer and consumers are started in the start() method of the class, we need a stop() method so that they can be shut down gracefully:

async def stop(self):
    await self._consumer.stop()
    await self._producer.stop()

The code above can be found here.

Installing the Model

Now that we have a streaming processor class, we can install a model package that will be hosted by the class. To do this, we'll use the iris_model package that we built in a previous blog post. The model package can be installed from its git repository with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now we can add the model's details to the config.py module so that we can dynamically load the model into the application later:

class Config(dict):
    models = [
        {
            "module_name": "iris_model.iris_predict",
            "class_name": "IrisModel"
        }
    ]

The code above can be found here.

This configuration class is used by the application in all environments. The module_name and class_name fields allow the application to find the MLModel class that implements the prediction functionality of the iris_model package.

Streaming Application

In order to use the MLModelStreamProcessor class, we need to write code that will dynamically instantiate it from configuration for each MLModel class that will be hosted by the application. We'll do this in the app.py module:

configuration = __import__("model_stream_processor"). \ 
    __getattribute__("config"). \
    __getattribute__(os.environ["APP_SETTINGS"])

model_manager = ModelManager()
model_manager.load_models(Config.models)

The code above can be found here.

The application starts by importing a configuration class, using a special environment variable called "APP_SETTINGS", the configuration class is imported from the config.py module. The application also instantiates the ModelManager singleton that hosts the models. A full explanation of the ModelManager class can be found in previous blog posts.

Next we'll create the function that actually starts and runs the application:

def main():
    loop = asyncio.get_event_loop()
    asyncio.set_event_loop(loop)

The code above can be found here.

The main() function of the application first starts up an asyncio event loop that will be shared by all of the stream processors in the application. The loop allows the streaming processors to efficiently cooperate to do IO-bound tasks like writing to the network.

Once we have an event loop, we can start instantiating the streaming processors:

stream_processors = []
for model in model_manager.get_models():
    stream_processors.append(MLModelStreamProcessor(
        model_qualified_name=model["qualified_name"],
        loop=loop,
        bootstrap_servers=configuration.bootstrap_servers))

The code above can be found here.

Each stream processor is responsible for hosting one MLModel object from the ModelManager singleton that we initialized above.

The stream processors are not started up and connected to a Kafka topic yet, so we start them up like this:

for stream_processor in stream_processors:
    loop.run_until_complete(stream_processor.start())

The code above can be found here.

Each stream processor is started by calling the start method(). Since the method is asynchronous, it is called by using the run_until_complete() method of the asyncio loop.

try:
    for stream_processor in stream_processors:
        loop.run_until_complete(stream_processor.process())
except KeyboardInterrupt:
    logging.info("Process interrupted.")
finally:
    for stream_processor in stream_processors:
        loop.run_until_complete(stream_processor.stop())
    loop.close()
    logging.info("Successfully shutdown the processors.")

The code above can be found here.

When all of the stream processors are started up, we are ready to process messages from Kafka. To do this, we call the process() method of each stream processor with the asyncio loop. The loop will run the processors forever, unless a keyboard interrupt is received. When an interrupt happens, each processor is stopped by calling the stop() method, then we close the asyncio loop itself, and then we can exit the application.

The application is started from the command line with this code at the bottom of the module:

if __name__ == "__main__":
    main()

The code above can be found here.

Now that we have an application that can run the stream processor classes, we can test things against a Kafka broker instance.

Setting Up a Development Environment

To set up a development environment we'll use docker images with the docker compose tool. The docker images come from the official dockerhub repository of Confluent, which is the company that manages the Kafka project. The docker-compose tool is useful for building development environments because it automates a lot of steps that would need to be performed manually.

The docker-compose.yml file in the project root contains configuration for three services:

zookeeper, a service for maintaining shared configuration and doing synchronization
kafka, the message broker, which depends on zookeeper
confluent control center, a user interface service useful for debugging

The docker-compose.yml file contains the docker image information, configuration options, and network settings for each service. It also contains dependency information for each service so that they are started in the right order.

To start up the three services, we need to execute this command from the root of the project:

docker-compose up

To see if everything came up correctly, execute this command in another shell:

docker-compose ps

If everything looks good, there should be three docker images running and the confluent control center UI should be accessible at this URL: http://localhost:9021/.

Creating Kafka Topics

In order to more easily create the topics needed to deploy the stream processor for a model, I created a simple command line tool. The tool reads the configuration of the streaming application, generates the correct topic names, connects to the kafka broker and creates the topics for each model. The tool can be found in the scripts folder in the create_topics.py module.

To use the tool, execute these commands from the root of the project:

export PYTHONPATH=./
python scripts/create_topics.py --bootstrap_servers=localhost:9092

The first command set the PYTHONPATH environment variable so that the configuration module can be found, the second command executes the CLI tool that creates the topics.

Now we can go into the confluent control center UI and see the topics that were just created:

Since the configuration points at the iris_model package, there are now three topics for that model's stream processor. If more models are listed in the configuration of the application, more topics would be created by the tool.

Running the Application

Now that we have the broker and topics for the stream processor, we can start up the application send some messages to the model.

First, we'll start the application with these commands in a new command shell:

export APP_SETTINGS=ProdConfig
export PYTHONPATH=./
python model_stream_processor/app.py

The streaming processor for the iris_model wrote these messages to the log:

INFO:model_stream_processor:Initializing stream processor for model: iris_model
INFO:model_stream_processor:iris_model stream processor: Consuming messages from topic..
INFO:model_stream_processor:iris_model stream processor: Producing messages to topics...
INFO:model_stream_processor:iris_model stream processor: Starting consumer and producer.

The stream processor is now ready to receive messages in the "inputs" topic. To more easily send messages to a topic, I built a simple CLI tool that reads messages from stdin and send them to the topic, the tool is in the send_messages.py module. To use the tool, execute this command in a new command shell:

python scripts/send_messages.py --topic=model_stream_processor.iris_model.0.1.inputs --bootstrap_servers=localhost:9092

The tool will start and wait for input from the command line, every time the ENTER key is pressed the contents of stdin will be sent to the "inputs" topic.

To be able to see the output messages produced by the stream processor I built a similar CLI tool that consumes messages from a topic and prints them to the screen. The tool is in the receive_messages.py module. To use it, execute this command in a new command shell:

python scripts/receive_messages.py --topic=model_stream_processor.iris_model.0.1.outputs --bootstrap_servers=localhost:9092

Now we're ready to send some messages to the stream processor. To do this, type the following JSON string into the send_messages command that we started above:

{"sepal_length": 1.1, "sepal_width": 1.2, "petal_length": 1.3, "petal_width": 1.4}

The receive_messages command should print out the prediction message from the model stream processor:

{"species": "setosa"}

The last thing we can test is the error handling of the stream processor. To do this we have to listen to the "errors" topic of the stream processor. We can do this by executing the receive_messages command with the "errors" topic as an option:

python scripts/receive_messages.py --topic=model_stream_processor.iris_model.0.1.errors --bootstrap_servers=localhost:9092

To cause an error in the stream processor we can send in a malformed JSON string to the send_messages command that should still be running:

{"sepal_length": 1.1, "sepal_width": 1.2, "petal_length": 1.3, "petal_width": 1.4

The stream processor will catch the exception and send the input that caused the error to the "errors" topic. We can see the message that caused the error in the confluent control center UI:

Closing

In this blog post, we've shown how to deploy an ML model inside a streaming application. This type of deployment is becoming more and more useful in recent times, as the popularity of stream processing and Kafka grows. As in previous blog posts, we've built an application that can support any number of ML models that implement the MLModel interface The only requirement for deployment is that the model package is installed in the environment and the configuration of the application is updated. The flexibility of this approach has allowed us to deploy the iris_model ML model in five different applications without any modification of the model code itself.

Another benefit of the stream processing application shown in this blog post is the fact that we are using an asyncio-compatible Kafka client library. By using asynchronous programming, we are able to greatly increase the performance of the code. In tests, asynchronous python code is able to significantly outperform normal synchronous code. The performance boost is most pronounced when working with file IO and network IO applications, which our streaming processor application will definitely benefit from.

To keep things simple, we used JSON strings in the messages we sent through Kafka. However, there are more efficient standards for serializing data which we could have used. For example, the confluent schema registry works with Avro schemas, and Avro is well supported in the Kafka ecosystem. Another way we can improve in the project are the CLI tools that were built to test the application. They are very simple and don't support many of the options that would be needed for a real production application. For example the create_topics.py script only creates topics with a replication factor of one. We can improve this tool by adding more of the options supported by Kafka's topic creation CLI tool.

An AWS Lambda ML Model Deployment

2019-11-10T09:25:00-05:00

This blog post builds on the ideas started in three previous blog posts.

I also showed how to deploy the same ML model used in this blog post as a batch job in this blog post, and in a task queue in this blog post.

The code in this blog post can be found in this github repo.

Introduction

In the last few years, a new cloud computing paradigm has emerged: serverless computing. This new paradigm flips the normal way of provisioning resources in a cloud environment on its head. Whereas a normal application is deployed onto pre-provisioned servers that are running before they are needed, a serverless application's codebase is deployed and the servers are assigned to run the application as demand for the application rises.

Although "serverless" can have several different interpretations, the one that is most commonly used by developers is Functions as a Service (FaaS). In this context, a function is a small piece of software that does one thing, and hosting a function as a service means that the cloud provider manages the server on which the function runs and allocates the resources needed to run the function. Another interesting application of the serverless paradigm are databases that are run and managed by cloud providers, some examples of this are AWS Aurora, and Google Cloud Datastore. However, these services don't run code that is provided by the user, so they are not as interesting for deploying an ML model.

Serverless functions provide several benefits over traditionally-deployed software. Serverless functions are inherently elastic since they run only when an event triggers them, this makes them easier to deploy and manage. They are also cheaper to run for the same reason, since charges for execution time of a serverless function only accrue when it is actually running. Lastly, using serverless functions makes software engineers more productive, since a lot of deployment details are abstracted out by the cloud provider, greatly simplifying the deployment process.

Serverless functions have some drawbacks as well. The resources assigned to a function are reclaimed by the cloud provider after a period of inactivity, which means that the next time the function is executed extra latency will be incurred when the resources are reassigned to the function. Cloud providers often have limitations on the resources that a function can take in a given period of time, which means that serverless function might not be a good fit for certain workloads. Lastly, access to the underlying server that is running the function is not available, which limits the ability to control certain aspects of the execution environment.

In this blog post, I will show how to deploy a machine learning model on AWS Lambda, which is the AWS serverless function offering. The code for this blog post can run locally, but to go through all of the scenarios explained it's necessary to get an AWS account. We'll also show how to integrate the lambda with AWS API Gateway, which will make the model hosted by the lambda accessible through a REST API. To interact with the AWS API, the AWS CLI package needs to be installed as well.

Serverless Framework

The serverless framework is a software framework for developing applications that use the serverless FaaS model for deployment. The framework provides a command line interface (CLI) that can operate across different cloud providers and helps software engineers to develop, deploy, test, and monitor serverless functions. We'll be using the serverless framework to work with the AWS Lambda service.

In order to use the serverless framework, we need to first install the node.js runtime. After this, we can install the serverless framework with this command:

npm install -g serverless

After this, we need to get an AWS account and add permissions to allow the framework to create resources, instructions can be found here.

Package Structure

To begin, I set up the project structure for the application package:

- model_lambda ( python package for model lambda app )
    - web_api ( package for handling http requests/responses
        - __init__.py
        - controllers.py
        - schemas.py
    - __init__.py
    - config.py
    - lambda_handler.py ( lambda entry point )
    - model_manager.py
- scripts
    - openapi.py (script for generating an OpenAPI specification)
- tests ( unit tests for the application )
- Makefle
- README.md
- requirements.txt
- serverless.yaml ( configuration for serverless framework )
- setup.py
- test_requirements.txt

This structure can be seen in the github repository.

Lambda Handler

The AWS Lambda service is event-oriented which means that it runs code in responses to events. The entry point for the code is a function called the lambda handler. The lambda handler function is expected to receive two parameters: event and context. The event parameter is usually a dictionary that contains the details of the event that triggered the execution of the lambda. The context parameter is a dictionary that holds information about the function execution and the execution environment.

To begin, we'll add an entry point for the lambda in the lambda_function.py module:

def lambda_handler(event, context):
    """Lambda handler function."""

The code above can be found here.

We'll be adding code to the handler function later.

Model Manager Class

In order to manage a collection MLModel objects in the lambda, we'll reuse a piece of code that we've used before in a previous blog post. In the previous post, I wrote a class called "ModelManager" that is responsible for instantiating MLModel classes from configuration, returning information about the model objects being managed, and return references to the model objects upon request. We can reuse the class in this project since we'll need the same functionality.

The ModelManager class has three methods: get_models() which returns list of models under management, get_model_metadata() which returns metadata about a single model, and get_model() which returns a reference to a model under management. The code for the ModelManager class can be found here. For a full explanation of the code in the class, please read the original blog post.

In order to use the ModelManager class within the model lambda we have to first instantiate it, then call the load_model() method to load MLModels objects we want to host in the lambda. Since the model classes will load their parameters from disk when they are instantiated, it's important that we only do this one time, when the lambda starts up. We can do this by adding this code at the top of the lambda_handler.py module:

# instantiating the model manager class
model_manager = ModelManager()

# loading the MLModel objects from configuration
model_manager.load_models(configuration=Config.models)

The code above can be found here.

By putting this initialization at the top of the lambda function module, we can be sure that the models are initialized one time only. The configuration is loaded from the config.py module found here.

REST Endpoints

An AWS Lambda function is able to handle events from several sources in the AWS ecosystem. In this blog post, we'll build a simple web service that can serve predictions from the models that are hosted by the lambda. To do this, we'll add an API Gateway as an event source to the lambda function later. To be able to handle the HTTP requests sent by the API Gateway, we'll copy the code from a previous blog post used to build a Flask web service. The code that defines the REST endpoints is isolated inside of a subpackage inside of the model_lambda package, since we want to easily adapt the model lambda for other types of integrations.

The data models accepted by the REST endpoints will be the same. We'll use the marshmallow schema package to define the schemas of the objects accepted by and returned from the endpoints. The schemas can be found in this module. Since the API Gateway is handling all of the functionality normally handled by a web application framework, we'll avoid using Flask for building the application. However, we still have to define controller functions that receive requests and return responses to a client. To do this we'll reuse the controllers from the previous blog post and rewrite them a bit to remove the Flask dependency. The new controller functions can be found in this module.

The web_api package within the model_lambda application is built along the same lines as a web application. It is built in this way so that it isolates the functionality to one package within the application. Now that we have the ability to receive HTTP requests and return HTTP responses, we have to integrate it with the AWS Lambda service, we'll do this in the next section.

Handling API Gateway Events

The AWS Lambda service integrates with other systems by using event types. For this blog post, we'll be integrating with an AWS API Gateway, to do this we'll need to handle AWS API Gateway events. The Lambda service sends events to our lambda by encoding all information about an HTTP request into a dictionary data structure and calling the lambda handler function with the dictionary as the "event" parameter. In order to integrate our REST endpoint code with the API Gateway, we'll need to recognize the event type, route the request to the right REST endpoint, encode the HTTP response into a dictionary, and return it to the Lambda service. The Lambda service will then return the response to the API Gateway which will create the actual HTTP response that will go back to the client.

To recognize the API Gateway event type, we'll check for a few fields in the event dictionary:

if event.get("resource") is not None \
    and event.get("path") is not None \
    and event.get("httpMethod") is not None:

The code above can be found here.

Once we're sure that we have an API Gateway event, we can choose which REST endpoint to route the request to:

if event["resource"] == "/api/models" and event["httpMethod"]c== "GET":
    response = get_models()

The code above can be found here.

If the API Gateway event is a request for the "models" endpoint with the GET verb, we'll route it to the get_models() controller function. This will return a list of the model available for prediction to the API Gateway, which will then return it as an HTTP response to the client system.

Next, we'll route to the metadata endpoint:

elif event["resource"] == "/api/models/{qualified_name}/metadata" \
        and event["httpMethod"] == "GET":
    response = get_metadata(qualified_name=event["pathParameters"]["qualified_name"])

The code above can be found here.

The get_metadata() function requires a parameter called "qualified_name" which is the unique name of the model that the client wants the metadata for. This parameter is parsed for us from the path of the request by the API Gateway, and is sent in the "pathParameters" field in the event dictionary.

Next, we'll route to the "predict" endpoint:

elif event["resource"] == "/api/models/{qualified_name}/predict" \
        and event["httpMethod"] == "POST" \
        and event.get("pathParameters") is not None \
        and event["pathParameters"].get("qualified_name") is not None:
    response = predict(qualified_name=event["pathParameters"]["qualified_name"], request_body=event["body"])

The code above can be found here.

This endpoint takes a little more effort since it also requires that the body of the request be sent to the predict() function.

Lastly, we'll raise an error for any resources in the API Gateway that we can't handle:

else:
    raise ValueError("This lambda cannot handle this resource.")

The code above can be found here.

This last statement raises an exception if the lambda can't handle the resource that the API Gateway is requesting. This should never happen if the API Gateway is created correctly, since only the three resources listed above will be added to the API Gateway when we create it.

Now that the REST endpoint code has handled the request and created a response, we have to encode it into a dictionary that the Lambda service will send back to the API Gateway:

return {
    "isBase64Encoded": False,
    "statusCode": response.status,
    "headers": {"Content-Type": response.mimetype},
    "body": response.data
}

The code above can be found here.

Lastly, we close the lambda handler by throwing an exception if we can't identify the event type:

else:
    raise ValueError("This lambda cannot handle this event type.")

The code above can be found here.

The code in this section forms an adapter layer between the Lambda service and the web application that we want to build. For the sake of good engineering practices, we isolate the code that deals with interfacing with the AWS Lambda service and the code that handles the HTTP requests and responses. By building the code this way, we have a much easier time writing unit tests for the code.

Adding Serverless Configuration

The serverless framework provides a command for starting a python lambda project, we'll skip using this command since we already created the lambda handler code inside of the model_lambda packages. We'll create the settings file that the serverless framework works with by hand. The file name is serverless.yml and it should be in the root of the project.

To begin we'll add a few basic things to the file:

service: model-service

provider:
  name: aws
  runtime: python3.7

stage: dev
region: us-east-2

The code above can be found here.

These values will be used by the serverless framework to create a service. A service can contain one or more functions plus any other resources needed to support them. The name of the service is "model-service", the provider will be AWS and the function runtime will be python 3.7. The default stage will be "dev" and the default region will be us-east-2. The values can be changed at deployment time.

Now we can add a function to the service:

functions:
  model-lambda:
    handler: model_lambda.lambda_function.lambda_handler

The code above can be found here.

The function will be named "model-lambda", and the handler points at the location of the lambda_handler function that we put into the lambda_function module. The lambda_handler function is located within the lambda_function module, which is located in the model_lambda package.

These lines are the only ones needed to get the basic settings in place for the lambda. In the next sections we'll add more lines to the serverless.yml file to handle other things.

Building a Deployment Package

The serverless framework can help us to build a deployment package for the model-lambda, but to do this we need to add an extension called "serverless-python-requirements". This extension allows the serverless framework to create deployment packages that include all of the python dependencies for the model-lambda code. The extension uses the requirements.txt file in the root of the project. To install the extension, use this command:

sls plugin install -n serverless-python-requirements

This command will add a node_modules folder to the project folder, and some other files to keep track of the node.js dependencies of the extension.

In order for the serverless framework to make use of the extension for this project, we have to add this line to the serverless.yml file:

plugins:
  - serverless-python-requirements

This code can be found here.

Once serverless can find the extension, we can modify the way that the extension will create the deployment package by adding these lines to the serverless.yml file:

custom:
  pythonRequirements:
    dockerizePip: true
    slim: true
    noDeploy:
      - apispec
      - PyYAML

The code above can be found here.

The dockerizePip options makes the serverless-python-requirements extension do the installation of the packages within the docker-lambda image which will guarantee that the deployment package will work in the lambda service. The slim options causes the extension to not put several unneeded file types in the deployment package, such as "*.__pycache__" files.

The noDeploy list of packages will cause the build process to ignore those packages, in this case we don't need the apispec and PyYAML packages in the lambda.

Once we have all of this set up, we can test the creation of the deployment package by using this command:

sls package

After executing this command, the serverless framework will create a new folder called ".serverless" inside of the project root. This folder contains several different files that will be used when deploying the service to AWS. The file we are interested in is called "model-service.zip", this file is the deployment package which will be used to create the lambda. When we open this file we'll see that the serverless framework actually packaged almost all of the files in the project folder into the deployment package, most of which are not needed. To prevent this we'll add these lines to the serverless.yml file:

package:
  exclude:
    - "**/**"
  include:
    - "model_lambda/**"

The code above can be found here.

These lines tell the serverless framework to only add the code in the model_lambda python package to the lambda deployment package. This step is important because the AWS Lambda service has a limit on the size of deployment packages.

Having written scripts that build lambda deployment packages for lambdas that have scikit-learn and numpy before, I can say that the serverless-python-requirements extension makes everything much simpler. The addition of the docker image for compiling source Python packages makes everything even better since it guarantees that the deployment package will work correctly in the AWS Lambda python environment. By leveraging on the serverless framework and the serverless-python-requirements extension to do this for us, we've avoided writing a lot of code for deploying the lambda.

Deploying the Model Lambda

Now that we have the deployment package in hand, we can try to create the lambda in AWS. To do this, we execute this command:

sls deploy

This command will interact with the AWS API to create the lambda, using a CloudFormation template. If we log in to the AWS console, we can see the lambda listed in the user interface of the AWS Lambda service:

We can execute the lambda in the cloud with this command:

serverless invoke -f model-lambda -s dev -r us-east-1 -p tests/data/api_gateway_list_models_event.json

The command executes the lambda through the AWS API using a test event from the unit tests folder.

Adding a RESTful Interface

Now that we have a lambda working inside of the AWS Lambda service, we need to connect it to an event source. The serverless framework supports this by adding an "events" array to the lambda function in serverless.yml file:

events:
  - http:
      path: api/models
      method: get
  - http:
      path: api/models/{qualified_name}/metadata
      method: get
      request:
        parameters:
          paths:
            qualified_name: true
  - http:
      path: api/models/{qualified_name}/predict
      method: post
      request:
        parameters:
          paths:
            qualified_name: true

The code above can be found here.

The three events above correspond to three AWS API Gateway resources that will trigger a lambda execution when they receive requests. After adding these events, we can execute the deploy command again to create the API Gateway:

sls deploy

The API Gateway and it's resources are added to the CloudFormation template that serverless manages for the service, and serverless uses the AWS API to create the API Gateway and route the events to the lambda.

The deploy command returned the URL of the new API Gateway endpoints, so to test out the new API Gateway I simply executed this command:

curl https://ra2nrqnhrj.execute-api.us-east-1.amazonaws.com/dev/api/models

As expected, the endpoint returned information about the iris_model MLModel that is configured. Note that the endpoint is not secured, so it's not a good idea to keep the API Gateway running for a long time. To delete the AWS resources we've been working with, execute this command:

sls remove

Even though we can create an API Gateway by using the serverless framework, the serverless.yml file is missing a lot of information that is provided by an OpenAPI specification. In order to properly document the API, I created an OpenAPI specification for the API we created, it can be found here.

Closing

A benefit of deploying an ML model on an AWS Lambda is the simplicity of the deployment. By removing the need to manage servers, the path to deploying an ML model is much faster and simpler. Another benefit is the number of integrations that AWS provides for the Lambda service. In this blog post, we showed how to integrate the lambda with an API Gateway to create a RESTful service, but there are many other options available.

A drawback of the lambda service is that is suffers from cold start latency. A coldstart happens when a lambda is executed in response to an event after not being used for 15 minutes, when this happens, the lambda takes extra time to respond to the request. This blog post goes into the details of this problem. The cold start problem becomes even more pronounced with a lambda that is hosting an ML model because the model parameters need to be deserialized when the lambda first starts up, which adds to the cold start time.

Another problem that we might face when deploying an ML model inside a lambda is the limits on the deployment package size. The AWS Lambda service limit for the deployment package size is 50 MB. When packaging model files along with the deployment package we might go beyond that limit very easily. This can be fixed by having the lambda pick up the model files from an S3 bucket. I will show details for a simple and general way to do this in a later blog post.

An interesting way to improve the code is to make it possible to integrate other data sources in AWS with the model lambda. For example, we can have the Lambda listen for events coming from a Simple Queueing Service queue, make a prediction and put the prediction result in another SQS queue. Another option is to do a similar integration with the AWS Kinesis service for doing streaming analytics. Both of these services can be integrated with AWS Lambda easily.

A Task Queue ML Model Deployment

2019-10-24T09:24:00-05:00

This blog post builds on the ideas started in three previous blog posts.

The code in this blog post can be found in this github repo.

Introduction

When building software, we may come across situations in which we want to execute a long-running operation behind the scenes while keeping the main execution path of the code running. This is useful when the software needs to remain responsive to a user, and the long running operation would get in the way. These types of operations often involve contacting another service over the network or writing data to IO. For example, when a web service needs to send an email, often the best way to do it is to launch a task in the background that will actually send the email, and return a response to the client immediately.

These types of tasks are often handled in a task queue, which can also be called a job queue. A task queue is a service that receives requests to perform tasks, and handles finding the resources necessary for the task, and scheduling the task. It can also store the results of the tasks for later retrieval. Tasks usually execute asynchronously, which means that the client does not wait for the result of the task, but synchronous execution can also be supported.

A task queue can also execute tasks on many different physical computers, which makes it a distributed system. To handle communication between many machines, a task queue often makes use of a message broker service to handle message passing between the worker processes that execute the tasks and the clients of the tasks. A message broker service acts as a middle man, receiving, storing, routing, and sending messages between many different services. A message router service is an implementation of the publish-subscribe pattern. The benefits of using this pattern is that the services that communicate over the message broker remain decoupled from each other.

A task queue can be useful for machine learning model deployments, since a machine learning model may take some time to make a prediction and return a result. Most often, the ML prediction algorithm itself is CPU-bound, which means that it is limited by the availability of CPU time. This means that a task queue is usually not necessary for the deployment of the ML model itself, but for dealing with the loading of data that the prediction algorithm may need to make a prediction which is an IO-bound process. Another situation in which a task queue may be useful is when we need to make thousands of predictions and return them as a result; in this case it would be useful to launch an asynchronous task that will take care of the predictions behind the scenes and then come back later to access the results.

Task Queueing With Celery

Celery is a python package that handles most of the complexity of distributing and executing tasks across different processes. Celery is able to use many different types of message brokers to distribute tasks, for this blog post we'll use the Redis message broker. In order to access task results, Celery supports several kinds of result storage backends, for this blog we'll also use Redis to store the prediction results of the model. As in previous blog posts, we'll be deploying the iris_model package, which was developed as an example and has now been deployed several times.

Since we are now dealing with more than one service and we are communicating data between several different processes over a network, it's useful to visualize the activity of the task queue with a software architecture diagram:

The client application installs the Celery application package and sends task requests through the tasks that are defined in it, whenever a task needs to be executed, it sends a message to the task broker with any parameters that the task needs to execute. The message broker receives messages and holds them until they are picked up by the worker processes. The workers are running the Celery application and pick up messages from the message broker, when a task is completed, they store the results to the result storage backend.

Package Structure

To begin, I set up the project structure for the application package:

- model_task_queue ( python package for task queue app )
    - __init__.py
    - __main__.py ( command line entry point )
    - celery.py ( celery application )
    - config.py
    - ml_model_task.py ( task class )
- scripts
    - simple_test.py ( single prediction test )
    - continuous_test.py ( multiple prediction test )
- tests ( unit tests )
- Makefle
- README.md
- requirements.txt
- test_requirements.txt
- setup.py

This structure can be seen here in the github repository.

Model Async Task

Creating an asynchronous task with the Celery package is simple, it's as easy as putting a function decorator on a function. An example of how to do this can be found in the Celery startup guide. The function decorator allows the client application to call the function just like a local function, while having the actual execution of the code happen asynchronously in a worker process running in a different computer. In the client code, the function acts as a facade that hides the complexities of parameter serialization/deserialization, network communication and other complexities of the distributed nature of the task queue.

The function decorator is a simple way to get started with Celery tasks, but we have some special requirements that make it hard to create Celery tasks this way. For example, Celery task functions don't maintain state between requests. If we had to instantiate an MLModel object for every task request, the model parameters would have to be loaded and deserialized over and over for each request. To get around this limitation we'll have to code the ML model async task in such a way that it can maintain an instance of an MLModel object in memory between requests. A way to do this can be found in the Celery documentation here.

Following the example in the documentation, we'll define a class that inherits from the celery.Task base class:

from celery import Task

class MLModelPredictionTask(Task):
    """Celery Task for making ML Model predictions."""

The code above can be found here.

Now we'll define the task class' __init__ method:

def __init__(self, module_name, class_name):
    """Class constructor."""
    super().__init__()
    self._model = None

    model_module = importlib.import_module(module_name)
    model_class = getattr(model_module, class_name)

    if issubclass(model_class, MLModel) is False:
        raise ValueEror("MLModelPredictionTask can only be used with subtypes of MLModel.")

    # saving the reference to the class to avoid having to import it again
    self._model_class = model_class

    # adding a name to the task object
    self.name = "{}.{}".format(__name__, model_class.qualified_name)

The code above can be found here.

The __init__() method accepts two parameters: the name of the module where we can find the MLModel-derived class, and the name of the class in that module that implements the prediction functionality. The __init__() method then calls the __init__() method of the Celery Task base class to make sure that all of the required initialization code is executed correctly. Then the "_model" property is set to None (for now). After this, we dynamically import the MLModel class from the environment, and check that it is a subclass of MLModel. Next, we save a reference to the class in the "_model_class" property of the new task object but we do not instantiate the model class itself, the reason for this is explained below. Lastly, we set a unique name for the Celery task based on the name of the MLModelPredictionTask class' module and the qualified name of the MLModel class that is being hosted inside of this instance of the MLModelPredictionTask class. The name of the task is set dynamically so that we are able to host many different models within the same celery application, while guaranteeing that the tasks will have unique names.

Next, we have the initialize() method is responsible for instantiating the model class, and saving the reference as a property of the MLModelPredictionTask object:

def initialize(self):
    model_object = self._model_class()
    self._model = model_object

The code above can be found here.

Lastly, the run() method is responsible for doing the work of the async task:

def run(self, data):
    if self._model is None:
        self.initialize()
        return self._model.predict(data=data)

The code above can be found here.

The run() method checks if the model class is instantiated before it attempts to make a prediction. If it is not instantiated, it calls the initialize() method to create the model object before making a prediction with it. The run() method is the one that defines the actual functionality of the Celery task.

In previous blog posts, the instantiation of the model class happens in the __init__() method of the class that is managing the model object. After this, we can use the model class to make a prediction. We have to take a different approach in this application because we need to keep the model class from being instantiated in the client application that is using the asynchronous task. This happens because the client application instantiates and manages an instance of the task class in its own process space, and uses it to communicate with the worker processes that are actually doing the work. To keep the model class from being instantiated in the client application, the run() method is actually responsible for initializing the model class instead of the __init__() method. The only downside to this approach is that when the worker process instantiates the task class, it will not have an instance of the model class in memory, it will only be created the first time that a prediction is made.

Celery Application

Now that we have a Celery task that can host an MLModel-based class, we can start building a Celery application that hosts the tasks. To do this, we first have to instantiate a task registry to hold the instantiated tasks:

First, we will install a machine learning model that will be hosted by the Celery application. For this we'll use the iris_model package that I've already shown in previous blog posts:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements#egg=iris_model

Then, we'll create a configuration class for the application:

class Config(object):
    """Configuration for all environments."""
    models = [
        {
            "module_name": "iris_model.iris_predict",
            "class_name": "IrisModel"
        }
    ]

The code above can be found here.

The configuration class defines property called "models" that is a list of dictionaries, each dictionary containing two keys. The "module_name" key points at a module that contains an MLModel-derived class, and the "class_name" key contains the name of the class. By storing the locations of the classes in this way, adding a new MLModel class to the application is as simple as adding an entry to the list. The configuration above points at the IrisModel class that we just installed in the iris_model package. This class is meant to hold configuration that is shared by all of the environments.

In the same file we also store configuration for different environments, here is the configuration class for the production environment:

class ProdConfig(Config):
    """Configuration for the prod environment."""
    broker_url = 'redis://localhost:6379/0'
    result_backend = 'redis://localhost:6379/0'

The code above can be found here.

The configuration is pointing at a redis service on the localhost for now. Now that we have configuration taken care of, we can start building the Celery application. To do this we start by instantiating a task registry:

registry = TaskRegistry()

The code above can be found here.

Next, we add tasks to the task registry:

for model in Config.models:
    registry.register(MLModelPredictionTask(module_name=model["module_name"], class_name=model["class_name"]))

The code above can be found here.

The loop iterates through the list of models in the configuration, instantiates a MLModelPredictionTask for each model, and registers the new task with the task registry object we defined above.

Celery tasks are usually automatically registered in a task registry as soon as they are instantiated, but we have a special situation because of the dynamic and configuration-driven nature of the Celery application. The manual registration of the task shown above is needed because we don't know how many tasks we will be hosting in the application, we only know this once the application starts up and reads the configuration.

Now that we have a task registry with tasks in it, we can create the Celery application object:

app = Celery(__name__, tasks=registry)

The code above can be found here.

The name of the application is pulled from the module name, and the tasks parameter is the task registry object we defined above.

Lastly, we need to point the Celery application to a broker and result backend so that the clients and workers can communicate. These settings are loaded from the configuration classes we've already defined:

app.config_from_object("model_task_queue.config.{}".format(os.environ['APP_SETTINGS']))

The code above can be found here.

The name of the environment is loaded from an environment variable called "APP_SETTINGS". The environment variable is then used to load the correct configuration object from the config.py file.

Using the Task

To use the iris_model task in the Celery application we just built, we'll need to start up an instance of redis to serve as the message broker and storage backend for the task queue. To do this, we can use a docker image with this command:

docker run -d -p 6379:6379 redis

Now that we have a redis instance to communicate with, we can start a Celery worker process:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
export APP_SETTINGS=ProdConfig
export PYTHONPATH=./
python3 -m model_task_queue --loglevel INFO

The OBJC_DISABLE_INITIALIZE_FORK_SAFETY environment variable is needed in MacOS to allow Celery to fork processes when handling task execution. The APP_SETTINGS environment variable is needed so that the Celery application will load the right configuration. The PYTHONPATH environment allows the Python interpreter to find the dependencies of the Celery application. The last command start the Celery worker process by calling the script in the __main__.py module.

Next, we can try out the task itself in a python interactive session:

>>> import os
>>> os.environ["APP_SETTINGS"] = "ProdConfig"
>>> from model_task_queue.celery import app
>>> task = app.tasks["model_task_queue.ml_model_task.iris_model"]
>>> task.__dict__
{'_model': None, '_model_class': <class
'iris_model.iris_predict.IrisModel'>, 'name':
'model_task_queue.ml_model_task.iris_model', '_exec_options':
{'queue': None, 'routing_key': None, 'exchange': None,
'priority': None, 'expires': None, 'serializer': 'json',
'delivery_mode': None, 'compression': None, 'time_limit': None,
'soft_time_limit': None, 'immediate': None, 'mandatory': None,
'ignore_result': False}}

When using the celery task, we first need to instantiate the Celery application object that is hosting the task. This happens when we import the model_task_queue.celery module. Once we have the application object, we can query the app.tasks dictionary for the model task we are interested in. The name of the task is dynamically generated from the qualified name of the model that it is hosting.

As can be seen above, when the task is first instantiated, it does not have an object reference in the _model property. This is as we intended, since we are using the Celery application as a client and we don't want the task to instantiate the model class which would cause the model to be deserialized in the client process.

Now that we have an instance of the task, we can try to execute it:

>>> result = task.delay(data={ "sepal_length": 5.0, "sepal_width": 3.2, "petal_length": 1.2, "petal_width": 0.2})
>>> result.ready()
True
>>> result.get()
{'species': 'setosa'}

We use the task.delay() method to call the task asynchronously, getting back a result object that can be used to get a result once the task is completed. The ready() method of the result can be used to check on the status of the result of the task. Once it is completed, the result can be retrieved from the result backend with the get() method.

If the task throws an exception, the result will also throw an exception when it is accessed:

>>> result = task.delay(data={ "sepal_length": 5.0, "sepal_width": 3.2, "petal_length": 1.2, "petal_width": "asdfg"})
>>> result.ready()
True
>>> result.get()
Traceback (most recent call last):
...
ml_model_abc.MLModelSchemaValidationException: Failed to validate input data: Key 'petal_width' error: asdfg should be instance of 'float'

Because the "petal_width" field contains data that does not meet the schema of the iris model, the model threw an exception of type MLModelSchemaValidationException. The exception was caught by the celery worker, serialized, and transported back to the client.

Test Script

In order to test the Celery application, we'll code a script that will make use of the iris_model task asynchronously. To use the application, we import the Celery application from the module where it is defined:

from model_task_queue.celery import app

The code above can be found here.

Next, we'll define a function that start a task, wait for it to complete, and return the prediction result:

def request_task(data):
    task = app.tasks["model_task_queue.ml_model_task.iris_model"]
    result = task.delay(data=data)

    # waiting for the task to complete
    while result.ready() is not True:
        time.sleep(1)
        prediction = result.get(timeout=1)
    return prediction

The code above can be found here.

Lastly, we'll define a function that uses the function above to test the iris_model task concurrently:

def run_test():
    data = [
            {"sepal_length": 5.0, "sepal_width": 3.2, "petal_length": 1.2, "petal_width": 0.2},
            {"sepal_length": 5.5, "sepal_width": 3.5, "petal_length": 1.3, "petal_width": 0.2},
            {"sepal_length": 4.9, "sepal_width": 3.1, "petal_length": 1.5, "petal_width": 0.1},
            {"sepal_length": 4.4, "sepal_width": 3.0, "petal_length": 1.3, "petal_width": 0.2}
        ]
    with Executor(max_workers=4) as exe:
        jobs = [exe.submit(request_task, d) for d in data]
        results = [job.result() for job in jobs]
        print("The tasks returned these predictions: {}\".format(results))

The code above can be found here.

The function sets up a few inputs for the model in the data list. It then calls the task concurrently using the ThreadPoolExecutor context manager from the concurrent Python package. The context manager executes the request_task() function concurrently in four worker processes.

To run the script, we'll need the redis docker image and the worker process to be running. The script above can be executed from the command line by using these commands:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
python3 scripts/concurrent_test.py

Closing

In this blog post I showed how to build a task queue application that is able to host machine learning models. A task queue is very useful in certain situations for deploying ml models because of capabilities that it brings to the table.Task queues allow applications to do work asynchronously behind the scenes without having the main application being affected.

The ML model deployment strategy I showed in this blog post works in the same way as the previous blog posts I've published. The Celery application I built does not work with only one ML model, it works with any ML model that uses the MLModel base class. The application is also able to host any number of models, and they are loaded from configuration which means that a new model can be added to the Celery application without modifying the code. By following good software engineering design practices, we are able to easily put machine learning models into production without having to worry about the implementation details of the models. All of these capabilities stem from the design of the MLModel base class.

Another interesting feature of the Celery package is that we can launch tasks from a variety of different languages. There are client libraries for node.js and PHP. This flexibility makes it possible to use Python for building and deploying ML models, and to use other languages for the work that is best suited for them.

A drawback of this approach is that when the Celery application is built and deployed, the dependencies of the machine learning models that it is hosting are installed along with it. This means that if two models depend on different versions of scikit-learn or pandas, for example, they won't be able to be installed in the same Celery application. This limits the usefulness of the Celery application somewhat, since it can't host models together that have conflicting requirements.

Another drawback of this approach is the extra complexity that it entails, since it requires message broker service, a result storage service, and the worker processes to be running for the task queue to be available to client applications. All of these requirements add extra complexity to this deployment option.

The Celery application I built is only able to deal with single prediction requests. Even though this is useful it would make more sense for the Celery application to be used to run longer prediction jobs that make thousands of predictions at a time. An improvement that can be made to the task is to be able to launch prediction tasks that take large files as input, feed the individual records in the file as inputs to the model, and store the resulting predictions back into a storage service. The long-running task can also be instrumented to report its progress back to the client that requested the predictions.

A Batch Job ML Model Deployment

2019-09-20T09:24:00-05:00

This blog post continues the ideas started in three previous blog posts.

The code in this blog post can be found in this github repo.

Introduction

In previous blog posts I showed how to develop an ML model in such a way that makes it easy to deploy, and I showed how to create a web app that is able to deploy any model that followed the same design pattern. However, not all deployments of ML model are deployed within web apps. In this blog post I deploy the same model used in the previous blog posts as an ETL job.

An ETL job is a procedure for copying data from a source system into a destination system, with some processing along the way. The acronym ETL stands for extract, transform, and load; as in extract from a source system, transform the data into a format compatible with the destination system, and load the resulting data into the destination system. ETLs are most commonly associated with data warehousing systems, in which they are used to take data from a system of record and transform it to make it useful for reporting.

ETL jobs are useful for making predictions available to end users or to other systems. The ETL for such an ML model deployment looks like this: extract features used for prediction from a source system, send the features to the model for prediction, and save the predictions to a destination system.

A big distinction between ML models that are deployed in an ETL job and the Flask web application shown in the previous blog post is that the ETL job is not a real time system since it is not expected to return predictions to the client quickly. ETLs are also meant to process thousands of records at a time, whereas a web app processes one record (request) at a time. A real-time deployment of an ML model should be able to return single predictions in less than a second, an ETL deployment has a looser time constraint but makes many more predictions.

Another distinction between an ETL job deployment and a web service deployment of an ML model is that an ETL saves predictions to data storage, and the predictions are then accessed from there by the users of the predictions. This means that the user of the predictions does not interact with the model directly, and only has access to the predictions saved since the ETL last ran. I call this distinction interactive vs. non-interactive ML models. When an ML model is deployed non-interactively, the users of the predictions have limitations as to how they are able to use the model since they don't have direct access to the model.

Bonobo for ETL Jobs

The bonobo package is a python package for writing ETL jobs, offering a simple pythonic interface for writing code that loads, transforms, and saves data. The package works well for small datasets that can be processed in single processes, but not as useful for larger datasets. Nevertheless, the package is perfect for small scale data processing. The package has a strong object-oriented bend to it and it encourages good software engineering best practices through a well-designed API.

The bonobo package does data processing by running directed acyclic graphs (DAG) of operations defined by the user. I won't get into the complex aspects of what a DAG is in this post, so to define it simply: a DAG of data processing steps is a set of steps that can be executed in a certain order in time based on their dependencies. For example, in order to transform a data record we must first load the record into memory, therefore the Extract step must be done before the Transform step. Each step in a DAG is called a "transformation", a transformation can do one of three things: load data, transform data, or save data.

ETL Application

To develop the ETL application with the Bobobo package I first set up the project structure

- data (folder for test data)
- model_etl (folder for application code)
    - __init__.py
    - etl_job.py
    - model_node.py
- tests (folder for unit tests)
- .gitignore
- Makefile
- README.md
- requirements.txt
- test_requirements.txt

This folder structure can be seen here in the github repository.

This folder structure for the ETL application looks very similar to the one used in the Flask application in the previous blog post. We will be following the same practices as before, adding documentation, unit tests and a Makefile to the application to ensure quality code and to make it easier to use.

MLModelTransformer Class

Running a machine learning model prediction step inside an ETL DAG requires many of the same things as running a model inside a web application. In the previous blog post we managed instances of MLModel classes inside a ModelManager singleton object. The ModelManager object was used by the web application to maintain a list of MLModel objects, and returned information about them on request.

When a model makes a prediction, it is making a transformation on an incoming record and returning a prediction. Therefore, to embed an ML model inside of a bonobo ETL job, we just need to write a transformation. We can write a transformation as a class:

class MLModelTransformer(object):
    def __init__(self, module_name, class_name):
        model_module = importlib.import_module(module_name)
        model_class = getattr(model_module, class_name)
        model_object = model_class()

        if isinstance(model_object, MLModel) is False:
            raise ValueError("The MLModelNode can only hold references to objects of type MLModel.")

        # saving the model reference
        self.model = model_object

The code above can be found here.

The __init__ method receives two parameters: module_name and class_name. The __init__ method uses these parameters to dynamically import and instantiate an MLModel class and saves a reference to the newly created object. The __init__ method also verifies that the class inherits from the MLModel base class.

Just like the ModelManager class from the Flask app, the MLModelTransformation class instantiates and maintains a reference to an MLModel object internally. However, it is not meant to be a singleton object and it only holds one MLModel object.

The MLModelTransformation class is meant to be plugged into a bonobo DAG and exchange data with other transformations in the DAG. For that purpose we implement a __call__ method:

def __call__(self, data):
    try:
        yield self.model.predict(data=data)
    except MLModelSchemaValidationException as e:
        raise e

The code above can be found here.

The __call__() method makes the class a callable. This mechanism is used by the bonobo package to feed data into the DAG transformation and receive data back. The yield keyword allows bonobo to run transformations asynchronously. By implementing the transformer this way, we can compose many different DAGs that use MLModel derived classes to do data transformations.

Now we can test the MLModelTransformation class to make sure it\'s working as expected. First, we have to install a model to the environment, we'll install the iris_model package that was built in a previous blog post:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now that we have a model package in the environment, we use a Python interactive session to instantiate the class and try to make a prediction:

>>> from model_etl.model_node import MLModelTransformer
>>> model_transformer = MLModelTransformer(module_name="iris_model.iris_predict", class_name="IrisModel")
>>> generator = model_transformer(data={"sepal_length": 4.4,
... "sepal_width": 2.9, "petal_length": 1.4, "petal_width": 0.2})
>>> result = list(generator)
>>> result
[{'species': 'setosa'}]

We first instantiate the transformer class by pointing it at the module and class in the iris_model package that implement the MLModel base class. Then we can make a prediction by calling class with a single dictionary object. The transformers makes predictions by using the yield keyword, so we have to cast the return value of the transformer into a list to view it on the screen.

As in the previous blog posts, we are trying to write the code in such a way as to make it reusable in many situations. The MLModelTransformer class can be used to load and manage ML model objects in any bonobo ETL, which saves time and work later. One caveat to this, however, is that the ETL must feed records to the MLModelTransformer object exactly as the MLModel expects it, since any schema differences will raise a MLModelSchemaValidationException from the model within the transformer. In practice, this means that the IrisModel.predict() method expects to receive data in a dictionary with several floating point numbers, if the data source does not provide records with this schema, we have to transform the incoming data to match it.

Creating a Graph

A bonobo application runs an ETL from a Graph object that is defined at application startup. Any number of transformations can be used, and they can be arranged into complex DAGs. Every Graph object must contain at least one extractor to get data from an outside source, and one loader to save data to an outside destination. The bonobo package provides several options for accessing data files, we'll use the JSONLD extractor and loader transformations to define a simple Graph inside a function:

def get_graph(**options):
    graph = bonobo.Graph()
    graph.add_chain(
        LdjsonReader(options["input_file"], mode='r'),
        MLModelTransformer(module_name="iris_model.iris_predict", class_name="IrisModel"),
        LdjsonWriter(options["output_file"], mode='w'))
    return graph

The code above can be found here.

The function receives two file names as parameters. The input file name is used to instantiate a LDjsonReader object that will load data from a local JSON file, and the output file name is used to instantiate an LdjsonWriter to write data to a local JSON file. The MLModelTransformer class is instantiated by pointing it at the IrisModel class.

We can now instantiate the graph from an interactive Python session:

>>> from model_etl.etl_job import get_graph
>>> graph = get_graph("data/input.json", "data/output.json")
>>> graph
<bonobo.structs.graphs.Graph object at 0x10a52ffd0>

The great thing about this approach to building ETLs is that a different reader or writer can be easily swapped in to add functionality, while the core transformations of the ETL remain unchanged. For example, we can implement a Graph that reads CSV files and writes TSV files in the same module, and select it at runtime using a parameter.

Running the ETL Process Locally

The graph defined in the previous section works well when running it from an interactive Python session, but it would be better to run in from the command line. Before writing the code to create simple command line interface, we need to create some parameters for the input and output file names:

def get_argument_parser(parser=None):
    parser = bonobo.get_argument_parser(parser=parser)
    parser.add_argument("--input_file", "-i", type=str, default=None, help="Path of the input file.")
    parser.add_argument("--output_file", "-o", type=str, default=None, help="Path of the output file.")
    return parser

The code above can be found here.

The function retrieves standard command line parser that is defined by the bonobo package, and adds two parameters for the input and output file names. The new parser object is then returned.

To create a CLI interface we define a __main__ function inside of the etl.py module and use the parser defined above:

if __name__ == '__main__':
    parser = get_argument_parser()
    with bonobo.parse_args(parser) as options:
        bonobo.run(
            get_graph(**options),
            services={}
        )

The code above can be found here.

The graph can now be run from the command line with these commands:

export PYTHONPATH="${PYTHONPATH}:./"
python model_etl/etl_job.py --input_file=data/input.json --output_file=data/output.json

First, we add the current directory to the PYTHONPATH environment variable to ensure that the python modules will be found. Then we can execute the graph with the command line interface in the etl_job.py module and the CLI parameters. The input file is included in the repository here, and contains 15 records, which we can see were processed by the three transformations in the graph. The output will be saved as an LDJSON file in the data folder of the project.

The ETL graph looks pretty good now, it is able to run from the command line and we can parametrize the input and output files. However, a real-world ETL is probably not accessing data from the local hard drive, so we'll add the ability to access data from other places.

Accessing Data from a Service

When testing an ETL job locally, it is easiest to load data from and save data to the local hard drive. When running the ETL in a production environment, the ETL code will most likely be accessing data from remote storage systems. We can easily write an implementation of the LdjsonReader and LdjsonWriter classes to access files from a remote system, but this is not a best practice.

To be able to write code once and reuse it in many different situations, the bonobo package supports dependency injection through service abstractions. A service is a software component that provides functionality to other software components. For example, the os Python package that is part of the standard library can be thought of as a service, since it provides access to the operating system. Dependency injection is a software pattern that allows software components to be written in such a way that makes them easier to reuse in many different situations. For example, a d instance of the os

In the example we set up for this blog post, we are interested in accessing files from a remote data source, but without changing the ETL's Graph. In this way, we can easily change the data source of the ETL in the future without changing the code of the ETL. To show how to do this, I will change the local filesystem as the file source for an S3 bucket, without changing the bonobo Graph object.

The bonobo package provides a mechanism for injecting service instances into a Graph at runtime. Right now, the JSON files are being accessed through a local filesystem service that is injected by default into every Graph. To be able to access files from a remote service, we'll just replace the default filesystem service instance with another service instance with the same interface that loads files from a remote source.

As an example, we'll show how to access files stored in S3. To be able to access files in an S3 bucket, we first have to install the fs-s3fs package with this command:

pip install fs-s3fs

Now we can instantiate a special type of filesystem that accesses files from an AWS bucket but has the same interface as a local filesystem. The fs package already provides this functionality when we accessed the files in the example above, so we know that the code will work with the s3 filesystem.

To inject a service through the bonobo package we define a dictionary like this:

def get_services(**options):
    return {
        'fs': S3FS(options["bucket"],
                   aws_access_key_id=options["key"],
                   aws_secret_access_key=options["secret_key"],
                   endpoint_url=options["endpoint_url"],)
    }

The code above can be found here.

The new fs filesystem service replaces the service that is instantiated by bonobo by default at startup. The extra options needed to connect to S3 received through keyword arguments, we'll provide them to the function at runtime.

In order to run the new ETL, we'll create a new CLI interface for it:

def get_argument_parser(parser=None):
    parser = bonobo.get_argument_parser(parser=parser)
    parser.add_argument("--input_file", "-i", type=str, default=None, help="Path of the input file.")
    parser.add_argument("--output_file", "-o", type=str, default=None, help="Path of the output file.")

    # these parameters are added for accessing different S3 services
    parser.add_argument("--bucket", "-b", type=str, default=None, help="Bucket name in S3 service.")
    parser.add_argument("--key", "-k", type=str, default=None, help="Key to access S3 service.")
    parser.add_argument("--secret_key", "-sk", type=str, default=None, help="Secret key to access the S3 service.")
    parser.add_argument("--endpoint_url", "-ep", type=str, default=None, help="Endpoint URL for S3 service.")
    return parser

The code above can be found here.

The new command line argument parser still accepts input and output file names, but now also receives parameters to access the S3 bucket where the data to be processed is to be found. The parameters are: the key and secret key to access the bucket, and the endpoint url for contacting the S3 service.

Lastly, we'll add a __main__ block that will actually run the ETL job:

if __name__ == '__main__':
    parser = get_argument_parser()
    with bonobo.parse_args(parser) as options:
        bonobo.run(
            get_graph(**options),
            services=get_services(**options)
        )

The code above can be found here.

The bonobo graph that actually runs the ETL does not change at all, since we are only injecting a new service for accessing the files. This shows the power of accessing outside resources from your code through interfaces, since it makes it possible to run the application in many different contexts without changing the application code itself. In this case, the code that actually accesses the files that will be processed is injected at runtime into the DAG.

In order to test the loading and saving of files to S3, we can run a drop-in replacement service locally. The minio project replicates the S3 API, and also provides a docker image. To run an instance of minio locally, I used this command:

docker run -p 9000:9000 --name minio -e "MINIO_ACCESS_KEY=TEST" -e "MINIO_SECRET_KEY=ASDFGHJKL" -v /Users/brian/Code/etl-job-ml-model-deployment:/data minio/minio server/data

The minio service instance is accessing the local filesystem to serve files, and I pointed it at the root of the project. When minio is running in this way, it makes the folders it finds in the local filesystem available as buckets through its interface. We can see the files hosted by the minio service by accessing the minio web UI:

Now we can try out the new ETL job by executing this command:

export PYTHONPATH="${PYTHONPATH}:./"
python model_etl/s3_etl_job.py --input_file=input.json --output_file=output.json --bucket=data --key=TEST --secret_key=ASDFGHJKL --endpoint_url=http://127.0.0.1:9000/

The command above will run the new ETL, providing it with the credentials it needs to access the S3 service. This section showed how by injecting dependencies into the bonobo Graph, we can change the way the ETL accesses data without having to change the code of the ETL itself.

Closing

In this blog post, I showed how to deploy the iris model developed in a previous blog post inside of an ETL application. By splitting the deployment code and the model code into separate packages, I'm able to reuse the model in many different types of deployments. By structuring the codebases in this way, I'm able to keep the machine learning code separate from the deployment code very effectively.

In addition, by creating the MLModelTransformer class that works with the bonobo package, we can leverage all of the tools that bonobo has for building ETL applications. For example, the bonobo package provides functionality to load data from CSV files, JSON files, and databases. Bonobo also makes it easy to extend its capabilities with custom code through its highly modular object-oriented design. It also enforces good coding practices by supporting service dependency injection and parametrization.

One downside of this example is that this ETL is not meant to handle large scale data processing since it can only run in a single computer. A better way to do data processing over data sets that don't fit in the memory of a single computer is to use Apache Spark. Another drawback of the Bonobo package is that it does not support joins and aggregations over the data, since it only allows each incoming record to be processed individually.

Even though the ETL applications is able to make predictions with the MLModelTransformer class, it is very common for business logic to also be needed in a real-world deployment of an ML model. For example, we might want to prevent the model from making a prediction in certain locales or jurisdictions for legal reasons. For the sake of simplicity, I didn't include any business logic in the DAG we defined. The business logic should not be packaged inside of the MLModel class. We can keep it separate by creating a separate transformer that implements the business logic and putting it in the DAG. This way, we can apply the business logic without mixing it with the machine learning code in the MLModel class.

Another common situation in a real-world deployment of an ML model is the need to keep track of the predictions made by the model outside of the results that are provided to the clients of the system. This is a special log that the model generates as it is operating. Some of the contents of the prediction log would be: the inputs used to make a prediction, internal data that the model generated as it was making a prediction, and the output sent back to the client system. This is a more advanced requirement of an ML model deployment that I may expand on in another blog post.

Using the ML Model Base Class

2019-07-28T09:12:00-05:00

This blog post continues the ideas started in two previous blog posts.

The code in this blog post can be found in this github repo.

Introduction

In previous blog posts I showed how to build a simple base class for abstracting machine learning models and how to create a python package that makes use of the base class. In this blog post I aim to use the ideas from the previous blog posts to build a simple application that uses the MLModel base class to deploy a model. I will be using the iris_model package built in this blog post.

When creating software, interacting with a component through an abstraction makes the code easier to understand and evolve. In the vocabulary of software design patterns, this is called the strategy pattern. When using the strategy pattern, the implementation details of a software component (the "strategy") are not decided up front, they are deferred until later. Instead, the interface between the code that is using the component and the component itself is designed and put into code. When it is time to write the code that uses the component, it is written against the abstract interface, trusting that the component will provide an implementation that matches the agreed-on interface. Afterwards, an implementation of the strategy can be implemented as needed. This approach makes it possible to easily switch between implementations of the strategy easily. It also makes it possible to decide which implementation of the strategy to use at runtime, which makes the software more flexible.

By interacting with machine learning models through the MLModel abstraction, it becomes possible to build applications that can host any model that implements the MLModel interface. This way, simple model deployments become much faster since a custom-made application is not needed to put a model into production. The application I will show in this blog post takes advantage of this fact to allow a software engineer to install and deploy any number of models that implement the MLModel base class inside a web application.

Overall, I aim to show how to deploy the model code in the iris_model package into a simple web application. I also want to show how the MLModel abstraction makes the use of machine learning models much easier in production software.

Flask Web Application

One of the simplest ways to build a web application with python is to use the Flask framework. Flask makes it easy to set up a simple web application that serves web pages and a RESTful interface.

To begin, I set up the project structure for the application package:

- model_service
    - static ( folder containing the static web assets )
    - templates ( folder for the html templates
    - __init__.py
    - config.py
    - endpoints.py
    - model_manager.py
    - schemas.py
    - views.py
- scripts ( folder containing scripts )
- tests ( folder containing the unit test suite)
- requirements.txt
- test_requirements.txt

This structure can be seen here in the github repository.

The Flask application is set up with this code in the __init__.py file:

app = Flask(__name__)
if os.environ.get("APP_SETTINGS") is not None:
    app.config.from_object(os.environ['APP_SETTINGS'])
    bootstrap = Bootstrap(app)

The code above can be found here.

The Flask application is initiated by instantiating the Flask() class. The configuration is being imported by the configuration classes found in the config.py file, there is one configuration class per environment. The environment name is being imported as the "APP_SETTINGS" environment variable, which makes it easy to change the configuration of the app at runtime.

The configuration classes can be found here. More information about this pattern for managing and importing configuration details in Flask applications can be found here. Lastly, I am using the flask_bootstrap package for adding bootstrap elements to the web pages, this package is initiated after loading the configuration.

So far, this is a simple Flask application that is not able to manage or serve machine learning models, in the next section we will start to add the functionality needed to do this.

Model Manager Class

In order to use the iris_model class within the Flask application we are building, we need to have a way to manage the model object within the Python process. To do this we will create a ModelManager class that follows the singleton pattern. The ModelManager class will be instantiated one time at application startup. The ModelManager singleton instantiates MLModel classes from configuration, and returns information about the model objects being managed as well as references to the model objects.

Let's get started, here is the class declaration:

class ModelManager(object):
    _models = []

The code above can be found here.

The ModelManager class has a private list property called _models that will contain the references to the model objects that are under management.

Now we need a way to actually instantiate the model classes, the code to do this is below:

@classmethod
def load_models(cls, configuration):
    for c in configuration:
        model_module = importlib.import_module(c["module_name"])
        model_class = getattr(model_module, c["class_name"])
        model_object = model_class()

        if isinstance(model_object, MLModel) is False:
            raise ValueError("The ModelManager can only hold references to objects of type MLModel.")

        cls._models.append(model_object)

The code above can be found here.

The load_models() class method receives a configuration dictionary object and iterates through it, importing the classes from the environment, instantiating the classes, and saving the references to the objects in the _models class property. The method also checks that the classes being imported and instantiated are instances of the MLModel base class. The ModelManager singleton object is able to hold any number of model objects.

The ModelManager class also provides three other methods that help to use the models that it manages. The get_models() method returns a list of dictionaries with information about the model object. The get_model_metadata() method returns detailed data about a single model object, identified with the qualified_name property of the model object. The metadata returned by this method contains the input and output schemas of the model encoded as JSON schema dictionaries. Lastly, the get_model() method searches the models in the _models list and returns a reference to one model object. When searching through the list of model objects in the _models class property, the qualified name of the model is used to identify the model.

With the ModelManager class, we can now test it out with the iris_model package from the previous blog post. To do this we need to install the package from github by executing this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Once we have the iris_model package installed in the environment, we can use a python interactive session to execute this code to try out the ModelManager class:

>>> from model_service.model_manager import ModelManager
>>> model_manager = ModelManager()
>>> model_manager.load_models(configuration=[
...     {
...         "module_name": "iris_model.iris_predict",
...         "class_name": "IrisModel"
...     }
...])
>>> model_manager.get_models() 
[{'display_name': 'Iris Model', 'qualified\_name': 'iris_model', 'description': 'A machine learning model for predicting the species of a flower based on its measurements.', 'major\_version': 0, 'minor_version': 1}]

The ModelManager class is being used to load the IrisModel class which is found in the the iris_model package within the iris_predict module, the information needed to find the class is held within the configuration. Once the model object is instantiated, the get_models() method is called to get data about the models in memory.

In order to use the ModelManager class within the Flask application we have to instantiate it and call the load_model(). Since the model classes will load their parameters from disk when they are instantiated, it's important that we only do this one time at application startup. We can do that by adding this code to the __init__.py module:

@app.before_first_request
def instantiate_model_manager():
    model_manager = ModelManager()
    model_manager.load_models(configuration=app.config["MODELS"])

The code above can be found here.

The \@app.before_first_request decorator on the function causes it to be executed before requests can be handled by the application. The model manager configuration is loaded from the Flask application configuration found here.

The ModelManager class handles the complexities of instantiating and managing model objects in memory. As long as a MLModel-derived class can be found in the python environment, then it can be loaded and managed by the ModelManager class.

Flask REST Endpoints

To make use of the models hosted in the ModelManager object, we will first build a simple REST interface that will allow clients to find and make predictions. To define the data models that are returned by the REST interface we make use of the marshmallow schema package. Although it's not strictly necessary to use it to build a web app, the marshmallow package provides a simple and quick way to build schemas and do serialization and deserialization.

The Flask application has three endpoints: a models endpoint for getting information about all models hosted by the app, a metadata endpoint for getting information about a specific model, and a predict endpoint for making predictions with a specific model.

The models endpoint is created by registering a function with the Flask application:

@app.route("/api/models", methods=['GET'])
def get_models():
    model_manager = ModelManager()
    models = model_manager.get_models()
    response_data = model_collection_schema.dumps(dict(models=models)).data
    return response_data, 200

The code above can be found here.

The function uses the ModelManager class to access data about all models hosted within it. It uses the get_models() method in the same way that the index defined above view does. The response_data is serialized using a marshmallow schema object which is instantiated from the schema class defined here.

The metadata endpoint is built similarly to the models endpoint. The metadata endpoint function uses the ModelManager class to access information about the models. In the same way as the models endpoint, the metadata endpoint also defines a set of schema classes for serialization.

The predict endpoint functions differently from the previous endpoints since it does not define a schema class for the input and output data that is expects. If a client wants to know what fields it needs to send to a model to make a prediction, it can find a description of the fields in the JSON schema published by the metadata endpoint. If a new version of a model with new input or output schemas is installed into the Flask application, the code of the Flask app would not have to change at all to accommodate the new model.

Flask Views

The Flask framework is also able to render web pages using Jinja templates, a great guide for learning about this can be found here. To add webpages rendered with Jinja templates to the web application I added the templates folder to the application package. In it I created the base html template, from which other templates inherit. The base template uses styles from the bootstrap package. To render the templates into views I also added the views.py module.

In order to show some information about the models that are in the ModelManager object, I added the index.html template. To render the template, I added this code to the views.py module:

@app.route('/', methods=['GET'])
def index():
    model_manager = ModelManager()
    models = model_manager.get_models()
    return render_template('index.html', models=models)

The code above can be found here.

The index view function first registers itself with the Flask application's root URL so that it becomes the homepage. The ModelManager is then instantiated, but since it is a singleton that was first instantiated at application startup, the reference to the singleton object is returned with all of the model objects already loaded. Next, we use the singleton's get_models() method to get a list of models available. Lastly, we send the list of models returned to the template for rendering, and return the resulting webpage to the user. This view also renders links to a model's metadata and prediction views. These views are presented below. The index webpage looks like this:

A similar approach is followed for the metadata view, which displays an individual model's metadata as well as the input and output schemas. The template for this view is here, and the view function is here. One difference between this view and the index view is that it accepts a path parameter that determines which model's metadata is rendered in the view. The metadata webpage looks like this:

Dynamic Web Form

The last webpage of the application makes use of a view to render a webpage and the predict endpoint. The prediction web page for a model renders a dynamic form from the input json schema provided by the model, then accepts user input and sends it to the prediction REST endpoint when the user presses the "Predict" button, lastly it displays the prediction results from the model.

The prediction web page is rendered like the other views:

@app.route("/models/<qualified_name>/predict", methods=['GET'])
def display_form(qualified_name):
    model_manager = ModelManager()
    model_metadata = model_manager.get_model_metadata(qualified_name=qualified_name)
    return render_template('predict.html', model_metadata=model_metadata)

The code above can be found here.

The template, however is different because it uses JQuery to get the input schema of the model from the metadata endpoint:

$(document).ready(function() {
$.ajax({
    url: '/api/models/{{model_metadata.qualified_name}}/metadata',

The code above can be found here.

If the request returns successfully then we use the brutusin forms package to render a form from the model's input JSON schema. The webform created from the JSON schema is dynamic, which allows a custom form to be created for any model that is hosted by the application. Below is the code to render the form:

success: function(data) {
    var container = document.getElementById('prediction_form');
    var BrutusinForms = brutusin["json-forms"];
    bf = BrutusinForms.create(data.input_schema);
    bf.render(container);
}

The code above can be found here.

Lastly, there is a JQuery request to make the prediction when the user presses the "Predict" button, and a callback function that renders the prediction to the webpage.

Here is a screen shot of the prediction webpage:

Documentation

To make the REST API easier to use we will produce documentation for it. A common way to document RESTful interfaces is the OpenAPI specification. In order to automatically create an OpenAPI document for the RESTful API that the model service provides, I used the python apispec package. The apispec package is able to automatically extract schema information from marshmallow Schema classes, and is able to extract endpoint specifications from Flask \@app.route decorated functions.

To be able to automatically extract the OpenAPI specification document from the code, I created a python script called openapi.py. The script creates an object to describe the document:

spec = APISpec(
    openapi_version="3.0.2",
    title='Model Service',
    version='0.1.0',
    info=dict(description=__doc__),
    plugins=[FlaskPlugin(), MarshmallowPlugin()],
)

The code above can be found here.

Then we can add the marshmallow schema classes, which are imported from the schemas.py module:

spec.components.schema("ModelSchema", schema=ModelSchema)
spec.components.schema("ModelCollectionSchema", schema=ModelCollectionSchema)
spec.components.schema("JsonSchemaProperty", schema=JsonSchemaProperty)
spec.components.schema("JSONSchema", schema=JSONSchema)
spec.components.schema("ModelMetadataSchema", schema=ModelMetadataSchema)
spec.components.schema("ErrorSchema", schema=ErrorSchema)

The code above can be found here.

To document the paths of the API, the OpenAPI specification has to be added to the docstring of the controller functions that are registered with the Flask application, an example of how to do this can be found here. After this is done, we can add the paths to the OpenAPI document using the code below:

with app.test_request_context():
    spec.path(view=get_models)
    spec.path(view=get_metadata)
    spec.path(view=predict)

The code above can be found here.

Once all the components are loaded from the codebase, the OpenAPI document can be saved to disk as a YAML file, using this code. The resulting file can be found here. There is also an open source viewer for OpenAPI documents which is able to do automatic code generation and renders a webpage for viewing the document:

Conclusion

In this blog post I showed how to create a web application that is able to host any model that inherits from and follows the standards of the MLModel base class. By using an abstraction to deal with machine learning model code, it becomes possible to write an application that can deploy any model, instead of building applications that can deploy only one ML model.

A drawback of this blog post's approach is that the types of the fields in objects given and returned from the model object's predict() method must be serializable to JSON and the schema package must be able to create a JSON schema for them. This is not always easy to do with more complicated data models. Since this is a web application, the use of JSON schema makes a lot of sense, but there are situations in which a JSON schema is not the best way to publish schema information.

A point I want to highlight is that I am purposefully maintaining separate codebases for the model code and the application code. In this approach, the model is a python package that is installed into the application codebase. By separating the model code from the application code, creating new versions of the model becomes simpler and more straightforward. It also enables Data Scientists and engineers to maintain separate codebases that better fit their needs, as well as making it possible to deploy the same model package in multiple applications and to deploy different versions of the same model.

Improving the MLModel Base Class

2019-06-12T09:21:00-05:00

This blog post continues with the ideas developed in the previous post in this series.

All of the code shown in this post can be found in this Github repository.

In the previous blog post in this series I showed an object oriented design for a base class that does Machine Learning model prediction. The design of the base class was intentionally very simple so that I could show a simple example of how to use the base class with a scikit-learn model. I showed an easy way to publish schema metadata about the model inputs and outputs, and how to write model deserialization code so that it is hidden from the users of the model. I also showed how to hide the implementation details of the model by translating the user's input to the model's input so that the user of the model doesn't have to know how to use pandas or numpy. In this blog post I will continue to make improvements to the MLModel class and the example that I used in the previous post.

In this blog post I will make the iris example code from the previous post into a full python package with many features that will make the iris model easier to install and use from other python packages. I will also continue to improve the MLModel base class. In general, I want to show how to make ML code easier to install and use.

When I was doing research for this blog post I found a great blog post by Mateusz Bednarski showing how to build machine learning models as python packages. There are some similarities between what I will show here and that blog post, however, this post focuses more on the deployment of ML models into production systems, whereas Mateusz'z post focuses on packaging the training code.

This blog post assumes that you have some experience with Python. I will be referencing resources for learning the tools that I will be using in the blog post.

Making the Iris Model into a Python Package

Another improvement that we can make to the example code is to make it into a full-fledged Python package. This makes it easier to use and install in other projects. The goal here is to treat ML models as just another python package, this makes it possible to leverage all of the tools that Python has for packaging and reusing code. A good guide for structuring python packages can be found here.

An common pattern that can be seen in ML code is that it is almost always hard to use and deploy. This is something that teams that do machine learning know very well, since the code written by a Data Scientist almost always needs to be rewritten by a software engineer before it is possible to deploy it into production systems. Luckily, we have a lot of tools to make the transition from experimental model to production model a smoother process. In this section I will show a few simple steps that will make the example model from the last blog post into an installable Python package. To accomplish this, we will add version information to the package, add a command line interface to the training script, add Sphinx documentation, and add a setup.py file to the project. As an additional touch, we will automate the documentation process for the interface of the ML model.

First of all we need to reorganize the code in the project a little bit:

- project_root
    - docs (a folder, package documentation will goes in here)
    - iris_model (a folder, iris package code will goes in here)
    - model_files (a folder, the model files go in here)
        - __init__.py (this file is for the python package)
        - iris_predict.py (the prediction code goes here)
        - iris_train.py ( the training script goes here)
    - tests (a folder, unit tests for iris_model package go here)
    - ml_model_abc.py (the MLModel base class goes here)
    - requirements.txt
    - setup.py (the package installation script goes here)

A lot of this code is shared with the previous blog post, but it is reorganized here to make it possible to have an ML model that is can be installed as a Python package.

Adding Package Versioning

Python packages are usually versioned using semantic versioning. Software packages that use semantic versioning must declare a public API. This is complicated when we want to do versioning of ML models because we have two APIs: the API for making model predictions and the API for training the model. We can deal with this complexity by tying the different components of the semantic version of the package to the prediction API and the training API of the package.

I chose to version the prediction API of the model using the major and minor version components of the semantic versioning standard. The reasoning for this is that a lot of users are affected by changes in the prediction API, but not as many users are affected by changes in the training API. This is because ML models are usually used by many people but trained by a few experts. The patch number of the version can be used to version changes to the training API.

As an example, whenever the ML model prediction API changes in a backward-incompatible way the major version number will go up, and whenever it changes in a backwards-compatible way the minor version will go up. This approach ensures that any user of the ML model package will know how changes in the prediction API will affect them when they install the package. A simple way to understand when to increase the major or minor version numbers is to do so when the input and output schemas of the model change. Lastly, any changes to the model training API will cause the patch version number to go up.

A common approach for storing version information in a python package is to put a "__verison__" property into the __init__.py module in the root of the package:

__version_info__ = (0, 1, 0)
__version__ = ".".join([str(n) for n in __version_info__])

The code above can be found here.

I like to think of an ML model as a software component like any other, the only difference being that an ML model is statistically significant. Of course, being statistically significant adds a lot of complexity, but at the end of the day ML models are just code that can be managed just like any other piece of code. In this section we can see how to take a step in that direction by attaching version information to the IrisModel package.

Although semantic versioning is not designed to be used for versioning models, we can apply it here to version model code and gloss over the more complicated aspects of ML models. For example, we can't use semantic versioning to version model parameters since they are not part of the codebase and don't have an API. This is a problem that I will tackle in another blog post.

Adding a CLI interface to the Training Script

When building ML models, the training code is often written in jupyter notebooks, while there are ways to automate the training process with notebooks it's a lot easier to do it through the command line. To do this we will add a simple command line interface to the Iris model training script. We will create the interface using the argparse package and then create a function that calls the train() function when the iris_train.py script is called from the command line.

To create the argparse ArgumentParse object we create a dedicated function (the reason for this will be explained below):

def argument_parser():
    parser = argparse.ArgumentParser(description='Command to train the Iris model.')
    parser.add_argument('-gamma', action="store", dest="gamma", type=float, help='Gamma value used to train the SVM model.')
    parser.add_argument('-c', action="store", dest="c", type=float, help='C value used to train the SVM model.')
    return parser

The code above can be found here.

To call the train() function from the command line, I created a new function called main(). The function gets a parser object, parses the incoming parameters, and calls the train() function:

def main():
    parser = argument_parser()
    results = parser.parse_args()
    try:
        if results.gamma is None and results.c is None:
            train()
        elif results.gamma is not None and results.c is None:
            train(gamma=results.gamma)
        elif results.gamma is None and results.c is not None:
            train(c=results.c)
        else:
            train(gamma=results.gamma, c=results.c)
    except Exception as e:
        traceback.print_exc()
        sys.exit(os.EX_SOFTWARE)
    sys.exit(os.EX_OK)

The code above can be found here.

The reason for adding the main() function to wrap the train() function is so that the main() function can be registered as an entry point when the iris_model package is installed. The main() function also handles parsing the command line arguments, calls the train() function, handles exceptions and returns the success or error code to the operating system when the training process is done. Another benefit of this approach is that train() function can still be imported into other code and called as a function, but now it also has a CLI interface.

Adding Sphinx Documentation

One of the great parts of working in the Python ecosystem is the Sphinx package, which is used for creating documentation from source files. There are a lot of great guides for documenting your package using Sphinx, so I won't go through it again here. For this blog post, I followed these guides to create a simple documentation page and hosted it on Github pages. Adding documentation is a simple process and it is done by almost all Python packages that have more than a few users. After putting together the basic documentation, I followed a few simple extra steps to fully automate the creation of the documentation for the model.

First of all, I added documentation strings to all classes and methods in the iris_model package where it made sense. Here is an example of how I documented the predict() method using the docstring in the .py file. The docstring is formatted so that it can be automatically built by the sphinx autodoc extension. This extension makes it easy to extract docstrings from python packages and modules and build documentation. A good guide for using the autodoc extension can be found here.

However, one problem with using the MLModel base class for writing code is that the predict() method of a class that inherits from MLModel only accepts a single parameter called "data" as input. This makes it hard to document the input schema of the model through autodoc since the data structure accepted by the model for prediction can't be easily described in the docstring. The same problem happens when we try to document the return type of the predict() method. Luckily, we can automatically extract the JSON Schema representation of the input and output schemas of the model. In order to leverage this, I used the sphinx-jsonschema extension to automatically add the schema information to a documentation page. The process for adding it is simple, I just had to add this code to an .rst file:

.. jsonschema:: ../build/input_schema.json

The code above can be found here.

The only problem is that the input and output json schema strings are not saved to disk for the jsonschema extension to access, but are available from an instance IrisModel class. To fix this, I added this code to the conf.py file that creates the Sphinx documentation. The code instantiates an IrisModel object, extracts the JSON Schema strings, and saves it to a location that can then be read by the Sphinx documentation generator. The documentation that is generated can be seen here and here.

Since we are using the argparse library for creating the CLI interface for the training script, we can use the sphinxarg.ext Sphinx extension to automatically generate the documentation. This was as easy as adding this code to the .rst file that describes the training script:

.. argparse::
    :module: iris_model.iris_train
    :func: argument_parser
    :prog: iris_train

The code above can be found here.

The sphixarg.ext extension then goes to the iris_train module and calls the argument parser function, which returns an instance of a ArgumentParser object, which is then used to generate the documentation. The results can be seen in the documentation here.

This section shows how it is possible to write the code of an ML model in such a way that the documentation can be created automatically. Exposing the input and output schemas of the model as JSON schema strings makes it possible for a Data Scientist to communicate the requirements of the model clearly to the end user of the model. At the same time, by exposing the hyperparameters of the training script as command line options, its becomes possible to automatically document the training process. By writing the ML model code in a certain way, it makes it possible for any changes to the code to be documented automatically whenever the documentation is generated.

Adding a setup.py File

Now that we have the ML model code structured as a Python package, versioned, and documented, we'll add a setup.py file to the project folder. The setup.py file is used by the setuptools package to install python packages and makes the ML model easily installable in a virtual environment. A great guide for writing the setup.py file for your package can be found here.

In the iris_model package setup.py file, most of the fields are very easy to understand and they are better explained in other guides. In this blog post, I'll focus on the sections of the setup.py file that had to be specifically modified for the ML model package. First of all, we want to point at the folder that contains the iris_model package, we can do this with this line in the setup.py file:

packages=["iris_model"],

The code above can be found here.

Next, we need to make sure that the ml_model_abc.py Python module is installed along with the iris_model package. In the future, it would be better to take this code and put it into another Python package that the iris_model package would depend on, but for now we just need this line of code:

py_modules=["ml_model_abc"],

The code above can be found here.

Next, we take care of the model parameters. The ML model requires that the model parameters be available for loading at prediction time, the setup.py file can handle this by adding this line of code:

package_data={'iris_model': ['model_files/svc_model.pickle']},
include_package_data=True,

The code above can be found here.

This ensures that when the package is installed into an environment, the model parameters will be copied along with the model_files folder.

Next, we have to register the iris_train.py script as an entry point. This makes it possible to run the training script from the command line inside of an environment where the iris_model package is installed:

entry_points={ 'console_scripts': ['iris_train=iris_model.iris_train:main',]

The code above can be found here.

Once we have all of this in the setup.py file, we can try to do a pip install on a new virtual environment. We will install the package directly from the git repository to keep things simple. The shell commands to do this are these:

mkdir example
cd example

# creating a virtual environment
python3 -m venv venv

#activating the virtual environment, on a mac computer
source venv/bin/activate

# installing the iris_model package from the github repository
pip install git+https://github.com/schmidtbri/ml-model-abc-improvements#egg=iris_model

Now we can test the installation by starting an interactive Python interpreter and executing this Python code:

>>> from iris\_model.iris\_predict import IrisModel
>>> model = IrisModel()
>>> model
<iris_model.iris_predict.IrisModel object at 0x105d1e940>
>>> model.input_schema
Schema({'sepal_length': <class 'float'>, 
        'sepal_width': <class 'float'>, 
        'petal_length': <class 'float'>, 
        'petal_width': <class 'float'>})
>>> model.output_schema
Schema({'species': <class 'str'>})

Next, we can test the CLI interface for the training code by executing the command line in the command line:

iris_train -c=10.0 -gamma=0.01

This section showed how to install the iris_model Python package using common Python packaging tools, and how to use and retrain the model in different Python environment.

Model Metadata in the MLModel Base Class

In the previous blog post we showed an MLModel base class with two required abstract properties: "input_schema" and "output_schema". These two properties were required to be provided by any class that derived from the MLModel base class and were used to publish schema metadata about the input and output data of the model. In order to keep things simple, I chose not to expose more metadata through class properties, however there are several other pieces of metadata that would be useful to expose to the outside world. For example:

display_name, a property that returns a display name for the model
qualified_name, a property that returns the qualified name of the model, a qualified name is an unambiguous identifier for the model
description, a property that returns a description of the model
major_version, this property returns the model's major version as a string
minor_version, this property returns the model's minor version as a string

These properties are exposed as object properties and can be accessed the same way as the input_schema and output_schema properties. The new code for the MLModel base class now looks like this:

class MLModel(ABC):
    @property
    @abstractmethod
    def display_name(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def qualified_name(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def description(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def major_version(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def minor_version(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def input_schema(self):
        raise NotImplementedError()

    @property
    @abstractmethod
    def output_schema(self):
        raise NotImplementedError()

    @abstractmethod
    def __init__(self):
        raise NotImplementedError()

    @abstractmethod
    def predict(self, data):
        self.input_schema.validate(data)

The code above can be found here.

The new MLModel base class looks exactly like the previous implementation, but now also requires the properties described above to be published as instance properties.

This metadata is added in the __init__.py file of the iris_model package, since it is applicable to the whole package:

# a display name for the model
__display_name__ = "Iris Model"

# returning the package name as the qualified name for the model
__qualified_name__ = __name__.split(".")[0]

# a description of the model
__description__ = "A machine learning model for predicting the species of a flower based on its measurements."

The code above can be found here.

In order to show how a class that derives from the MLModel base class can publish these properties, we can modify the Iris model example used in the previous blog post. The Iris model class now looks like this:

from ml_model_abc import MLModel
from iris_model import __version_info__, __display_name__, __qualified_name__, __description__

class IrisModel(MLModel):
    # accessing the package metadata
    display_name = __display_name__
    qualified_name = __qualified_name__
    description = __description__
    major_version = __version_info__[0]
    minor_version = __version_info__[1]

    # stating the input schema of the model as a Schema object
    input_schema = Schema({'sepal_length': float,
        'sepal_width': float,
        'petal_length': float,
        'petal_width': float})

    # stating the output schema of the model as a Schema object
    output_schema = Schema({'species': str})

    def __init__(self):
        dir_path = os.path.dirname(os.path.realpath(__file__))
        file = open(os.path.join(dir_path,
                "model_files",
                "svc_model.pickle"), 'rb')
        self._svm_model = pickle.load(file)
        file.close()

    def predict(self, data):
        super().predict(data=data)
        X = array([data["sepal_length"],
                   data["sepal_width"],
                   data["petal_length"],
                   data["petal_width"]]).reshape(1, -1)

        y_hat = int(self._svm_model.predict(X)[0])
        targets = ['setosa', 'versicolor', 'virginica']
        species = targets[y_hat]
        return {"species": species}

The code above can be found here.

The display name, qualified name, and description properties are set as string class properties in the IrisModel class, and they are accessed from the __init__ module. The major and minor version properties are extracted from the __version_info__ property.

There can be some situations in which a single Python package will hold more than one MLModel derived class. In that case the display name, qualified name, and description metadata would be set individually within the MLModel derived class itself instead of accessing it from the package-wide metadata stored in the __init__ module.

The class properties are now easily accessible from the model object, to show this we can instantiate the object and access the properties:

>>> from iris_model.iris_predict import IrisModel
>>> iris_model = IrisModel()
>>> iris_model.qualified_name
'iris\_model'
>>> iris_model.display_name
'Iris Model'

These new metadata properties can now be used to introspect information about the model more easily, this also makes it possible to more easily manage many MLModel model objects in the same python process.

Future Improvements

In this blog post we showed how to do versioning of an ML model using standard conventions of python packages, however the model parameters of the Iris model also need to be versioned over time and metadata about them also needs to be kept. This is a problem that I will tackle in a future blog post.

Another problem that we did not tackle in this blog post is how to have a more complex API for ML models. For example, the Iris model is only allowed to have one predict() method, this makes it impossible to do more complex operations with the Iris model. In a future blog post I will show how to modify the ML model base class to allow this.