A gRPC Service ML Model Deployment

Mon 20 January 2020

This blog post builds on the ideas started in three previous blog posts.

In this blog post I'll show how to deploy the same ML model that l deployed as a batch job in this blog post, as a task queue in this blog post, inside an AWS Lambda in this blog post, and a Kafka streaming application in this blog post.

The code in this blog post can be found in this github repo.


With the rise of service oriented architectures and microservice architectures, the gRPC system has become a popular choice for building services. gRPC is a fairly new system for doing inter-service communication through Remote Procedure Calls (RPC) that started in Google in 2015. A remote procedure call is an abstraction that allows a developer to make a call to a function that runs in a separate process, but that looks like it executes locally. gRPC is a standard for defining the data exchanged in an RPC call and the API of the function through protocol buffers. gRPC also supports many other features, such as simple and streaming RPC invocations, authentication, and load balancing.

Protocol buffers are defined through an interface definition language, and the code that actually does the serialization/deserialization is then generated from the definition. Once a protocol buffer definition file is created, the protocol buffer definition can be compiled into many different programming languages through a compiler. This allows gRPC to be a cross-language standard for a common exchange format between services.

gRPC services are coded in much the same way as a regular web service but have several differences that will affect the service we'll build in this blog post. First, protocol buffers are statically typed, which makes the serialized data packages smaller but allows for less flexibility in the code of the service. Second, protocol buffers must be compiled to source code, which makes it harder to evolve services that use them. Lastly, a protocol buffer is a binary data structure that is optimized for size and processing speed, whereas a JSON data structure is a string-based data structure optimized for simplicity and readability. In performance comparisons, protocol buffers have been found to be many times faster than JSON.

In previous blog posts, we've used JSON exclusively, to keep things simple. JSON allowed the services and applications to deserialize the data structure and send it directly to the model without having to worry about the contents of the data structure. This is not possible with gRPC since the service requires explicit knowledge of the schema of the models incoming and outgoing data.

Package Structure

-   model_grpc_service (python package for service)
    -   __init__.py
    -   config.py configuration for the application)
    -   ml_model_grpc_endpoint.py (MLModel gRPC endpoint class)
    -   model_manager.py (model manager singleton class)
    -   service.py (service code)
-   scripts
    -   client.py (single prediction test)
    -   generate_proto.py
-   tests (unit tests)
-   Dockerfile
-   Makefle
-   model_service.proto (protocol buffer definition of gRPC service)
-   model_service_pb2.py (python protocol buffer code)
-   model_service_pb2_grpc.py (python gRPC service bindings)
-   model_service_template.proto (protocol buffer template file)
-   README.md
-   requirements.txt
-   setup.py
-   test_requirements.txt

This structure can be seen in the github repository.

Installing the Model

In order to create a gRPC service for ML models we'll first install a model package into the environment. We'll use the iris_model package, which has been used in several previous blog posts. The model package itself was created in this blog post. The model package can be installed from its git repository with this command:

pip install git+https://github.com/schmidtbri/ml-model-abc-improvements

Now that we have the model package in the environment, we can add it to the config.py module:

class Config(dict):
models = [{
    "module_name": "iris_model.iris_predict",
    "class_name": "IrisModel"

The code above can be found here.

This configuration class is used by the service in all environments. The module_name and class_name fields allow the application to find the MLModel class that implements the prediction functionality of the iris_model package. The list can hold information for many models, so there's no limitation to how many models can be hosted by the service.

The reason that we need to install the model package before we can write any other code is because the model's input and output schemas are needed to be able to define the gRPC service's API.

Generating a Protocol Buffer Definition

Since we can't code the gRPC service until we have a .proto file with the definition of the API of the service, our first task is to generate .proto file from the models that will be hosted by the service. In order to automatically generate the file from the iris_model's input and output schemas we'll use the Jinja2 templating tool. Jinja2 is a templating tool that allows documents to be generated by combining a template file and a data structure, it allows a developer to isolate the unchanging parts of a document in the template, and keeps the parts that change in the data structure. First we'll create a template, and after that we'll add the schema information to it to generate a .proto file for the service.

The Template File

First we'll create the template file from which we'll generate the .proto file:

syntax = "proto3";

package model_grpc_service;

This code can be found here.

At the top of the template, we declare that we'll use the proto3 format, and the name of the package is "model_grpc_service". Next, we'll declare some data structures:

message empty {}

message model {
    string qualified_name = 1;
    string display_name = 2;
    string description = 3;
    sint32 major_version = 4;
    sint32 minor_version = 5;
    string input_type = 6;
    string output_type = 7;
    string predict_operation = 8;

message model_collection {
    repeated model models = 1;

This code can be found here.

These data structures will be used by an operation that will be declared further down in the template. The data structures hold information about the models that are hosted by the service, including the names of the input and output types and the name of the prediction operation for the model. The model_collection type holds a list of model objects.

Next, we'll generate an input type for the models hosted by the service:

{% for model in models %}
message {{ model.qualified_name }}_input { 
    {% for field in model.input_schema %}
        {{ field.type }} {{ field.name }} = {{ field.index }};
    {% endfor %}
{% endfor %}

This code can be found here.

This template code uses the qualified name of a model and the schema of the input of the model to generate a protocol buffer type that matches the model's input. The name of the input type for a model always follows this pattern: "<model_qualified_name>_input". Each field in the input schema of the model is translated to the equivalent field type in a protocol buffer and is given the same name. Lastly, an index is generated and assigned to the field.

Next, we'll do the same for the output schema of the model:

{% for model in models %}
message {{ model.qualified_name }}_output { 
    {% for field in model.output_schema %}
        {{ field.type }} {{ field.name }} = {{ field.index }};
    {% endfor %}
{% endfor %}

This code can be found here.

Now we can start to define the service's API:

service ModelgRPCService {
    rpc get_models (empty) returns (model_collection) {}
    {% for model in models %}
        rpc {{ model.qualified_name }}_predict ({{ model.qualified_name }}_input) returns ({{ model.qualified_name }}_output) {}
    {% endfor %}

The code above can be found here.

This code defines the operations that the service implements. The first operation is called "get_models" and it uses the first set of protobuf data structures that we defined above. This operation is simple since it does not change with the models that are being hosted by the gRPC service. It accepts the "empty" type since it does not require any inputs, and it returns the "model_collection" type.

Next, we will generate a set of prediction operations, one for each model hosted by the service. The name of the predict operation always follows this pattern: "<model_qualified_name>_predict". The model's input and output types are added to the operation by name.

Using the Template File

This template file is now ready to be used, so we'll create a python script that will take it and add information about the models that we actually want to host in the service. The script to do this is in the generate_proto.py script.

This code will make use of the ModelManager class that has been used in several previous blog posts. The ModelManager class is responsible for loading models from configuration, maintaining references to the model objects, and returning information about the models. In this section we'll use the get_models() and get_model_metadata() operations to access the information needed to generate the protocol buffer definition.

The script starts by instantiating the ModelManager and loading the models from the configuration:

model_manager = ModelManager()


This code can be found here.

Then the script loads the Jinja2 template file:

template_loader = jinja2.FileSystemLoader(searchpath="./")
template_env = jinja2.Environment(loader=template_loader)
template = template_env.get_template("model_service_template.proto")

This code can be found here.

Now that the template is loaded, we can generate the data structure that will be passed to the template:

models = []
for model in model_manager.get_models():
    model_details = model_manager.get_model_metadata(qualified_name=model["qualified_name"])
            "qualified_name": model_details["qualified_name"],
            "input_schema": [{
                "index": str(index + 1),
                "name": field_name,
                "type": type_mappings[model_details["input_schema"]["properties"][field_name]["type"]]
        } for index, field_name in enumerate(model_details["input_schema"]["properties"])],
        "output_schema": [
            "index": str(index + 1),
            "name": field_name,
            "type": type_mappings[model_details["output_schema"] ["properties"][field_name]["type"]]
        } for index, field_name in enumerate(model_details["output_schema"]["properties"])]

This code can be found here.

The code builds a dictionary for each model that contains the qualified name, input schema, and output schema of each model in the ModelManager. The python data types are converted to the equivalent protocol buffer types as it goes along. The resulting dictionary is the data structure that is used by the Jinja2 template defined above to generate a protocol buffer definition.

Lastly, we'll render the template with the information we just extracted from the models and then save the generated file to disk:

output_text = template.render(models=models)
with open(output_file, "w") as f:

This code can be found here.

Now that we have the template and the script that uses the template completed, we can try to generate a protocol buffer definition for the service. The command to do this goes like this:

export PYTHONPATH=./
python scripts generate_proto.py --output_file=model_service.proto

The file generated by the command above is called "model_service.proto" and it can be found here. The protocol buffer definition contains the types needed for the get_models operation as well as the operation itself. It also contains the types and operations needed to interact with the iris_model, which were automatically extracted from the information provided by the model.

By using a template and script approach to generating a protocol buffer definition we are able to host any number of models inside of the gRPC service. This is possible because every model that will be hosted is required to expose its input and output schema through the MLModel interface.

Defining the Service

Now that we have a protocol buffer definition for the gRPC service we can actually start writing the code to implement the service itself. To do this, we first need to compile the protocol buffer into its python implementation. This is done with this command:

export PYTHONPATH=./
python -m grpc_tools.protoc --proto_path=. --python_out=. --grpc_python_out=. model_service.proto

This command generates two files: the model_service_pb2.py file and the model_service_pb2_grpc.py file. The model_service_pb2.py file contains the python data structures that will serialize and deserialize from native python types to the protocol buffer binary format. The model_service_pb2_grpc.py file contains the bindings that will allow us to write a service that implements the operations defined in the protocol buffer definition and also to write client code that can call the implementations.

We'll start by creating a python file that contains the main service codebase. We'll also implement the get_models operation in this file since it is not a dynamic endpoint which depends on the presence of a model to execute.

The gRPC service is defined as a class that inherits from a "Servicer" class that was generated by the protoc compiler:

class ModelgRPCServiceServicer(model_service_pb2_grpc.ModelgRPCServiceServicer):

This code can be found here.

Within the class, each operation is defined as a method with the same name as the operation in the .proto file. The get_models operation is defined like this:

def get_models(self, request, context):
    model_data = self.model_manager.get_models()
    models = []
    for m in model_data:
        response_model = model(qualified_name=m["qualified_name"],
    response_models = model_collection()
    return response_models

This code can be found here.

The operation does not receive any data in the request and returns a model_collection data structure in the response. The model_collection data structure was defined in the .proto file and compiled into a python class by the protoc compiler. In order to fill the model_collection, we iterate through the data returned by the ModelManager creating a list of model objects as we go along. We then create the model_collection from the list and return it to the client.

MLModelgRPCEndpoint Class

In order for the service to host any model that uses the MLModel base class, we'll need to create a class that translates the protocol buffer data structures into the native python data structures used by the models. This class will be instantiated for every model that is hosted by the service.

class MLModelgRPCEndpoint(object):

The code above can be found here.

When the service is initiated, we'll create one instance of this class for every model. The __init__ method is looks like this:

def __init__(self, model_qualified_name):
    model_manager = ModelManager()
    self._model = model_manager.get_model(model_qualified_name)
    if self._model is None:
        raise ValueError("'{}' not found in ModelManager instance.".format(model_qualified_name))

    logger.info("Initializing endpoint for model: {}".format(self._model.qualified_name))

The code above can be found here.

The __init__ method has one argument called "model_qualified_name" which tells the endpoint class which model it will be hosting. The __init__ method gets a reference to the ModelManager object that is managed by the service, then it gets a reference to the model object from the ModelManager object using the model_qualified_name argument. Lastly, before finishing we check that the model instance is actually available in the ModelManager.

Now that we have an instance of the endpoint for the MLModel object, we need to write a method that will make the predict method available as a gRPC endpoint. We'll do this by defining the __call__ method on the endpoint class. When a __call__ method is attached to a class, it turns all instances of the class into callables, which allows instances of the class to be used like functions. This will be useful later when we need to initialize a dynamic number of endpoints in the gRPC service.

def __call__(self, request, context):
    data = MessageToDict(request, preserving_proto_field_name=True)

    prediction = self._model.predict(data=data)

    output_protobuf_name = "{}_output".format(self._model.qualified_name)
    output_protobuf = MLModelgRPCEndpoint._get_protobuf(output_protobuf_name)

    response = output_protobuf(**prediction)

    return response

The code above can be found here.

The method uses the MessageToDict function from the protobuf package to turn a protocol buffer data structure into a Python dictionary. The dictionary is then passed into the model's predict method and a prediction is returned.

Now that we have a prediction, we have to find the right protocol buffer data structure to return the prediction result to the client. To do this, a special method called "_get_protobuf" is used which goes into the model_service_pb2.py module where the python protocol buffer definitions are stored, and dynamically import the correct class for the output of the model. For example, the iris_model's output protocol buffer definition is called "iris_model_output". This lookup is possible because the output protocol buffer of a model is always named according to the same pattern. In the last step, we hand over the model's prediction to the protocol buffer class which initializes itself with the prediction data and return the resulting object.

Creating gRPC Endpoints Dynamically

Now that we have a class that can handle any model object, we need to connect it to the service. To do this, we'll create an __init__ method in the service class that will execute when the service starts up:

def __init__(self):
    self.model_manager = ModelManager()

    for model in self.model_manager.get_models():
        endpoint = MLModelgRPCEndpoint(model_qualified_name=model["qualified_name"])

        operation_name = "{}_predict".format(model["qualified_name"])
        setattr(self, operation_name, endpoint)

The code above can be found here.

The __init__ method first instantiates the ModelManager class and loads the models listed in the configuration. Once the models are in memory, we create an endpoint object for each one in a loop. For each model, we create an MLModelgRPCEndpoint object which is given the model's qualified name. Then we generate the model's operation name which matches the operation name for the model's predict operation listed in the .proto file. For example, the iris_model's predict operation is named "iris_model_predict". Lastly, we use the operation name and dynamically set an attribute on the service class that attaches the newly created endpoint to the class. This last step allows the service to find the right endpoint for the operation when a call for a prediction from a certain model is received. The fact that each endpoint object is callable allows the service to call the endpoint object as if it was a method of the class even though the endpoint is actually another class.

Using the Service

We now have a complete service that we can test out. To do this we'll execute these commands:

export PYTHONPATH=./
export APP_SETTINGS=ProdConfig
python model_grpc_service/service.py

In order to test out the service, I created a simple script that sends a single gRPC request to the service. The script is found here. To send a request to the get_models operation, the code looks like this:

with grpc.insecure_channel("localhost:50051") as channel:
    stub = ModelgRPCServiceStub(channel)
    response = stub.get_models(empty())

This code can be found here.

To send a test request to the iris_model_predict operation of the service, execute this command:

export PYTHONPATH=./
python scripts/client.py --iris_model_predict

The script will contact the service running locally, make a prediction with some sample data and print out the prediction result.


In this blog post we've shown how to deploy an ML model inside a gRPC service. As gRPC becomes more popular, the option of deploying ML models as gRPC services is becoming more attractive. As in previous blog posts, we've built the service so that it can support any number of ML models, as long as they implement the ML Model interface. This is one more type of deployment that we implemented without having to modify the iris_model package. The ability to deploy an ML model in different ways without having to rewrite any part of the model code is very valuable and ensures good software engineering practices.

By using gRPC to deploy an MLModel, we're able to take advantage of all of the features of gRPC. These benefits include lightweight and fast serialization of messages and built in support for streaming. The ability to document a service API using protocol buffers also simplifies the documentation and roll out of a new service. Lastly, the ability to compile service and client codebases from the protocol buffer definitions allows us to avoid many common errors.

In previous blog posts, deploying a new model was as simple as installing the model package into the environment and adding it to the configuration of the application. The schema of the model's inputs and outputs did not affect the application code at all. In the code of this blog post, we have to do more work because of the nature of protocol buffers, since the generated code in the project is specific to a set of models. Because of this, adding a new model to the gRPC service requires us to generate a new .proto file from the model's input and output schemas, generate python code from the .proto file, and finally add the model to the configuration of the service. The extra steps make it more complex to deploy the service.

In the future, the service could be improved by handling more complex schemas, since currently the schema mapping between native python types and protocol buffers only supports simple data structures. Another way to improve the service is to add support for streaming endpoints for each model. Lastly, protocol buffers have a mechanism for evolving message schemas, the code could be improved by safely evolving the shema of the service through this mechanism when the model schema changes.