Options for model deployment

AWS has three main options for deploying ML models:

SageMaker Endpoints: Fully managed
EKS: partially managed Kubernetes
EC2: mostly unmanaged VMs
Lambda: So-called “serverless” compute

Of these, all but Lambda offer the option of using GPUs.

Borderline options

Technically, there are three others: Elastic container service (ECS), Fargate and Batch. In practice, none of these are used very often:

Lambda is a so-called “serverless” compute environment. It works very well for ad-hoc integration, but it gets expensive if it needs to run more than a short time. It does not support GPUs.
ECS is a fully managed container orchestration service. It does support GPUs, so technically it could support demanding technical requirements. However, by the time you need to put ML models in containers, your team is almost certainly capable of figuring out EKS. The only time it makes sense to use ECS for ML is if your company is already heavily invested in it.
Fargate is an abstraction layer over EKS and ECS. Like ECS itself, it is a half-measure that should be skipped if you have a choice.
Batch is an end-to-end managed orchestration service, including compute provisioning. While Batch does support GPUs, it suffers from the same fundamental pointlessness as Fargate: by the time you have batch compute needs complex enough for orchestration layer, you probably have people who can figure out how to use Airflow. If you have a single, simple workflow, it might be a reasonable way to get a prototype off the ground, but you should quickly transition to a full-fledged orchestration layer with a more robust solution for inference.

Update 2024-11-30: I was interviewing with a company that uses ECS for ML inference. So there must be some rationale for doing so.

AI/ML as a service

Companies do also have the option of treating ML as a service, and very often this is the right choice. (See the ML on AWS for specific services available.) In this case, the backend system is typically very thin. Hence:

For real-time workflows, Lambda is likely to be cost-effective as an interface layer between a front-end and a managed ML service. For an existing backend application, it is usually satisfactory to integrate the request logic directly into the application. Take care to employ asynchronous requests if applicable, as tail latency can be substantial for ML workloads.
For batch workflows, AWS often has native APIs for batch inferencing. Hence batch inference can be a stand-alone step in an orchestration pipeline. If this is not available, it may be worth considering an alternative service. If this is not an option, there is little to be done but to write a custom handler. This can be done very easily in Python using asyncio and/or concurrent.futures. Using these two together (easier than you think!), it is possible to get so much throughput that you are likely to quickly exceed a default usage cap.

Comparing the main options

Real-time deployments

Early-stage

For early stage deployments, SageMaker Endpoints are an attractive option. They are very easy to manage, support auto-scaling, and handle their own ops (load balancing, health checks, etc).

At scale

SageMaker endpoints on-demand instances are only about 30% more expensive than bare EC2. Although pricing is more nuanced than that, it can safely be said that SageMaker remains cost-competitive for many larger-scale deployments when factoring in labor and supporting infrastructure.

However, managed endpoints limit operational flexibility, and may not be necessary for an organization that already uses Kubernetes. Additionally, mature workloads are likely to require significant pre- and post-processing, such as batching and feature hydration. While SageMaker does offer these capabilities—it has a managed feature store and various strategies for data transformation—these situations can grow arbitrarily complex. For these complex cases, EKS or self-managed Kubernetes on EC2 start to make more sense.

Batch deployments

Early-stage

SageMaker Batch Transform is an attractive solution for batch inference. Like endpoints, you pay by the instance. The difference here is that the instances are only running during the time that the model is being used. Hence you pay only for the hours (or fractions of hours) that you actually spend performing inference.

At scale

SageMaker batch transform remains a good option for standalone, embarrassingly parallel workloads. If the company already has existing compute infrastructure for ETL workloads, it may be preferable to use that. For example, if the company has a robust Kubernetes cluster capable of quickly adding capacity for Kubernetes jobs, it may be more cost-effective to create a GPU-enabled node group and let the cluster autoscale. Note, though, that this does add operational complexity, and managed compute is often cheaper than labor until very large scales.

For companies that already have complex, large-scale ETL processes using Spark, it is also possible to provision GPU-enabled instances for EMR. Note that this path is generally only suitable if either the model is being added to an existing Spark process or it has been determined that Spark is the clear right tool for the other parts of the task.

Streaming / near real-time deployments

Early-stage

Streaming ML applications, of which near real-time is a special case, resemble real-time deployments with relaxed latency constraints. This typically implies request batching, which can make the deployment much more compute-efficient. SageMaker asynchronous endpoints automatically handle the accumulation of requests into a batch, as well as the appropriate callbacks to each requester once finished. This can be a very convenient way to get started.

At scale

Both TorchServe and TensorFlow Serving support dynamic batching out of the box, such that they will send the correct part of the batch to each requester. However, these batching mechanisms are synchronous; you have to provide your own orchestration logic. For companies with robust message broker support, this is achievable, but it can quickly become a morass. Hence, until a more complicated pre- and post-processing workflow necessitates a custom solution, SageMaker continues to make business sense even at scale.

David's raw ML reference notes

Explorer

Comparing AWS options for ML model inference (deployment)

Options for model deployment

Borderline options

AI/ML as a service

Comparing the main options

Real-time deployments

Early-stage

At scale

Batch deployments

Early-stage

At scale

Streaming / near real-time deployments

Early-stage

At scale

Graph View

Table of Contents

Backlinks