The two usual options

There are basically two ways I’d expect a company to actually train their model within the AWS ecosystem: SageMaker Training and EKS.

SageMaker Training: works basically like SageMaker Batch Transform, in that you only pay for instances while training is in progress. Prices are a bit lower than for inference. You can train using SageMaker and then take your model artifact to a different deployment environment.

EKS: You’ll use a container image containing your ML environment, and train as a Kubernetes job.

Why just these?

Why not EC2, ECS, Fargate, Batch, Glue, or Step Functions?

EC2: There are special cases where EC2 makes sense, such as the need for very particular environments or bare-metal access. Other than these, you’re basically self-managing the kind of infrastructure that Kubernetes can handle for you: you’re choosing pets over cattle.

ECS and Fargate: My usual objection to these is that it’s hard to imagine who they’re really for. SageMaker is a suitable option for straightforward workloads at all but the largest scale. Kubernetes really isn’t that hard anymore. So by the time you have the model complexity or scale to need to move off it, you will definitely have the technical maturity to handle EKS.

Batch and Glue: These are reasonable starting points for a team with a slightly involved ETL process surrounding their training job. However, most teams will want to quickly transition to something more flexible, such as using Airflow with one of the preferred training options.

Step Functions: The main thing this product offers over Airflow or Batch is latency improvements, which isn’t relevant for training.

EMR: a special case

For companies that already have large map-reduce processes in place, and in cases where this process is closely connected to ML training, using something like Spark on EMR is reasonable. Note, though, that most training jobs are embarrassingly parallel, making this level of complexity unnecessary.