Cheapest AWS EC2 Instance for ML Inference

CloudBench recommendation

The right inference instance depends entirely on your model size and latency requirements. For most workloads at most teams, **g5.xlarge** at $1.0060/hr (~$724/mo) is the cost-effective starting point. The NVIDIA A10G has 16GB of VRAM — enough for quantized LLMs up to ~13B parameters, vision-language models, embeddings at scale, and most fine-tuned diffusion variants.

For production LLM serving at any meaningful volume, the cheapest option is almost always AWS Inferentia (inf2) — it's purpose-built for transformer inference and runs at roughly 50% the per-hour cost of an equivalent GPU. The catch: you have to compile your model with the AWS Neuron SDK first, which adds engineering work and limits you to model architectures that have first-class Neuron support (most major LLMs do). If you're serving a fixed set of models and care about cost per token, Inferentia is the right answer.

For CPU-only inference, c-series Graviton (c7g, c8g) instances handle small classical ML models — sklearn pipelines, gradient-boosted trees, distilled transformers under ~100M parameters — at a fraction of the cost of any GPU instance. The break-even point is roughly: if your model takes <50ms per request on a c7g.large, you don't need a GPU at all. Most CV embedding models, classical NLP pipelines, and tabular models fit comfortably under this threshold.

Spot pricing matters enormously for inference workloads because they're often interruption-tolerant. Both inf2 and g5 see 60-70% Spot discounts in most regions. The setup overhead is real (you need a request queue, retries, and a fallback to On-Demand if Spot capacity drops), but the cost savings on a meaningful inference fleet are substantial. For latency-sensitive synchronous workloads, stay on On-Demand; for batch processing, async APIs, or anything you'd describe as "throughput-bound," Spot is the right default.

Alternatives by tier

Budget (CPU)

c7g.large

$0.0725/hr · ~$52/mo

Small CPU-only models (tiny transformers, classical ML) — 2 vCPUs, 4GB RAM. Cheapest viable inference for sub-100M-param models.

Standard (GPU) ★

g5.xlarge

$1.0060/hr · ~$724/mo

General GPU inference — NVIDIA A10G, 16GB VRAM. Runs most current models (LLMs up to 13B parameters quantized, vision models, embeddings).

Scale (Inferentia)

inf2.xlarge

$0.7582/hr · ~$546/mo

Compiled transformer inference at ~50% the cost of equivalent GPU — needs Neuron SDK compilation step but cheapest per-token for production LLM serving.

Things to consider

For LLM inference at any volume, Inferentia (inf2) is dramatically cheaper than NVIDIA GPUs — but only if your model compiles with the Neuron SDK
For one-off batch inference, c-series CPU instances are often more cost-effective than provisioning a GPU you don't fully utilize
NVIDIA A10G (g5) is the right starting GPU for most workloads — better availability than newer H100s, plenty of VRAM for 7B-13B models
Spot pricing on inf2 and g5 is often 60-70% off On-Demand; for non-realtime inference, build interruption tolerance and use Spot

Related workloads

Best EC2 for Node.js

Best EC2 for Django

Alternatives by tier

Things to consider

Related workloads

Browse All Cloud Instances