Cheapest AWS EC2 Instance for ML Inference
The right inference instance depends entirely on your model size and latency requirements. For most workloads at most teams, **g5.xlarge** at $1.0060/hr (~$724/mo) is the cost-effective starting point. The NVIDIA A10G has 16GB of VRAM — enough for quantized LLMs up to ~13B parameters, vision-language models, embeddings at scale, and most fine-tuned diffusion variants.
For production LLM serving at any meaningful volume, the cheapest option is almost always AWS Inferentia (inf2) — it's purpose-built for transformer inference and runs at roughly 50% the per-hour cost of an equivalent GPU. The catch: you have to compile your model with the AWS Neuron SDK first, which adds engineering work and limits you to model architectures that have first-class Neuron support (most major LLMs do). If you're serving a fixed set of models and care about cost per token, Inferentia is the right answer.
For CPU-only inference, c-series Graviton (c7g, c8g) instances handle small classical ML models — sklearn pipelines, gradient-boosted trees, distilled transformers under ~100M parameters — at a fraction of the cost of any GPU instance. The break-even point is roughly: if your model takes <50ms per request on a c7g.large, you don't need a GPU at all. Most CV embedding models, classical NLP pipelines, and tabular models fit comfortably under this threshold.
Spot pricing matters enormously for inference workloads because they're often interruption-tolerant. Both inf2 and g5 see 60-70% Spot discounts in most regions. The setup overhead is real (you need a request queue, retries, and a fallback to On-Demand if Spot capacity drops), but the cost savings on a meaningful inference fleet are substantial. For latency-sensitive synchronous workloads, stay on On-Demand; for batch processing, async APIs, or anything you'd describe as "throughput-bound," Spot is the right default.
Alternatives by tier
Things to consider
- For LLM inference at any volume, Inferentia (inf2) is dramatically cheaper than NVIDIA GPUs — but only if your model compiles with the Neuron SDK
- For one-off batch inference, c-series CPU instances are often more cost-effective than provisioning a GPU you don't fully utilize
- NVIDIA A10G (g5) is the right starting GPU for most workloads — better availability than newer H100s, plenty of VRAM for 7B-13B models
- Spot pricing on inf2 and g5 is often 60-70% off On-Demand; for non-realtime inference, build interruption tolerance and use Spot