How to Use Spot Instances Without Downtime

Tas Skoudros

3 days ago

Most people talk about spot instances like they’re a clever hack: the same cloud compute, but cheaper, if you’re willing to tolerate a bit of risk.

That framing is useful; but it’s also the source of a lot of disappointment.

Spot isn’t a discount programme. It’s a mechanism for cloud providers to even out utilisation on a busy commodity platform. You’re not buying “cheap infrastructure”; you’re buying access to spare capacity the provider would otherwise leave idle. When there’s slack in the system, you can get compute at a bargain. When the platform gets busy, the provider needs that capacity back.

So the real question isn’t “how do I get spot cheaper?” It’s this:

How do I build workloads that can ride a fluctuating capacity market without falling over?

TL;DR

Spot is interruptible spare capacity, not a stable pricing tier.
The biggest risk isn’t price — it’s reclaim events and failed replacement when pools tighten.
Use spot where interruption is a non-event: queues, stateless workers, idempotent jobs, fast replacement.
The safest pattern is baseline on stable capacity, burst on spot, with an automatic fallback to on-demand.

The misunderstanding: treating spot like a stable pricing tier

The classic spot mistake is to treat it like a permanent cost reduction:

“We’ll put our services on spot and save 60%”
“We’ll lift-and-shift to spot and accept occasional termination”
“We’ll use spot for everything except the database”

Those statements assume spot behaves like on-demand with a slightly worse SLA.

But spot behaves more like a market for leftover capacity. And markets don’t stay still — and you don’t control the market.

When the platform has spare capacity, spot is plentiful. When demand rises — seasonal traffic, a product launch, an incident elsewhere, a popular instance family becoming constrained — spare capacity disappears. The provider will reclaim instances, and your “cheap servers” evaporate exactly when the world is most inconvenient.

What spot is actually optimised for

From the provider’s point of view, spot is how you monetise slack and smooth out utilisation. A fleet that’s 60–70% utilised all the time is expensive to run and hard to predict. Filling the gaps with interruptible compute turns idle metal into revenue without committing to long-lived guarantees.

From your point of view, spot is perfect for anything that can tolerate interruption:

stateless workers behind a queue
batch processing
CI runners and build farms
render farms
event-driven jobs
horizontally scalable services with graceful degradation

Spot is not “cheap compute”; it’s interruptible capacity. Cheap is an effect, not the product.

The price isn’t the real risk — interruptions are

People tend to watch spot price and assume the cost risk is “it might go up.” In practice, especially in modern spot markets, the bigger risk is usually:

your instances get reclaimed
you can’t replace them in the same pool
your scaling fails because the pool is tight

The “cost” of spot isn’t primarily the hourly rate. It’s the engineering you need to make interruption a non-event.

If your workload can’t adapt automatically, spot savings are just deferred costs paid later — as downtime, paging, and emergency refactors.

And in the real world, spot tends to go wrong in two ways: it creates avoidable incidents, or it quietly becomes a tax on your team. You end up with brittle node groups, surprise capacity gaps, and engineers babysitting scaling behaviour that should be automatic.

The lever you need: feedback loops and control knobs

The mental shift is this:

Only use spot if you can actively respond to capacity conditions.

That doesn’t mean staring at a graph all day. It means building your platform so it can observe conditions and change behaviour quickly.

A safe operating model (baseline + burst)

A good default pattern is:

Buy stability for the minimum you must have (baseline)
Use spot for elastic capacity above that baseline (burst)
Fall back automatically to stable capacity when the market tightens

This keeps your core SLOs safe while still capturing savings when spare capacity is available.

The spot operating checklist

If you want spot savings to stick, you should be able to answer “yes” to most of these:

Baseline capacity defined: you know the minimum stable capacity required to hold SLOs.
Interruption is routine: workloads restart cleanly; jobs are idempotent and retry-safe.
Placement is flexible: multiple instance types and families; you’re not married to one SKU.
Spread is deliberate: multi-AZ (and multi-region where justified), avoiding correlated failures.
Scaling is deadline-aware: queue depth and job latency drive scaling, not CPU alone.
Fallback exists: automatic move to on-demand when launch failures spike or capacity errors appear.
Degradation is designed: you shed non-critical work first before user-facing impact.
Visibility is clear: interruption rate, launch failures, queue depth, and SLO burn are on one dashboard.
Ownership is explicit: someone owns the policies, the guardrails, and the incident playbooks.

If you don’t have these knobs, spot is just gambling with your uptime.

A practical rule of thumb

If losing 30–50% of your spot capacity in two minutes would cause an incident, you’re not “using spot” — you’re running production on a hope.

Spot is fantastic — when it’s treated like what it is: a variable, interruptible slice of a shared commodity platform.

Where teams usually need help

Most teams don’t fail on the concept — they fail on the operating model: instrumentation, autoscaling policy, safe fallbacks, and clear ownership when the market tightens.

That’s where StackTrack typically helps through Cloud Support and DevOps as a Service:

Stabilise the baseline: define minimum viable capacity and SLO guardrails
Make interruption boring: design restartability, idempotency, and fast replacement
Remove single points of capacity: diversify pools and instance families
Harden scaling: tune autoscaling against queue depth, latency, and deadlines
Run it with you: monitoring, incident response, and continuous improvement so savings don’t degrade into risk

Closing thought

Spot doesn’t reward people who find the cheapest instance type.
It rewards people who build systems that can adapt to a market.
If you want to save money with spot, don’t start with pricing. Start with architecture: queues, statelessness, idempotency, fast replacement, and the ability to shift across pools.

Then the savings show up as a side-effect — and they keep showing up even when everyone else piles onto the same capacity.