Shane Burrell
11 min read

The Day After Migration: Why Cloud Savings Disappear — and How to Keep Them

Cloud migrations win headlines, but savings often creep back within 18 months. How to institutionalize cost discipline, engineering guardrails, and reliability tradeoffs so your migration dividend actually sticks.

The Day After Migration: Why Cloud Savings Disappear — and How to Keep Them

A migration is the part everyone remembers. The savings number gets a slide. Finance books it. Engineering moves on to the next initiative. And then, somewhere around eighteen months later, the cloud bill is back near where it started, and nobody can quite explain how.

I have watched this pattern enough times to recognize it as the rule, not the exception. The celebration leads to budget reallocation, which leads to new workloads, which leads to sprawl, which leads to a surprise bill. The migration team that earned the savings has long since disbanded, and the discipline that produced the number left with them.

The mistake is treating cost savings as a project outcome instead of an operating capability. Migration is the starting line, not the finish. And the thing that erodes savings the fastest is almost never the headline architecture decision—it is the dozens of small, unguided choices teams make once nobody is watching. The single biggest cost surprise I see in post-migration environments is Redis, and not because Redis is expensive. It is because teams use it improperly when there are no platform guardrails to stop them. More on that below.

The Migration Dividend Trap

The drop in spend right after a migration is real, but it is mostly a one-time event. It feels like structural change. It usually is not.

  • Lift-and-shift leaves the inefficiency baked in. Decommissioning old providers produces a clean, visible cost drop. The underlying waste—oversized instances, idle environments, chatty data transfer—came along for the ride.
  • The savings get spent before they are verified. The moment a number lands on a finance slide, it gets reallocated to headcount, a new initiative, or an AI pilot. The dividend is gone before anyone confirms it was durable.
  • New teams inherit the platform without the context. The people who run workloads a year later never sat in the migration war room. They do not know which defaults were deliberate and which were guardrails holding back a flood.
  • Commitments get locked at the wrong moment. Reserved capacity and savings plans purchased at peak migration usage can quietly overcommit you to spend you no longer need.
  • Managed services get adopted without patterns. A team decides “we need caching” and spins up a managed service with no platform review. The bill arrives weeks later, after the architecture is already load-bearing.

For how the original $14M in annual savings was actually earned during execution, see Leading Platform Migrations at Scale. This article is about the harder part: keeping it.

Redis: The #1 Cost Surprise

If I had to name the single most common post-migration cost surprise, it would be Redis—almost always run as Amazon ElastiCache. It is worth singling out because it is so easy to provision and so hard to right-size without discipline, which makes it the perfect illustration of why cost is a platform problem.

To be clear up front: Redis is excellent when it is used correctly. The failure is not the technology. It is unguided adoption after a migration, when every team is rebuilding patterns from scratch and nobody owns the defaults.

The misuse patterns repeat almost word for word across organizations:

  • Redis as a database. Teams store durable state, large payloads, or unbounded keys with no eviction policy. Memory grows until you are paying for a cluster you never planned to run—and treating an in-memory cache as your source of truth.
  • Oversized nodes “just in case.” The largest instance type, Multi-AZ replication, for a dev or staging environment, or for a workload that does not actually need sub-millisecond latency.
  • Cluster mode by default. Sharded, multi-node topology stood up when a single small node with sensible TTLs would have done the job for a fraction of the cost.
  • No TTLs and no maxmemory policy. A cache that never evicts is not a cache. It is an expensive memory leak with a hostname.
  • Provisioned outside the paved path. Someone clicks “add Redis” in the console, or copies a Terraform block from another repo. Nobody reviews the sizing until finance asks why the line item tripled.

Here is why this is a platform problem and not a FinOps spreadsheet problem. A siloed cost team sees the ElastiCache line item spike in month three. They can write it up. They can send a Slack message. What they cannot do is change the architecture. A platform team, on the other hand, ships approved Redis patterns: sizing guidance, required TTLs, separate dev and production tiers, automated alerts on memory growth, and clear guidance on when not to reach for Redis at all.

The cache was supposed to make the product faster. Nobody approved the bill for making the company poorer.

What Actually Kept $14M on the Books

The savings from that migration held because the mechanisms that produced them were institutionalized, not improvised. A few mattered more than the rest:

  • Governance from day one. Landing zone standards, consistent tagging, and a deliberate account structure were built into the foundation—not bolted on after the final phase when the mess was already large.
  • Kill lists, not just right-sizing. The biggest wins came from workloads that should never have migrated at all: zombie services, orphaned environments, duplicated tooling. Right-sizing trims; killing eliminates.
  • Architecture choices that compound. Autoscaling defaults, instance family discipline, and awareness of data transfer costs pay off every single day, not once at cutover.
  • Managed service patterns, with Redis as the canonical example. Approved caching tiers, sizing guardrails, and TTL and eviction requirements turn a recurring surprise into a solved problem.
  • A platform team that owns cost as a product concern. Not an analyst with read-only access to billing data, but engineers who ship the defaults, guardrails, and paved paths that make the efficient choice the easy choice.

That last point is the crux. Cost retention requires the same team that owns landing zones, CI/CD, and self-service infrastructure—the people described in Building High-Performing Platform Engineering Teams. If no engineer’s job includes saying no to an expensive default, the default wins.

FinOps Belongs in Platform Engineering, Not a Silo

The most common organizational mistake I see is standing up a FinOps team whose entire job is to look at numbers. It fails, predictably, because watching cost is not the same as changing it.

The silo failure mode looks like this:

  • The FinOps function sits inside finance or a central operations group, publishes monthly reports, and runs reserved-instance and savings-plan spreadsheet exercises.
  • Application and platform teams treat cost as somebody else’s concern until finance escalates.
  • Dashboards exist. Guardrails do not. There is visibility without leverage.

Cost discipline sticks when it is built into platform thinking—the same operating model that already governs how teams deploy, scale, and consume infrastructure. In a platform-integrated model:

  • Cost is a platform product. Teams see their spend in the same portal where they provision infrastructure. Tagging and budgets are enforced at creation time, not audited months later.
  • Guardrails beat gatekeeping. Sensible defaults—instance families, autoscaling bounds, sandbox time-to-live, region policies—are baked into the paved path. Exceptions require justification, rather than good behavior requiring heroics.
  • Redis is the poster child for guardrails. Platform-approved cache templates with a maximum node size, required TTLs, and distinct dev and staging tiers. Unconstrained ElastiCache provisioning in self-service flows gets flagged or blocked before the bill, not after.
  • FinOps skills live on the platform team. People who understand both unit economics and architecture tradeoffs, who can change a template, a policy, or a cluster configuration—not just send an alert.
  • Finance is a partner, not the owner. Finance sets targets and holds accountability. Platform engineering owns the mechanisms that keep spend aligned with those targets.

To be clear, this is not an argument against cost visibility or against partnering with finance. Both are essential. It is an argument against the organizational separation that turns cost management into reporting theater.

The red flags of a siloed model are easy to spot:

  • Cost spikes are treated as finance surprises rather than platform design failures.
  • Monthly reviews happen, but the reviewers have no authority to change defaults, templates, or approval workflows.
  • Application teams cannot act on their own spend data without opening a ticket to another org.
  • The platform team has no cost metric sitting alongside reliability and developer experience.

When cost is a first-class platform outcome rather than a side project for an analyst, it stops being a surprise. That framing is the same one that justifies platform investment in the first place, which I cover in The Business Case for Platform Engineering.

When Cheap Becomes Expensive

Cost and reliability are the same conversation, and treating them as opposites is how organizations talk themselves into expensive mistakes. The cheapest line item is not the cheapest outcome once you account for the total cost of failure.

  • Under-provisioning leads to incidents, then emergency scale-ups that cost more than a right-sized baseline ever would have.
  • Aggressive use of spot and preemptible capacity without a real failover strategy trades a predictable bill for unpredictable outages.
  • Cutting observability to save money lengthens mean time to recovery, and a longer outage is almost always more expensive than the monitoring you removed.
  • Skipping disaster recovery testing is a saving right up until the day it is the most expensive decision the company ever made.

The executive framing is simple: optimize for the total cost of failure, not just the cloud bill. Experiments and aggressive cost moves are healthy when someone owns the tradeoff. The danger is the unowned cost experiment that nobody is accountable for when it fails.

The 90-Day Savings Retention Playbook

If you have just finished a migration—or you are six months past one and watching the number drift—here is a concrete plan.

Days 1–30: Baseline and ownership

  • Lock the post-migration spend baseline and document exactly what was decommissioned to produce it.
  • Assign a named owner for the cloud spend trajectory. A person, not a committee.
  • Run a tagging audit and identify the top ten cost drivers by team and service.
  • Run a Redis and ElastiCache audit specifically: find every cluster, check sizing against the actual workload, verify TTL policies, and flag dev and production tier violations. If Redis is your top surprise, this is your fastest win.

Days 31–60: Platform guardrails, embedded rather than bolted on

  • Ship cost-aware defaults into platform templates: instance sizes, region policies, sandbox expiration, autoscaling bounds, and Redis and cache tier templates.
  • Enforce a pull-request checklist for new services through platform tooling, not a FinOps slide deck.
  • Hold a weekly fifteen-minute review with platform and application leads focused on variance drivers and which guardrail to change—not just which number moved.

Days 61–90: Institutionalize

  • Establish a quarterly architecture review for the top spenders.
  • Stand up a kill-list process for unused resources so cleanup is routine, not heroic.
  • Build an executive dashboard with three numbers only: total spend versus baseline, the top variance drivers, and the percentage of savings retained.

In the migration itself, my north-star question was “Did anyone even know this happened?” Invisibility was the mark of success. The retention version of that question is sharper: would finance notice if we lost the migration savings? If the honest answer is that finance would eventually notice but engineering would not catch it until it was too late, the program is not working yet.

What to Do When Savings Are Already Gone

If the dividend has already evaporated, the answer is not to migrate again. It is to diagnose the sprawl.

  • Run a 30-day forensic: untagged resources, duplicate services, over-provisioned non-production environments, and misused Redis clusters that are oversized, lack TTLs, or are quietly serving as a primary data store.
  • Decide whether you need outside leadership to reset the operating model or a permanent platform lead to own it going forward. A fractional CTO engagement can be the right way to install the discipline before you hire for it permanently.

Recovery is rarely about a single dramatic fix. It is about restoring the guardrails that should have been there the day after migration.

Conclusion

A migration builds capability. Retention builds culture. The $14M number is credible precisely because it was earned during execution—but the genuinely hard work begins after the project team disbands and the savings have to survive on their own.

Make cost a platform responsibility. Put the guardrails where the work happens. And treat the boring, recurring surprises—Redis chief among them—as the early warning system they are. That is how the savings stay saved.


Want to discuss cloud cost retention, platform engineering operating models, or post-migration FinOps? Connect with me on LinkedIn to continue the conversation.