Why not let an LLM cost agent suggest savings?

A model without a ruleset can recommend the same wrong thing on Monday that it recommended on Friday, with no audit trail and no traceable math. The ruleset in this post is what the model ought to be applying. Once it is written down, you do not need the model to apply it.

Does this work on AWS or GCP?

The same shape works. Replace Azure SKU families with EC2 instance families, Azure Monitor with CloudWatch, savings plans with their AWS or GCP equivalents. The deterministic frame is portable. TwoOps is Azure-only today; AWS and GCP are on the roadmap.

What about reserved instances on database VMs?

Database VMs are usually the right place for a 1- or 3-year RI because the SKU does not move. Right-size the database first (it is the most common E-series misallocation), then commit against the post-right-sizing SKU.

How often should we re-run this?

Monthly is plenty for steady fleets. Weekly is right for teams shipping infrastructure changes daily. TwoOps runs the equivalent continuously because the cost of the run is zero and the cost of waiting is observable on the bill.

Why your Azure VMs cost 2x more than they should

Open the Cost Analysis blade for any Azure subscription that has been running production for more than a year. Sort by resource type descending. Virtual Machines are almost always at the top. Inside that line item, three or four SKUs are almost always doing 80% of the spend.

Now pull 30 days of CPU metrics for those VMs. The 95th percentile is usually under 40%. The median is usually under 15%. You are paying for headroom you never use.

Right-sizing decisions should follow a deterministic ruleset: p50/p95 thresholds, SKU-swap rules, an eviction-tolerance matrix. Vibes lose. An LLM cost agent that “looks at your bill and suggests savings” loses. Rules with traceable math win, because the next engineer can read them, the auditor can verify them, and the change can be reverted when it is wrong. The rest of this post is the ruleset, the script that surfaces candidates against it, and the commit math you apply after — in that order, never the other way around.

Why VMs end up oversized

Three patterns produce almost every oversized VM.

The “we’ll just match prod” dev environment. Someone provisions Standard_D8s_v5 for prod, then copies the Terraform module into dev/ and staging/ without changing the SKU. Dev sits at 3% CPU forever. The cost is 3x what it should be, because you are paying for three identical environments instead of one big and two small.

The “what if there’s a spike” buffer. The original engineer sized for the worst load they could imagine. Black Friday. An unannounced product launch. The database migration. The spike never came, or it came once and the VM handled it at 60%. The SKU never came back down.

The “VMSS scaled up and never scaled down” ratchet. Auto-scale triggered on a CPU spike at 2 a.m. last June. The cool-down logic was wrong, or the scale-in rule was missing, and the scale set stayed at 8 instances. Nobody noticed, because the dashboard only shows current state, not history.

None of these are mistakes anyone got fired for. They are the natural gravity of cloud infra under a team that does not have a dedicated person watching it. Right-sizing is not a one-time project. It is a maintenance task that has to live somewhere, run on a schedule, and produce a record of what it decided.

The B/D/E choice is a rule, not a judgment call

Microsoft’s general-purpose families overlap in confusing ways. The decision is deterministic once you commit to the inputs.

B-series (burstable). You earn CPU credits while running below baseline (typically 20–40% of one vCPU) and spend them when you burst. Correct for workloads where the average is low but occasional spikes are normal: web tier, dev environments, internal tools, build agents that idle most of the day. Roughly 40–55% cheaper than the equivalent D-series for the same vCPU count.
D-series (general purpose). Sustained baseline performance, no credit system. Correct when p50 CPU is consistently above 30%: databases under steady load, application servers serving real traffic, anything that would chew through credits in an hour.
E-series (memory optimized). 8 GB RAM per vCPU instead of D’s 4 GB. Correct for in-memory caches, JVM heap-heavy apps, Postgres with shared_buffers cranked up. Wrong for general workloads — you are paying for memory you will not use.

The mistake we see most often: D-series running at 5% CPU because someone read a 2019 blog post that said “B-series is for dev, D-series is for prod.” That advice was true when burstable credit caps were lower. In 2026, B-series goes up to Standard_B20ms (20 vCPU, 80 GB RAM). Comfortably production-grade for any web tier that is not sustained-CPU-bound.

The other mistake: E-series for “we might need the memory someday.” If current workload uses 30% of available RAM, you are paying double for headroom you will outgrow before you fill.

Write the rule down once. Apply it on every VM, every week. Stop relitigating B vs. D in Slack.

The signal that picks the candidates

The cleanest signal is 30-day p95 CPU plus 30-day p95 memory, joined against current SKU. Anything where both p95s are below 40% is a right-sizing candidate. Anything where p95 CPU is above 80% is an upsizing candidate — fix that first, because performance issues are more expensive than over-provisioning.

The script below pulls 30 days of CPU metrics from Azure Monitor for every running VM in a subscription, computes p50/p95, and prints candidates sorted against current SKU. Save as right-size-vms.sh:

#!/usr/bin/env bash
set -euo pipefail

SUBSCRIPTION_ID="${SUBSCRIPTION_ID:?set SUBSCRIPTION_ID env var}"
DAYS="${DAYS:-30}"
END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
START=$(date -u -d "$DAYS days ago" +"%Y-%m-%dT%H:%M:%SZ")

echo "Pulling VMs in $SUBSCRIPTION_ID..."

az vm list \
  --subscription "$SUBSCRIPTION_ID" \
  --show-details \
  --query "[?powerState=='VM running'].{name:name, rg:resourceGroup, id:id, size:hardwareProfile.vmSize}" \
  -o json > /tmp/vms.json

jq -c '.[]' /tmp/vms.json | while read -r vm; do
  name=$(echo "$vm" | jq -r '.name')
  id=$(echo "$vm" | jq -r '.id')
  size=$(echo "$vm" | jq -r '.size')

  samples=$(az monitor metrics list \
    --resource "$id" \
    --metric "Percentage CPU" \
    --interval PT1H \
    --start-time "$START" \
    --end-time "$END" \
    --aggregation Average \
    --query "value[0].timeseries[0].data[].average" \
    -o tsv 2>/dev/null | grep -v '^$' || true)

  if [[ -z "$samples" ]]; then
    continue
  fi

  read p50 p95 < <(echo "$samples" | sort -n | awk '
    { a[NR]=$1 }
    END {
      if (NR == 0) { print "0 0"; exit }
      p50_idx = int(NR * 0.50); if (p50_idx < 1) p50_idx = 1
      p95_idx = int(NR * 0.95); if (p95_idx < 1) p95_idx = 1
      printf "%.1f %.1f\n", a[p50_idx], a[p95_idx]
    }
  ')

  printf "%-35s %-22s p50=%5s%%  p95=%5s%%\n" "$name" "$size" "$p50" "$p95"
done

Sample output from a real subscription we ran this against:

api-prod-01                         Standard_D4s_v5        p50=  8.2%  p95= 23.4%
api-prod-02                         Standard_D4s_v5        p50=  9.1%  p95= 25.1%
db-prod                             Standard_E8s_v5        p50= 41.0%  p95= 68.2%
worker-batch-01                     Standard_D8s_v5        p50=  3.4%  p95= 91.2%
build-agent-01                      Standard_D4s_v5        p50=  1.1%  p95=  6.5%
build-agent-02                      Standard_D4s_v5        p50=  0.9%  p95=  5.2%

Reading those rows against the rule:

api-prod-01/02 — p95 of 23–25%. Move to Standard_D2s_v5 with 50% headroom, or to Standard_B4ms if traffic is spiky enough that B-series credits keep up. Roughly $70/mo savings per VM either way.
db-prod — p50 of 41%, p95 of 68%. Sustained load. Leave alone.
worker-batch-01 — low p50, high p95. Batch workload that spikes during runs. Do not downsize without knowing how long those spikes last. Bursting on a smaller SKU might double job duration. Spot-priced Standard_D8s is the better lever, which cuts cost 60–80% if the workload tolerates eviction.
build-agent-01/02 — p95 of 5–6%. Textbook B-series candidates. Standard_B4ms would save ~$60/mo each, and these workloads are exactly what burstable credits were designed for.

The rule made the call on every row in seconds. No one had to “have an opinion” about api-prod-01.

Do not size on CPU alone

CPU is the easy metric, because Azure Monitor collects it natively for every VM. Memory requires the Azure Monitor Agent (or the legacy Log Analytics agent) installed on the guest OS. If it is not installed, install it before making any sizing decision. Moving an E-series VM down to D-series purely on CPU evidence is how you cause an OOM at 2 a.m.

If the agent is in place, the same script works with --metric "Available Memory Bytes" and a small adjustment to convert to a percentage against the SKU’s documented memory size. The same p50/p95 thresholds apply. The same rule decides.

Commit timing is also a rule

Once the fleet is right-sized, the next lever is committed-use discounts. Azure offers three in 2026:

Option	Discount	Flexibility	Best for
Pay-as-you-go	0%	Full	Bursty, unpredictable, pre-product-market-fit
1-year Savings Plan	11–15%	Any VM family, any region (compute only)	Most production workloads
3-year Savings Plan	28–35%	Any VM family, any region (compute only)	Steady baseline you are confident about
1-year Reserved Instance	30–40%	Locked to specific SKU + region	Database VMs you will not move
3-year Reserved Instance	50–65%	Locked to specific SKU + region	Long-term, pinned-architecture workloads

The savings plan is what most teams should reach for first. Reserved Instances look cheaper on the per-hour rate, but the SKU lock-in creates a perverse incentive. You bought a 3-year RI on Standard_D8s, now your workload should be on Standard_B4ms, but you keep the D8s “to use the reservation.” That is not savings. That is sunk cost wearing a discount.

The honest math, for a typical dev-shop production workload:

Right-sizing alone: 30–50% off the VM bill.
Right-sizing + 1-year savings plan on the right-sized baseline: another 12% off.
Right-sizing + 3-year savings plan: another 30% off.

Total realistic compounded savings on the VM bill: 50–65%. The right-sizing is the bigger lever.

The rule that should never bend: right-size first, commit against the post-right-sizing baseline, never before. Teams that commit before they right-size are using a 3-year contract to lock in their own waste.

Spot is a tolerance matrix, not a vibe

Spot VMs are 60–80% cheaper than on-demand and can be evicted with 30 seconds notice. The eviction is the catch — your workload has to tolerate it. The decision is again a rule, not a judgment. Write the matrix once:

Tolerates eviction:

Build agents (the build re-runs).
Batch jobs with checkpointing.
Stateless web tier where a load balancer drains the node.
Dev environments (you do not care if it dies on a Sunday).

Does not:

Databases of any kind.
Anything stateful without proper drain logic.
Workloads where eviction during a job costs more than the savings.

In a typical Azure shop, 20–40% of the VM fleet could be Spot and is not. The friction is mostly cultural — “what if it gets evicted” — which is solvable with a five-minute writeup of which workloads sit on which side of the matrix. Put the matrix in the IaC repo next to the SKU-swap rules. Reference it in PR reviews. Stop relitigating it.

What to do this week

Today: run the script above against your largest subscription. Sort the output by current monthly spend (multiply hourly rate by 730).
This week: pick the five worst offenders. For each, apply the rule — downsize within the same family, switch to B-series, or switch to Spot. Push the change through your normal IaC review process. Do not resize via the portal. That is the same drift pattern we wrote about in why we built TwoOps, where real cloud state diverges from the Terraform and nobody notices until an audit.
This month: install the Azure Monitor Agent on any VM you are considering moving to a smaller memory tier.
This quarter: once right-sizing changes have been stable for 30 days and the new baseline is the new normal, then shop for a savings plan against that baseline. Not before.

The order matters. Right-size first, commit second, optimize-with-Spot third. Teams that do it backwards lock in waste with a 3-year reservation and have nowhere to go.

Applying the rules at scale

The script in this post is the easy version of the ruleset. It runs once, against one subscription, on CPU alone, and prints a list a human still has to act on. That is fine for a Tuesday afternoon. It is not fine as the answer to a recurring maintenance task across a multi-subscription estate.

The production version of this work has to handle subscriptions with thousands of VMs without rate-limiting Azure Monitor, factor in memory and disk IOPS alongside CPU, account for time-of-day patterns (a build agent idle 18 hours a day is a different recommendation than one idle for a week), and survive owners who ignore notifications until the recommendation is stale. It has to generate the SKU-swap PR against the IaC repo where the VM is defined, because the portal is not the source of truth. It has to track realized vs. predicted savings so the ledger shows what right-sizing actually banked, not what it claimed it would.

That is the loop TwoOps runs continuously. The rules in this post are the rules it applies. The script in this post is a single-pass version of one query it runs every day. We built it because the alternative — a team running the ruleset by hand once a quarter — is the same maintenance task that does not survive the engineer who set it up. Determinism is only useful if it keeps running after the person who wrote it moves teams. That is the gap we are closing in the deterministic-AI pillar: rules you can read, traceable math, and a pipeline that applies both without a human in the hot path.

If running the script feels like a good Tuesday afternoon project, do that. If the underlying problem feels bigger than what a script can hold, tell us what you are running and we will tell you whether TwoOps or a Twofold engagement is the right shape.

Conclusion

Right-sizing is not a hunch. It is a ruleset: p50/p95 thresholds against current SKU, a B/D/E decision tree, an eviction-tolerance matrix, and a commit-timing rule that refuses to lock in waste. Run the script, read the rows against the rules, push the changes through IaC, then commit against the new baseline. That order is the whole post.

The deterministic frame is the part that scales past the first run. Vibes do not survive a personnel change. An LLM cost agent without a ruleset cannot be audited. Rules with traceable math can be both. When the ruleset becomes a pipeline, the savings show up on a ledger instead of a slide. If you want help getting there, start a conversation.

Why your Azure VMs cost 2x more than they should

Why VMs end up oversized

The B/D/E choice is a rule, not a judgment call

The signal that picks the candidates

Do not size on CPU alone

Commit timing is also a rule

Spot is a tolerance matrix, not a vibe

What to do this week

Applying the rules at scale

Conclusion

FAQ

We built TwoOps because nothing else fit

Detecting Terraform drift in Azure: a practical guide

Want to Learn More?

Why VMs end up oversized

The B/D/E choice is a rule, not a judgment call

The signal that picks the candidates

Do not size on CPU alone

Commit timing is also a rule

Spot is a tolerance matrix, not a vibe

What to do this week

Applying the rules at scale

Conclusion

FAQ

Related from the lab

We built TwoOps because nothing else fit

Detecting Terraform drift in Azure: a practical guide

Want to Learn More?