You know per-token billing from the direct APIs. Foundry adds a second mode — renting capacity by the hour — that changes the cost calculus for steady, high-volume workloads.
On the direct APIs there's essentially one knob: you pay per token, scaling smoothly from zero. Foundry keeps that — and adds Provisioned Throughput Units (PTUs): you rent a fixed block of model capacity, billed by the hour whether or not you use it. The entire decision is about which curve is cheaper for your traffic shape.
Pay-as-you-go (Standard) is a line through the origin — zero traffic costs zero, and cost rises with every token.1 PTU is a flat horizontal line — you pay $/PTU/hr × PTUs for reserved capacity regardless of token volume; it can't be paused (you stop billing by deleting the deployment).1 Where those two lines cross is your break-even utilization. Below it, per-token wins; above it, PTU wins.
A Provisioned Throughput Unit is a generic unit of model processing capacity — not tokens, not a specific model.1 Three properties matter for a decision:
rate × 300 per hour even at 0% utilization (prorated to the minute; resizing adjusts immediately).1Pay $/PTU/hr with no commitment. Spin up, spin down (by deleting). No term.
Commit to a fixed PTU count for a 1-month or 1-year term → a discounted effective $/PTU/hr.1
Always create the deployment first, then buy the reservation to cover it.1 Why: a reservation is a financial instrument — it does not guarantee capacity. Capacity is a separate, finite, shifting resource. Commit to PTUs you haven't confirmed you can deploy and you can end up paying for a reservation with nowhere to spend it. (Reservations are also flexible: scoped to RG/sub/management-group, consolidatable for Global deployments, cancelable with possible fees.)1
You don't need exact prices to reason about this (and they change constantly — always pull the live pricing page for a real quote). The shape is what to internalize:
PTUs are a feature of "models sold by Azure" (Azure OpenAI, Llama, DeepSeek…). Claude via Foundry is an "Azure Direct Model" — it bills through the Microsoft Marketplace at Anthropic's standard per-token API prices, so the PTU lever doesn't apply to it.2 If a cost model leans on PTU savings, confirm the model you're betting on is actually PTU-eligible. The same two-tier split that governs compliance (axis 1) governs pricing.
| Dimension | Pay-as-you-go per-token | PTU reserved capacity |
|---|---|---|
| You pay for | Tokens consumed | Capacity deployed (per PTU/hr) |
| Cost at idle | Zero | Full (can't pause; delete to stop) |
| Cost predictability | Variable with usage | Fixed / known |
| Throughput & latency | Best-effort, shared | Guaranteed, reserved |
| Commitment | None | None (hourly) or 1-mo/1-yr (reservation) |
| Discount lever | — | Azure Reservation (term commitment) |
| Applies to Claude? | Yes (Marketplace, Anthropic prices) | No — Claude is a direct model |
| Best when… | Spiky/low volume, dev/test, exploration | Steady high volume needing cost + latency certainty |
Judgement, instant feedback.
1. A startup runs unpredictable, bursty traffic — busy some hours, idle most. Which billing model, and why?
2. An ops lead buys a 1-year PTU reservation before creating any deployment, to "lock in the discount early." What's the risk?
3. A cost model for a new product assumes big PTU-reservation savings — and the product is standardizing on Claude. What do you flag?
You've now mapped the whole landscape: the decision map, build tooling & agents, enterprise & compliance, and the commercial model. The strongest next step for a decision mission is to run a real scenario through all four — bring me an actual org/workload and we'll produce the recommendation together, end-to-end.