Commercial Model — PTUs vs. Per-Token

← Back to the decision map

On the direct APIs there's essentially one knob: you pay per token, scaling smoothly from zero. Foundry keeps that — and adds Provisioned Throughput Units (PTUs): you rent a fixed block of model capacity, billed by the hour whether or not you use it. The entire decision is about which curve is cheaper for your traffic shape.

The reframe: you're choosing a cost curve, not a price

Pay-as-you-go (Standard) is a line through the origin — zero traffic costs zero, and cost rises with every token.¹ PTU is a flat horizontal line — you pay $/PTU/hr × PTUs for reserved capacity regardless of token volume; it can't be paused (you stop billing by deleting the deployment).¹ Where those two lines cross is your break-even utilization. Below it, per-token wins; above it, PTU wins.

Pay-as-you-go — cost ∝ tokens PTU — flat, capacity rented by the hour

What a PTU actually is

A Provisioned Throughput Unit is a generic unit of model processing capacity — not tokens, not a specific model.¹ Three properties matter for a decision:

Billed on deployed capacity, not consumption. A 300-PTU deployment costs rate × 300 per hour even at 0% utilization (prorated to the minute; resizing adjusts immediately).¹
Model-independent. Your PTU quota is shared across all supported models in a region/deployment-type — you don't buy PTUs "for GPT-5." Swap models within the pool freely.¹
Buys guarantees. Reserved capacity gives predictable cost and consistent throughput/latency — the thing per-token pricing can't promise under load.

Two ways to pay for PTUs

Hourly billing

Pay $/PTU/hr with no commitment. Spin up, spin down (by deleting). No term.

Use for: benchmarking a model before committing, or temporary capacity for an event/launch. Not for steady production — it's the expensive way to run PTUs long-term.

Azure Reservations

Commit to a fixed PTU count for a 1-month or 1-year term → a discounted effective $/PTU/hr.¹

Use for: sustained production. Significantly cheaper than long-term hourly. Matched by deployment type + region + scope (not by model); overage bills hourly.

The non-obvious operational rule

Always create the deployment first, then buy the reservation to cover it.¹ Why: a reservation is a financial instrument — it does not guarantee capacity. Capacity is a separate, finite, shifting resource. Commit to PTUs you haven't confirmed you can deploy and you can end up paying for a reservation with nowhere to spend it. (Reservations are also flexible: scoped to RG/sub/management-group, consolidatable for Global deployments, cancelable with possible fees.)¹

The break-even, in plain terms

You don't need exact prices to reason about this (and they change constantly — always pull the live pricing page for a real quote). The shape is what to internalize:

Low or spiky volume, lots of idle time → per-token. You'd be paying for reserved PTUs that sit unused. The flat line is above the rising line.
High, steady, predictable volume → PTU + reservation. You're past the crossover; the flat line sits below where per-token would have climbed — and you also get latency guarantees.
The reservation lowers the flat line further, pushing the break-even point left — so committed, sustained workloads cross into "PTU cheaper" sooner.

Per-model caveat (the thread from axes 1 & 3)

PTUs are a feature of "models sold by Azure" (Azure OpenAI, Llama, DeepSeek…). Claude via Foundry is an "Azure Direct Model" — it bills through the Microsoft Marketplace at Anthropic's standard per-token API prices, so the PTU lever doesn't apply to it.² If a cost model leans on PTU savings, confirm the model you're betting on is actually PTU-eligible. The same two-tier split that governs compliance (axis 1) governs pricing.

Decision table

Per-token = the model you already know from the direct APIs. PTU = the Azure-specific addition.
Dimension	Pay-as-you-go per-token	PTU reserved capacity
You pay for	Tokens consumed	Capacity deployed (per PTU/hr)
Cost at idle	Zero	Full (can't pause; delete to stop)
Cost predictability	Variable with usage	Fixed / known
Throughput & latency	Best-effort, shared	Guaranteed, reserved
Commitment	None	None (hourly) or 1-mo/1-yr (reservation)
Discount lever	—	Azure Reservation (term commitment)
Applies to Claude?	Yes (Marketplace, Anthropic prices)	No — Claude is a direct model
Best when…	Spiky/low volume, dev/test, exploration	Steady high volume needing cost + latency certainty

Check yourself — 3 scenarios

Judgement, instant feedback.

1. A startup runs unpredictable, bursty traffic — busy some hours, idle most. Which billing model, and why?

Idle-heavy, spiky load lives left of the break-even point: per-token (which costs zero when idle) beats paying for reserved capacity you're not using.

2. An ops lead buys a 1-year PTU reservation before creating any deployment, to "lock in the discount early." What's the risk?

Reservations are purely financial; capacity is a separate, finite resource. Microsoft's explicit guidance: create the deployment first to confirm capacity, then buy the reservation to cover it.

3. A cost model for a new product assumes big PTU-reservation savings — and the product is standardizing on Claude. What do you flag?

The two-tier split strikes again: PTU economics are a "models sold by Azure" feature. A Claude-based cost model can't bank PTU savings; it pays Anthropic's standard token rates.

That's all four axes — what now?

You've now mapped the whole landscape: the decision map, build tooling & agents, enterprise & compliance, and the commercial model. The strongest next step for a decision mission is to run a real scenario through all four — bring me an actual org/workload and we'll produce the recommendation together, end-to-end.