Azure AI · Axis 4 — Commercial model

PTUs vs. per-token

You know per-token billing from the direct APIs. Foundry adds a second mode — renting capacity by the hour — that changes the cost calculus for steady, high-volume workloads.

← Back to the decision map

On the direct APIs there's essentially one knob: you pay per token, scaling smoothly from zero. Foundry keeps that — and adds Provisioned Throughput Units (PTUs): you rent a fixed block of model capacity, billed by the hour whether or not you use it. The entire decision is about which curve is cheaper for your traffic shape.

The reframe: you're choosing a cost curve, not a price

Pay-as-you-go (Standard) is a line through the origin — zero traffic costs zero, and cost rises with every token.1 PTU is a flat horizontal line — you pay $/PTU/hr × PTUs for reserved capacity regardless of token volume; it can't be paused (you stop billing by deleting the deployment).1 Where those two lines cross is your break-even utilization. Below it, per-token wins; above it, PTU wins.

break-even utilization PTU (flat, reserved) per-token (rises with use) monthly cost utilization / sustained throughput → per-token cheaper PTU cheaper
Pay-as-you-go — cost ∝ tokens PTU — flat, capacity rented by the hour

What a PTU actually is

A Provisioned Throughput Unit is a generic unit of model processing capacity — not tokens, not a specific model.1 Three properties matter for a decision:

Two ways to pay for PTUs

Hourly billing

Pay $/PTU/hr with no commitment. Spin up, spin down (by deleting). No term.

Use for: benchmarking a model before committing, or temporary capacity for an event/launch. Not for steady production — it's the expensive way to run PTUs long-term.

Azure Reservations

Commit to a fixed PTU count for a 1-month or 1-year term → a discounted effective $/PTU/hr.1

Use for: sustained production. Significantly cheaper than long-term hourly. Matched by deployment type + region + scope (not by model); overage bills hourly.
The non-obvious operational rule

Always create the deployment first, then buy the reservation to cover it.1 Why: a reservation is a financial instrument — it does not guarantee capacity. Capacity is a separate, finite, shifting resource. Commit to PTUs you haven't confirmed you can deploy and you can end up paying for a reservation with nowhere to spend it. (Reservations are also flexible: scoped to RG/sub/management-group, consolidatable for Global deployments, cancelable with possible fees.)1

The break-even, in plain terms

You don't need exact prices to reason about this (and they change constantly — always pull the live pricing page for a real quote). The shape is what to internalize:

Per-model caveat (the thread from axes 1 & 3)

PTUs are a feature of "models sold by Azure" (Azure OpenAI, Llama, DeepSeek…). Claude via Foundry is an "Azure Direct Model" — it bills through the Microsoft Marketplace at Anthropic's standard per-token API prices, so the PTU lever doesn't apply to it.2 If a cost model leans on PTU savings, confirm the model you're betting on is actually PTU-eligible. The same two-tier split that governs compliance (axis 1) governs pricing.

Decision table

Per-token = the model you already know from the direct APIs. PTU = the Azure-specific addition.
DimensionPay-as-you-go per-tokenPTU reserved capacity
You pay forTokens consumedCapacity deployed (per PTU/hr)
Cost at idleZeroFull (can't pause; delete to stop)
Cost predictabilityVariable with usageFixed / known
Throughput & latencyBest-effort, sharedGuaranteed, reserved
CommitmentNoneNone (hourly) or 1-mo/1-yr (reservation)
Discount leverAzure Reservation (term commitment)
Applies to Claude?Yes (Marketplace, Anthropic prices)No — Claude is a direct model
Best when…Spiky/low volume, dev/test, explorationSteady high volume needing cost + latency certainty

Check yourself — 3 scenarios

Judgement, instant feedback.

1. A startup runs unpredictable, bursty traffic — busy some hours, idle most. Which billing model, and why?

Idle-heavy, spiky load lives left of the break-even point: per-token (which costs zero when idle) beats paying for reserved capacity you're not using.

2. An ops lead buys a 1-year PTU reservation before creating any deployment, to "lock in the discount early." What's the risk?

Reservations are purely financial; capacity is a separate, finite resource. Microsoft's explicit guidance: create the deployment first to confirm capacity, then buy the reservation to cover it.

3. A cost model for a new product assumes big PTU-reservation savings — and the product is standardizing on Claude. What do you flag?

The two-tier split strikes again: PTU economics are a "models sold by Azure" feature. A Claude-based cost model can't bank PTU savings; it pays Anthropic's standard token rates.
That's all four axes — what now?

You've now mapped the whole landscape: the decision map, build tooling & agents, enterprise & compliance, and the commercial model. The strongest next step for a decision mission is to run a real scenario through all four — bring me an actual org/workload and we'll produce the recommendation together, end-to-end.