Traditional SaaS economics are forgiving on the margin side. Once software is written, the marginal cost of serving one more user approaches zero. Gross margins of 80% or higher are expected, and the financial model is built around this assumption. Spend on sales, spend on R&D, the margin holds because serving customers is cheap.
AI-native products break this model. Every request hits an LLM provider API that charges by the token. Your cost to serve scales with usage. A customer who doubles their activity doubles your COGS. A customer who sends long prompts with large context windows costs more than one who sends short ones. Unlike traditional SaaS where all users cost roughly the same, your AI users have meaningfully different unit economics depending on how they interact with the product.
The calculator above lets you model this before it surprises you. Enter your model, token volumes, caching assumptions, and the price you charge end-users. The calculator shows you where your gross margin lands and how sensitive it is to each variable.
Why AI COGS Is Different#
Token-based pricing has a structure that catches teams off guard when they first work through the math.
Input and output tokens are priced differently. Most LLM providers charge more for output tokens than input tokens, often by a factor of four to six. GPT-5.4 is $2.50 per million input tokens and $15.00 per million output tokens. Claude Sonnet 4.6 is $3.00 per million input and $15.00 per million output. This means the ratio of generated text to submitted text in your application has an outsized effect on per-call cost. An application that generates long, detailed responses has a very different cost profile from one that generates brief confirmations.
Context window size is a hidden cost driver. Every token in the context window is charged on every request. If your application includes a 4,000-token system prompt, you are paying for those 4,000 tokens on every single API call. At 50,000 calls per month, that is 200 million input tokens from system instructions alone, before any user input counts. Teams that don’t measure this routinely underestimate their input token costs by a wide margin.
Token counts vary more than you expect. User inputs vary in length. Generated outputs vary in length. When you model COGS using average tokens per call, you are modeling the expected value of a distribution that can have significant variance at the tail. A customer with unusually long queries or a use case that generates verbose outputs will cost you more than your average assumptions predict. Modeling your 90th percentile user, not just the median, gives you a more defensible cost forecast.
The Gross Margin Surprise#
Here is how the math plays out for a typical early-stage AI product that hasn’t worked through its unit economics.
A team builds an AI writing assistant. They price it at $49 per month, targeting 1,000 active users as the milestone to feel like the business is working. On a per-seat SaaS mental model, the gross margin looks strong.
Now the token math: the average user makes 150 API calls per month. Each call includes a 1,000-token system prompt, 500 tokens of user input, and generates 800 tokens of output. They’re using GPT-5.4.
Input tokens per call: 1,500 (system + user)
Output tokens per call: 800
Cost per call (GPT-5.4): (1,500 / 1,000,000 × $2.50) + (800 / 1,000,000 × $15.00)
= $0.00375 + $0.01200
= $0.01575
Monthly calls per user: 150
Monthly COGS per user: 150 × $0.01575 = $2.36
Monthly revenue per user: $49
Gross margin per user: ($49 - $2.36) / $49 = 95.2%
At first glance, 95.2% gross margin looks excellent. But this is average-user math. The real question is what happens at the tail.
A power user who makes 1,500 calls instead of 150 costs you $23.63 per month in COGS. Your revenue from that user is still $49. Their gross margin is 51.8%, not 95.2%. At 5x the average usage, you are still profitable, but only barely. At 10x average usage, that user is costing you $47.25 in COGS against $49 in revenue. You are nearly at breakeven serving your most engaged customers.
The real problem is not average gross margin. It is the margin distribution across your user base. Flat pricing that looks fine at average usage systematically subsidizes your heaviest users. The customers who get the most value from your product are the ones your pricing model is least equipped to handle.
The Levers That Control Your Margin#
There are three levers that meaningfully move gross margin in an AI-native product: model selection, prompt caching, and output length.
Model selection. The cost difference between model tiers is not marginal, it is substantial. Claude Opus 4.7 costs $5.00 per million input tokens and $25.00 per million output tokens. Claude Haiku 4.5 costs $1.00 per million input and $5.00 per million output. That is a 5x cost difference. On the OpenAI side, GPT-5.5 costs $5.00 per million input and $30.00 per million output; GPT-5.4 Nano costs $0.20 per million input and $1.25 per million output, a 25x cost ratio on output.
If you are defaulting to your most capable model because you started the project with it and it worked, you may be leaving an 80 to 95% cost reduction on the table. The right approach is to run objective evals and evaluate cost-adjusted performance, not to pick the cheapest model that passes a subjective quality check. Define a measurable quality metric for your use case: task completion rate, accuracy on a held-out test set, or a consistent human-rating rubric. Score each model tier against it and record the cost per call alongside. A model that scores 20% higher but costs 5x more is not the better choice: the performance gain does not justify the margin hit. The bar to clear is that the percentage improvement in quality must exceed the percentage increase in cost. Many teams find that 60 to 80% of their requests do not clear this bar for the highest-tier model, which means routing simpler requests to a cheaper tier and reserving the premium model for tasks where the eval gap is genuinely material.
Prompt caching. Prompt caching is the highest-leverage cost reduction mechanism available for most applications. When your system prompt or context is stable across requests, the provider caches the processed representation and charges a substantially reduced rate when subsequent requests match the cached prefix. Both OpenAI and Anthropic now charge 10% of the standard input rate for cached tokens, a 90% discount on the cached portion.
For an application with a 2,000-token system prompt running 50,000 calls per month on Claude Sonnet 4.6 without caching, the input cost from that system prompt alone is:
50,000 × 2,000 / 1,000,000 × $3.00 = $300/month
With a 70% cache hit rate and the 90% cache discount:
Cache misses: 50,000 × 0.30 × 2,000 / 1,000,000 × $3.00 = $90/month
Cache hits: 50,000 × 0.70 × 2,000 / 1,000,000 × $0.30 = $21/month
Total: $111/month
That is a $189 per month saving on system prompt tokens alone, a 63% reduction. At scale, prompt caching is frequently worth more margin improvement than a model downgrade or a price increase.
Output length control. Output tokens cost more than input tokens and, unlike input tokens, output length is something you can influence. Responses that are constrained to be concise cost less than responses that are allowed to run long. If your application allows open-ended generation without length guidance, you are leaving cost optimization on the table. Structuring prompts to request specific output formats, limiting response length where appropriate, and using streaming to allow users to stop generation early are all techniques that reduce output token costs without degrading perceived quality.
What Good Looks Like#
The target gross margin for a sustainable AI-native SaaS product at scale is 70% or above. This is lower than the 80 to 85% that traditional SaaS commands, but it reflects the structural reality of compute COGS.
The AI companies that get to 70%+ gross margin share a few practices.
They segment pricing by usage intensity. Flat per-seat pricing at $49 per month subsidizes power users. A pricing model with usage tiers, consumption floors, or per-unit charges aligns revenue to COGS more accurately. When a customer’s usage doubles, their bill increases accordingly, and your margin stays stable instead of compressing. This is not just a revenue optimization; it is a margin stability mechanism.
They treat model selection as a continuous optimization problem, evaluated on cost-adjusted performance rather than raw capability. Model prices fall and new tiers launch regularly. Teams that lock in a model at launch and never revisit it are paying yesterday’s prices on today’s capabilities. The discipline is running objective evals on a defined quality metric, comparing the performance-to-cost ratio across model tiers, and routing each request type to the model that clears the quality bar at the lowest cost. A model that scores higher is not automatically the right choice: it earns that position only when the performance gain percentage exceeds the cost increase percentage. This is a recurring analysis, not a one-time launch decision.
They watch cost per API call as a primary metric. For traditional SaaS, COGS per user barely moves. For AI products, cost per API call is a live operational metric that belongs on the same dashboard as revenue and churn. A spike in cost per call can signal a prompt regression, an unexpected change in user behavior, or a pricing model that is being exploited in ways you didn’t anticipate.
They design prompt caching as a feature. Cache hit rate is a measurable, improvable metric. Teams with high cache hit rates design their prompts and request structure deliberately to maximize prefix reuse. This is an engineering and product discipline that pays off directly in the margin line.
Related Reading#
- Why Usage-Based Pricing Is Driving Revenue Growth
- Usage-Based Billing Implementation Bottlenecks
- Prepaid Wallet Burn Rate Calculator
- How Enso’s usage metering works
Frequently Asked Questions#
What is a good gross margin for an AI-native SaaS product? Most investors and CFOs target 70% or above for AI-native SaaS at scale. Early-stage companies often run at 40 to 60% while they optimize model selection and caching strategies, then push toward 70%+ as volume grows and the cost structure matures. Below 40% is a signal that either the pricing model needs to change or the underlying model selection is wrong for the use case.
How does prompt caching reduce LLM costs? Prompt caching lets providers reuse the KV cache from a previous request when the prefix of a new request matches. This primarily reduces input token costs because the provider does not need to process the cached portion again. Both OpenAI and Anthropic charge 10% of the standard input rate for cached tokens, a 90% discount. The savings compound quickly when your prompts have long, stable system instructions that appear in every request.
How do I choose the right LLM model to protect gross margin? The right question is not which model is best, but which delivers the best cost-adjusted performance for your workload. Run objective evals on your actual tasks: define a measurable quality metric, score each model tier against it, and record the cost per call alongside the score. A model that scores 15% higher but costs 3x more has failed the ROI test. The threshold is a ratio: a more expensive model only clears the bar if the performance gain percentage exceeds the cost increase percentage. Anything less is margin given away for no net benefit. The cost ratios between tiers (5x to 25x across current OpenAI and Anthropic lineups) make this one of the highest-leverage margin decisions an AI product team makes. Revisit it regularly as model prices fall and new tiers emerge.

