The Cost Is in the Context
Why unlimited AI usage is shifting toward token-based billing, and why locally operated AI is becoming more important for companies
May 04, 2026·10 Minuten Reading time

Photo: @moneyphotos on Unsplash, edited.
When AI is no longer billed at a flat rate, the cost factor often sits in the background: the more context systems process, the more the budget gets vacuumed up.
Flat-Rate AI Subscriptions Come Under Pressure
A short prompt and a multi-step analysis across internal documents can look similar on the surface: someone asks a question, and the system responds. Technically, they are worlds apart. In the first case, a language model processes little text and returns a short answer. In the second, documents are searched, code sections are read, intermediate results are generated, tools are called, and answers are built up over several steps. This is precisely where flat-rate AI subscriptions start to strain.
This development is already visible with Copilot. GitHub is moving the service to usage-based billing from 1 June 2026. Instead of the previous Premium Requests, GitHub AI Credits will be consumed. According to GitHub, input tokens, output tokens, and cached tokens count toward this, as does the AI model used. The reasoning is sober: a short chat question incurs different inference costs than a longer agentic coding session.
This makes visible a development that goes beyond a single provider. AI products are moving away from simple usage promises and closer to billing models that reflect actual resource consumption. This is not merely a price change. It changes how AI has to be planned, limited, measured, and governed in companies.
“Unlimited” Often Applies Only to Parts of the Product
For many software subscriptions, the calculation used to be straightforward. A user account cost a fixed amount per month. Whether the application was used intensively or only occasionally made little difference to billing. Generative AI fits this pattern only to a limited extent, because two requests made through the same interface can create very different compute loads.
GitHub’s Copilot documentation on usage-based billing shows this split particularly clearly. Features such as Copilot Chat, CLI, Cloud Agent, Spaces, Spark, and third-party agents consume AI Credits. Code completions and Next Edit Suggestions, by contrast, remain unlimited in paid plans. “Unlimited” therefore no longer automatically means that the entire product can be used without any consumption logic. It can also mean that individual features remain included at a flat rate, while more compute-intensive features are controlled through credits, tokens, or budgets.
Cursor showed in 2025 how much explanation such usage promises now require. In a clarification of its pricing model, the company wrote that the Pro plan includes unlimited use of Tab and AI models in Auto mode, while frontier models are billed through a monthly allowance and then at API cost. Cursor acknowledged that its communication around “unlimited usage” had not been clear enough. What is interesting here is less the individual case than the pattern: AI subscriptions now have to be read more carefully than traditional flat rates.
The relevant question is shifting. It is no longer enough to compare the number of licences and the monthly price. What matters is which features are actually included at a flat rate, which usage consumes a budget, how additional usage is billed, and whether costly features can be controlled organisationally.
When a Request Is More Than a Request
Tokens are the units in which language models process and generate text. More context means more input. Longer answers mean more output. Repeated text components may become cheaper through caching, depending on the provider, but they do not disappear from the technical calculation. This basic logic alone is enough to explain why AI costs can fluctuate significantly in operation.
In typical enterprise scenarios, usage rarely remains a single isolated chat question. A document analysis loads draft contracts, policies, minutes, or technical specifications into the context. A RAG system supplements a request with excerpts from internal knowledge repositories. A coding assistant includes files, tests, error messages, and dependencies from the project. An agent performs intermediate steps, discards results, calls tools, and tries alternatives. A single user request then becomes a sequence of model calls.
The cost difference becomes especially tangible with agents. Cursor has changed agent usage in the Teams plan from fixed request costs to variable costs, because a simple syntax question creates less effort than an agent that is supposed to implement an entire pull request. In the explanation of Teams and Auto, pricing therefore follows not the visible surface of the product, but the work that happens in the background.
The same logic is also present in the public API pricing lists of AI providers. OpenAI distinguishes between input, cached input, and output on its pricing page; Google’s Gemini API likewise separates input, output, context caching, and additional functions such as Grounding. The list price of an AI model therefore says only so much about the cost of a specific workflow. What matters is how much context is processed, how much output is produced, and how often a system repeats the same task internally.
The very use cases that are interesting for production use therefore create cost uncertainty: internal knowledge search, document review, code analysis, pre-processing support cases, research, summaries of large data sets, and multi-step assistant processes. They are not expensive because AI is inherently expensive. They are hard to calculate because their consumption depends on the specific workflow.
Why AI Budgets Alone Are Not Enough
When costs depend on tokens, model choice, and depth of use, a monthly licence overview is too coarse. What is needed are budgets, role-based permissions, approved models, logs, alert thresholds, and allocation to teams or projects. This control layer has to be defined before broad rollout, not only after the first unexpectedly high bill.
For Copilot, GitHub provides budgets at several levels, including enterprise, organisation, cost center, and individual user. Once credits are exhausted, additional usage can be billed or access can be blocked until the next billing period. This helps with limitation, but it does not replace a subject-matter assessment. A hard stop protects the budget, but it can also interrupt a productive task in the middle of execution.
Similar control decisions appear with Claude. In paid Claude plans, once included usage has been reached, users can switch to Extra Usage at pay-as-you-go rates. With Claude Code, the use of API credits can be deliberately prevented so that the tool works only within the plan allowance. Such options are useful, but they also show that cost control does not arise automatically from buying a subscription. It has to be configured and understood organisationally.
In smaller and medium-sized organisations, this task often does not fall to specialised FinOps or platform teams, but to IT management, business units, and executive leadership. Clarity then matters. Anyone approving AI use should know which AI models are allowed for which tasks, which user groups may use agentic features, and at what point additional consumption will be blocked. Without this clarification, billing becomes a downstream control instrument. By then, it is too late.
Cost Flows Are Also Data Flows
Token-based billing initially sounds like a matter for procurement and controlling. In AI operation, however, it directly affects data control. A long context is not just a larger compute unit. It is a larger data package. It may contain source code, customer data, contracts, internal policies, tickets, minutes, or technical documentation.
When an AI assistant automatically loads suitable documents into the prompt, two questions arise at the same time: what does this processing cost, and where does the data flow? Relevant factors include storage location, logging, access to logs, retention and deletion periods, role-based access models, and the ability to trace which content was processed in which context. A reference to being “enterprise-ready” or “GDPR-compliant” does not answer these questions. Such a statement becomes robust only when the technical implementation is transparent.
Measures for limiting costs and measures for confidentiality are therefore often closely related. Less unnecessary context not only reduces consumption, but also reduces the amount of sensitive information flowing into AI calls. More precise document selection, limited chat histories, clean retrieval configurations, and role-based access improve both calculability and data minimisation.
Conversely, poorly configured systems can become problematic in two ways at once. They attach too many documents to every request, generate high token usage, and process more internal information than is necessary for the task. The cost is then indeed in the context, but the context is not only a cost item. It is part of the security and data-protection architecture.
Locally Operated AI Makes Costs and Data Flows More Predictable
Token-based billing makes visible that AI usage depends not only on the AI model selected, but on the entire usage pattern. The more often a company analyses internal documents, searches knowledge repositories, prepares support cases, or evaluates code bases, the less this resembles occasional individual requests. Recurring workload profiles emerge, and they have to be planned for permanently.
In its paper on cost estimation of AI workloads, the FinOps Foundation treats such AI workloads not as a mere billing item, but as a planning task across development, piloting, and production use. This is exactly where locally operated AI becomes relevant. Not because it makes AI free, or because it would be the better answer for every use case. Its advantage lies in a different operating logic: capacities, data flows, access, and technical limits can be planned more strongly within the company’s own or otherwise controlled infrastructure.
The costs shift from ongoing individual consumption to infrastructure, capacity, utilisation, maintenance, monitoring, and responsibility. That is not a small difference. An external consumption model bills extensive processing by tokens, credits, or usage classes. Locally operated AI, by contrast, requires an upstream decision about which workloads occur regularly enough to justify providing capacity for them.
This question becomes particularly relevant when high usage and confidential data coincide. In enterprise applications, long contexts rarely contain only non-critical text. They often involve draft contracts, customer cases, technical documentation, tickets, policies, or source code. In its guidance on RAG systems with enterprise or public-sector knowledge sources, the German Datenschutzkonferenz (DSK) — the conference of Germany’s independent federal and state data protection supervisory authorities — therefore addresses not only model performance and answer quality, but also purpose limitation, access, transparency, and data minimisation.
Locally operated AI is therefore not a blanket answer to rising or hard-to-plan cloud costs. It does, however, become a regular architecture question when companies combine recurring AI usage with sensitive information. The potential advantage then lies not only in a different cost structure, but in greater control over which data is processed, who can access it, and which technical limits a system has to observe. Such requirements do not belong only in ongoing operation; they have to be clarified from both the business and technical sides before rollout.
The Decision Is Made Before Rollout
Many cost problems arise before the first bill arrives. A chatbot that answers short questions has different requirements from a system that searches internal documents for every request. A coding assistant with simple completions has to be assessed differently from an agent that independently works through tasks across several files and tests. A department that occasionally summarises texts creates a different consumption pattern from a support team that pre-processes hundreds of cases with AI every day.
Before a rollout, therefore, the starting point should not be only a plan comparison. A technical usage description is more useful: Which data enters the context? How long are prompts and answers likely to be in practice? Are documents processed multiple times? Are there agent loops? Which models are approved for which tasks? Which usage may take place automatically, and where are limits needed?
This analysis does not lead to a one-size-fits-all architecture decision. Some use cases can be controlled sufficiently with a well-configured SaaS or API model, provided that budgets, logging, approved models, and permissions are set cleanly. Others point toward hybrid approaches, where confidential or frequent tasks run locally while external AI models are used for selected cases. Still other organisations will prefer locally operated AI because they want to control data processing, access, and capacities more tightly.
AI becomes viable only when the cost model, data flow, and operating model fit together. Anyone who only buys user licences overlooks the technical cost logic. Anyone who relies only on local infrastructure without realistically assessing utilisation, operation, and model maintenance replaces variable API costs with other forms of effort. Both can be right. Both can fail if the context is not understood.
AI Costs Are an Architecture Question
The shift from unlimited or flat-rate-seeming AI usage to token-, credit-, and limit-based models is not a marginal issue of provider billing. It follows from the fact that AI systems use longer contexts, more powerful AI models, and agentic ways of working. As a result, the visible monthly price says less about what production use actually costs.
For companies, AI is therefore becoming more of an infrastructure question. Where context is processed, costs arise. Where context is processed, data flows. Both require transparency: AI models used, tokens, budgets, roles, storage locations, logs, and technical limits.
In this environment, locally operated AI is not automatically the better solution. It does, however, become more important when high usage, confidential information, and the need for predictable control come together. The right architecture does not follow from a pricing model alone, but from the sober question of which data should be processed, with what effort, and under whose control.