Why Claude Opus 4.7 Seems to Use More Tokens on Purpose

Anthropic says Claude Opus 4.7 uses an updated tokenizer, and that the same input can become 1.0–1.35× more tokens than it was on Opus 4.6. That is an odd move if you think tokenizers exist only to compress text as hard as possible.

A lot of people immediately jumped to the cynical explanation: Anthropic made the model more expensive while hiding behind the same per-token sticker price. There is probably some truth in the economics being annoying. But the better explanation, based on the public docs, reverse-engineering work, and recent tokenizer papers, is that Anthropic likely traded some raw compression for cleaner segmentation and better model behavior — especially on English and code.

My strongest version of the claim is this:

Opus 4.7 probably gave up some tokenizer efficiency on purpose in exchange for better literalness, cleaner code behavior, and fewer brittle or under-trained merge tokens.

The one caveat worth stating early: Anthropic’s tokenizer is private. I cannot prove the exact vocabulary size, and I cannot prove that the dictionary is literally smaller in 4.7 than 4.6. Anthropic has not published that. What I can say is that the public evidence fits a story where the tokenizer became less aggressively compressive and probably cleaner.

What we actually know

The hard facts are not many, but they are enough to be interesting.

Anthropic says three relevant things in its launch post and Opus 4.7 docs:

Opus 4.7 has an updated tokenizer.
The same text can become up to ~35% more tokens.
Opus 4.7 is more literal in instruction following and better on long-running coding and agentic work.

That combination matters. If Anthropic had only said “token counts changed,” this could just be a new vocabulary learned on a different corpus. But they tied the tokenizer change to a model that is also explicitly being sold as more exact, more disciplined, and better at hard coding work.

Independent measurement lines up with that story. Claude Code Camp ran the public count_tokens endpoint on real content and found that technical docs, shell scripts, Markdown, and code often land near the top of Anthropic’s stated range or above it:

technical docs: ~1.47×
shell scripts: ~1.39×
TypeScript: ~1.36×
Python: ~1.29×
ordinary English prose: ~1.20×
Chinese and Japanese: almost flat, around 1.01×

That is not what a universal tokenizer rewrite looks like. It looks more like Anthropic changed the part of the tokenizer that matters most for Claude’s bread-and-butter workload: English and code.

There is also useful background from Rohan Gupta’s reverse engineering of Claude’s token counter, which strongly suggests Claude 4.x tokenization behaves like a BPE-family tokenizer with a large learned vocabulary. That post predates Opus 4.7, so it is not evidence about the 4.7 vocabulary itself, but it does make the overall shape of the problem more concrete: this is probably still a subword tokenizer world, not some radically new tokenizer-free architecture.

Two more public breadcrumbs are worth keeping in view:

Anthropic’s own user-facing docs tell developers to re-baseline max_tokens and compaction triggers rather than treat 4.7 as a drop-in budget match for 4.6.
Non-English users are already complaining about tokenization fairness in public, for example in Anthropic’s own Claude Code issue tracker.

So yes: the costs are real. But the interesting question is why a model vendor would accept that cost in the first place.

The strongest clue is not the token count. It’s the behavior change.

Anthropic’s own 4.7 materials repeatedly emphasize behavior that is best described as more exact:

more literal instruction following
fewer tool errors and better self-verification
better coding, especially on longer and harder workflows

That is exactly the kind of model behavior you would expect to improve if the tokenizer exposed text in smaller, cleaner, more reusable pieces rather than packing too many common strings into long brittle merges.

If a tokenizer aggressively compresses English and code, it can end up with very convenient long tokens for common phrases, keywords, separators, or formatting patterns. That sounds good until you notice the tradeoff: the model now sees fewer boundaries and has to recover more local structure from inside each token embedding. In some tasks that is fine. In tasks that depend on exact spelling, exact delimiters, exact argument structure, exact symbol placement, or exact obedience to wording, it can be a liability.

The tokenizer literature has been moving away from “best compression wins”

This is the part where recent papers are actually more interesting than the discourse on X or Hacker News.

1. Compression is not the same thing as tokenizer quality

The best high-level paper for this discussion is Beyond Text Compression: Evaluating Tokenizers Across Scales by Lotz et al. The headline result is that text compression alone is a poor way to judge tokenizer quality. Smaller models can predict tokenizer quality differences, multilingual effects are consistent, and intrinsic metrics can matter more than raw compression when you care about downstream behavior.

That sounds almost tailor-made for Opus 4.7. If Anthropic found a tokenizer that was somewhat worse at compression but better for coding or instruction following, this paper says that would not be surprising. It would be expected.

A companion paper, Tokenization is Sensitive to Language Variation by Wegmann, Nguyen, and Jurgens, pushes on a related point: pre-tokenization and segmentation choices can matter more than vocabulary size itself. The best tokenizer changes depending on whether you want robustness to variation or sensitivity to form.

That matters because a lot of internet discussion collapses everything into “bigger vocab good” versus “smaller vocab good.” The papers are saying: not so fast. The segmentation policy matters at least as much as the dictionary size.

2. Long tokens can hide useful character-level and positional information

The paper that best captures this intuition is CharBench: Evaluating the Role of Tokenization in Character-Level Tasks. It finds that tokenization properties are not equally important for all character-level tasks, but for intra-word positional tasks, longer tokens can obscure the information the model needs.

That is not just about parlor tricks like counting letters in “strawberry.” It matters for work where exact local structure is the whole game:

code edits
identifier matching
punctuation-sensitive formats
JSON/XML compliance
locating exactly which item an instruction refers to

If Anthropic wanted Opus 4.7 to be more literal and less sloppy in code, a tokenizer that exposes more local structure — even at the cost of more tokens — is a plausible route.

3. BPE vocabularies accumulate junk

This is the part most people miss.

LiteToken studies what the authors call intermediate merge residues: tokens that were useful during BPE vocabulary construction, survived into the final vocabulary, but are rarely actually emitted at inference time. These residues waste vocabulary capacity and can make the tokenizer less robust, especially on noisy or atypical text.

That is a very clean argument for a smaller but cleaner vocabulary. If part of your vocabulary is dead weight, removing it can improve things even if token counts go up slightly.

A very recent and even more directly relevant paper is From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution. The authors show that code tokenizers are prone to learning unused, under-trained, source-specific tokens because large code corpora are full of repetitive repository-local junk. Their proposed regularization methods reduce under-trained tokens while keeping the same inference procedure.

If your goal is a better coding model, this paper is hard to ignore. A tokenizer can overfit to the quirks of its corpus. That overfitting may help compression and hurt generalization.

4. Fixed tokenization is a compromise, not a law of nature

Retrofitting Large Language Models with Dynamic Tokenization is a useful reminder that the whole static-tokenizer setup is an engineering compromise. The paper shows that dynamically adjusting token boundaries can reduce sequence length substantially while preserving most downstream performance.

That does not mean Anthropic is using dynamic tokenization in Opus 4.7. There is no evidence of that. What it does mean is that there is no single best static tokenizer. If a vendor shifts the compromise point from “max compression” toward “better control and better generalization,” that is a coherent design choice.

The code angle is probably central

If you read Anthropic’s Opus 4.7 materials closely, the model is being positioned above all as:

a better coding model
a better long-running agent
a model that follows instructions more literally
a model that verifies its own work more reliably

That makes me think the relevant comparison class is not “general-purpose English chatbot tokenizer.” It is coding-agent tokenizer.

In that context, more tokens can be a feature rather than a bug if those tokens expose structure more cleanly:

smaller chunks inside identifiers
less over-merging of common code phrases
fewer weird repo-specific subwords
better treatment of separators, whitespace runs, punctuation, and formatting
cleaner alignment between tokens and meaningful edit units

The Claude Code Camp measurements point in exactly this direction: code and technical docs get hit much harder than Chinese and Japanese. If Anthropic mainly wanted a tokenizer that made Opus 4.7 a better coding coworker, that pattern makes sense.

The Caylent write-up is also useful here, not for original research but for framing the operational consequences: 4.7 is more literal, more disciplined, more self-contained, and more dependent on effort/task-budget control. That is exactly the sort of model behavior where tokenizer tradeoffs become part of the product design, not just a low-level implementation detail.

The counterargument: bigger vocabularies often help

This is not one of those topics where the papers all sing in chorus.

Two of the most relevant papers point the other way:

Both argue, in different ways, that larger vocabularies can help language modeling performance. The second paper in particular finds a log-linear relationship between input vocabulary size and training loss when input and output vocabularies are decoupled.

That is why I do not think the right summary is “smaller vocabulary is better.” The literature absolutely does not support that blanket claim.

What it supports is something more nuanced and more annoying:

larger vocabularies can improve compression and sometimes overall performance
cleaner vocabularies can improve robustness, exactness, and efficiency of representation
segmentation policy may matter more than the raw number of tokens in the dictionary
the right choice depends on what failure mode you care about most

For Anthropic’s stated priorities in Opus 4.7 — coding, agentic persistence, literalness, fewer tool mistakes — I can see why they might choose a tokenizer that spends more tokens but reduces a class of expensive behavioral failures.

Why I don’t think this was mainly a multilingual fairness move

One could imagine Anthropic rebuilding the tokenizer to reduce non-English penalties. There is a whole body of work pointing at that problem, including:

And the public pressure is obvious in things like Anthropic’s own issue tracker.

But the independent measurements do not look like a tokenizer primarily optimized for multilingual fairness. They look like a tokenizer that changed English and code much more than CJK. That makes me think the main target was not “make the tokenizer globally fairer.” It was something like “make Claude better at the work Anthropic most wants to win on.”

That is a more cynical interpretation in product terms, but it is also the one most consistent with the observed ratios.

The neatest explanation is: Anthropic traded compression for control

Put the pieces together:

Anthropic openly says the tokenizer changed and token counts rose.
Anthropic also openly says the model is more literal and better at hard coding work.
Independent measurement suggests the biggest token-count shift is in English and code.
Recent tokenizer papers repeatedly show that better compression is not the same thing as better downstream behavior.
Other papers show that vocabularies accumulate dead merges, under-trained residues, and source-specific code junk.

That yields a pretty coherent story.

If I had to compress the whole thing into one sentence, it would be this:

Claude Opus 4.7 probably uses more tokens because Anthropic decided that for its target workloads, cleaner boundaries and better-behaved subwords were worth more than maximum compression.

That would also explain why users feel the cost increase immediately while Anthropic’s partner quotes focus on lower per-task friction, fewer tool errors, and better reliability. Those are exactly the kinds of gains you would hope to buy by giving the model a representation that is less compressed and more explicit.

What I would test next

If Anthropic ever publishes more tokenizer detail, great. Until then, the cleanest way to test this hypothesis is behavioral.

I would want to run A/B evaluations between Opus 4.6 and 4.7 on:

IFEval for strict instruction compliance
M-IFEval for multilingual instruction following
typo robustness and noisy-code prompts inspired by LiteToken
identifier-level edit tasks motivated by CharBench
long code-context tasks where source-specific junk tokens should hurt most, following From Where Words Come

A particularly revealing experiment would be to compare:

strict JSON and XML conformance,
minimally edited code diffs,
typo-heavy shell logs,
multilingual prompts with equivalent semantics,
prompts with repeated similar identifiers where over-merged tokenization would be a liability.

If Opus 4.7 wins disproportionately on those tasks while spending more tokens, the thesis gets stronger.

Internet trail, if you want the whole rabbit hole

The best public references I found while digging through this:

Anthropic: Introducing Claude Opus 4.7
Anthropic docs: What’s new in Claude Opus 4.7
Anthropic docs: Migration guide
Claude Code Camp: I Measured Claude 4.7’s New Tokenizer. Here’s What It Costs You.
Rohan Gupta: Reverse Engineering Claude’s Token Counter
Hacker News discussion: Opus 4.7 uses an updated tokenizer that improves how the model processes text…
Caylent: Claude Opus 4.7 Deep Dive
Anthropic Claude Code issue: Non-English users face structural disadvantage due to tokenization inefficiency
LLM Stats: Claude Opus 4.7 vs Opus 4.6
Labellerr: Claude Opus 4.7 vs Opus 4.6: What Actually Changed?

Papers and research references

All of the papers below were either central to the argument above or part of the research trail that shaped it.

Most relevant

Important counterpoints and adjacent work

Extra papers from the broader search trail

I would not claim all of these support the same conclusion. They do not. That is part of the point. The literature does not say “small vocabulary good” or “big vocabulary good.” It says tokenization is a tradeoff surface with more axes than most product discussions acknowledge.

My bet is that Anthropic moved along that surface toward control, compositionality, and coding robustness — and accepted the token bill that came with it.