Cadmus: a multilingual tokenizer for European languages

A tokenizer is the first thing a language model sees and the last thing anyone optimises. For a model built to serve regulated European industries such as finance, insurance and legal, that is exactly the wrong way round. We built Cadmus, a byte pair encoding tokenizer covering 60 natural languages, and benchmarked it against the tokenizers of the leading open source models. On European languages, it leads every one of them.

+26%

more European text per token than the best tokenizer we tested

27 / 27

European languages on which Cadmus leads every comparator

natural languages in a single vocabulary

Why build our own tokenizer

Three reasons, all of them specific to a European production model rather than a research demo.

European coverage is a first priority, not a side effect. Every official EU language is in the training mix with an explicit byte budget. Most frontier tokenizers are optimised for English and Chinese; their European coverage is whatever falls out of a web crawl. That shows up directly as wasted tokens, and wasted tokens are wasted context, latency and cost on every European request.

Reasoning happens in English, production happens in 27 languages. A modern reasoning model thinks in English chain of thought regardless of the user's language. Cadmus is deliberately built around that asymmetry: English is oversampled for cheap reasoning tokens, while the European languages keep the coverage that customer facing output demands.

Numbers and documents are not an afterthought. Cadmus uses a place value invariant digit scheme, so the ones digit of a number always lands in the same position regardless of length. This is a structural property that matters when a model reads IBANs, ISINs and monetary amounts in free text. The vocabulary also reserves dedicated slots for image, video, audio and OCR markers, so a later multimodal model never needs a disruptive vocabulary extension after the fact.

Tokens needed to encode a 12-digit number, and whether the ones digit keeps a fixed position regardless of the number's length.
Tokenizer	Digit grouping	Tokens	Fixed position
Cadmus (Ablatic)	right to left, in threes	4	✓
DeepSeek V4	left to right, in threes	4	✗
Kimi K2.6	left to right, in threes	4	✗
GLM 5.1	irregular	7	✗
Qwen 3.5	one per digit	12	✓
Gemma 4	one per digit	12	✓
MiMo V2.5 Pro	one per digit	12	✓
Mistral Medium 3.5	one per digit	12	✓

European languages: Cadmus vs. the field

The headline metric is bytes per token (BPT): how many bytes of real text each token carries. Higher is better, because more text per token means more context, fewer tokens to generate and lower cost. The table below shows the mean BPT across the 27 European languages on the FLORES-200 devtest.

Mean bytes per token across 27 European languages (FLORES-200 devtest; the 24 official EU languages plus Catalan, Icelandic and Norwegian). Higher is better.
Tokenizer	Mean BPT	vs. Cadmus
Cadmus (Ablatic)	4.649	·
Gemma 4	3.699	+25.7%
Qwen 3.5	3.635	+27.9%
Mistral Medium 3.5	3.588	+29.6%
GLM 5.1	3.330	+39.6%
DeepSeek V4	3.270	+42.2%
MiMo V2.5 Pro	3.034	+53.2%
Kimi K2.6	3.022	+53.9%

“vs. Cadmus” is the additional European text per token Cadmus carries over each comparator. On a per language paired test, Cadmus wins on all 27 European languages against every tokenizer above.

And it still leads on English

A European focus usually costs you English efficiency. It does not here. Cadmus is also the most efficient tokenizer on English in this comparison.

English bytes per token (FLORES-200 devtest). Higher is better.
Tokenizer	English BPT	vs. Cadmus
Cadmus (Ablatic)	5.316	·
DeepSeek V4	4.885	+8.8%
Gemma 4	4.872	+9.1%
Kimi K2.6	4.868	+9.2%
GLM 5.1	4.859	+9.4%
Qwen 3.5	4.798	+10.8%
MiMo V2.5 Pro	4.782	+11.2%
Mistral Medium 3.5	4.747	+12.0%

The lead holds on out of distribution text too: on a Wikipedia 2025 set published after every tokenizer's training cutoff, Cadmus keeps the top spot on English and on the European mean.

And across the rest of the world

The pattern holds beyond Europe. Across 33 non-European languages, Cadmus has the highest average bytes per token in the comparison, because it spreads vocabulary evenly across the long tail of mid-resource scripts (African, South East Asian, Caucasus and Cyrillic) that English and Chinese centric tokenizers cover only thinly. The Chinese built tokenizers are strong on Chinese itself, but leave most of this set thinly served.

Mean bytes per token across 33 non-European languages (FLORES-200 devtest). Higher is better.
Tokenizer	Mean BPT	vs. Cadmus
Cadmus (Ablatic)	6.249	·
Gemma 4	5.793	+7.9%
Qwen 3.5	4.642	+34.6%
Mistral Medium 3.5	4.596	+36.0%
DeepSeek V4	3.879	+61.1%
Kimi K2.6	3.584	+74.4%
MiMo V2.5 Pro	3.040	+105.5%
GLM 5.1	2.996	+108.6%

Mean across all 33 non-European languages. Cadmus has the highest mean and the most per-language wins; single-language specialists still lead on the languages they target (for example, the Chinese tokenizers on Chinese).

How it's built

Cadmus is a two phase byte pair encoder. A standard BPE phase learns the bulk of the vocabulary; a second SuperBPE phase adds merges across word boundaries for a further compression gain. A dedicated curriculum reserves merge budget for underrepresented scripts such as Indic, CJK, Arabic and Hebrew, so smaller languages are not crowded out by English and code. English is oversampled 2× over the other 59 languages to match the way modern reasoning models actually reason, and a split minimum length filter keeps both long form prose and short code fragments well tokenised. The tokenizer and its reproduction pipeline are going to be released under the Apache 2.0 licence.

More to come

This is the short version. The full methodology, including the ablation chain, the digit tokenization study, the decontamination analysis and the complete per language results across all 60 languages, is written up in the accompanying paper, which we will publish later this year. More on that then.