Building a Production-Ready Tokenizer for Garhwali (GBM): Design, Training, and Benchmarks
TL;DR: I built a 128K-vocabulary unigram tokenizer for Garhwali (ISO 639-3: GBM) using SentencePiece, end to end—optimized for Devanagari script and mixed-language text. It powers the pahadi.ai platform for Garhwali LLM downstream tasks. The tokenizer achieves ~2.2M tokens/sec encoding speed, 2.66× compression, and 98.5% round-trip accuracy—competitive with or better than several general-purpose tokenizers on my evaluation set. This post explains the SentencePiece unigram model with mathematics, then walks through the design, training pipeline (from the gbm-tokenizer repo), evaluation methodology, and how to use it. Try it here: gbm-tokenizer.vercel.app.
Why a dedicated tokenizer for Garhwali?
Large language models are only as good as their tokenizers. Subword tokenizers (BPE, unigram, etc.) are trained on corpora, when that corpus is mostly English or a mix of high-resource languages, low-resource languages get the short end of the stick. Garhwali (GBM) is a Central Pahari language spoken in Uttarakhand, India, written primarily in Devanagari. Although it shares the Devanagari script with Hindi, it is a distinct language with its own phonology, grammar, and vocabulary—not a dialect of Hindi. For example, Garhwali has a retroflex lateral (ळ /ɭ/) and vowel distinctions (e.g. vowel length, different inventories) that differ from standard Hindi its allophony and assimilation patterns also diverge. So tokenizers trained only on Hindi or on "Indic" corpora dominated by Hindi learn subword units that don't align well with Garhwali's natural morphemes and character sequences. General-purpose tokenizers (GPT-4, Llama, BERT) are even less suited: they are not trained on Garhwali at all. Together, the result is:
- Over-segment Devanagari text (more tokens per word → higher cost, weaker semantics)
- Under-segment or mis-segment mixed Garhwali–English content
- Fail round-trip on rare characters or script mixes
A tokenizer trained on a Garhwali-focused corpus can learn stable subword units for the script and language while still handling English, numbers, and symbols—making it suitable for Garhwali NLP, mixed-language apps, and future Garhwali LLMs. I trained this tokenizer for the pahadi.ai platform, which targets Garhwali LLM downstream tasks.
SentencePiece unigram in a nutshell
I used SentencePiece with unigram (not BPE). This section gives a compact mathematical view of how unigram tokenization works.
Vocabulary and segmentations
Let denote the vocabulary: a finite set of subword pieces (e.g. characters, substrings). Each piece has an associated probability , with .
A segmentation of a string is a sequence of pieces such that concatenation gives back :
I denote by the set of all segmentations of that use only pieces from .
Unigram language model
The unigram model assumes that the probability of a segmentation is the product of the probabilities of its pieces (independence assumption):
So for a given string , the likelihood of a segmentation is
Decoding: maximum likelihood segmentation
Encoding (tokenization) chooses the segmentation that maximizes this likelihood:
Equivalently, one maximizes the log-likelihood (sum of log-probabilities):
SentencePiece computes this with the Viterbi algorithm: dynamic programming over positions in , so decoding is where is maximum piece length.
Training: vocabulary and probabilities
Training takes a raw corpus and (1) builds a vocabulary of size (e.g. ), and (2) assigns probabilities to each . SentencePiece does this with an Expectation–Maximization (EM) style procedure:
- E-step: For each sentence, compute (or approximate) segmentations given current .
- M-step: Update counts and re-estimate (and optionally prune low-probability pieces to keep ).
After training, one has a fixed and ; at inference the argmax segmentation (Viterbi) is used as above.
Why unigram (vs BPE)?
- Unigram: Probabilistic, each piece has ; encoding = argmax product of probabilities. Good for mixed scripts and dropping rare pieces in training.
- BPE: Greedy merge rule, no explicit probabilities, encoding is deterministic given merge order. Unigram often gives better compression and round-trip when a fixed and byte fallback are needed.
I set and used byte fallback so any Unicode character can be represented, rare characters are encoded as byte sequences, preserving round-trip.
Design choices
Algorithm: Unigram (SentencePiece)
As above: I used a unigram language model over a vocabulary of size , with encoding = of (or ), implemented via Viterbi in SentencePiece.
- Language-agnostic tokenization (no whitespace dependency), which fits Devanagari and mixed text.
- Configurable vocab size, character coverage, and normalization (e.g. NFKC).
- Byte fallback so any Unicode character can be represented and round-trip is possible.
Script and language
- Primary script: Devanagari (Garhwali, Hindi, Sanskrit).
- Secondary: Latin (English), digits, punctuation, symbols.
- Goal: One tokenizer for Garhwali, English, and mixed content (e.g. code-switched sentences, math, dates).
So the design needs high character coverage, stable round-trip, and no script bias in segmentation.
Training pipeline
The training methodology follows the gbm-tokenizer repository: a single script train.py that calls SentencePiece’s trainer with a fixed set of hyperparameters.
Corpus
- Single concatenated corpus (e.g. one text file).
- Scale: 621K+ lines of Garhwali-centric and mixed text (actual size depends on the data).
- Recommendation: Deduplicate, normalize line endings, and keep a mix of Garhwali, Hindi, English, and symbols so the tokenizer sees real usage patterns.
Configuration (high level)
In train.py I called SentencePiece with:
model_type=unigram,vocab_size=128000character_coverage=1.0— use all Unicode characters seen (no dropping of rare scripts).byte_fallback=true— encode any character via byte sequences so (up to normalization).normalization_rule_name=nmt_nfkc— standard Unicode NFKC normalization.remove_extra_whitespaces=false— preserve spaces/tabs/newlines.split_by_unicode_script=true— lets the model learn different segmentations per script (Devanagari vs Latin).- Special token ids:
pad_id=0,unk_id=1,bos_id=2,eos_id=3.
Rule of thumb: – the number of unique characters in the corpus often gives a good tradeoff between fertility and coverage. I logged unique character count before training to inform this.
Evaluation methodology
I evaluated using the same metrics and test harness as in the gbm-tokenizer repo (eval.py). Denote by a test set of strings and by , the tokenizer’s encode/decode maps.
-
Round-trip accuracy
For each , check whether equals (or its normalized form).
where is after the same normalization applied by the tokenizer (e.g. NFKC). -
Compression ratio
Characters per token (higher = fewer tokens for the same text):
-
Fertility
Tokens per word (lower = more semantic chunks per token). If is word count (e.g. whitespace-split):
-
Speed
Tokens per second when encoding :
where is wall-clock time for encoding all .
Test set
- Size: ~238 lines after stripping comments/blanks (from
eval.txtin the repo). - Contents: English, Devanagari, Garhwali, mixed sentences, numbers, punctuation, Unicode symbols, whitespace and edge cases.
Comparison tokenizers
I compared against character-level baseline, GPT-2, GPT-4/Claude, GPT-4o, BERT, Llama 3, Gemma 3, Sarvam-1. Each is wrapped in a small adapter so I could run the same metrics (encode → decode, compression, fertility, speed) on the same test lines.
Results (summary)
Evaluated on 152 test cases from the evaluation set, single-threaded on Apple Silicon (M-series), Python 3.11:
| Tokenizer | Vocab Size | Compression | Round-trip Acc. | Fertility (tokens/word) | Speed |
|---|---|---|---|---|---|
| GBM Tokenizer | 128,000 | 2.66× | 98.5% | 2.11 | ~2.2M t/s |
| GPT-4o (o200k) | 199,998 | 2.93× | 100.0% | 1.92 | ~1.2M t/s |
| Gemma 3 | 262,144 | 3.06× | 100.0% | 1.84 | ~0.5M t/s |
| Llama 3 | 128,000 | 2.51× | 99.5% | 2.24 | ~0.4M t/s |
| GPT-4/Claude | 100,256 | 1.77× | 100.0% | 3.18 | ~1.6M t/s |
| Sarvam-1 | 68,096 | 2.31× | 100.0% | 2.44 | ~0.6M t/s |
| BERT | 30,522 | 2.27× | 18.2% | 2.48 | ~0.4M t/s |
| GPT-2 | 50,257 | 1.31× | 100.0% | 4.30 | ~1.8M t/s |
| Character-level | — | 1.00× | 100.0% | 5.63 | ~53M t/s |
Takeaways:
- GBM has the highest encoding speed among subword tokenizers in this benchmark (~2.2M t/s), and better compression than Llama 3 (same vocab size) and much better than GPT-2/GPT-4.
- Round-trip is 98.5%, the few failures are typically normalization or rare Unicode. BERT’s 18.2% reflects it being wordpiece and not designed for arbitrary Unicode/Devanagari.
- Fertility (2.11) is between GPT-4o (1.92) and Llama 3 (2.24), so it’s in the same ballpark as modern LLM tokenizers.
Note: Exact numbers depend on hardware and the exact eval set.
How to use it
Web app
The tokenizer can be accessed at gbm-tokenizer.vercel.app. Encoding and decoding run server-side, the model is never sent to the client.
Hugging Face
The model is also available on Hugging Face at somu9/gbm-tokenizer. You can load it with the Hugging Face ecosystem and use it with SentencePiece.
Lessons and recommendations
- Byte fallback and normalization are critical for 100% coverage and stable round-trip on mixed scripts and symbols.
- Split-by-Unicode-script helps the unigram model learn different segmentations for Devanagari vs Latin instead of forcing one strategy everywhere.
- Preserving whitespace (
remove_extra_whitespaces=false) matters for code, dates, and structured data in the same tokenizer. - Vocab size in the 100K–128K range is a good default for a single language + English/symbols, tune with unique character count and fertility on your dev set.
- Evaluation should include round-trip, compression, fertility, and speed on a diverse test set (multiple scripts, numbers, edge cases)—as in the gbm-tokenizer
eval.pypipeline.
Conclusion
I built a Garhwali-oriented, production-friendly tokenizer with SentencePiece (unigram), trained end to end for pahadi.ai and Garhwali LLM downstream tasks. This post expressed the model in terms of vocabulary , piece probabilities , and maximum-likelihood segmentation (Viterbi), and described training (EM-style) and evaluation (round-trip, compression, fertility, speed) with explicit formulas. The implementation and methodology live in the gbm-tokenizer repository. The tokenizer performs well on mixed Devanagari–English content and is fast enough for real-time or batch use.
If you’re working on Garhwali NLP, multilingual tokenization, or tokenizer design for Indic scripts, I hope this post and the repository give you a clear blueprint to adapt and extend. I plan to write another blog about the pahadi.ai platform and Garhwali LLM work sometime soon.
References
- GBM Tokenizer (web app) — try the tokenizer in the browser
- SentencePiece — algorithm and implementation
- ISO 639-3: GBM (Garhwali) - more about Garhwali language
- Hugging Face: somu9/gbm-tokenizer — pre-trained tokenizer