Building a Production-Ready Tokenizer for Garhwali (GBM): Design, Training, and Benchmarks

TL;DR: I built a 128K-vocabulary unigram tokenizer for Garhwali (ISO 639-3: GBM) using SentencePiece, end to end—optimized for Devanagari script and mixed-language text. It powers the pahadi.ai platform for Garhwali LLM downstream tasks. The tokenizer achieves ~2.2M tokens/sec encoding speed, 2.66× compression, and 98.5% round-trip accuracy—competitive with or better than several general-purpose tokenizers on my evaluation set. This post explains the SentencePiece unigram model with mathematics, then walks through the design, training pipeline (from the gbm-tokenizer repo), evaluation methodology, and how to use it. Try it here: gbm-tokenizer.vercel.app.


Why a dedicated tokenizer for Garhwali?

Large language models are only as good as their tokenizers. Subword tokenizers (BPE, unigram, etc.) are trained on corpora, when that corpus is mostly English or a mix of high-resource languages, low-resource languages get the short end of the stick. Garhwali (GBM) is a Central Pahari language spoken in Uttarakhand, India, written primarily in Devanagari. Although it shares the Devanagari script with Hindi, it is a distinct language with its own phonology, grammar, and vocabulary—not a dialect of Hindi. For example, Garhwali has a retroflex lateral (ळ /ɭ/) and vowel distinctions (e.g. vowel length, different inventories) that differ from standard Hindi its allophony and assimilation patterns also diverge. So tokenizers trained only on Hindi or on "Indic" corpora dominated by Hindi learn subword units that don't align well with Garhwali's natural morphemes and character sequences. General-purpose tokenizers (GPT-4, Llama, BERT) are even less suited: they are not trained on Garhwali at all. Together, the result is:

  • Over-segment Devanagari text (more tokens per word → higher cost, weaker semantics)
  • Under-segment or mis-segment mixed Garhwali–English content
  • Fail round-trip on rare characters or script mixes

A tokenizer trained on a Garhwali-focused corpus can learn stable subword units for the script and language while still handling English, numbers, and symbols—making it suitable for Garhwali NLP, mixed-language apps, and future Garhwali LLMs. I trained this tokenizer for the pahadi.ai platform, which targets Garhwali LLM downstream tasks.


SentencePiece unigram in a nutshell

I used SentencePiece with unigram (not BPE). This section gives a compact mathematical view of how unigram tokenization works.

Vocabulary and segmentations

Let V\mathcal{V} denote the vocabulary: a finite set of subword pieces (e.g. characters, substrings). Each piece xVx \in \mathcal{V} has an associated probability p(x)0p(x) \geq 0, with xVp(x)=1\sum_{x \in \mathcal{V}} p(x) = 1.

A segmentation of a string XX is a sequence of pieces (x1,x2,,xM)(x_1, x_2, \ldots, x_M) such that concatenation gives back XX:

X=x1x2xM.X = x_1 \cdot x_2 \cdots x_M.

I denote by S(X;V)\mathcal{S}(X; \mathcal{V}) the set of all segmentations of XX that use only pieces from V\mathcal{V}.

Unigram language model

The unigram model assumes that the probability of a segmentation is the product of the probabilities of its pieces (independence assumption):

P(x1,,xM)=i=1Mp(xi).P(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i).

So for a given string XX, the likelihood of a segmentation (x1,,xM)S(X;V)(x_1,\ldots,x_M) \in \mathcal{S}(X; \mathcal{V}) is

P(Xsegmentation)=i=1Mp(xi).P(X \mid \text{segmentation}) = \prod_{i=1}^{M} p(x_i).

Decoding: maximum likelihood segmentation

Encoding (tokenization) chooses the segmentation that maximizes this likelihood:

(x1,,xM)=argmax(x1,,xM)S(X;V)  i=1Mp(xi).(x_1^*, \ldots, x_M^*) = \underset{(x_1,\ldots,x_M) \in \mathcal{S}(X; \mathcal{V})}{\arg\max} \; \prod_{i=1}^{M} p(x_i).

Equivalently, one maximizes the log-likelihood (sum of log-probabilities):

(x1,,xM)=argmax(x1,,xM)S(X;V)  i=1Mlogp(xi).(x_1^*, \ldots, x_M^*) = \underset{(x_1,\ldots,x_M) \in \mathcal{S}(X; \mathcal{V})}{\arg\max} \; \sum_{i=1}^{M} \log p(x_i).

SentencePiece computes this with the Viterbi algorithm: dynamic programming over positions in XX, so decoding is O(XL)O(|X| \cdot L) where LL is maximum piece length.

Training: vocabulary and probabilities

Training takes a raw corpus and (1) builds a vocabulary V\mathcal{V} of size KK (e.g. K=128000K = 128\,000), and (2) assigns probabilities p(x)p(x) to each xVx \in \mathcal{V}. SentencePiece does this with an Expectation–Maximization (EM) style procedure:

  • E-step: For each sentence, compute (or approximate) segmentations given current p(x)p(x).
  • M-step: Update counts and re-estimate p(x)p(x) (and optionally prune low-probability pieces to keep V=K|\mathcal{V}| = K).

After training, one has a fixed V\mathcal{V} and p(x)p(x); at inference the argmax segmentation (Viterbi) is used as above.

Why unigram (vs BPE)?

  • Unigram: Probabilistic, each piece has p(x)p(x); encoding = argmax product of probabilities. Good for mixed scripts and dropping rare pieces in training.
  • BPE: Greedy merge rule, no explicit probabilities, encoding is deterministic given merge order. Unigram often gives better compression and round-trip when a fixed KK and byte fallback are needed.

I set K=128000K = 128\,000 and used byte fallback so any Unicode character can be represented, rare characters are encoded as byte sequences, preserving round-trip.


Design choices

Algorithm: Unigram (SentencePiece)

As above: I used a unigram language model over a vocabulary V\mathcal{V} of size K=128000K = 128\,000, with encoding = argmax\arg\max of ip(xi)\prod_i p(x_i) (or ilogp(xi)\sum_i \log p(x_i)), implemented via Viterbi in SentencePiece.

  • Language-agnostic tokenization (no whitespace dependency), which fits Devanagari and mixed text.
  • Configurable vocab size, character coverage, and normalization (e.g. NFKC).
  • Byte fallback so any Unicode character can be represented and round-trip is possible.

Script and language

  • Primary script: Devanagari (Garhwali, Hindi, Sanskrit).
  • Secondary: Latin (English), digits, punctuation, symbols.
  • Goal: One tokenizer for Garhwali, English, and mixed content (e.g. code-switched sentences, math, dates).

So the design needs high character coverage, stable round-trip, and no script bias in segmentation.


Training pipeline

The training methodology follows the gbm-tokenizer repository: a single script train.py that calls SentencePiece’s trainer with a fixed set of hyperparameters.

Corpus

  • Single concatenated corpus (e.g. one text file).
  • Scale: 621K+ lines of Garhwali-centric and mixed text (actual size depends on the data).
  • Recommendation: Deduplicate, normalize line endings, and keep a mix of Garhwali, Hindi, English, and symbols so the tokenizer sees real usage patterns.

Configuration (high level)

In train.py I called SentencePiece with:

  • model_type=unigram, vocab_size=128000
  • character_coverage=1.0 — use all Unicode characters seen (no dropping of rare scripts).
  • byte_fallback=true — encode any character via byte sequences so decode(encode(X))=X\mathrm{decode}(\mathrm{encode}(X)) = X (up to normalization).
  • normalization_rule_name=nmt_nfkc — standard Unicode NFKC normalization.
  • remove_extra_whitespaces=false — preserve spaces/tabs/newlines.
  • split_by_unicode_script=true — lets the model learn different segmentations per script (Devanagari vs Latin).
  • Special token ids: pad_id=0, unk_id=1, bos_id=2, eos_id=3.

Rule of thumb: K5×K \approx 5\times10×10\times the number of unique characters in the corpus often gives a good tradeoff between fertility and coverage. I logged unique character count before training to inform this.


Evaluation methodology

I evaluated using the same metrics and test harness as in the gbm-tokenizer repo (eval.py). Denote by T\mathcal{T} a test set of strings and by encode()\mathrm{encode}(\cdot), decode()\mathrm{decode}(\cdot) the tokenizer’s encode/decode maps.

  1. Round-trip accuracy
    For each XTX \in \mathcal{T}, check whether decode(encode(X))\mathrm{decode}(\mathrm{encode}(X)) equals XX (or its normalized form).
    accuracy=1TXT1[decode(encode(X))=X],\text{accuracy} = \frac{1}{|\mathcal{T}|} \sum_{X \in \mathcal{T}} \mathbf{1}\bigl[\mathrm{decode}(\mathrm{encode}(X)) = X'\bigr],
    where XX' is XX after the same normalization applied by the tokenizer (e.g. NFKC).

  2. Compression ratio
    Characters per token (higher = fewer tokens for the same text):
    compression=XTXXTencode(X).\text{compression} = \frac{\sum_{X \in \mathcal{T}} |X|}{\sum_{X \in \mathcal{T}} |\mathrm{encode}(X)|}.

  3. Fertility
    Tokens per word (lower = more semantic chunks per token). If w(X)w(X) is word count (e.g. whitespace-split):
    fertility=XTencode(X)XTw(X).\text{fertility} = \frac{\sum_{X \in \mathcal{T}} |\mathrm{encode}(X)|}{\sum_{X \in \mathcal{T}} w(X)}.

  4. Speed
    Tokens per second when encoding T\mathcal{T}:
    speed=XTencode(X)Δt,\text{speed} = \frac{\sum_{X \in \mathcal{T}} |\mathrm{encode}(X)|}{\Delta t},
    where Δt\Delta t is wall-clock time for encoding all XTX \in \mathcal{T}.

Test set

  • Size: ~238 lines after stripping comments/blanks (from eval.txt in the repo).
  • Contents: English, Devanagari, Garhwali, mixed sentences, numbers, punctuation, Unicode symbols, whitespace and edge cases.

Comparison tokenizers

I compared against character-level baseline, GPT-2, GPT-4/Claude, GPT-4o, BERT, Llama 3, Gemma 3, Sarvam-1. Each is wrapped in a small adapter so I could run the same metrics (encode → decode, compression, fertility, speed) on the same test lines.


Results (summary)

Evaluated on 152 test cases from the evaluation set, single-threaded on Apple Silicon (M-series), Python 3.11:

TokenizerVocab SizeCompressionRound-trip Acc.Fertility (tokens/word)Speed
GBM Tokenizer128,0002.66×98.5%2.11~2.2M t/s
GPT-4o (o200k)199,9982.93×100.0%1.92~1.2M t/s
Gemma 3262,1443.06×100.0%1.84~0.5M t/s
Llama 3128,0002.51×99.5%2.24~0.4M t/s
GPT-4/Claude100,2561.77×100.0%3.18~1.6M t/s
Sarvam-168,0962.31×100.0%2.44~0.6M t/s
BERT30,5222.27×18.2%2.48~0.4M t/s
GPT-250,2571.31×100.0%4.30~1.8M t/s
Character-level1.00×100.0%5.63~53M t/s

Takeaways:

  • GBM has the highest encoding speed among subword tokenizers in this benchmark (~2.2M t/s), and better compression than Llama 3 (same vocab size) and much better than GPT-2/GPT-4.
  • Round-trip is 98.5%, the few failures are typically normalization or rare Unicode. BERT’s 18.2% reflects it being wordpiece and not designed for arbitrary Unicode/Devanagari.
  • Fertility (2.11) is between GPT-4o (1.92) and Llama 3 (2.24), so it’s in the same ballpark as modern LLM tokenizers.

Note: Exact numbers depend on hardware and the exact eval set.


How to use it

Web app

The tokenizer can be accessed at gbm-tokenizer.vercel.app. Encoding and decoding run server-side, the model is never sent to the client.

Hugging Face

The model is also available on Hugging Face at somu9/gbm-tokenizer. You can load it with the Hugging Face ecosystem and use it with SentencePiece.


Lessons and recommendations

  1. Byte fallback and normalization are critical for 100% coverage and stable round-trip on mixed scripts and symbols.
  2. Split-by-Unicode-script helps the unigram model learn different segmentations for Devanagari vs Latin instead of forcing one strategy everywhere.
  3. Preserving whitespace (remove_extra_whitespaces=false) matters for code, dates, and structured data in the same tokenizer.
  4. Vocab size in the 100K–128K range is a good default for a single language + English/symbols, tune with unique character count and fertility on your dev set.
  5. Evaluation should include round-trip, compression, fertility, and speed on a diverse test set (multiple scripts, numbers, edge cases)—as in the gbm-tokenizer eval.py pipeline.

Conclusion

I built a Garhwali-oriented, production-friendly tokenizer with SentencePiece (unigram), trained end to end for pahadi.ai and Garhwali LLM downstream tasks. This post expressed the model in terms of vocabulary V\mathcal{V}, piece probabilities p(x)p(x), and maximum-likelihood segmentation (Viterbi), and described training (EM-style) and evaluation (round-trip, compression, fertility, speed) with explicit formulas. The implementation and methodology live in the gbm-tokenizer repository. The tokenizer performs well on mixed Devanagari–English content and is fast enough for real-time or batch use.

If you’re working on Garhwali NLP, multilingual tokenization, or tokenizer design for Indic scripts, I hope this post and the repository give you a clear blueprint to adapt and extend. I plan to write another blog about the pahadi.ai platform and Garhwali LLM work sometime soon.


References