Skip to content
ai

SentencePiece

SentencePiece Tokenizer

Definition

SentencePiece is a language-independent subword tokenization library that implements BPE and unigram language model tokenization directly on raw text without language-specific pre-processing. It is used by T5, LLaMA, and other models that require consistent tokenization across multilingual training data.

SentencePiece models encode text as sequences of subword units selected to minimize sequence length.


Ship secure code faster

Crash Override integrates security into the developer workflow. No context switching, no waiting on reviews.