BPE Tokenizer
Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use
94 chars · 16 words
Tokenized Output
[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]Merge Operations
| Step | Pair | Merged Token | Pair Frequency | Vocab Size |
|---|---|---|---|---|
| #1 | "e"+"s" | "es" | 9 | 12 |
| #2 | "es"+"t" | "est" | 9 | 13 |
| #3 | "est"+"␣" | "est␣" | 9 | 14 |
| #4 | "l"+"o" | "lo" | 7 | 15 |
| #5 | "lo"+"w" | "low" | 7 | 16 |
| #6 | "n"+"e" | "ne" | 6 | 17 |
| #7 | "ne"+"w" | "new" | 6 | 18 |
| #8 | "new"+"est␣" | "newest␣" | 6 | 19 |
| #9 | "low"+"␣" | "low␣" | 5 | 20 |
| #10 | "w"+"i" | "wi" | 3 | 21 |
Initial Vocabulary (11)
"␣""d""e""i""l""n""o""r""s""t""w"Final Vocabulary (21)
"␣""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"Export (JSON)
{
"vocab": [
"</w>",
"d",
"e",
"es",
"est",
"est</w>",
"i",
"l",
"lo",
"low",
"low</w>",
"n",
"ne",
"new",
"newest</w>",
"o",
"r",
"s",
"t",
"w",
"wi"
],
"merges": [
[
"e",
"s"
],
[
"es",
"t"
],
[
"est",
"</w>"
],
[
"l",
"o"
],
[
"lo",
"w"
],
[
"n",
"e"
],
[
"ne",
"w"
],
[
"new",
"est</w>"
],
[
"low",
"</w>"
],
[
"w",
"i"
]
],
"tokens": [
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>"
],
"tokenIds": [
10,
21,
10,
21,
10,
21,
10,
21,
10,
21,
9,
2,
16,
0,
21,
9,
2,
16,
0,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
20,
1,
5,
21,
20,
1,
5,
21,
20,
1,
5
]
}How BPE Works
Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.
- Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker
</w>(shown as ␣ above). - Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
- Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
- Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
- Encode new text by greedily applying learned merges in the order they were learned.
This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.
Continue Exploring
Other Text Tools you might like...
Cipher Tools
Encrypt and decrypt text with Caesar, Vigenère, ROT13, and Atbash classical ciphers — with live alphabet preview
Find & Replace
Find and replace text with regex support, case sensitivity, whole-word matching, and live preview
Strip HTML Tags
Remove all HTML tags from text — with options to preserve links, decode entities, keep line breaks, and remove script/style blocks
Unicode Character Inspector
Inspect every character in text — view code points, UTF-8/UTF-16 encoding, HTML entities, categories, detect invisible characters and zero-width spaces
Readability Score Checker
Calculate Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog, SMOG, and ARI scores for any text
Case Converter
Convert text between UPPER CASE, lower case, Title Case, camelCase, and more
Lorem Ipsum Generator
Generate placeholder Lorem Ipsum text in paragraphs or words
Text Diff
Find and highlight differences between two pieces of text