BPE Tokenizer
Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use
94 chars · 16 words
Tokenized Output
[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]Merge Operations
| Step | Pair | Merged Token | Pair Frequency | Vocab Size |
|---|---|---|---|---|
| #1 | "e"+"s" | "es" | 9 | 12 |
| #2 | "es"+"t" | "est" | 9 | 13 |
| #3 | "est"+"␣" | "est␣" | 9 | 14 |
| #4 | "l"+"o" | "lo" | 7 | 15 |
| #5 | "lo"+"w" | "low" | 7 | 16 |
| #6 | "n"+"e" | "ne" | 6 | 17 |
| #7 | "ne"+"w" | "new" | 6 | 18 |
| #8 | "new"+"est␣" | "newest␣" | 6 | 19 |
| #9 | "low"+"␣" | "low␣" | 5 | 20 |
| #10 | "w"+"i" | "wi" | 3 | 21 |
Initial Vocabulary (11)
"␣""d""e""i""l""n""o""r""s""t""w"Final Vocabulary (21)
"␣""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"Export (JSON)
{
"vocab": [
"</w>",
"d",
"e",
"es",
"est",
"est</w>",
"i",
"l",
"lo",
"low",
"low</w>",
"n",
"ne",
"new",
"newest</w>",
"o",
"r",
"s",
"t",
"w",
"wi"
],
"merges": [
[
"e",
"s"
],
[
"es",
"t"
],
[
"est",
"</w>"
],
[
"l",
"o"
],
[
"lo",
"w"
],
[
"n",
"e"
],
[
"ne",
"w"
],
[
"new",
"est</w>"
],
[
"low",
"</w>"
],
[
"w",
"i"
]
],
"tokens": [
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>"
],
"tokenIds": [
10,
21,
10,
21,
10,
21,
10,
21,
10,
21,
9,
2,
16,
0,
21,
9,
2,
16,
0,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
20,
1,
5,
21,
20,
1,
5,
21,
20,
1,
5
]
}How BPE Works
Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.
- Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker
</w>(shown as ␣ above). - Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
- Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
- Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
- Encode new text by greedily applying learned merges in the order they were learned.
This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.
继续探索
您可能喜欢的其他 文本工具…
密码工具
使用凯撒、维吉尼亚、ROT13 和阿特巴什经典密码加密和解密文本 — 带实时字母表预览
查找与替换
支持正则表达式、大小写敏感、全词匹配和实时预览的文本查找替换工具
去除HTML标签
从文本中去除所有HTML标签 — 可选保留链接、解码实体、保持换行、移除script/style块
Unicode字符检查器
检查文本中的每个字符 — 查看码点、UTF-8/UTF-16编码、HTML实体、分类,检测不可见字符和零宽空格
可读性评分
计算任意文本的 Flesch 易读性、Flesch-Kincaid 年级、Gunning Fog、SMOG 和 ARI 分数
大小写转换器
在大写、小写、标题格式、驼峰命名、下划线命名等之间转换文本
Lorem Ipsum生成器
生成段落、句子或单词形式的Lorem Ipsum占位文本
文本差异
查找两段文本之间的差异,提供字符和行级别高亮