文本工具

BPE Tokenizer

Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use

深受 1 万+ 开发者信赖

Training Corpus

94 chars · 16 words

Number of Merges

010100

Show merge steps

Initial Vocabulary

unique chars + </w>

Final Vocabulary

+10 merges

Tokens Produced

2.19 chars/token

Compression Ratio

54.3%

vs char tokens

Tokenized Output

low␣#10 #21low␣#10 #21low␣#10 #21low␣#10 #21low␣#10 #21low#9e#2r#16␣#0 #21low#9e#2r#16␣#0 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21wi#20d#1est␣#5 #21wi#20d#1est␣#5 #21wi#20d#1est␣#5

Token IDs

[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]

Merge Operations

Step	Pair	Merged Token	Pair Frequency	Vocab Size
#1	`"e"`+`"s"`	`"es"`	9	12
#2	`"es"`+`"t"`	`"est"`	9	13
#3	`"est"`+`"␣"`	`"est␣"`	9	14
#4	`"l"`+`"o"`	`"lo"`	7	15
#5	`"lo"`+`"w"`	`"low"`	7	16
#6	`"n"`+`"e"`	`"ne"`	6	17
#7	`"ne"`+`"w"`	`"new"`	6	18
#8	`"new"`+`"est␣"`	`"newest␣"`	6	19
#9	`"low"`+`"␣"`	`"low␣"`	5	20
#10	`"w"`+`"i"`	`"wi"`	3	21

Initial Vocabulary (11)

"␣""d""e""i""l""n""o""r""s""t""w"

Final Vocabulary (21)

"␣""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"

Export (JSON)

{
  "vocab": [
    "</w>",
    "d",
    "e",
    "es",
    "est",
    "est</w>",
    "i",
    "l",
    "lo",
    "low",
    "low</w>",
    "n",
    "ne",
    "new",
    "newest</w>",
    "o",
    "r",
    "s",
    "t",
    "w",
    "wi"
  ],
  "merges": [
    [
      "e",
      "s"
    ],
    [
      "es",
      "t"
    ],
    [
      "est",
      "</w>"
    ],
    [
      "l",
      "o"
    ],
    [
      "lo",
      "w"
    ],
    [
      "n",
      "e"
    ],
    [
      "ne",
      "w"
    ],
    [
      "new",
      "est</w>"
    ],
    [
      "low",
      "</w>"
    ],
    [
      "w",
      "i"
    ]
  ],
  "tokens": [
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low",
    "e",
    "r",
    "</w>",
    " ",
    "low",
    "e",
    "r",
    "</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "wi",
    "d",
    "est</w>",
    " ",
    "wi",
    "d",
    "est</w>",
    " ",
    "wi",
    "d",
    "est</w>"
  ],
  "tokenIds": [
    10,
    21,
    10,
    21,
    10,
    21,
    10,
    21,
    10,
    21,
    9,
    2,
    16,
    0,
    21,
    9,
    2,
    16,
    0,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    20,
    1,
    5,
    21,
    20,
    1,
    5,
    21,
    20,
    1,
    5
  ]
}

How BPE Works

Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.

Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker </w> (shown as ␣ above).
Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
Encode new text by greedily applying learned merges in the order they were learned.

This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.

继续探索

您可能喜欢的其他文本工具…

查看分类下全部

没找到需要的？

我们根据社区反馈开发免费工具。欢迎提出能改善您工作流的工具建议！

BPE Tokenizer

Tokenized Output

Merge Operations

Initial Vocabulary (11)

Final Vocabulary (21)

Export (JSON)

How BPE Works

继续探索

您可能喜欢的其他文本工具…

密码工具

查找与替换

去除HTML标签

Unicode字符检查器

可读性评分

大小写转换器

Lorem Ipsum生成器

文本差异

没找到需要的？

BPE Tokenizer

Tokenized Output

Merge Operations

Initial Vocabulary (11)

Final Vocabulary (21)

Export (JSON)

How BPE Works

继续探索

您可能喜欢的其他 文本工具…

密码工具

查找与替换

去除HTML标签

Unicode字符检查器

可读性评分

大小写转换器

Lorem Ipsum生成器

文本差异

没找到需要的？

您可能喜欢的其他文本工具…