テキストツール

BPE Tokenizer

Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use

94 chars · 16 words

010100
Initial Vocabulary
11
unique chars + </w>
Final Vocabulary
21
+10 merges
Tokens Produced
43
2.19 chars/token
Compression Ratio
54.3%
vs char tokens

Tokenized Output

low␣#10 #21low␣#10 #21low␣#10 #21low␣#10 #21low␣#10 #21low#9e#2r#16#0 #21low#9e#2r#16#0 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21newest␣#14 #21wi#20d#1est␣#5 #21wi#20d#1est␣#5 #21wi#20d#1est␣#5
Token IDs
[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]

Merge Operations

StepPairMerged TokenPair FrequencyVocab Size
#1"e"+"s""es"912
#2"es"+"t""est"913
#3"est"+"""est␣"914
#4"l"+"o""lo"715
#5"lo"+"w""low"716
#6"n"+"e""ne"617
#7"ne"+"w""new"618
#8"new"+"est␣""newest␣"619
#9"low"+"""low␣"520
#10"w"+"i""wi"321

Initial Vocabulary (11)

"""d""e""i""l""n""o""r""s""t""w"

Final Vocabulary (21)

"""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"

Export (JSON)

{
  "vocab": [
    "</w>",
    "d",
    "e",
    "es",
    "est",
    "est</w>",
    "i",
    "l",
    "lo",
    "low",
    "low</w>",
    "n",
    "ne",
    "new",
    "newest</w>",
    "o",
    "r",
    "s",
    "t",
    "w",
    "wi"
  ],
  "merges": [
    [
      "e",
      "s"
    ],
    [
      "es",
      "t"
    ],
    [
      "est",
      "</w>"
    ],
    [
      "l",
      "o"
    ],
    [
      "lo",
      "w"
    ],
    [
      "n",
      "e"
    ],
    [
      "ne",
      "w"
    ],
    [
      "new",
      "est</w>"
    ],
    [
      "low",
      "</w>"
    ],
    [
      "w",
      "i"
    ]
  ],
  "tokens": [
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low</w>",
    " ",
    "low",
    "e",
    "r",
    "</w>",
    " ",
    "low",
    "e",
    "r",
    "</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "newest</w>",
    " ",
    "wi",
    "d",
    "est</w>",
    " ",
    "wi",
    "d",
    "est</w>",
    " ",
    "wi",
    "d",
    "est</w>"
  ],
  "tokenIds": [
    10,
    21,
    10,
    21,
    10,
    21,
    10,
    21,
    10,
    21,
    9,
    2,
    16,
    0,
    21,
    9,
    2,
    16,
    0,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    14,
    21,
    20,
    1,
    5,
    21,
    20,
    1,
    5,
    21,
    20,
    1,
    5
  ]
}

How BPE Works

Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.

  1. Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker </w> (shown as ␣ above).
  2. Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
  3. Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
  4. Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
  5. Encode new text by greedily applying learned merges in the order they were learned.

This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.

お探しのツールはありますか?

コミュニティの声をもとに無料ツールを作っています。ワークフローに欲しい機能を提案してください。

BPE Tokenizer — Free Online Tool | FreeTool24 | FreeTool24