BPE Tokenizer
Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use
94 chars · 16 words
Tokenized Output
[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]Merge Operations
| Step | Pair | Merged Token | Pair Frequency | Vocab Size |
|---|---|---|---|---|
| #1 | "e"+"s" | "es" | 9 | 12 |
| #2 | "es"+"t" | "est" | 9 | 13 |
| #3 | "est"+"␣" | "est␣" | 9 | 14 |
| #4 | "l"+"o" | "lo" | 7 | 15 |
| #5 | "lo"+"w" | "low" | 7 | 16 |
| #6 | "n"+"e" | "ne" | 6 | 17 |
| #7 | "ne"+"w" | "new" | 6 | 18 |
| #8 | "new"+"est␣" | "newest␣" | 6 | 19 |
| #9 | "low"+"␣" | "low␣" | 5 | 20 |
| #10 | "w"+"i" | "wi" | 3 | 21 |
Initial Vocabulary (11)
"␣""d""e""i""l""n""o""r""s""t""w"Final Vocabulary (21)
"␣""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"Export (JSON)
{
"vocab": [
"</w>",
"d",
"e",
"es",
"est",
"est</w>",
"i",
"l",
"lo",
"low",
"low</w>",
"n",
"ne",
"new",
"newest</w>",
"o",
"r",
"s",
"t",
"w",
"wi"
],
"merges": [
[
"e",
"s"
],
[
"es",
"t"
],
[
"est",
"</w>"
],
[
"l",
"o"
],
[
"lo",
"w"
],
[
"n",
"e"
],
[
"ne",
"w"
],
[
"new",
"est</w>"
],
[
"low",
"</w>"
],
[
"w",
"i"
]
],
"tokens": [
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>"
],
"tokenIds": [
10,
21,
10,
21,
10,
21,
10,
21,
10,
21,
9,
2,
16,
0,
21,
9,
2,
16,
0,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
20,
1,
5,
21,
20,
1,
5,
21,
20,
1,
5
]
}How BPE Works
Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.
- Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker
</w>(shown as ␣ above). - Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
- Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
- Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
- Encode new text by greedily applying learned merges in the order they were learned.
This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.
さらに探す
おすすめのその他の テキストツール…
暗号ツール
シーザー、ヴィジュネル、ROT13、アトバシュ古典暗号でテキストを暗号化・復号 — リアルタイムのアルファベットプレビュー付き
検索と置換
正規表現・大小文字区別・単語単位マッチ・ライブプレビュー対応のテキスト検索・置換ツール
HTMLタグ除去
テキストからすべてのHTMLタグを除去 — リンク保持、エンティティデコード、改行保持、script/styleブロック削除のオプション付き
Unicode文字インスペクター
テキストの各文字を検査 — コードポイント、UTF-8/UTF-16エンコーディング、HTMLエンティティ、カテゴリ表示、不可視文字とゼロ幅スペースの検出
可読性スコアチェッカー
任意のテキストのFleschリーディングイーズ、Flesch-Kincaid学年レベル、Gunning Fog、SMOG、ARIスコアを計算
大文字小文字変換機
テキストを大文字、小文字、タイトルケース、キャメルケース、スネークケース等に変換
Lorem Ipsumジェネレーター
段落、文、単語形式のLorem Ipsumプレースホルダーテキストを生成
テキスト差分
文字レベルと行レベルのハイライトで2つのテキスト間の違いを見つける