BPE Tokenizer
Train Byte Pair Encoding on your own text and watch subword tokens emerge — visualize merge steps, vocabulary growth, and token IDs like GPT and modern LLMs use
94 chars · 16 words
Tokenized Output
[10, 21, 10, 21, 10, 21, 10, 21, 10, 21, 9, 2, 16, 0, 21, 9, 2, 16, 0, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 14, 21, 20, 1, 5, 21, 20, 1, 5, 21, 20, 1, 5]Merge Operations
| Step | Pair | Merged Token | Pair Frequency | Vocab Size |
|---|---|---|---|---|
| #1 | "e"+"s" | "es" | 9 | 12 |
| #2 | "es"+"t" | "est" | 9 | 13 |
| #3 | "est"+"␣" | "est␣" | 9 | 14 |
| #4 | "l"+"o" | "lo" | 7 | 15 |
| #5 | "lo"+"w" | "low" | 7 | 16 |
| #6 | "n"+"e" | "ne" | 6 | 17 |
| #7 | "ne"+"w" | "new" | 6 | 18 |
| #8 | "new"+"est␣" | "newest␣" | 6 | 19 |
| #9 | "low"+"␣" | "low␣" | 5 | 20 |
| #10 | "w"+"i" | "wi" | 3 | 21 |
Initial Vocabulary (11)
"␣""d""e""i""l""n""o""r""s""t""w"Final Vocabulary (21)
"␣""d""e""es""est""est␣""i""l""lo""low""low␣""n""ne""new""newest␣""o""r""s""t""w""wi"Export (JSON)
{
"vocab": [
"</w>",
"d",
"e",
"es",
"est",
"est</w>",
"i",
"l",
"lo",
"low",
"low</w>",
"n",
"ne",
"new",
"newest</w>",
"o",
"r",
"s",
"t",
"w",
"wi"
],
"merges": [
[
"e",
"s"
],
[
"es",
"t"
],
[
"est",
"</w>"
],
[
"l",
"o"
],
[
"lo",
"w"
],
[
"n",
"e"
],
[
"ne",
"w"
],
[
"new",
"est</w>"
],
[
"low",
"</w>"
],
[
"w",
"i"
]
],
"tokens": [
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"low",
"e",
"r",
"</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"newest</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>",
" ",
"wi",
"d",
"est</w>"
],
"tokenIds": [
10,
21,
10,
21,
10,
21,
10,
21,
10,
21,
9,
2,
16,
0,
21,
9,
2,
16,
0,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
14,
21,
20,
1,
5,
21,
20,
1,
5,
21,
20,
1,
5
]
}How BPE Works
Byte Pair Encoding (BPE) is a subword tokenization algorithm used by many modern language models (GPT, RoBERTa, etc.). It starts with a vocabulary of single characters and iteratively merges the most frequent adjacent pair into a new token.
- Pre-tokenize text into words; represent each word as a sequence of characters with an end-of-word marker
</w>(shown as ␣ above). - Count pairs: for every adjacent symbol pair across the corpus, count occurrences.
- Merge the most frequent pair into a single new token, updating every occurrence in the corpus.
- Repeat for N iterations. Each merge expands the vocabulary by exactly one token.
- Encode new text by greedily applying learned merges in the order they were learned.
This implementation trains BPE on the text you enter — try the "Classic example" preset (Sennrich et al. 2016) and 10 merges to see common subwords like "low", "est", "new" emerge.
Sigue explorando
Otras herramientas de Texto que te pueden gustar…
Herramientas de cifrado
Cifra y descifra texto con los cifrados clásicos César, Vigenère, ROT13 y Atbash — con vista previa del alfabeto en vivo
Buscar y reemplazar
Busca y reemplaza texto con soporte para regex, distinción de mayúsculas, coincidencia de palabras completas y vista previa en tiempo real
Eliminar etiquetas HTML
Elimine todas las etiquetas HTML del texto — con opciones para preservar enlaces, decodificar entidades, mantener saltos de línea y eliminar bloques script/style
Inspector de Caracteres Unicode
Inspeccione cada carácter del texto — vea puntos de código, codificación UTF-8/UTF-16, entidades HTML, categorías, detecte caracteres invisibles y espacios de ancho cero
Comprobador de legibilidad
Calcula las puntuaciones Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog, SMOG y ARI para cualquier texto
Conversor de capitalización
Convierte texto entre MAYÚSCULAS, minúsculas, Title Case, camelCase, snake_case y más
Generador Lorem Ipsum
Genera texto de relleno Lorem Ipsum en párrafos, oraciones o palabras
Diferencia de texto
Encuentra las diferencias entre dos textos con resaltado a nivel de carácter y línea