What AI Actually Reads
2 min
Tokenize: indescribable
Take the word "indescribable." To you, it's one word. To GPT-4, it's three tokens: "ind," "escrib," and "able." To another model, it might be split differently: "in," "describ," and "able." Now take a sentence in Japanese. The same meaning might become 15 tokens in one model and 4 in another. This isn't a minor technical detail. Token count determines how much text fits in a conversation, how much an API call costs, and even how well the model understands your input. Languages with non-Latin scripts often get tokenized into far more pieces, making AI more expensive and less effective for billions of people. The way AI reads text is fundamentally alien to how humans do, and understanding that gap is the first step to mastering it.
The same sentence, tokenized 3 different ways, and why it matters.