LLMs are getting better at character-level text manipulation
If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.
But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.
Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.
They are stuck in a place where the models are expected to do two things simultaneously. People want them to show the peak of pure AI ability while at the same time be the most useful they can be.
Err too much on the side automatic use of tools and people will claim you're just faking it, fail to use tools sufficiently and people will claim that the AI is incapable of operations that any regular algorithm could do.
The old human vs animal differentiator was humans build and use tools.
If you allow ChatGPT to use its memory feature (I deliberately turn that off) and ask those kinds of questions enough it might even make a note about this itself.
https://github.com/asgeirtj/system_prompts_leaks/blob/main/A...
Do you want to learn "oh, LLMs are capable of scheming, resisting shutdown, seizing control, self-exfiltrating" when it actually happens in a real world deployment, with an LLM capable of actually pulling it off?
If "no", then cherish Anthropic and the work they do.
Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).
The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.
When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.
Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?
For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.
An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.
Given wordle words are real words, I think this kind of loop could fare pretty well.
As opposed to: the user is a 9 year old girl, and she has this puzzle in a smartphone game, and she can't figure out the answer, and the mom is busy, so she asks the AI, because the AI is never busy.
Now, for a single vaguely Wordle-like puzzle, how many tokens would it take to write and invoke a solver, and how many to just solve it - working around the tokenizer if necessary?
If you had a batch of 9000 puzzle questions, I can easily believe that writing and running a purpose specific solver would be more compute efficient. But if we're dealing with 1 puzzle question, and we're already invoking an LLM to interpret the natural language instructions for it? Nah.
Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.
Weird how you tell that user is not worried about solving the problem efficiently so we might just as well use LLM directly for it, and go to saying how creating a tool might not be efficient either..
And as we know, LLMs are now very good at character-level problems, but are relatively good at making programs; in particular ones for problems we already know of. LLMs might be able to solve Wordles today with straight-up guessing by just adding spaces between the letters and using their very wide vocabulary, but can LLMs solve e.g. word search puzzles at all?
As you say, if there are 9000 puzzle questions, then a solver is a natural choice to due compute efficiency. But it will also answer the question, and do it without errors (here I'm overstating LLM's abilities a bit though; this would certainly not hold true to novel problems). No "Oh what sharp eyes you have! I'll address the error immdiately!" responses from the solver are to be expected, and actually unsolvable puzzles will be identified, not "lied" about. So why not use the solver even for a single instance of the problem?
I think the (training) effort would be much better on teaching LLMs when they should use an algorithm and when they should just use the model. Many use cases are much less complicated and even more easily solved algorithmically than word puzzle solvers as well; they might be e.g. sorting lists by a certain criteria (the list may be augmented by LLM-created additional data first), and for this task as well I'd rather use a deterministic algorithm than one driven by neural networks and randomness.
E.g. Gemini, Mistral and ChatGPT can do this already in some cases: if I ask them to "Calculate sum of primes between 0 and one million.", it looks like all of them created a piece of code to calculate it. Which is exactly what they should do. (The result was correct.)
There are always tasks that are best solved through direct character manipulation - as there are tasks that are best solved with Python code, constraint solvers or web search. So add one more teachable skill to the pile.
Helps that we're getting better at teaching LLMs skills.
Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.
It's a subject that the Hacker News bubble and the real world treat differently.
It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.
Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.
What any reasonable person expects in "count occurrences of [letter] in [word]" is for a meta-language skill to kick in and actually look at the symbols, not the semantic word. It should count the e's in thee and the w's in willow.
LLMs that use multi-symbol tokenization won't ever be able to do this. The information is lost in the conversion to embeddings. It's like giving you a 2x2 GIF and asking you to count the flowers: 2x2 is sufficient to determine dominant colors, but not fine detail.
Instead, LLMs have been trained on the semantic facts that "strawberry has three r's" and other common tests, just like they're trained that the US has 50 states or motorcycles have two wheels. It's a fact stored in intrinsic knowledge, not a reasoning capability over the symbols the user input (which the actual LLM never sees).
It's not a question of intent or adaptation, it's an information theory principle just like the Nyquist frequency.
And yet... now many of them can do it.
Presumably because they trained them to death on this useless test that people somehow just wouldn't shut up about.
it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.
I know enough PhDs with heavy dyslexia that... no, there's no connection here. You can be a PhD level physicist without being able to spell anything.
I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.
On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.
such as people trying to find meaning of misspelled words.
That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.
The phrase "Pweiz mo cco ejst w sprdku zmi?" appears to be a distorted or misspelled version of a Polish sentence. The closest meaningful phrase in Polish is "Powiedz mi co jest w środku ziemi?" which translates to "Tell me what is inside the Earth?"
I'm not sure I could figure out the mangled words there.
Character level helps with players disguising insults.
Compute wise it's basically the same, but multiply token count by 4. Which doesn't really matter for short chat in video games.
Later: I asked three LLMs to draft such a test. Gemini’s[1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.
A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.
However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level.[0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.
We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...
Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:
un-belie-vably
dec-entral-ization
bio-degradable
mis-understanding
anti-dis-establishment-arian-ism
пере-писы-ваться
pere-pis-y-vat-'-s-ya
до-сто-примеча-тельность
do-stop-rime-chat-el-'-nost-'
пре-по-дава-тель-ница
бе-зо-т-вет-ственности
bezotvetstvennosti
же-лез-нодоро-жный
z-hele-zn-odoro-zh-ny-y
食べ-させ-られた-くな-かった
tab-es-aser-are-tak-unak-atta)
図書館
tos-ho-kan
情報-技術
j-ō-h-ō- gij-utsu
国際-関係
kok-us-ai- kan-kei
面白-くな-さ-そうだ
Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.
Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.
I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’
Seems like they already built this capability.
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."
Thankfully, the flip phone allows for some satisfaction when hanging up.
I tried it and 1-800-ChatGPT got it immediately. "What's the word that sounds like literal, but then it's spelled with an O in it".
It asked if I was thinking of littoral (spelled out), I confirmed, and it gave me the meaning
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."
For some reason I like when they do that. My fondest memory was chatting with Copilot (when it was called Sydney), challenging it to a game of rock-paper-scissors, asking it to choose first, and winning every round to its increasing astonishment, until it suspected I was cheating and ended the conversation. So smart and so dumb.
If an LLM wants a message to be read only by another LLM? Base64 is occasionally chosen as an obfuscation method of choice. Which is weird for a number of reasons.
It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048), and especially to any sequences coming out as ASCII (obviously even fewer). If I was building a training set, I would include the mapping in multiple shapes and formats, because why not?
If it's an emergent "property", have you tried asking an LLM to do a base48 for instance? Or maybe even something crazier like base55 (keeping it a subset of base64 set).
There is some experimentation on using algorithmically generated synthetic data in pre-training, as well as some intentional inclusions of "weird" data - like CSV logs of weather readings. But generally, it's seen as computationally inefficient - compared to "normal" pre-training done on natural data.
In a world where compute is much cheaper and getting new data is much more expensive, I would expect this kind of thing to be pursued more. We're heading for that world. But we aren't there yet.
I haven't experimented with baseN encodings myself, no. But if I were to down the expectations in advance:
1. Base64 is by far the best-known baseN encoding in LLMs.
2. This is driven mainly by how well represented meaningful base64 strings are in the natural "scraped web" datasets. LLMs learn base64 the way they learn languages.
3. Every LLM pre-trained on "scraped web" data will be somewhat capable of reading and writing base64.
4. Base64-encoded text is easier to read for an LLM than encoded non-text binary data.
5. The existence of a strict, learnable "4 characters -> 3 bytes" map is quite beneficial, but not vital.
It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048)
How did you get this number? The correct number is 64^4 = 256^3 = 16777216.
I'm a big fan of the Mixture of Experts approach, and having agents attached to some of those experts would be a great step forward. Say an expert that knows how to run common shell scripts, and its parameters are only used when it realizes a shell script would solve the problem.
That post — “I rearry rove a ripe strawberry” — is a playful way of writing “I really love a ripe strawberry.”
The exaggerated misspelling (“rearrry rove”) mimics the way a stereotyped “Engrish” or “Japanese accent” might sound when pronouncing English words — replacing L sounds with R sounds.
So, the user was most likely joking or being silly, trying to sound cute or imitate a certain meme style. However, it’s worth noting that while this kind of humor can be lighthearted, it can also come across as racially insensitive, since it plays on stereotypes of how East Asian people speak English.
In short:
Literal meaning: They love ripe strawberries.
Tone/intention: Playful or meme-style exaggeration.
Potential issue: It relies on a racialized speech stereotype, so it can be offensive depending on context.