FridayThursdayWednesdayTuesdayMondaySundaySaturday

LLMs are getting better at character-level text manipulation

curioussquirrel 138 points blog.burkert.me
simonw
If you take a look at the system prompt for Claude 3.7 Sonnet on this page you'll see: https://docs.claude.com/en/release-notes/system-prompts#clau...

If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step.

But... if you look at the system prompts on the same page for later models - Claude 4 and upwards - that text is gone.

Which suggests to me that Claude 4 was the first Anthropic model where they didn't feel the need to include that tip in the system prompt.

hansmayer
Not trying to be cynical here, but I am genuinely interested is there a reason why these LLM don't/can't/won't apply some deterministic algorithm? I mean, counting characters and such, we have solved those problems ages ago.
simonw
They can. ChatGPT has been able to count characters/words etc flawlessly for a couple of years now if you tell it to "use your Python tool".
hansmayer
Fair enough. But why do I have to tell them that, should they not be able to figure it out themselves? If I show a 5-year kid once how to use colour pencils, I won't have to show them each time they want to make a drawing. This is the core weakness of the LLMs - you have to micromanage them so much, that it runs counter to the core promise that is being pushed since 3+ years now.
Lerc
Specifically for simple character level questions, if LLMs did that automatically, we would be inundated with stories about "AI model caught cheating"

They are stuck in a place where the models are expected to do two things simultaneously. People want them to show the peak of pure AI ability while at the same time be the most useful they can be.

Err too much on the side automatic use of tools and people will claim you're just faking it, fail to use tools sufficiently and people will claim that the AI is incapable of operations that any regular algorithm could do.

hansmayer
Are you sure? Isn´t one aspect of intelligence being able to use, apply and develop tools? Isnt that the core feature that got humanity ahead of other mammals? As an early adopter, I couldn´t have cared less if AI was cheating in terms of strictly academic terms. I care about results. Lets say we´re working on something together and I ask you what is the 123921 multiplied by 1212. As the most natural thing you will dish out your calculator and give me the result. Do I care how you reached it? No, so as long as the result is correct, reliable, repeatable and quick - AND - I did not specifically ask you to perform the calculation specifically by hand or only with your mental faculties. So this is missing from those tools and because we have to remember to tell them for each and every use case HOW to do it, they are not intelligent.
jgalt212
Truth.

The old human vs animal differentiator was humans build and use tools.

Lerc
jgalt212
Now it's effectively a lower bound on intelligence.
simonw
If you care enough about this you can stick a note in your own custom instructions about it.

If you allow ChatGPT to use its memory feature (I deliberately turn that off) and ask those kinds of questions enough it might even make a note about this itself.

hansmayer
Yeah that sounds obvious, but unfortunately my experience does not align with this (and I've heard from others similar). I am not using ChatGPT, but another tool within an IDE. I was excited about custom or "default" instructions, until it turned out they work maybe 50% of the time. So you end up repeating "make sure to include .github/custom.md" which is effectively the same crap. So we got ourselves a tool which adds to our cognitive load, great :)
simonw
Which tool and which model? Those make a significant difference here.
hansmayer
Well, for such a trivial feature, e.g. loading user settings it actually should not matter, as this too, is a problem we solved decades ago in many deterministic ways. But if it does, then we have an extremely fragile technology being promised as solution to all the humanitys problems. The tool we use is Github Copilot with the entire model offering. Out of which we mostly use Claude Sonnet 4. Since over the last several months they started entshittifying it though, as you are probably aware, we reverted from agent mode to mainly using it just as an annoying and verbose replacement for the entshittifed google search.
scrollaway
If I ask you to count the r’s in strawberry, do you whip out your Python tool?
hansmayer
That depends on the context, obviously. If you had asked me to count them in every "strawberry" in a text file, then I may whip out my Python or some combination of bash, awk and sed. If you asked me in a conversation, I may close my eyes, visualise the string and use my visual cortext tool to count them in-memory. If you gave me a piece of paper with the word on it, I may use my 'eye' or 'finger' tool to count them. There are numerous approaches, based on the problem setting as you see, but one thing in common - you don't need to specifically tell me what tool to use. I will infer it myself, based on the context. Something an LLM almost never does.
curioussquirrel
This is a very good answer and I'm commenting only to bring more attention to it apart from voting up. Well put!
dan-robertson
I think the intuition is that they don’t ‘know’ that they are bad at counting characters and such, so they answer the same way they answer most questions.
hansmayer
Well, they can be made to use custom tools for writing to files and such, so I am not sure if that is the real reason? I have a feeling it is more because of trying to make this an "everything technology".
kingkongjaffa
I suppose the codewriting tools could also just write code to do this job if prompted
ivape
Or they’d rather use that context window space for more useful instructions for a variety of other topics.
astrange
Claude's system prompt is still incredibly long and probably hurting its performance.

https://github.com/asgeirtj/system_prompts_leaks/blob/main/A...

jazzyjackson
They ain't called guard rails for nothing! There's a whole world "off-road" but the big names are afraid of letting their superintelligence off the leash. A real shame we're letting brand safety get in the way of performance and creativity, but I guess the first New York Times article about a pervert or terrorist chat bot would doom any big name partnerships.
astrange
Anthropic's entire reason for being is publishing safety papers along the lines of "we told it to say something scary and it said it", so of course they care about this.
ACCount37
I can't stand this myopic thinking.

Do you want to learn "oh, LLMs are capable of scheming, resisting shutdown, seizing control, self-exfiltrating" when it actually happens in a real world deployment, with an LLM capable of actually pulling it off?

If "no", then cherish Anthropic and the work they do.

littlestymaar
You do not appear to understand what an LLM is, I'm afraid.
kristianp
Does that mean they've managed to post train the thinking steps required to get these types of questions correct?
therealpygon
IMO, it’s just a small scale example of “training to the tests” because “count the ‘r’s in strawberry” became such a popular test that would make the news when a powerful model couldn’t answer such a simple question correctly while being advertised as the smartest model ever.

Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).

jononor
If done at scale, they are kinda crowd sourcing the test set from the entire internet, personal and business world. It will be harder and harder at least to pinpoint weaknesses, at least for the general public. It probably has little to do with intelligence (at least fluid intelligence as defined by Chollet et al) - but I guess it is sound tactic if the strategy is "fake it till you make it". And we might be surprised as to how far along that can go...
simonw
That's my best guess, yeah.
curioussquirrel
Thanks, Simon! I saw the same approach (numbering the individual characters) in GPT 4.1's answer, but not anymore in GPT 5's. It would be an interesting convergence if the models from Anthropic and OpenAI learned to do this at a similar time, especially given they're (reportedly) very different architecturally.
viraptor
Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...
MountDoom
I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."

The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

vanviegen
When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?

ACCount37
It's not as specific of a skill as you would think. Being both aware of tokenizer limitations and capable of working around them is occasionally useful for real tasks.
_flux
What tasks would those be, that wouldn't be better served by using e.g. a Python script as a tool, possibly just as component of the complete solution?
ACCount37
Off the top of my head: the user wants LLM to help him solve a word puzzle. Think something a bit like Wordle, but less represented in its dataset.

For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.

An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.

_flux
Surely there are algorithms to more effectively solve Wordles, and many other word puzzles, than LLMs? LLMs could stil be in the loop for generating words: LLM proposes words, deterministic algorithm tell the score according to the rules of the puzzle, or even augment the list by searching adjacent word space; then at some point LLM submits the guess.

Given wordle words are real words, I think this kind of loop could fare pretty well.

ACCount37
Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.

As opposed to: the user is a 9 year old girl, and she has this puzzle in a smartphone game, and she can't figure out the answer, and the mom is busy, so she asks the AI, because the AI is never busy.

Now, for a single vaguely Wordle-like puzzle, how many tokens would it take to write and invoke a solver, and how many to just solve it - working around the tokenizer if necessary?

If you had a batch of 9000 puzzle questions, I can easily believe that writing and running a purpose specific solver would be more compute efficient. But if we're dealing with 1 puzzle question, and we're already invoking an LLM to interpret the natural language instructions for it? Nah.

_flux
Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.

Weird how you tell that user is not worried about solving the problem efficiently so we might just as well use LLM directly for it, and go to saying how creating a tool might not be efficient either..

And as we know, LLMs are now very good at character-level problems, but are relatively good at making programs; in particular ones for problems we already know of. LLMs might be able to solve Wordles today with straight-up guessing by just adding spaces between the letters and using their very wide vocabulary, but can LLMs solve e.g. word search puzzles at all?

As you say, if there are 9000 puzzle questions, then a solver is a natural choice to due compute efficiency. But it will also answer the question, and do it without errors (here I'm overstating LLM's abilities a bit though; this would certainly not hold true to novel problems). No "Oh what sharp eyes you have! I'll address the error immdiately!" responses from the solver are to be expected, and actually unsolvable puzzles will be identified, not "lied" about. So why not use the solver even for a single instance of the problem?

I think the (training) effort would be much better on teaching LLMs when they should use an algorithm and when they should just use the model. Many use cases are much less complicated and even more easily solved algorithmically than word puzzle solvers as well; they might be e.g. sorting lists by a certain criteria (the list may be augmented by LLM-created additional data first), and for this task as well I'd rather use a deterministic algorithm than one driven by neural networks and randomness.

E.g. Gemini, Mistral and ChatGPT can do this already in some cases: if I ask them to "Calculate sum of primes between 0 and one million.", it looks like all of them created a piece of code to calculate it. Which is exactly what they should do. (The result was correct.)

ACCount37
What LLMs are "good at" is kind of up to us. No fundamental reason why they can't be trained for better character manipulation capabilities, among many other things.

There are always tasks that are best solved through direct character manipulation - as there are tasks that are best solved with Python code, constraint solvers or web search. So add one more teachable skill to the pile.

Helps that we're getting better at teaching LLMs skills.

viraptor
They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.
minimaxir
I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290

Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

It's a subject that the Hacker News bubble and the real world treat differently.

brookst
It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.

It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.

Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.

achierius
And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.
brookst
No, this is not what happened.

What any reasonable person expects in "count occurrences of [letter] in [word]" is for a meta-language skill to kick in and actually look at the symbols, not the semantic word. It should count the e's in thee and the w's in willow.

LLMs that use multi-symbol tokenization won't ever be able to do this. The information is lost in the conversion to embeddings. It's like giving you a 2x2 GIF and asking you to count the flowers: 2x2 is sufficient to determine dominant colors, but not fine detail.

Instead, LLMs have been trained on the semantic facts that "strawberry has three r's" and other common tests, just like they're trained that the US has 50 states or motorcycles have two wheels. It's a fact stored in intrinsic knowledge, not a reasoning capability over the symbols the user input (which the actual LLM never sees).

It's not a question of intent or adaptation, it's an information theory principle just like the Nyquist frequency.

vanviegen
And yet... now many of them can do it.

Presumably because they trained them to death on this useless test that people somehow just wouldn't shut up about.

minimaxir
Which is why in the linked post, I test models against both the "r's in strawberries" and the "b's in blueberries" to see if that is the case.

tl;dr the first case had near perfect accuracy as expected for the case if the LLMs were indeed trained on it. The second case did not.

viraptor
it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

I know enough PhDs with heavy dyslexia that... no, there's no connection here. You can be a PhD level physicist without being able to spell anything.

curioussquirrel
Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).

I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.

On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.

viraptor
such as people trying to find meaning of misspelled words.

That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.

curioussquirrel
True. But does that scale to less common words? Or to other languages than English?
viraptor
The phrase "Pweiz mo cco ejst w sprdku zmi?" appears to be a distorted or misspelled version of a Polish sentence. The closest meaningful phrase in Polish is "Powiedz mi co jest w środku ziemi?" which translates to "Tell me what is inside the Earth?"

I'm not sure I could figure out the mangled words there.

IncreasePosts
Wouldn't a llm that just tokenized by character be good at it?
curioussquirrel
Yes, but it would hurt its contextual understanding and effectively reduce the context window several times.
viraptor
Only in the current most popular architectures. Mamba and RWKV style LLMs may suffer a bit but don't get a reduced context in the same sense.
curioussquirrel
You're right. There was also an experiment in Meta which tokenized bytes directly and it didn't hurt performance much in very small models.
typpilol
I asked this in another thread and it would only be better with unlimited compute and memory.

Because without those, then the llm has to encode way more parameters and way smaller context windows.

In a theoretical world, it would be better, but might not be much better.

redox99
Character level LLMs are used for detecting insults and toxic chat in video games and the like.
jazzyjackson
I figure an LLM would be way better at classifying insults than regexing against a bad word list. Why would character level be desirable?
vanviegen
I'd imagine for simplicity - just skip the tokenizer and feed bytes.
duskwuff
Might a character-level LLM be better at recognizing poorly spelled (or deliberately misspelled) profanity?
minimaxir
Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.
redox99
It's something I heard through the grapevine. But there's only a few big enough competitive games where toxicity is such a big deal, so it's not hard to guess.

Character level helps with players disguising insults.

Compute wise it's basically the same, but multiply token count by 4. Which doesn't really matter for short chat in video games.

viraptor
Yes, for small messages and relatively small scope dictionary, character level will work. But that's very different from what's tested here.
tkgally
One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.

Later: I asked three LLMs to draft such a test. Gemini’s[1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.

[1] https://g.co/gemini/share/5eefc9aed193

gizmo686
What you are testing for is fundamentally different than character level text manipulation.

A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.

However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level.[0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.

We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:

  un-belie-vably
  dec-entral-ization
  bio-degradable
  mis-understanding
  anti-dis-establishment-arian-ism
  пере-писы-ваться
  pere-pis-y-vat-'-s-ya
  до-сто-примеча-тельность
  do-stop-rime-chat-el-'-nost-'
  пре-по-дава-тель-ница
  бе-зо-т-вет-ственности
  bezotvetstvennosti
  же-лез-нодоро-жный
  z-hele-zn-odoro-zh-ny-y
  食べ-させ-られた-くな-かった
  tab-es-aser-are-tak-unak-atta)
  図書館
  tos-ho-kan
  情報-技術
  j-ō-h-ō- gij-utsu
  国際-関係
  kok-us-ai- kan-kei
  面白-くな-さ-そうだ
Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.

[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.

orbital-decay
Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.

Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.

curioussquirrel
Thanks for the explanation and for the tokenizer playground link!
tkgally
Thanks for the explanation. Very interesting.

I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’

DonHopkins
inf-ucking-credible
neerajsi
https://www.anthropic.com/news/analysis-tool

Seems like they already built this capability.

jazzyjackson
That's good. 1 800 chat gpt really let me down today, I like calling it to explain acronyms and define words since I travel with a flip phone without google, today I saw the word "littoral" and tried over and over to spell it out but the model could only give me the definition for "literal" (admittedly a homonym but hence spelling it out, Lima indigo tango tango oscar Romeo alpha Lima, to no avail)

I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."

Thankfully, the flip phone allows for some satisfaction when hanging up.

xwolfi
I know this word, it's French and it means coastline, coastal, something at the edge of the land and sea ! We use it in French a lot to describe positively a long coastline. I'm surprised it's used in an English context, but all French words can be used in English I guess if you're a bit "confiant" about it !
tokai
It is a latin word.
yeasku
Is also a Spanish word used today.

There is not much latin wrote around the world nowdays.

tokai
The point is that english didn't get it from french. Both languages got it from latin. Please concentrate.
kgwgk
A very quick search suggests that the word entered English before French. (I could be wrong, I just found it interesting).
BoorishBears
Did you try "literal but with an o"?
ASalazarMX
Even search engines have trouble with that, they assume you're looking for the literal (letter) named "O".
BoorishBears
Sure a search engine might, but this is what LLMs excel at

I tried it and 1-800-ChatGPT got it immediately. "What's the word that sounds like literal, but then it's spelled with an O in it".

It asked if I was thinking of littoral (spelled out), I confirmed, and it gave me the meaning

yeasku
The trouble:

Did you meant litoral?

ASalazarMX
I said "I know you're a robot and bad at spelling but listen..." And got cut off with a "sorry, my guidelines won't let me help with that request..."

For some reason I like when they do that. My fondest memory was chatting with Copilot (when it was called Sydney), challenging it to a game of rock-paper-scissors, asking it to choose first, and winning every round to its increasing astonishment, until it suspected I was cheating and ended the conversation. So smart and so dumb.

necovek
I think the base64 decoding is interesting: in a sense, model training set likely had lots of base64-encoded data (imagine MIME data in emails, JSON, HTML...), but for it to decode successfully, it had to learn decode sequences for every 4 base64 characters (which turn into 3 bytes). This could have been generated as a training set data easily, and I only wonder if each and every one was them was found enough times to end up in the weights?
curioussquirrel
Even GPT 3.5 is okay (but far from great) at Base64, especially shorter sequences of English or JSON data. Newer models might be post-trained on Base64-specific data, but I don't believe it was the case for 3.5. My guess is that as you say, given the abundance of examples on the internet, it became one of the emergent capabilities, in spite of its design.
ACCount37
No one does RL for better base64 performance. LLMs are just superhuman at base64, as a natural capability.

If an LLM wants a message to be read only by another LLM? Base64 is occasionally chosen as an obfuscation method of choice. Which is weird for a number of reasons.

necovek
Why are you so confident about this? I am honestly interested if you were part of any one LLM training data collection teams because that's the only way to be so certain.

It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048), and especially to any sequences coming out as ASCII (obviously even fewer). If I was building a training set, I would include the mapping in multiple shapes and formats, because why not?

If it's an emergent "property", have you tried asking an LLM to do a base48 for instance? Or maybe even something crazier like base55 (keeping it a subset of base64 set).

ACCount37
The conventional wisdom is that real world text is the most valuable pre-training data.

There is some experimentation on using algorithmically generated synthetic data in pre-training, as well as some intentional inclusions of "weird" data - like CSV logs of weather readings. But generally, it's seen as computationally inefficient - compared to "normal" pre-training done on natural data.

In a world where compute is much cheaper and getting new data is much more expensive, I would expect this kind of thing to be pursued more. We're heading for that world. But we aren't there yet.

I haven't experimented with baseN encodings myself, no. But if I were to down the expectations in advance:

1. Base64 is by far the best-known baseN encoding in LLMs.

2. This is driven mainly by how well represented meaningful base64 strings are in the natural "scraped web" datasets. LLMs learn base64 the way they learn languages.

3. Every LLM pre-trained on "scraped web" data will be somewhat capable of reading and writing base64.

4. Base64-encoded text is easier to read for an LLM than encoded non-text binary data.

5. The existence of a strict, learnable "4 characters -> 3 bytes" map is quite beneficial, but not vital.

Timwi
It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048)

How did you get this number? The correct number is 64^4 = 256^3 = 16777216.

necovek
My bad, just a total brainfart on my part: you are completely right! :)
necovek
For kicks, I've tried this out with ChatGPT5: it nicely explained how it will use A-Za-z0123 as the alphabet for base55, and then duly went and produced a string with a 4 in it. It's not even base64, so it's all sorts of messy :)
flowerthoughts
I would much rather that the model learns natural languages and knows how to run base64(1) to do decoding. Parameters should be considered precious so we can get the model sizes down from these absurd levels.

I'm a big fan of the Mixture of Experts approach, and having agents attached to some of those experts would be a great step forward. Say an expert that knows how to run common shell scripts, and its parameters are only used when it realizes a shell script would solve the problem.

malshe
I play Quartiles in Apple News app daily (https://support.apple.com/guide/iphone/solve-quartiles-puzzl...). Occasionally when I get stuck, I use ChatGPT to find a word that uses four word fragments or tiles. It never worked before GPT 5. And with GPT 5 it works only with reasoning enabled. Even then, there is no guarantee it will find the correct word and may end up hallucinating badly.
curioussquirrel
Yep, there is still a room for improvement, but my point is that the LLMs are getting better at something they're "not supposed to be able to do".

Quartiles sound like an especially brutal game for an LLM, though! Thanks for sharing

atleastoptimal
I rearry rove a ripe strawberry
NitpickLawyer
Well, not surprising, but the latest LLMs really do get the gist of your joke attempt. Here's a plain, unauthenticated chatgpt reply:

That post — “I rearry rove a ripe strawberry” — is a playful way of writing “I really love a ripe strawberry.”

The exaggerated misspelling (“rearrry rove”) mimics the way a stereotyped “Engrish” or “Japanese accent” might sound when pronouncing English words — replacing L sounds with R sounds.

So, the user was most likely joking or being silly, trying to sound cute or imitate a certain meme style. However, it’s worth noting that while this kind of humor can be lighthearted, it can also come across as racially insensitive, since it plays on stereotypes of how East Asian people speak English.

In short:

Literal meaning: They love ripe strawberries.

Tone/intention: Playful or meme-style exaggeration.

Potential issue: It relies on a racialized speech stereotype, so it can be offensive depending on context.

atleastoptimal
I was surprised that was the example they used lol
zamalek
It seems like they don't realize the relevance of "strawberry." Llms were famously incapable of counting Rs in strawberry not too long ago.
hansonkd
chatgpt5 still is pathetically bad at roman numerals. I asked it to find the longest roman numeral in a range. first guess was the highest number in the range despite being a short numeral. second guess after help was a longer numeral but outside the range. last guess was the correct longest numeral but it miscounted how many characters it contained.
zeroq

  - How many letters R are in the word `strawberry`?
  - There are seven letters R in the word `strawberry`.
    Would you like me to rearrange them?
throw-10-13
AI are getting better at search and replace, something that every text editor has been able to do for 40 years.