Beliefs that are true for regular software but false when applied to AI
I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control.
to perform up to Apple's typical standards of polish and control.
i no longer believe they have kept on to the standards in general. the ux/ui used to be a top priority, but the quality control has certainly gone down over the years[1]. the company is now driven by supply chain and business-minded optimizations than what to give to the end user.
at the same time, what one can do using AI has large correlation with what one does with their devices in the first place. a windows recall like feature for ipad os might have been interesting (if not equally controversial), but not that useful because even till this day it remains quite restrictive for most atypical tasks.
[1] https://www.macobserver.com/news/macos-tahoe-upside-down-ui-...
to perform up to Apple's typical standards of polish and control.i no longer believe they have kept on to the standards in general.
One 100% agree with this, if I compare AI's ability to speed up the baseline for me in terms of programming Golang (hard/tricky tasks clearly still require human input - watch out for I/O ops) with Apple's lack of ability to integrate it in even the simplest of ways.. things are just peculiar on the Apple front. Bit similar to how MS seems to be gradually loosing the ability to produce a version of Windows that people want to run due to organisational infighting.
I’ve seen a lot of things that look like they’re working for a demo, but shortly after starting to use it? Trash. Not every time (and it’s getting a little better), but often enough that personally I’ve found them a net drain on productivity.
And I literally work in this space.
Personally, I find apples hesitation here a breath of fresh air, because I’ve come to absolutely hate Windows - and everybody doing vibe code messes that end up being my problem.
I've now changed to asking where things are in the code base and how they work then making changes myself.
Personally, I find apples hesitation here a breath of fresh air
i does not appear to me as hesitation but rather an example of how they were unable to recently deliver on their marketing promises.
calling a suite of incomplete features as "Apple Intelligence" means that they had much higher expectations internally, similar to how they refined as second-movers in other instances. they have a similar situation with XR now.
Apple had two separate teams competing against each other on this topic
That is a sign of very bad management. Overlapping responsibilities kill motivation as winning the infighting becomes more important than creating a good product. Low morale, and a blaming culture is the result of such "internal competition". Instead, leadership should do their work and align goals, set clear priorities and make sure that everybody rows in the same direction.
In other words, should he shrink the Mac, which would be an epic feat of engineering, or enlarge the iPod? Jobs preferred the former option, since he would then have a mobile operating system he could customize for the many gizmos then on Apple’s drawing board. Rather than pick an approach right away, however, Jobs pitted the teams against each other in a bake-off.
If you have say 16GB of GPU RAM and around 64GB of RAM and a reasonable CPU then you can make decent use of LLMs. I'm not a Apple jockey but I think you normally have something like that available and so you will have a good time, provided you curb your expectations.
I'm not an expert but it seems that the jump from 16 to 32GB of GPU RAM is large in terms of what you can run and the sheer cost of the GPU!
If you have 32GB of local GPU RAM and gobs of RAM you can rub some pretty large models locally or lots of small ones for differing tasks.
I'm not too sure about your privacy/risk model but owning a modern phone is a really bad starter for 10! You have to decide what that means for you and that's your thing and your's alone.
https://www.theinformation.com/articles/apple-fumbled-siris-...
Distrust between the two groups got so bad that earlier this year one of Giannandrea’s deputies asked engineers to extensively document the development of a joint project so that if it failed, Federighi’s group couldn’t scapegoat the AI team.It didn’t help the relations between the groups when Federighi began amassing his own team of hundreds of machine-learning engineers that goes by the name Intelligent Systems and is run by one of Federighi’s top deputies, Sebastien Marineau-Mes.
This is a pretty good article, and worth reading if you aren't aware that Apple has seemingly mostly abandoned the vision of on-device AI (I wasn't aware of this)
However, when I stopped driving and looked at the picture the AI generated description was pretty poor - it wasn't completely wrong but it really wasn't what I was expecting given the description.
What really kills me is “a screenshot of a social media post” come on it’s simple OCR read the damn post to me you stupid robot! Don’t tell me you can’t, OCR was good enough in the 90s!
get LLMs to perform up to Apple's typical standards of polish and control.
I reject this spin (which is the Apple PR explanation for their failure). LLMs already do far better than Apple’s 2025 standards of polish. Contrast things built outside Apple. The only thing holding Siri back is Apple’s refusal to build a simple implementation where they expose the APIs to “do phone things” or “do home things” as a tool call to a plain old LLM (or heck, build MCP so LLM can control your device). It would be straightforward for Apple to negotiate with a real AI company to guarantee no training on the data, etc. the same way that business accounts on OpenAI etc. offer. It might cost Apple a bunch of money, but fortunately they have like 1000 bunches of money.
minor tools for making emojis, summarizing notifications, and proof reading.
The notification / email summaries are so unbelievably useless too: it’s hardly more work to skim the notification / email that I do anyway.
So while Apple's AI summaries may have been poorly executed, I can certainly understand the appeal and motivation behind such a feature.
Why use 10 words when you could do 1000. Why use headings or lists, when the whole story could be written in a single paragraph spanning 3 pages.
If it's to succinctly communicate key facts, then you write it quickly.
- Discovered that Bilbo's old ring is, in fact, the One Ring of Power.
- Took it on a journey southward to Mordor.
- Experienced a bunch of hardship along the way, and nearly failed at the end, but with Sméagol's contribution, successfully destroyed the Ring and defeated Sauron forever.
....And if it's to tell a story, then you write The Lord of the Rings.
"When's dinner?" "Well, I was at the store earlier, and... (paragraphs elided) ... and so, 7pm."
Probably a sci-fi story about it, if not, it should be written.
Eg sometimes the writer is outright antagonistic, because they have some obligation to tell you something, but don't actually want you to know.
Those kinds of emails are so uncommon they’re absolutely not worth wasting this level of effort on. And if you’re in a sorry enough situation where that’s not the case, what you really need is the outside context the model doesn’t know. The model doesn’t know your office politics.
Why do I think this? ...in the early 2000's my employer had a company wide license for a document summarizer tool that was rather accurate and easy to use, but nobody ever used it.
There are some good parts to Apple Intelligence though. I find the priority notifications feature works pretty well, and the photo cleanup tool works pretty well for small things like removing your finger from the corner of a photo, though it's not going to work on huge tasks like removing a whole person from a photo.
I want to open WhatsApp and open the message and have it clear the notif. Or atleast click the notif from the normal notif center and have it clear there. It kills me
it's not going to work on huge tasks like removing a whole person from a photo.
I use it for removing people who wander into the frame quite often. It probably wont work for someone close up, but its great for removing a tourist who spends ten minutes taking selfies in front of a monument.
"A bunch of people right outside your house!!!"
because it aggregates multiple single person walking by notifications that way...
Anyway, I get wanting to see who's ringing your doorbell in e.g. apartment buildings, and that extending to a house, especially if you have a bigger one. But is there a reason those cameras need to be on all the time?
so ramping it up the rhetoric doesn't really hurt them...
I mean, I could imagine a person with no common sense almost making the same mistake: "I have a list of 5 notifications of a person standing on the porch, and no notifications about leaving, so there must be a 5 person group still standing outside right now. Whadya mean, 'look at the times'?"
A biologist, a physicist and a mathematician were sitting in a street cafe watching the crowd. Across the street they saw a man and a woman entering a building. Ten minutes they reappeared together with a third person.- They have multiplied, said the biologist.
- Oh no, an error in measurement, the physicist sighed.
- If exactly one person enters the building now, it will be empty again, the mathematician concluded.
I think Apple execs genuinely underestimated how difficult it would be to get LLMs to perform up to Apple's typical standards of polish and control
Not only Apple, this is happening across the industry. Executives' expectations of what AI can deliver are massively inflated by Amodei et al. essentially promising human-level cognition with every release.
The reality is aside from coding assistants and chatbot interfaces (a la chatgpt) we've yet to see AI truly transform polished ecosystems like smartphones and OS' for a reason.
The reality is that if they hadn’t announced these tools and joined the make-believe AI bubble, their stock price would have crashed. It’s okay to spend $400 million on a project, as long as you don’t lose $50 billion in market value in an afternoon.
Why not take the easy wins? Like let me change phone settings with Siri or something, but nope.
A lot of AI seems to be mismanaging it into doing things AI (LLMs) suck at... while leaving obvious quick wins on the table.
https://9to5mac.com/2025/09/22/macos-tahoe-26-1-beta-1-mcp-i...
I hear that. Then I try to use AI for simple code task, writing unit tests for a class, very similar to other unit tests. If fails miserably. Forgets to add an annotation and enters in a death loop of bullshit code generation. Generates test classes that tests failed test classes that test failed test classes and so on. Fascinating to watch. I wonder how much CO2 it generated while frying some Nvidia GPU in an overpriced data center.
AI singularity may happen, but the Mother Brain will be a complete moron anyway.
I don't know how long that exponential will continue for, and I have my suspicions that it stops before week-long tasks, but that's the trend-line we're on.
The cases I'm thinking about are things that could be solved in a few minutes by someone who knows what the issue is and how to use the tools involved. I spent around two days trying to debug one recent issue. A coworker who was a bit more familiar with the library involved figured it out in an hour or two. But in parallel with that, we also asked the library's author, who immediately identified the issue.
I'm not sure how to fit a problem like that into this "duration of human time needed to complete a task" framework.
Or watch the Computerphile video summary/author interview, if you prefer: https://m.youtube.com/watch?v=evSFeqTZdqs
It's well worth looking at https://progress.openai.com/, here's a snippet:
human: Are you actually conscious under anesthesia?GPT-1 (2018): i did n't . " you 're awake .
GPT-3 (2021): There is no single answer to this question since anesthesia can be administered [...]
Given that AI couldn't even speak English 6 years ago, do you really think it's going to struggle with unit tests for the next 20 years?
Yes.
LLM is a very interesting technology for machines to understand and generate natural language. It is a difficult problem that it sort of solves.
It does not understand things beyond that. Developing software is not simply a natural language problem.
[1] https://fortune.com/article/jamie-dimon-jpmorgan-chase-ceo-a...
(Has anyone tried an LLM on an in-basket test?[1] That's a basic test for managers.)
On the other hand, trying to do something "new" is lots of headaches, so emotions are not always a plus. I could make a parallel to doctors: you don't want a doctor to start crying in a middle of an operation because he feels bad for you, but you can't let doctors doing everything that they want - there needs to be some checks on them.
Perhaps because most of the smartest people I know are regularly irrational or impulsive :)
Just as human navigators can find the smallest islands out in the open ocean, human curators can find the best information sources without getting overwhelmed by generated trash. Of course, fully manual curation is always going to struggle to deal with the volumes of information out there. However, I think there is a middle ground for assisted or augmented curation which exploits the idea that a high quality site tends to link to other high quality sites.
One thing I'd love is to be able to easily search all the sites in a folder full of bookmarks I've made. I've looked into it and it's a pretty dire situation. I'm not interested in uploading my bookmarks to a service. Why can't my own computer crawl those sites and index them for me? It's not exactly a huge list.
Now most of the photos online are just AI generated.
Our best technology at current require teams of people to operate and entire legions to maintain. This leads to a sort of balance, one single person can never go too far down any path on their own unless they convince others to join/follow them. That doesn't make this a perfect guard, we've seen it go horribly wrong in the past, but, at least in theory, this provides a dampening factor. It requires a relatively large group to go far along any path, towards good or evil.
AI reduces this. How greatly it reduces this, if it reduces it to only a handful, to a single person, or even to 0 people (putting itself in charge), seems to not change the danger of this reduction.
But don't count on it.
I mean, apart from anything else, that's still a bad outcome.
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
"Linda is a bank teller" is strictly more likely than "Linda is a bank teller and is active in the feminist movement" — all you have is P(a)>P(a&b), not what the probability of either statement is.power resides where men believe it resides
And also where people believe that others believe it resides. Etc...
If we can find new ways to collectively renegotiate where we think power should reside we can break the cycle.
But we only have time to do this until people aren't a significant power factor anymore. But that's still quite some time away.
here are some example ideas that are perfectly true when applied to regular software
Hm, I'm listening, let's see.
Software vulnerabilities are caused by mistakes in the code
That's not exactly true. In regular software, the code can be fine and you can still end up with vulnerabilities. The platform in which the code is deployed could be vulnerable, or the way it is installed make it vulnerable, and so on.
Bugs in the code can be found by carefully analysing the code
Once again, not exactly true. Have you ever tried understanding concurrent code just by reading it? Some bugs in regular software hide in places that human minds cannot probe.
Once a bug is fixed, it won’t come back again
Ok, I'm starting to feel this is a troll post. This guy can't be serious.
If you give specifications beforehand, you can get software that meets those specifications
Have you read The Mythical Man-Month?
He is trying to lax the general public perception around AIs shortcomings. He's giving AI a break, at the expense of regular developers.
This is wrong on two fronts:
First, because many people foresaw the AI shortcomings and warned about them. This "we can't fix a bug like in regular software" theatre hides the fact that we can design better benchmarks, or accountability frameworks. Again, lots of people foresaw this, and they were ignored.
Second, because it puts the strain on non-AI developers. It blamishes all the industry, putting together AI with non-AI in the same bucket, as if AI companies stumbled on this new thing and were not prepared for its problems, when the reality is that many people were anxious about the AI companies practices not being up to standard.
I think it's a disgraceful take, that only serves to sweep things under a carpet.
The fact is, we kind of know how to prevent problems in AI systems:
- Good benchmarks. People said several times that LLMs display erratic behavior that could be prevented. Instead of adjusting the benchmarks (which would slow down development), they ignored the issues.
- Accountability frameworks. Who is responsible when an AI fails? How the company responsible for the model is going to make up for it? That was a demand from the very beginning. There are no such accountability systems in place. It's a clown fiesta.
- Slowing down. If you have a buggy product, you don't scale it. First, you try to understand the problem. This was the opposite of what happened, and at the time, they lied that scaling would solve the issues (when in fact many people knew for a fact that scaling wouldn't solve shit).
Yes, it's kind of different. But it's a different we already know. Stop pushing this idea that this stuff is completely new.
But it's a different we already know
'we' is the operative word here. 'We', meaning technical people who have followed this stuff for years. The target audience of this article are not part of this 'we' and this stuff IS completely new _for them_. The target audience are people who, when confronted with a problem with an LLM, think it is perfectly reasonable to just tell someone to 'look at the code' and 'fix the bug'. You are not the target audience and you are arguing something entirely different.
What should I say now? "AI works in mysterious ways"? Doesn't sound very useful.
Also, should I start parroting innacurate outdated generalizations about regular software?
The post doesn't teach anything useful for a beginner audience. It's bamboozling them. I am amazed that you used the audience perspective as a defense of some kind. It only made it worse.
Please, please, take a moment to digest my critique properly. Think about what you just said and what that implies. Re-read the thread if needed.
He is trying to lax the general public perception around AIs shortcomings
This is not at all what I'm trying to do. This same essay is cross-posted on LessWrong[1] because I think ASI is the most dangerous problem of our time.
This "we can't fix a bug like in regular software" theatre hides the fact that we can design better benchmarks, or accountability frameworks
I'm not sure how I can say "your intuitions are wrong and you should be careful" and have that be misintepreted as "ignore the problems around AI"
[1] https://www.lesswrong.com/posts/ZFsMtjsa6GjeE22zX/why-your-b...
There are people (definitely not me) who buy 100% of anything AI they read. To that beginner enthusiastic audience, your text looks like "regular software is old and unintuitive, AI is better because you can just vibe", and the text reinforces that sentiment.
I’m also going to be making some sweeping statements about “how software works”, these claims mostly hold, but they break down when applied to distributed systems, parallel code, or complex interactions between software systems and human processes.
I'd argue that this describes most software written since, uh, I hesitate to even commit to a decade here.
If you include analog computers, then there are some WWII targeting computers that definitely qualify (e.g., on aircraft carriers).
these claims mostly hold, but they break down when applied to distributed systems, parallel code, or complex interactions between software systems and human processes
The claims the GP quoted DON’T mostly hold, they’re just plain wrong. At least the last two, anyway.
Ok, I'm starting to feel this is a troll post. This guy can't be serious.
Did you read the footnote about writing regression tests to catch bugs before they come back in production?
https://news.ycombinator.com/item?id=45583970
Thought I might just skip the repetition. You can continue the conversation within that thread.
Because eventually we’ll iron out all the bugs so the AIs will get more reliable over time
Honestly this feels like a true statement to me. It's obviously a new technology, but so much of the "non-deterministic === unusable" HN sentiment seems to ignore the last two years where LLMs have become 10x as reliable as the initial models.
kind of logarithmic
Of course LLMs aren't people, but an AGI might behave like a person.
LLMs don't learn from a project. At best, you learn how to better use the LLM.
They do have other benefits, of course, i.e. once you have trained one generation of Claude, you have as many instances as you need, something that isn't true with human beings. Whether that makes up for the lack of quality is an open question, which presumably depends on the projects.
LLMs don't learn from a project.
How long do you think that will remain true? I've bootstrapped some workflows with Claude Code where it writes a markdown file at the end of each session for its own reference in later sessions. It worked pretty well. I assume other people are developing similar memory systems that will be more useful and robust than anything I could hack together.
Many of the inventors of LLMs have moved on to (what they believe are) better models that would handle such learnings much better. I guess we'll see in 10-20 years if they have succeeded.
But NNs are fundamentally continuous, I don't think it even makes sense to "count" bugs. You can have a list of prompts to which the model gives unwanted output, but it's a completely different ball game compared to regular software.
While it’s possible to demonstrate the safety of an AI for
a specific test suite or a known threat, it’s impossible
for AI creators to definitively say their AI will never act
maliciously or dangerously for any prompt it could be given.
This possibility is compounded exponentially when MCP[0] is used.Granted this is not super common in these tools, but it is essentially unheard of in junior devs.
I have never met a human that would straight up lie about something
This doesn't match my experience. Consider high profile things like the VW emissions scandal, where the control system was intentionally programmed to only engage during the emissions test. Dictators. People are prone to lie when it's in their self interest, especially for self preservation. We have entire structures of government, courts, that try to resolve fact in the face of lying.
If we consider true-but-misleading, then politics, marketing, etc. come sharply into view.
I think the challenge is that we don't know when an LLM will generate untrue output, but we expect people to lie in certain circumstances. LLMs don't have clear self-interests, or self awareness to lie with intent. It's just useful noise.
Granted this is not super common in these tools, but it is essentially unheard of in junior devs.
I wonder if it's unheard of in junior devs because they're all saints, or because they're not talented enough to get away with it?
it’s impossible for AI creators to definitively say their AI will never act maliciously or dangerously for any prompt it could be given
This is false, AI doesn't "act" at all unless you, the developer, use it for actions. In which case it is you, the developer, taking the action.
Anthropomorphizing AI with terms like "malicious" when they can literally be implemented with a spreadsheet—first-order functional programming—and the world's dumbest while-loop to append the next token and restart the computation—should be enough to tell you there's nothing going on here beyond next token prediction.
Saying an LLM can be "malicious" is not even wrong, it's just nonsense.
AI doesn't "act" at all unless you, the developer, use it for actions
This seems like a pointless definition of "act"? someone else could use the AI for actions which affect me, in which case I'm very much worried about those actions being dangerous, regardless of precisely how you're defining the word "act".
when they can literally be implemented with a spreadsheet
The financial system that led to 2008 basically was one big spreadsheet, and yet it would have been correct to be worried about it. "Malicious" maybe is a bit evocative, I'll grant you that, but if I'm about to be eaten by a lion, I'm less concerned about not mistakenly athropomorphizing the lion, and more about ensuring I don't get eaten. It _doesn't matter_ whether the AI has agency or is just a big spreadsheet or wants to do us harm or is just sitting there. If it can do harm, it's dangerous.
Nevertheless, in the book, the AI managed to convince people, using the light signal, to free it. Furthermore, it seems difficult to sandbox any AI that is allowed to access dependencies or external resources (i.e. the internet). It would require (e.g.) dumping the whole Internet as data into the Sandbox. Taking away such external resources, on the other hand, reduces its usability.
Presumably it's a phrase you might hear from a boss who sees AI as similar to (and as benign/known/deterministic as) most other software, per TFA
(I think something along these lines was actually in the Terminator 3 movie, the one where Skynet goes live for the first time).
Agreed though, no relation to the actual post.
McKittrick: General, the machine has locked us out. It's sending random numbers to the silos.
Pat Healy: Codes. To launch the missiles.
General Beringer: Just unplug the goddamn thing! Jesus Christ!
McKittrick: That won't work, General. It would interpret a shutdown as the destruction of NORAD. The computers in the silos would carry out their last instructions. They'd launch.
Can you imagine the chaos of completely turning off GPS or Gmail today? Now imagine pulling the plug on something in the near future that controls all electric power distribution, banking communications, and Internet routing.
>Presumably it's a phrase you might hear from a boss who sees AI as similar to (and as benign/known/deterministic as) most other software, per TFA
Yeah I get that, but I think that given the content of the article, "can't you just fix the code?" or the like would have been a better fit.
Your boss (or more likely, your bosses’ bosses’s boss) is the one deeply worried about it. Though mostly worried about being left behind by their competitors and how their company’s use of AI (or lack thereof) looks to shareholders.
AIs will get more reliable over time, like old software is more reliable than new software.
:)
Was that a humam Freudian slip, or artificial one?
Yes, old software is often more reliable than new.
If you think modern software is unreliable, let me introduce you to our friend, Rational Rose.
Or debuggers that would take out the entire OS.
Or a bad driver crashing everything multiple times a week.
Or a misbehaving process not handing control back to the OS.
I grew up in the era of 8 and 16 bit micros and early PCs, they where hilariously less stable than modern machines while doing far less, there wasn’t some halcyon age of near perfect software, it’s always been a case of things been good enough to be good enough but at least operating systems did improve.
The fact you continued to have BSOD issues after a full reinstall is pretty strong evidence you probably had some kind of hardware failure.
It's why I don't play the new trackmania.
Windows is only stabilizing because it's basically dead. All the activity is in the higher layers, where they are racking their brains on how to enshittify the experience, and extract value out of the remaining users.
There were plenty of other issues, including the fact that you had to adjust the right IRQ and DMA for your Sound Blaster manually, both physically and in each game, or that you needed to "optimize" memory usage, enable XMS or EMS or whatever it was at the time, or that you spent hours looking at the nice defrag/diskopt playing with your files, etc.
More generally, as you hint to, desktop operating systems were crap, but the software on top of it was much more comprehensively debugged. This was presumably a combination of two factors: you couldn't ship patches, so you had a strong incentive to debug it if you wanted to sell it, and software had way fewer features.
Come to think about it, early browsers kept crashing and taking down the entire OS, so maybe I'm looking at it with rosy glasses.
As I mess around with these old machines for fun in my free time, I encounter these kinds of crashes pretty dang often. Its hard to tell if its just the old hardware is broken in odd ways or not so I can't fully say its the old software, but things are definitely pretty unreliable on old desktop Windows running old desktop Windows apps.
Last year I assembled a retro PC (Pentium 2, Riva TNT 2 Ultra, Sound Blaster AWE64 Gold) running Windows 98 to remember my childhood, and it is more stable than what I remembered, but still way worse than modern systems. There are plenty of games that will refuse to work for whatever reason, or that will crash the whole OS, specially when existing, and require a hard reboot.
Oh and at least in the '90s you could already ship patches, we used to get them with the floppies and later CDs provided by magazines.
Remember when debuggers were young?
Remember when OSes were young?
Remember when multi-tasking CPUs were young?
Etc...
Old software is typically more reliable, not because the developers were better or the software engineering targeted a higher reliability metric, but because it's been tested in the real world for years. Even more so if you consider a known bug to be "reliable" behavior: "Sure, it crashes when you enter an apostrophe in the name field, but everyone knows that, there's a sticky note taped to the receptionist's monitor so the new girl doesn't forget."
Maybe the new software has a more comprehensive automated testing framework - maybe it simply has tests, where the old software had none - but regardless of how accurate you make your mock objects, decades of end-to-end testing in the real world is hard to replace.
As an industrial controls engineer, when I walk up to a machine that's 30 years old but isn't working anymore, I'm looking for failed mechanical components. Some switch is worn out, a cable got crushed, a bearing is failing...it's not the code's fault. It's not even the CMOS battery failing and dropping memory this time, because we've had that problem 4 times already, we recognize it and have a procedure to prevent it happening again. The code didn't change spontaneously, it's solved the business problem for decades... Conversely, when I walk up to a newly commissioned machine that's only been on the floor for a month, the problem is probably something that hasn't ever been tried before and was missed in the test procedure.
And more often than not the issue is a local configuration issue, bad test data, a misunderstanding of what the code is supposed to do, not being aware of some alternate execution path or other pre/post processing that is running, some known issue that we've decided not to fix for some reason, etc. (And of course sometimes we do actually discover a completely new bug, but it's rare).
To be clear, there are certainly code quality issues present that make modifications to the code costly and risky. But the code itself is quite reliable, as most bugs have been found and fixed over the years. And a lot of the messy bits in the code are actually important usability enhancements that get bolted on after the fact in response to real-world user feedback.
Reality is management is often misaligned with proper software engineering craftsmanship at every org I've worked at except one, and that was because the top director who oversaw all of us was also a developer and he let our team lead direct us whichever way he wanted us to.
I much prefer the alternative where it's written in a manner where you can almost prove it's bug free by comprehensively unit testing the parts.
The author is talking about the maturity of a project. Likewise, as AI technologies become more mature we will have more tools to use them in a safer and more reliable way.
New code is the source of new bugs. Whether that's an entirely new product, a new feature on an existing project, or refactoring.
Human thought is analog. It is based on chemical reactions, time, and unpredictably (effectively) random physical characteristics. AI is an attempt to turn that which is purely digital into an rational analog thought equivalent.
No matter how much effort, money, power, and rare mineral eating TPUs will - ever - produce true analog data.
ISTR someone else round here observing how much more effective it is to ask these things to write short scripts that perform a task than doing the task themselves, and this is my experience as well.
If/when AI actually gets much better it will be the boss that has the problem. This is one of the things that baffles me about the managerial globalists - they don't seem to appreciate that a suitably advanced AI will point the finger at them for inefficiency much more so than at the plebs, for which it will have a use for quite a while.
that baffles me about the managerial globalists
It's no different from those on HN that yell loudly that unions for programmers are the worst idea ever... "it will never be me" is all they can think, then they are protesting in the streets when it is them, but only after the hypocrisy of mocking those in the street protesting today.
Unionized software engineers would solve a lot of the "we always work 80 hour weeks for 2 months at the end of a release cycle" problems, the "you're too old, you're fired" issues, the "new hires seems to always make more than the 5/10+ year veterans", etc. Sure, you wouldn't have a few getting super rich, but it would also make it a lot easier for "unionized" action against companies like Meta, Google, Oracle, etc. Right now, the employers hold like 100x the power of the employees in tech. Just look at how much any kind of resistance to fascism has dwindled after FAANG had another round of layoffs..
It’s entirely possible that some dangerous capability is hidden in ChatGPT, but nobody’s figured out the right prompt just yet.
This sounds a little dramatic. The capabilities of ChatGPT are known. It generates text and images. The qualities of the content of the generated text and images is not fully known.
The capabilities of ChatGPT are known. It generates text and images
There's a big difference between generating text which does someone's homework and text which changes peoples opinion about the world (e.g. the r/changemyview experiment done by Meta, in which their AI was better than almost all humans (it was 99th percentile) at changing peoples view, and not a single user was able to spot it as being AI[1])
If you're disagreeing with the precise wording of "capabilities" vs "qualities of the content", then sure, use whatever words make sense to you. But I don't think that's an interesting discussion to have.
I stand by my original statement.
[1] https://www.reddit.com/r/changemyview/comments/1k8b2hj/meta_...
Likewise what to ask it for how to make some sort of horrific toxic chemical, nuclear bomb, or similar isn't much good if you cannot recognize it and dangerous capability depends heavily on what you have available to you. Any idiot can be dangerous with C4 and detonator or bleach and ammonia. Even if ChatGPT could give entirely accurate instructions on how to build an atomic bomb it wouldn't do much good because you wouldn't be able to source the tools and materials without setting off red flags.
bugs are usually caused by problems in the data used to train an AI
I think a fundamental problem is that many people assume that an LLM's failure to correctly perform a task is a bug that can be fixed somehow. Often times, the reason for that failure is simply a property of the AI systems we have at the moment.
When you accidentally drop a glass and it breaks, you don't say that it's a bug in gravity. Instead, you accept that it's a part of the system you're working with. The same applies to many categories of failures in AI systems: we can try to reduce them, but unless the nature of the system fundamentally changes (and we don't know if or when that will happen), we won't be able to get rid of them.
"Bug" carries an implication of "fixable" and that doesn't necessarily apply to AI systems.
The reason we can't fix them is because we have no idea how they work; and the reason we have no idea how they work is this:
1. The "normal" computer program, which we do understand, implement a neural network
2. This neural network is essentially a different kind of processor. The "actual" computer program for modern deep learning systems is the weights. That is, weights : neural net :: machine language : normal cpu
3. We don't program these weights; we literally summon them out of the mathematical aether by the magic of back-propagation and gradient descent.
This summoning is possible because the "processor" (the neural network architecture) has been designed to be differentiable: for every node we can calculate the slope of the curve with respect to the result we wanted, so we know "The final output for this particular bit was 0.7, but we wanted it to be 1. If this weight in the middle of the network were just a little bit lower, then that particular output would have been a little bit higher, so we'll bump it down a bit."
And that's fundamentally why we can't verify their properties or "fix" them the way we can fix normal computer programs: Because what we program is the neural network; the real program, which runs on top of that network, is summoned and not written.
Both the weights and the formula is known. But the weight are meaningless in a human fashion. This is unlike traditional software where everything from encoding (the meaning of the bits) to how the state machine (the cpu) was codified by humans.
The only ways to fix it (somewhat) is to come up with better training data (hopeless), a better formula, or tacking something on top to smooth the worst errors (kinda hopeless).
The same inputs should produce the same outputs.
And that assumption is important because dependability is the strength of an automated process.
1. You can change the training data.
2. You can change the objective function.
3. You can change the network topology.
4. You can change various hyperparameters (learning rate, etc.).
From there, I think it is better to look at the process as one of scientific discovery rather than a software debugging task. You form hypotheses and you try to work out how test them by mutating things in one of the four categories above. The experiments are expensive and the results are noisy, since the training process is highly randomized. A lot of times the effect sizes are so small it is hard to tell if they are real. The universe of potential hypotheses is large, and if you test a lot of them, you have to correct for the chance that some will look significant just by luck. But if you can add up enough small, incremental improvements, they can produce a total effect that is large.
The good news is that science has a pretty good track record of improving things over time. The bad news is that it can take a lot of time, and there is no guarantee of success in any one area.
Those mechanisms only explain next word prediction, not LLM reasoning.
That's an emergent property that no person, as far as I understand it, can explain past hand waving.
Happy to be corrected here.
hey it's got an irrational preference for naming its variables after famous viking warriors, lets change that!
But worse, it's not that you can't change it, you just don't know! All you can do is test it and guess its biases.
Is it racist, is it homophobic, is it misogynistic? There was an article here the other day about AI in recruitment and the hidden biases. And there was a recruitment AI that only picked men for a role. The job spec was entirely gender neutral. And they hadn't noticed until a researcher looked at it.
It's a black box. So if it does something incorrectly, all they can do is retrain and hope.
Again, this is my present understanding of how it all works right now.
But overall in my opinion if devs are able to rebuild it from scratch with a predefined outcome, and even know how to improve the system to improve certain aspects of it, we do understand how it works.
With mixture of expert systems we’re introducing dedicated subsystems into the llm responsible for specific aspect of the llm
Common misconception, MoEs do have different "experts", but the model learns when to send input to different experts, and the model does not cleanly send coding tasks to the coding agent, physics tasks to the physics agent, etc. It's quite messy, and not nearly as intepretable as we'd want it to be.
Regarding the car, if you know how to build a car, you understand how a car works. A driver is more like someone using and llm, not a developer able to create an llm.
But with LLMs is there really more to understand?
Yes! loads! (: I want to be able to say statements like "this model will never ask the user to kill themselves" and be confident, but I can't do that today, and we don't know how. Note that we do know how to prove similar statements for regular software.
In a non-linear system the former is often easier than the latter. For example we know how planets “work” from the laws of motion. But planetary orbits involving > 2 bodies are non-linear, and predicting their motion far into the future is surprisingly difficult.
Neural networks are the same. They’re actually quite simple, it’s all undergraduate maths and statistics. But because they’re non-linear systems, predicting their behaviour is practically impossible.
The study of LLMs is much closer to biology than engineering.
I don't want to paste in the whole giant thing, but if you're curious:[0]
[0] https://drive.google.com/file/d/1D5yICywmkp24YajboKHdYFcBej0...
This article describes how Belgian supermarkets are replacing music played in stores by AI music to save costs, but you can easily imagine that the ai could also generate music to play to the emotions of customers to maybe influence their buying behavior: https://www.nu.nl/economie/6372535/veel-belgische-supermarkt...
What is it that we don’t understand?
And inspecting each part is not enough to understand how, together, they achieve what they achieve. We would need to understand the entire system in a much more abstract way, and currently we have nothing more than ideas of how it _might_ work.
Normally, with software, we do not have this problem, as we start on the abstract level with a fully understood design and construct the concrete parts thereafter. Obviously we have a much better understanding of how the entire system of concrete parts works together to perform some complex task.
With AI, we took the other way: concrete parts were assembled with vague ideas on the abstract level of how they might do some cool stuff when put together. From there it was basically trial-and-error, iteration to the current state, but always with nothing more than vague ideas of how all of the parts work together on the abstract level. And even if we just stopped the development now and tried to gain a full, thorough understanding of the abstract level of a current LLM, we would fail, as they already reached a complexity that no human can understand anymore, even when devoting their entire lifetime to it.
However, while this is a clear difference to most other software (though one has to get careful when it comes to the biggest projects like Chromium, Windows, Linux, ... since even though these were constructed abstract-first, they have been in development for such a long time and have gained so many moving parts in the meantime that someone trying to understand them fully on the abstract level will probably start to face the difficulty of limited lifetime as well), it is not an uncommon thing per se: we also do not "really" understand how economy works, how money works, how capitalism works. Very much like with LLMs, humanity has somehow developed these systems through interaction of billions of humans over a long time, there was never an architect designing them on an abstract level from scratch, and they have shown emergent capabilities and behaviors that we don't fully understand. Still, we obviously try to use them to our advantage every day, and nobody would say that modern economies are useless or should be abandoned because they're not fully understood.
At any frame we can pause, examine the state, then step forward, examine the state, and observe what changes have occurred
This example from software doesn't meaningfully hold for neural networks. It's a bit like trying to watch an individual COVID virus duplicate and then attempting to predict the pandemic. It's incredibly complicated and we haven't yet built the tools to help us understand
The ML field has a good understanding of the algorithms that produce these floating point numbers and lots of techniques that seem to produce “better” numbers in experiments. However, there is little to no understanding of what the numbers represent or how they do the things they do.
AI sits at a weird place where it can't be analyzed as software, and it can't be managed as a person.
My current mental model is that AGI can only be achieved when a machine experiences pleasure, pain, and "bodily functions". Otherwise there's no way to manage it.
Similarly, two adult humans know what to do to start the process that makes another human, and we know a few of the very low-level details about what happens, but that is a far cry from knowing how adult humans do what they do.
[1] https://www.reddit.com/r/slatestarcodex/comments/1o6n5ne/why...
I guessed the URL based on the Quartz docs. It seems to work but only has a few items from https://boydkane.com/essays/
Then it says the shop sign looks like a “Latin alphabet business name rather than Spanish or Portuguese”. Uhhh… what? Spanish and Portuguese use the Latin alphabet.
The answer is 24! See the ASCII values of '1' is 49, '2' is 50, and '+' is 43. Adding all that together we get 3. Now since we are doing this on a computer with a 8-bit infrastructure we multiply by 3 and so the answer is 24.
Cool! I didn't understand any of that but it was correct and you sound smart. I will put this thing in charge of critical parts of my business.
I mean, we know they work, and they work unreasonably well, but no one knows how, no one even knows why they work!
That's a weird situation, LLMs are language models, the very core of NLP, and yet the field tends to be overlooked. And by the way, she doesn't like the term "LLM": a language model that is large? what kind of model? what is "large"?
most AI companies will slightly change the way their AIs respond, so that they say slightly different things to the same prompt. This helps their AIs seem less robotic and more natural.
To my understanding this is managed by the temperature of the next token prediction which is picked more or less randomly based on this value. This temperature plays a role in the variability of the output.
I wasn't under the impression that it was to give the user a feeling of "realism", but rather that it produced better results with a slightly random prediction.
To my understanding this is managed by the temperature
This is true, but sampling also plays a fairly large role. The model will produce probabilities for the next token, temperature will modify these probabilities somewhat, but different sampling techniques (top-K, top-P, beam search, others) will also change these probabilities.
I wasn't under the impression that it was to give the user a feeling of "realism", but rather that it produced better results with a slightly random prediction.
My understanding is that it's a bit of both. If the AI responded exactly the same way to every "hi can you help me" prompt, I think users' would call it more robotic. I also think that slightly varying the token prediction helps prevent repetitive text
When a CEO sees their customer chatbot call a customer a slur, they don't see "oh my chatbot runs on a stochastic model of human language and OpenAI can't guarantee that it will behave in an acceptable way 100% of the time", they see "ChatGPT called my customer a slur, why did you program it to do that?"
[1] https://arxiv.org/abs/1712.02779
Edit: typo
bad behaviour isn’t caused by any single bad piece of data, but by the combined effects of significant fractions of the dataset
Related opposing data point to this statement: https://news.ycombinator.com/item?id=45529587
With AI systems, almost all bad behaviour originates from the data that’s used to train them
Careful with this - even with perfect data (and training), models will still get stuff wrong.
How do you define "perfect" data and training? I'd argue that if you trained a small NN to play tic-tac-toe perfectly, it'd quickly memorise all the possible scenarios, and since the world state is small, you could exhaustively prove that it's correct for every possible input. So at the very least, there's a counter example showing that with perfect data and training, models will not get stuff wrong.
Likewise a person you hire "could" take over the country and start a genocide, but it's rightfully low on your priority list because it's so unlikely that it's effectively impossible. Now an AI being rude or very unhelpul/harmful to your customer is a more pressing concern. And you don't have that confidence with most people either which is why we go through hiring processes.
The statics here are key and AI companies are geniuses at lying with statistics. I could shuffle a dictionary and outputting a random word each time and answer any hard problem. The entire point of AI is that you can do MUCH better than "random". Can anyone tell me which algorithm (this or chatgpt) has a higher likelihood of producing a proof of the RH after n tokens? No, they can't. But chatgpt can generate things in human timescale that look more like proofs than my bruteforce approach so people (investors) give it the benefit of the doubt even if it's not earned and could well be LESS capable than bruteforce as strange as it sounds.
[1] https://www.economist.com/leaders/2025/09/25/how-to-stop-ais...
"The worst effects of this flaw are reserved for those who create what is known as the “lethal trifecta”. If a company, eager to offer a powerful AI assistant to its employees, gives an LLM access to un-trusted data, the ability to read valuable secrets and the ability to communicate with the outside world at the same time, then trouble is sure to follow. And avoiding this is not just a matter for AI engineers. Ordinary users, too, need to learn how to use AI safely, because installing the wrong combination of apps can generate the trifecta accidentally."
I think that means savvy customers will want details or control over testing, and savvy providers will focus on solutions they can validate, or where testing is included in the workflow (e.g., code), or where precision doesn't matter (text and meme generation). Knowing that in depth is gold for AI advocates.
Otherwise, I don't think people really know or care about bugs or specifications or how AI breaks prior programmer models.
But people will become very hostile and demand regulatory frenzies if AI screws things up (e.g., influencing elections or putting people out of work). Then no amount of sympathy or understanding will help the industry, which has steadily been growing its capability for evading regulation via liability disclaimers, statutory exceptions, arbitration clauses, pitting local/regional/national governments against each other, etc.
To me that's the biggest risk: we won't get the benefits and generational investments will be lost in cleaning up after a few (even accidental) bad actors at scale.
To make this more concrete, here are some example ideas that are perfectly true when applied to regular software but become harmfully false when applied to modern AIs: ...
Hmm, I don't think any of these were true with non-AI software. Commonly held beliefs, sure.
If anything, I am glad AI is helping us revisit these assumptions.
- Software vulnerabilities are caused by mistakes in the code
Setting aside social engineering, mistake implies these were knowable in advance. Was the lack of TLS in the initial HTTP spec a mistake?
- Bugs in the code can be found by carefully analysing the code
If this was the case, why do people reach for rewriting buggy code they don't understand?
- Once a bug is fixed, it won’t come back again
Too many counter examples to this one in my lived experience.
- Every time you run the code, the same thing happens
Setting aside seeding PRNGs, there's the issue of running the code on different hardware. Or failing hardware.
- If you give specifications beforehand, you can get software that meets those specifications
I have never seen this work without needing to revise the specification during implementation.
nobody knows precisely what to do to ensure an AI writes formal emails correctly or summarises text accurately.
This is a bit of a hyperbole, a lot of the recent approaches rely on MoE, that are specialized. This makes it much more usuable for simple usecases.
In regular software, vulnerabilities are caused by mistakes in the lines of code that make up the softwarein modern AI systems, vulnerabilities or bugs are usually caused by problems in the data used to train an AI
In regular software, vulnerabilities are caused by lack of experience, therefor lack of proper training materials.
vulnerabilities are caused by lack of experience
In the limit, everyone has a lack of experience (compared to their future selves). This sentence proves too much (https://slatestarcodex.com/2013/04/13/proving-too-much/)
"Oh my goodness, it worked, it's amazing it's finally been updated," she tells the BBC. "This is a great step forward."
She thinks someone noticed the bug about not being able to show one-armed people, figured out why it wasn't working and wrote a fix.
I don’t think anyone is advocating for web apps to take the form of an LLM prompt with the app getting created on the fly every time someone goes to the url.
One popular dataset, FineWeb, is about 11.25 trillion words long3, which, if you were reading at about 250 words per minute, would take you over 85 thousand years to read. It’s just not possible for any single human (or even a team of humans) to have read everything that an LLM has read during training.
Do you have to read everything in a dataset with your own eyes to make sense of it? This would make any attempt to address bias in the dataset impossible, and I think it's not, so there should be other ways to make sense of the dataset distribution without having to read it yourself.
Do you have to read everything in a dataset with your own eyes to make sense of it?
I mean, if you don't read it yourself, you're going to have to rely on _something/someone_ to filter/summarise the output, and at that point you might as well just accept that you'll never truly understand the entire thing?
I'll agree that we can do meaningful work (like reducing bias) without reading the entire dataset ourselves, but that doesn't reduce the fact that we cannot read everything that's going into these machines.