Why the push for Agentic when models can barely follow a simple instruction?

fork-bomber 327 points forum.cursor.com

−

varun_chopra

Marketing is being done really well in 2025, with brands injecting themselves into conversations on Reddit, LinkedIn, and every other public forum.^[1]

CEOs, AI "thought leaders," and VCs are advertising LLMs as magic, and tools like v0 and Lovable as the next big thing. Every response from leaders is some variation of https://www.youtube.com/watch?v=w61d-NBqafM

On the ground, we know that creating CLAUDE.md or cursorrules basically does nothing. It’s up to the LLM to follow instructions, and it does so based on RNG as far as I can tell. I have very simple, basic rules set up that are never followed. This leads me to believe everyone posting on that thread on Cursor is an amateur.

Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

I’m at a stage where I use LLMs the same way I would use speech-to-text (code) - telling the LLM exactly what I want, what files it should consider, and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.

Edit:

[1] To add to this, any time you use search or Perplexity or what have you, the results come from all this marketing garbage being pumped into the internet by marketing teams.

−

joshvince

if you’re working on novel code, LLMs are absolutely horrible

This is spot on. Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs). But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

−

ehnto

at least for now.

I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

My thinking here is that we already had the technologies of the LLMs and the compute, but we hadn't yet had the reason and capital to deploy it at this scale.

So the surprising innovation of transformers did not give us the boost in capability itself, it still needed scale. The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever, it needs returns.

Scale has been exponential, and we are hitting an insane amount of capital deployment for this one technology that, has yet to prove commercially viable at the scale of a paradigm shift.

Are businesses that are not AI based, actually seeing ROI on AI spend? That is really the only question that matters, because if that is false, the money and drive for the technology vanishes and the scale that enables it disappears too.

−

NitpickLawyer

I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

??? It has already become exceptional. In 2.5 years (since chatgpt launched) we went from "oh, look how cute this is, it writes poems and the code almost looks like python" to "hey, this thing basically wrote a full programming language^[1] with genz keywords, and it mostly works, still has some bugs".

I think the goalpost moving is at play here, and we quickly forget how 1 year makes a huge difference (last year you needed tons of glue and handwritten harnesses to do anything - see aider) and today you can give them a spec and get a mostly working project (albeit with some bugs), 50$ later.

[1]https://github.com/ghuntley/cursed

−

ehnto

I don't disagree with you on the technology, but mostly my comment is about what the market is expecting. With such a huge capex expenditure it is expecting a huge returns. Given AI has not proven consistent ROI generally for other enterprises (as far as I know), they are hoping for something better than what is right now and they are hoping for it to happen before the money runs out.

I am not saying it's impossible, but there is no evidence that the leap in technology to reach wild profitability (replacing general labour) such investment desires is just around the corner either.

−

CuriouslyC

It doesn't really matter what the market is expecting at this point, the president views AI supremacy as non-negotiable. AI is too big to fail.

−

luhsprwhkv2

It’s true, but not just the presidency. The whole political class is convinced that this is the path out of all their problems.

−

danaris

...Is it the whole political class?

Or is it the whole political party?

−

ehnto

I am not from the US, but your administration could still fumble the AI bust even if it wants to avoid it. Who knows maybe they are hoping to short it.

−

baxtr

After 3 years, I would like to see pathways.

Let say we found a company that already realized 5-10% of savings in the first step. Now, based on this we might be able to map out the path to 25-30% savings in 5% steps for example.

I personally haven’t seen this, but I might have missed it as well.

−

Izkata

To phrase this another way, using old terms: We seem to be approaching the uncanny valley for LLMs, at which point the market overall will probably hit the trough of disillusionment.

−

Balinares

I feel like the invention of MCP was a lot more instrumental to that than model upgrades proper. But look at it as a good thing, if you will: it shows that even if models are plateauing, there's a lot of value to unlock through the tooling.

−

NitpickLawyer

it shows that even if models are plateauing,

The models aren't plateauing (see below).

invention of MCP was a lot more instrumental [...] than model upgrades proper

Not clear. The folks at hf showed that a minimal "agentic loop" in 100 LoC^[1] that gives the agent "just bash access" still got very close to SotA with all the bells and whistles (and surpassed last year models w/ handcrafted harnesses).

[1]https://github.com/SWE-agent/mini-swe-agent

−

StilesCrisis

I mean, that's still proving the point that tooling matters. I don't think his point was "MCP as a technology is extraordinary" because it's not.

−

theshrike79

Small focused (local) model + tooling is the future, not online LLMs with monthly costs. Your coding model doesn't need all of the information in the world built in, it needs to know code and have tools available to get any information it needs to complete its tasks. We have treesitter, MCPs, LSPs, etc - use them.

The problem is that all the billions (trillions?) of VC money go to the online models because they're printing money at this point.

There's no money to be made in creating models people can run locally for free.

−

kordlessagain

MCP is a marketing ploy, not an “invention”.

−

dragonwriter

It is an actual invention that has concrete function, whether or not it was part of a marketing push.

−

hitarpetar

I didn't realize generating the gen-z programming language was a goalpost in the first place

−

wkat4242

Yes it advanced extremely quickly,

It did but it's kinda stagnated now especially on the LLM front. The time when ever week a groundbreaking model came out is over for now. Later revisions of existing models, like GPT5 and llama4 have been underwhelming.

−

CuriouslyC

GPT5 may have been underwhelming to _you_. Understand that they're heavily RLing to raise the floor on these models, so they might not be magically smarter across the board, there are a LOT of areas where they're a lot better that you've probably missed because they're not your use case.

−

bangaroo

every time i say "the tech seems to be stagnating" or "this model seems worse" based on my observations i get this response. "well, it's better for other use cases." i have even heard people say "this is worse for the things i use it for, but i know it's better for things i don't use it for."

i have yet to hear anyone seriously explain to me a single real-world thing that GPT5 is better at with any sort of evidence (or even anecdote!) i've seen benchmarks! but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.

the few cases i have heard that venture near that ask may be moderately intriguing, but don't seem to justify the overall cost of building and running the model, even if there have been marginal or perhaps even impressive leaps in very narrow use cases. one of the core features of LLMs is they are allegedly general-purpose. i don't know that i really believe a company is worth billions if they take their flagship product that can write sentences, generate a plan, follow instructions and do math and they are constantly making it moderately better at writing sentences, or following instructions, or coming up with a plan and it consequently forgets how to do math, or becomes belligerent, or sycophantic, or what have you.

to me, as a user with a broad range of use cases (internet search, text manipulation, deep research, writing code) i haven't seen many meaningful increases in quality of task execution in a very, very long time. this tracks with my understanding of transformer models, as they don't work in a way that suggests to me that they COULD be good at executing tasks. this is why i'm always so skeptical of people saying "the big breakthrough is coming." transformer models seem self-limiting by merit of how they are designed. there are features of thought they simply lack, and while i accept there's probably nobody who fully understands how they work, i also think at this point we can safely say there is no superintelligence in there to eke out and we're at the margins of their performance.

the entire pitch behind GPT and OpenAI in general is that these are broadly applicable, dare-i-say near-AGI models that can be used by every human as an assistant to solve all their problems and can be prompted with simple, natural language english. if they can only be good at a few things at a time and require extensive prompt engineering to bully into consistent behavior, we've just created a non-deterministic programming language, a thing precisely nobody wants.

−

48terry

The simple explanation for all this, along with the milquetoast replies kasey_junk gave you, is that to its acolytes, AI and LLMs cannot fail, only be failed.

If it doesn't seem to work very well, it's because you're obviously prompting it wrong.

If it doesn't boost your productivity, either you're the problem yourself, or, again, you're obviously using it wrong.

If progress in LLMs seems to be stagnating, you're obviously not part of the use cases where progress is booming.

When you have presupposed that LLMs and this particular AI boom is definitely the future, all comments to the contrary are by definition incorrect. If you treat it as a given that this AI boom will succeed (by some vague metric of "success") and conquer the world, skepticism is basically a moral failing and anti-progress.

The exciting part about this belief system is how little you actually have to point to hard numbers and, indeed, rely on faith. You can just entirely vibe it. It FEELS better and more powerful to you, your spins on the LLM slot machine FEEL smarter and more usable, it FEELS like you're getting more done. It doesn't matter if those things are actually true over the long run, it's about the feels. If someone isn't sharing your vibes about the LLM slot machine, that's entirely their fault and problem.

−

mwigdahl

And on the other side, to detractors, AI and LLMs cannot ever succeed. There's always another goalpost to shift.

If it seems to work well, it's because it's copying training data. Or it sometimes gets something wrong, so it's unreliable.

If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

If they point to improvements in benchmarks, it's because model vendors are training to the tests, or the benchmarks don't really measure real-world performance.

If the improvements are in complex operations where there aren't benchmarks, their reports are too vague and anecdotal.

The exciting part about this belief system is how little you have to investigate the actual products, and indeed, you can simply rely on a small set of canned responses. You can just entirely dismiss reports of success and progress; that's completely due to the reporter's incompetence and self-delusion.

−

bangaroo

wouldn't call myself a detractor. i wouldn't call it a belief system i hold (i am an engineer 20 years into my career and would love to automate away the tedious parts of my job i've done a thousand times) as it is a position i hold based on the evidence i've seen in front of me.

i constantly hear that companies are running with "50% of their code written by AI!" but i've yet to meet an engineer who says they've personally seen this. i've met a few who say they see it through internal reporting, though it's not the case on their team. this is me personally! i'm not saying these people don't exist. i've heard it much more from senior leadership types i've met in the field - directors, vps, c-suite, so on.

i constantly hear that AI can do x, y, or z, but no matter how many people i talk to or how much i or my team works towards those goals, it doesn't really materialize. i can accept that i may be too stupid (though i'd argue that if that's the problem, the AI isn't as good as claimed) but i work with some brilliant people and if they can't see results, that means something to me.

i see people deploying the tool at my workplace, and recently had to deal with a situation where leadership was wondering why one of our top performers had slowed down substantially and gotten worse, only to find that the timeline exactly aligned with them switching to cursor as their IDE.

i read papers - lots of papers - and articles about both positive and negative assertions about LLMs and their applicability in the field. i don't feel like i've seen compelling evidence in research not done by the foundation model companies that supports the theory this is working well. i've seen lots of very valid and concerning discoveries reported by the foundation model companies, themselves!

there are many places in the world i am a hardliner on no generative AI and i'll be open about that - i don't want it in entertainment, certainly not in music, and god help me if i pick up the phone and call a company and an agent picks up.

for my job? i'm very open to it. i know the value i provide above what the technology could theoretically provide, i've written enough boilerplate and the same algorithms and approaches for years to prove to myself i can do it. if i can be as productive with less work, or more productive with the same work? bring it on. i am not worried about it taking my job. i would love it to fulfill its promise.

i will say, however, that it is starting to feel telling that when i lay out any sort of reasoned thought on the issue that (hopefully) exposes my assumptions, biases, and experiences, i largely get vague, vibes-based answers, unsourced statistics, and responses that heavily carry the implication that i'm unwilling to be convinced or being dogmatic. i very rarely get thoughtful responses, or actual engagement with the issues, concerns, or patterns i write about. oftentimes refutations of my concerns or issues with the tech are framed as an attack on my willingness to use or accept it, rather than a discussion of the technology on its merits.

while that isn't everything, i think it says something about the current state of discussion around the technology.

−

wkat4242

I work in a company that's "all in on AI" and there's so much BS being blown up just because they can't have it fail because all the top dogs will have mud on their faces. They're literally just faking it. Just making up numbers, using biased surveys, making sure employees know it's being "appreciated" if they choose option A "Yes AI makes me so much more productive" etc.

This is definitely something that biases me against AI, sure. Seeing how the sausage is made doesn't help. Because it's really a lot of offal right now especially where I work.

I'm a very anti-corporate non-teamplayer kinda person so I tend to be highly critical, I'll never just go along with PR if it's actually false. I won't support my 'team' if it's just wrong. Which often rubs people the wrong way at work. Like when I emphasised in a training that AI results must be double checked. Or when I answered in an "anonymous" survey that I'd rather have a free lunch than "copilot" and rated it a 2 out of 5 in terms of added value (I mean, at the time it didn't even work in some apps)

But I'm kinda done with soul-killing corporatism anyway. Just waiting for some good redundancy packages when the AI bubble collapses :)

−

hadlock

If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

A pretty substantial number of developers are doing trivial edits to business applications all over the globe, pretty much continuously. At least in the low to mid double digits %

−

48terry

You really thought you had a post with this one huh. I have second-hand embarrassment for you.

−

kasey_junk

Claude Sonnet 4.5 is _way_ better than previous sonnets and as good as Opus for the coding and research tasks I do daily.

I rarely use Google search anymore, both because llms got that ability embedded and the chatbots are good at looking through the swill search results have become.

−

bangaroo

"it's better at coding" is not useful information, sorry. i'd love to hear tangible ways it's actually better. does it still succumb to coding itself in circles, taking multiple dependencies to accomplish the same task, applying inconsistent, outdated, or non-idiomatic patterns for your codebase? has compliance with claude.md files and the like actually improved? what is the round trip time like on these improvements - do you have to have a long conversation to arrive at a simple result? does it still talk itself into loops where it keeps solving and unsolving the same problems? when you ask it to work through a complex refactor, does it still just randomly give up somewhere in the middle and decide there's nothing left to do? does it still sometimes attempt to run processes that aren't self-terminating to monitor their output and hang for upwards of ten minutes?

my experience with claude and its ilk are that they are insanely impressive in greenfield projects and collapse in legacy codebases quickly. they can be a force multiplier in the hands of someone who actually knows what they're doing, i think, but the evidence of that even is pretty shaky: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

the pitch that "if i describe the task perfectly in absolute detail it will accomplish it correctly 80% of the time" doesn't appeal to me as a particularly compelling justification for the level of investment we're seeing. actually writing the code is the simplest part of my job. if i've done all the thinking already, i can just write the code. there's very little need for me to then filter that through a computer with an overly-verbose description of what i want.

as for your search results issue: i don't entirely disagree that google is unusable, but having switched to kagi... again, i'm not sure the order of magnitude of complexity of searching via an LLM is justified? maybe i'm just old, but i like a list of documents presented without much editorializing. google has been a user-hostile product for a long time, and its particularly recent quality collapse has been well-documented, but this seems a lot more a story of "a tool we relied on has gotten measurably worse" and not a story of "this tool is meaningfully better at accomplishing the same task." i'll hand it to chatgpt/claude that they are about as effective as google was at directing me to the right thing circa a decade ago, when it was still a functional product - but that brings me back to the point that "man, this is a lot of investment and expense to arrive at the same result way more indirectly."

−

kasey_junk

You asked for a single anecdote of llms getting better at daily tasks. I provided two. You dismissed them as not valuable _to you_.

It’s fine that your preferences aren’t aligned such that you don’t value the model or improvements that we’ve seen. It’s troubling that you use that to suggest there haven’t been improvements.

−

bangaroo

you didn't provide an anecdote. you just said "it's better." an anecdote would be "claude 4 failed in x way, and claude 4.5 succeeds consistently." "it is better" is a statement of fact with literally no support.

the entire thrust of my statement was "i only hear nonspecific, vague vibes that it's better with literally no information to support that concept" and you replied with two nonspecific, vague vibes. sorry i don't find that compelling.

"troubling" is a wild word to use in this scenario.

−

mwigdahl

Here's one. I have a head to head "benchmark" involving generating a React web app to display a Gantt chart, add tasks, layer overlaps, read and write to files, etc. I compared implementing this application using both Claude Code with Opus 4.1 / Sonnet 4 (scenario 1) and Claude Code 2 with Sonnet 4.5 (scenario 2) head to head.

The scenario 1 setup could complete the application but it had about 3 major and 3 minor implementation problems. Four of those were easily fixed by pointing them out, but two required significant back and forth with the model to resolve.

The scenario 2 setup completed the application and there were four minor issues, all of which were resolved with one or two corrective prompts.

Toy program, single run through, common cases, stochastic parrot, yadda yadda, but the difference was noticeable in this direct comparison and in other work I've done with the model I see a similar improvement.

Take from that what you will.

−

bangaroo

so to clarify your case, you are having it generate a new application, from scratch, and then benchmarking the quality of the output and how fast it got to the solution you were seeking?

i will concede that in this arena, there does seem to be meaningful improvement.

i said this in one of my comments in this thread, but the place i routinely see the most improvement in output from LLMs (and find they perform best) for code generation is in green field projects, particularly ones whose development starts with an agent. some facts that make me side-eye this result (not yours in particular, just any benchmark that follows this model):

- the codebase, as long as a single agent and model are working on it, is probably suited to that model's biases and thus implicitly easier for it to work in and "understand."

- the codebase is likely relatively contained and simple.

- the codebase probably doesn't cross domains or require specialized knowledge of services or APIs that aren't already well-documented on the internet or weren't built by the tool.

these are definitely assumptions, but i'm fairly confident in their accuracy.

one of the key issues i've had approaching these agents is that all my "start with an LLM and continue" projects actually start incredibly impressively! i was pretty astounded even on the first version of claude code - i had claude building a service, web management interface AND react native app, in concert, to build an entire end to end application. it was great! early iteration was fast, particularly in the "mess around and find out what happens" phase of development.

where it collapsed, however, was when the codebase got really big, and when i started getting very opinionated about outcomes. my claude.md file grew and grew and seemed to enforce less and less behavior, and claude became less and less likely to successfully refactor or reuse code. this also tracks with my general understanding of what an LLM may be good or bad at - it can only hold so much context, and only as textual examples, not very effectively as concepts or mental models. this ultimately limits its ability to reason about complex architecture. it rapidly became faster for me to just make the changes i envisioned, and then claude became more of a refactoring tool that i very narrowly applied when i was too lazy to do the text wrangling myself.

i do believe that for rapid prototyping - particularly the case of "product manager trying to experiment and figure out some UX" - these tools will likely be invaluable, if they can remain cost effective.

the idea that i can use this, regularly, in the world of "things i do in my day-to-day job" seems a lot more far fetched, and i don't feel like the models have gotten meaningfully better at accomplishing those tasks. there's one notable exception of "explaining focused areas of the code", or as a turbo-charged grep that finds the area in the codebase where a given thing happens. i'd say that the roughly 60-70% success rate i see in those tasks is still a massive time savings to me because it focuses me on the right thing and my brain can fill in the rest of the gaps by reading the code. still, i wouldn't say its track record is phenomenal, nor do i feel like the progress has been particularly quick. it's been small, incremental improvements over a long period of time.

i don't doubt you've seen an improvement in this case (which is, as you admit, a benchmark) but it seems like LLMs keep performing better on benchmarks but that result isn't, as far as i can see, translating into improved performance on the day-to-day of building things or accomplishing real-world tasks. specifically in the case of GPT5, where this started, i have heard very little if any feedback on what it's better at that doesn't amount to "some things that i don't do." it is perfectly reasonable to respond to me that GPT5 is a unique flop, and other model iterations aren't as bad, in that case. i accept this is one specific product from one specific company - but i personally don't feel like i'm seeing meaningful evidence to support that assertion.

−

mwigdahl

Thank you for the thoughtful response. I really appreciate the willingness to discuss what you've seen in your experience. I think your observations are pretty much exactly correct in terms of where agents do best. I'd qualify in just a couple areas:

1. In my experience, Claude Code (I've used several other models and tools, but CC performs the best for me so that's my go-to) can do well with APIs and services that are proprietary as long as there's some sort of documentation for them it can get to (internal, Swagger, etc.), and you ensure that the model has that documentation prominently in context.

2. CC can also do well with brownfield development, but the scope _has_ to be constrained, either to a small standalone program or a defined slice of a larger application where you can draw real boundaries.

The best illustration I've seen of this is in a project that is going through final testing prior to release. The original "application" (I use the term loosely) was a C# DLL used to generate data-driven prescription monitoring program reporting.

It's not ultra-complicated but there's a two step process where you retrieve the report configuration data, then use that data to drive retrieval and assembly of the data elements needed for the final report. Formatting can differ based on state, on data available (reports with no data need special formatting), and on whether you're outputting in the context of transmission or for user review.

The original DLL was written in a very simplistic way, with no testing and no way to exercise the program without invoking it from its link points embedded in our main application. Fixing bugs and testing those fixes were both very painful as for production release we had to test all 50 states on a range of different data conditions, and do so by automating the parent application.

I used Claude Code to refactor this module, add DI and testing, and add a CLI that could easily exercise the logic in all different supported configurations. It took probably $50 worth of tokens (this was before I had a Max account, so it was full price) over the course of a few hours, most of which time I was in other meetings.

The final result did exhibit some classic LLM problems -- some of the tests were overspecific, it restructured without always fully cleaning up the existing functions, and it messed up a couple of paths through the business logic that I needed to debug and fix. But it easily saved me a couple days of wrestling with it myself, as I'm not super strong with C#. Our development teams are fully committed, and if I hadn't used CC for this it wouldn't have gotten done at all. Being able to run this on the side and get a 90% result I could then take to the finish line has real value for us, as the improved testing alone will see an immediate payback with future releases.

This isn't a huge application by any means, but it it's one example of where I've seen real value that is hitting production, and seems representative of a decently large category of line-of-business modules. I don't think there's any reason this wouldn't replicate on similarly-scoped products.

−

kasey_junk

My one shot rate for unattended prompts (triggered GitHub actions) has gone from about 2 in 3 to about 4 in 5 with my upgrade to 4.5 in the codebase I program in the most (one built largely pre-ai). These are highly biased to tasks I expect ai to do well.

Since the upgrade I don’t use opus at all for planning and design tasks. Anecdotally, I get the same level of performance on those because I can choose the model and I don’t choose opus. Sonnet is dramatically cheaper.

What’s troubling is that you made a big deal about not hearing any stories of improvements as if your bar was very low for said stories, then immediately raised the bar when given them. It means that one doesn’t know what level of data you actually want.

−

Retric

Requesting concrete examples isn’t a high bar. Autopilot got better tells me effectively nothing. Autopilot can now handle stoplights does.

−

kasey_junk

but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.

I don’t use OpenAI stuff but I seem to think Claude is getting better for accomplishing the real world tasks I ask of it.

−

Retric

Specifics are worth talking about. I just felt it unfair to complain about raising the bar when you didn’t initially reach it.

In your own worlds: “You asked for a single anecdote of llms getting better at daily tasks.”

Which is already less specific than their request: “i'd love to hear tangible ways it's actually better.”

Saying “is getting better for accomplishing the real world tasks I ask of it” brings nothing to a discussion and was the kind of vague statement that they were initially complaining about. If LLM’s are really improving it’s not a major hurdle to say something meaningful about what specific is getting better. /tilting at windmills

−

theshrike79

The biggest issue with Sonnet 4.5 is that it's chatty as fuuuck. It just won't shut up, it keeps producing massive markdown "reports" and "summaries" of every single minor change, wasting precious context.

With Sonnet 4 I rarely ran out of quota unexpectedly, but 4.5 chews through whatever little Anthropic gives us weekly.

−

orwin

Gpt5 isn't an improvement to me, but Claude sonnet4.5, handle terragrunt way, way better than the previous version did. It also go search AWS documentation by itself, and parse external documents way better. That's not LLM improvement, to be clear (except the terragrunt thing), I think it's improvement in data acquisition and a better inference engine. On react project it seems way, way less messy also, I have to use it more but the inference engine seems clearer. At least less prone to circular code, where it's stuck in a loop. It seems to be exiting the loop faster, even when the output isn't satisfactory (which isn't an issue to me, most of my prompt have more or less 'only write functions template, do not write the inside logic if it has to contain more than a loop', I fill the blanks myself)

−

maddmann

I’m curious what you are expecting when you say progress has stagnated?

−

delusional

Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

To comment om this, because its the most common counter argument. Most technology has worked in steps. We take a step forward, then iterate on essentially the same thing. It's very rare we see order of magnitude improvement on the same fundamental "step".

Cars were quite a step forward from donkeys, but modern cars are not that far off from the first ones. Planes were an amazing invention, but the next model of plane is basically the same thing as the first one.

−

ehnto

I agree, I think we are in the latter phase already. LLMs were a huge leap in machine learning, but everything after has been steps on top + scale.

I think we would need another leap to actually meet the markets expectations on AI. The market is expecting AGI, but I think we are probably just going to do incremental improvements for language and multi modal models from here, and not meet those expectations.

I think the market is relying on something that doesn't currently exist to become true, and that is a bit irrational.

−

zamalek

Transformers aren't it, though. We need a new fundamental architecture and, just like every step forward in AI that came before, when that happens is a completely random event. Some researcher needs to wake up with a brilliant idea.

The explosion of compute and investment could mean that we have more researchers available for that event to happen, but at the same time transformers are sucking up all the air in the room.

−

alganet

Several people hinted at the limits this technology was about to face, including training data and compute. It was obvious it had serious limits.

Despite the warnings, companies insisted on marketing superintelligence nonsense and magic automatic developers. They convinced the market with disingenous demonstrations, which, again, were called out as bullshit by many people. They are still doing it. It's the same thing.

−

bccdee

Yes it advanced extremely quickly

The things that impress me about gpt-5 are basically the same ones that impressed me about gpt-3. For all the talk about exponential growth, I feel like we experienced one big technical leap forward and have spent the past 5 years fine-tuning the result—as if fiddling with it long enough will turn it into something it is not.

−

timmytokyo

When building their LLMs, the model makers consumed the entire internet. This allowed the models to improve exponentially fast. But there's no more internet to consume. Yes, new data is being generated, but not at anywhere near the rate the models were growing in capability just a year ago. That's why we're seeing diminishing returns when comparing, say, GPT-5 to GPT-4.

The AI marketers, accelerationists and doomers may seem to be different from one another, but the one thing they have in common is their adherence to an extrapolationist fallacy. They've been treating the explosion of LLM capabilities as a promise of future growth and capability, when in fact it's all an illusion. Nothing achieves indefinite exponential growth. Everything hits a wall.

−

walleeee

The question in your last paragraph is not the only one that matters. Funding the technology at a material loss will not be off the table. Think about why.

−

FromTheFirstIn

Just tell us why you think funding at a loss at this scale is viable, don’t smugly assign homework

−

walleeee

Apologies, not meant to be smug

−

48terry

...But you did fully intend to assign homework? Why are you even commenting, what are you adding?

−

ludicrousdispla

The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever,

Striking parallels between AI and food delivery (uber eats, deliveroo, lieferando, etc.) ... burn capital for market share/penetration but only deliver someone else's product with no investment to understand the core market for the purpose of developing a better product.

−

wongarsu

I have had LLMs write entire codebases for me, so it's not like the hype is completely wrong. It's just that this only works if what you want is "boring", limited in scope and on a well-trodden path. You can have an LLM create a CRUD application in one go, or if you want to sort training data for image recognition you can have it generte a one-off image viewer with shortcuts tailored to your needs for this task. Those are powerful things and worthy of some hype. For anything more complex you very quickly run into limits and the time and effort to do it with an LLM quickly approaches the time and effort required to do it by hand.

−

physicsguy

They're powerful, but my feeling is that largely you could do this pre-LLM by searching on Stack Overflow or copying and pasting from the browser and adapting those examples, if you knew what you were looking for. Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

Of course, if you don't know what you are looking for, it can make that process much easier. I think this is why people at the junior end find it is making them (a claimed) 10x more productive. But people who have been around for a long time are more skeptical.

−

disgruntledphd2

Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

To be fair, this is super, super helpful.

I do find LLMs helpful for search and providing a bunch of different approaches for a new problem/area though. Like, nothing that couldn't be done before but a definite time saver.

Finally, they are pretty good at debugging, they've helped me think through a bunch of problems (this is mostly an extension of my point above).

Hilariously enough, they are really poor at building MCP like stuff, as this is too new for them to have many examples in the training data. Makes total sense, but still endlessly amusing to me.

−

marcosdumay

Of course, if you don't know what you are looking for, it can make that process much easier.

Yes. My experience is that LLMs are really, really good at understanding what you are trying to say and bringing up the relevant basic information. That's a task we call "search", but it is different from the focused search people do most of the time.

Anyway, by the nature of the problem, that's something that people should do only a few times for each subject. There is not a huge market opportunity there.

−

Izkata

Why bother searching yourself? This is pre-LLM: https://github.com/drathier/stack-overflow-import

−

alwahi

i have seen so many people say that, but the app stores/package managers aren't being flooded with thousands of vibe coded apps, meanwhile facebook is basically ai slop. can you share your github? or a gist of some of these "codebases"

−

vallavaraiyan

What is novel code?

  1. LLM's would suck at coming up with new algorithms. 
  2. I wouldn't let an LLM decide how to structure my code. Interfaces, module boundaries etc

Other than that, given the right context (the sdk doc for a unique hardware for eg) and a well organised codebase explained using CLAUDE.Md they work pretty well in filling out implementations. Just need to resist the temptation to prompt while the actual typing would take seconds.

−

IX-103

Yep, LLMs are basically at the "really smart intern" level. Give them anything complex or that requires experience and they crash and burn. Give them a small, well-specified task with limited scope and they do reasonably well. And like an intern they require constant check-ins to make sure they're on track.

Of course with real interns you end up at the end with trained developers ready for more complicated tasks. This is useful because interns aren't really that productive if you consider the amount of time they take from experienced developers, so the main benefit is producing skilled employees. But LLMs will always be interns, since they don't grow with the experience.

−

motorest

Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Yes, kind of. What you downplay as "extremely well-known opinionated patterns" actually means standard design patterns that are well established and tried-and-true. You know, what competent engineers do.

There's even a basic technique which consists of prompting agents to refactor code to clean it up to comply with best practices, as this helps agents evaluate your project as it lines them up with known patterns.

What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs).

Yes, they are. It helps if a project is well structured, clean, and follow best practices. Messy projects that are inconsistent and evolve as big balls of mud can and do judge LLMs to output garbage based on the garbage that was inputted. Once, while working on a particularly bad project, I noticed GPT4.1 wasn't even managing to put together consistent variable names for domain models.

But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

This really depends on what are your expectations. A glass half full perspective clearly points you to the fact that yes agents can and do create entire codebases. I know this to be a fact because I did it already just for shits and giggles. A glass half empty perspective however will lead people to nitpick their way into asserting agents are useless at creating code because they once prompted something to create a Twitter code and it failed to set the right shade of blue. YMMV and what you get out is proportional to the effort you put in.

−

0xAFFFF

Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Which makes sense, considering the absolutely massive amount of tutorials and basic HOWTOs that were present in the training data, as they are the easiest kind of programming content to produce.

−

amiga386

The purpose of an LLM is not to do your job, it's to do enough to convince your boss to sack you and pay the LLM company some portion of your salary.

To that end, it doesn't matter if it works or not, it just has to demo well.

−

vidarh

My experience is opposite to yours. I have had Claude Code fix issues in a compiler over the last week with very little guidance. Occasionally it gets frustrating, but most of the time Claude Code just churns through issue after issue, fixing subtle code generation and parser bugs with very little intervention. In fact, most of my intervention is tool weaknesses in terms of managing compaction to avoid running out of context at inopportune moments.

It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.

They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.

My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.

It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.

I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.

I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.

−

nosianu

My experience is opposite to yours.

But that is exactly the problem, no?

It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?

I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.

All those examples such as yours suffer from one big problem: They are selected afterwards.

To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.

Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.

PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.

EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.

We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.

−

baq

replace 'AI|LLM' with 'new hire' in your post for a funny outcome.

−

marcosdumay

New hires perform consistently. Even if you can't predict beforehand how well they'll work, after a short observation time you can predict very well how they will continue to work.

−

hitarpetar

this is the first time I've ever seen this joke, well done!

−

svieira

Replace 'new hire' with 'AI|LLM' in the updated post for a very sad outcome.

−

vidarh

My point is that these are not minor successes, and not occasional. Not every attempt is equally successful, but a significant majority of my attempts are. Otherwise I wouldn't be letting it run for longer and longer without intervention.

For me this isn't one project where I've "twiddled around with the file and added random sentences". It's an increasingly systematic approach to giving it an approach to making changes, giving it regression tests, and making it make small, testable changes.

I do that because I can predict with a high rate of success that it will achieve progress for me at this point.

There are failures, but they are few, and they're usually fixed simply by starting it over again from after the last succesful change when it takes too long without passing more tests. Occasionally it requires me to turn off --dangerously-skip-permissions and guide it through a tricky part. But that is getting rarer and rarer.

No, I haven't formally documented it, so it's reasonable to be skeptical (I have however started packaging up the hooks and agents and instructions that consistently work for me on multiple projects. For now, just for a specific client, but I might do a writeup of it at some point) but at the same time, it's equally warranted to wonder whether the vast difference in reported results is down to what you suggest, or down to something you're doing differently with respect to how you're using these tools.

−

kordlessagain

You are using the wrong tools if you are getting crappy results. It’s like editing a photo with notepad, it’s possible but likely to fail.

−

alwahi

if claude generates the tests, runs those tests, applies the fixes without any oversight, it is a very "who watches the watchmen" situation.

−

vidarh

That is true, so don't give it entirely free reign with that. I let Claude generate as many additional tests as it'd like, but I either produce high level tests, or review a set generated by Claude first, before I let it fill in the blanks, and it's instructed very firmly to see a specific set of test cases as critical, and then increasingly "boxed in" with more validated test cases as we go along.

E.g. for my compiler, I had it build scaffolding to make it possible to run rubyspecs. Then I've had it systematically attack the crashes and failures mostly by itself once the test suite ran.

−

fragmede

Gemini?

−

gmb_uk

Good lord, that would be like the blind leading the daft.

−

ErikBjare

If you generate the tests, run those tests, apply fixes without any oversight, it is the very same situation. In reality, we have PR reviews.

−

skydhash

Is it? Stuff like ripgrep, msmpt,… are very much one-man project. And most packages on distro are maintained by only one person. Expertise is a thing and getting reliable results is what differentiates expert from amateurs.

−

fragmede

I had a highly repetitive task (/subagents is great to know about), but I didn't get more advanced than a script that sent "continue\n" into the terminal where CC was running every X minutes. What was frustrating is CC was inconsistent with how long it would run. Needing to compact was a bit of a curveball.

−

vidarh

The compaction is annoying, especially when it sometimes will then fail to compact with an error, forcing rewinding. They do need to tighten that up so it doesn't need so much manual intervention...

−

iLoveOncall

brands injecting themselves into conversations on Reddit, LinkedIn, and every other public forum.

Don't forget HackerNews.

Every single new release from OpenAI and other big AI firms attracts a lot of new accounts posting surface-level comments like "This is awesome" and then a few older accounts that have exclusively posted on previous OpenAI-related news to defend them.

It's glaringly obvious, and I wouldn't be surprised if at least a third of the comments on AI-related news is astroturfing.

−

jrflowers

I personally always love the “I wrote an entire codebase with claud” posts where the response to “Can we see it?” is either the original poster disappearing into the mist until the next AI thread or “no I am under an NDA. My AI-generated code is so incredible and precious that my high-paying job would be at risk for disclosing it”

−

fragmede

If anyone actually believed those requests to see code were sincere, or if they at least generated interesting discussion, people might actually respond. But the couple of times I've linked to a blog post someone wrote about their vibe-coding experience in the comments, someone invariably responds with an uninteresting shallow dismissal shitting all over the work. It didn't generate any interesting discussion, so I stopped bothering.

https://mitchellh.com/writing/non-trivial-vibing went round here recently, so clearly LLMs are working in some cases.

−

skydhash

And I think, in this blog post, the author stated that he does heavy editing of what’s generated. So I don’t know how much time is saved actually. You can get the same kind of inspiration from docs, books, or some SO answer.

−

habinero

Haters gonna hate, but the haters aren't always wrong. If you just want people to agree with you, that's not a discussion.

−

hkt

Honestly I've generated some big ISH codebases with AI and have said so and then backed off when asked.. because a) I still want to try to establish more confidence in the codebase and b) my employment contract gleefully states everything I write belongs to my employer. Both of those things make me nervous.

That said, I have no doubt there are also bots setting out to generate FOMO

−

ponector

Everything you wrote belongs to them. But it's not you, it's Claude is the author.

−

isodev

NDA on AI generated code is funny since model outputs are technically someone else’s code. It’s amazing how we’re infusing all kinds of systems with potential license violations

−

benibela

Someone posted these single file examples: https://github.com/joaopauloschuler/ai-coding-examples/tree/...

−

ponector

And they are usually 10x more productive as well!

−

fransje26

"This is awesome"

Or the "I created 30 different .md instruction files and AI model refactored/wrote from scratch/fixed all my bugs" trope.

a third of the comments on AI-related news is astroturfing.

I wouldn't be surprised if it's even more than that.. And, ironically, probably aided in their astroturfing, by the capability of said models to spew out text..

−

latexr

Sam Altman would agree with you that those posts are bots and lament it, but would simultaneously remain (pretend to be?) absurdly oblivious about his own fault in creating that situation.

https://techcrunch.com/2025/09/08/sam-altman-says-that-bots-...

−

loveparade

Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Telling LLMs which files they should look at was indeed necessary 1-2 years ago in early models, but I have not done that for the last half year or so, and I'm working on codebases with millions of lines of code. I've also never had modern LLMs use nonexistent libraries. Sometimes they try to use outdated libraries, but it fails very quickly once they try to compile and they quickly catch the error and follow up with a web search (I use a custom web search provider) to find the most appropriate library.

I'm convinced that anybody who says that LLMs don't work for them just doesn't have a good mental model of HOW LLMs work, and thus can't use them effectively. Or their experience is just outdated.

That being said, the original issue that they don't always follow instructions from CLAUDE/AGENT.md files is quite true and can be somewhat annoying.

−

fnord123

Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Which language are you using?

−

loveparade

Rust, Python, and a bit of C++. Around 80% Rust probably

−

CuriouslyC

I've been genuinely surprised how well GPT5 does with rust! I've done some hairy stuff with Tokio/Arena/SIMD that I thought I would have to hand hold it through, and it got it.

−

loveparade

Yeah, it has been really good in my experience. I've done some niche WASM stuff with custom memory layouts and parallelism and it did great there too, probably better than I could've done without spending several hours reading up on stuff.

−

fnord123

Rust has been an outlier in my experience as well. I have a pet theory that it is due to rust code that's been pushed to github generally compiles. And if it compiles it generally works.

−

IX-103

It's pretty good at Rust, but it doesn't understand locking. When I tried it. It just put a lock on everything and then didn't take care to make sure the locks were released as soon as possible. This severely limited the scalability of the system it produced.

But I guess it passed the tests it wrote so win? Though it didn't seem to understand why the test it wrote where the client used TLS and the server didn't wouldn't pass and required a lot of hand holding along the way.

−

loveparade

I've experienced similar things, but my conclusion has usually been that the model is not receiving enough context in such cases. I don't know your specific example, but in general it may not be incorrect to put an Arc/Lock on many things at once (or using Arc isntead of Rc, etc) if your future plans are parallelize several parts of your codebase. The model just doesn't know what your future plans are, and in errs on the side of "overengineering" solutions for all kinds of future possibilities. I found that this is a bias that these models tend to have, many times their code is overengineered for features I will never need and I need to tell them to simplify - but that's expected. How would the model know what I do and don't need in the future without me giving all the right context?

The same thing is true for tests. I found their tests to be massively overengineered, but that's easily fixed by telling them to adopt the testing style from the rest of the codebase.

−

dvfjsdhgfv

we know that creating CLAUDE.md or cursorrules basically does nothing

While I agree, the only cases where I actually created something barely resembling useful (while still of subpar quality) was only after putting in CLAUDE.md lines like:

YOUR AIM IS NOT TO DELIVER A PROJECT. YOU AIM IS TO DO DEEP, REPETITIVE E2E TESTING. ONLY E2E TESTS MATTER. BE EXTREMELY PESSIMISTIC. NEVER ASSUME ANYTHING WORKS. ALWAYS CHECK EVERY FEATURE IN AT LEAST THREE DIFFERENT WAYS. USE ONLY E2E TESTS, NEVER USE OTHER TYPES OF TEST. BE EXTREMELY PESSIMISTIC. NEVER TRUST ANY CODE UNLESS YOU DEEPLY TEST IT E2E

REMEMBER, QUICK DELIVERY IS MEANINGLESS, IT'S NOT YOUR AIM. WORK VERY SLOWLY, STEP BY STEP. TAKE YOUR TIME AND RE-VERIFY EACH STEP. BE EXTREMELY PESSIMISTIC

With this kind of setup, it kind attempts to work in a slightly different way than it normally does and is able to build some very basic stuff although frankly I'd do it much better so not sure about the economics here. Maybe for people who don't care or won't be maintaining this code it doesn't matter but personally I'd never use it in my workplace.

−

habinero

My cynical working theory is this kind of thing basically never works but sometimes it just happens to coincide with useful code.

−

pandemic_region

omg imagine giving these instructions to a junior developer to accompany his task.

−

benibela

BE EXTREMELY PESSIMISTIC. NEVER ASSUME ANYTHING WORKS.

search for compiler bugs

−

moconnor

Genuinely interesting how divergent people's experiences of working with these models is.

I've been 5x more productive using codex-cli for weeks. I have no trouble getting it to convert a combination of unusually-structured source code and internal SVGs of execution traces to a custom internal JSON graph format - very clearly out-of-domain tasks compared to their training data. Or mining a large mixed python/C++ codebase including low-level kernels for our RISCV accelerators for ever-more accurate docs, to the level of documenting bugs as known issues that the team ran into the same day.

We are seeing wildly different outcomes from the same tools and I'm really curious about why.

−

hitarpetar

how did you measure your 5x productivity gain? how did you measure the accuracy of your docs?

−

vachina

You are asking it to do what it already knows, by feeding it in the prompt.

−

pancsta

Translation is not creation.

−

sorcercode

but genuinely. how many people are "creating", like truly novel stuff that someone hasn't thought out before?

I'd wager a majority of software engineers today are using techniques that are well established... that most models are trained on.

most current creation (IMHO) comes from wielding existing techniques in different combinations. which i wager is very much possible with LLMs

−

makingstuffs

and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.

This is my most consistent experience. It is great at catching the little silly things we do as humans. As such I have found them to be most useful as PR reviewers which you take with a pinch of salt

−

cube00

It is great at catching the little silly things we do as humans.

It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.

If it didn't we'd change the change code and the next time (and forever onward) it would catch that case too.

Now we're playing wack-a-mole and pleading with words like "CRITICAL" and bold text to our in .cursorrules to try and make the LLM pay attention, maybe it works today, might not work tomorrow.

Meanwhile the C-suite pushing these tools onto us still happily blame the developers when there's a problem.

−

skydhash

It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.

People are saying that you should write a thesis-length file of rules, and they’re the same people balking at programming language syntax and formalism. Tools like linters, test runners, compilers are reliable in a sense that you know exactly where the guardrails are and where to focus mentally to solve an issue.

−

cube00

This repo^[1] is a brilliant illustration of the copium going into this.

Third line of the Claude prompt^[2]:

IMPORTANT: You must NEVER generate or guess URLs for the user - Who knew solving LLM hallucinations was just that easy?

IMPORTANT: DO NOT ADD ***ANY*** COMMENTS unless asked - Guess we need triple bold to make it pay attention now?

It gets even more ludicrous when you see the recommendation that you should use a LLM to write this slop of a .cursorrules file for you.

[1] https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...

[2] https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...

−

isodev

Coding with Claude feels like playing a slot machine. Sometimes you get more or less what you asked, sometimes totally not. I don’t think it’s wise or sane to leave them unattended.

−

WickyNilliams

Yes, and I think a lot of people are addicted to gambling. The dizzying highs when you win cloud out the losses. Even when you're down overall.

−

wkat4242

I found that using opus helps a lot. It's eyewateringly expensive though so I generally avoid it. I pay through the API calls because I don't tend to code much.

−

micoti

You are absolutely right!

−

isodev

That was a very robust and comprehensive comment

−

motorest

Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

That's not my experience at all. A basic prompt file is all it takes to cover each and any assumption you leave out from your prompts. Nowadays the likes of Copilot even provide support out of the box for instruction files, and you can create them with a LLM prompt too.

Sometimes I wonder what is the first-hand experience of the most vocal LLM haters out here. They seem to talk an awful lot about issues that feel artificial and not grounded in reality. It's like we are discussing that riding a bicycle is great, and these guys start ranting on how the biking industry is in a bubble because they don't even manage to stay up with side wheels on. I mean, have you bothered to work on the basics?

−

fho

On the ground, we know that creating CLAUDE.md or cursorrules basically does nothing.

I don't agree with this. LLMs will go out of their way to follow any instruction they find in their context.

(E.g. i have "I love napkin math" in my kagi Agent Context, and every LLM will try to shoehorn some kind of napkin math into every answer.)

Cursor and Co do not follow these instructions because they:

(a) never make it into the context in the first place, or (b) fall out of the context window.

−

peab

There's more money to be made right now in selling courses than actually using the LLM well. So these guys pretend that they found all these ways to make agents, and they market it and people buy the course

−

saltysalt

Nailed it. The other side of the marketing hype cycle will be saner, when the market forces sort the wheat from the chaff.

−

nunez

I'm shocked that this isn't talked about more. The pro-AI astroturfing done everywhere (well, HN and Reddit anyway) is out of this world.

−

apples_oranges

Too much money was invested, it needs to be sold.

−

nvarsj

My experience is kind of the opposite of what you describe (working in big tech). Like, I'm easily hitting 10x levels of output nowadays, and it's purely enabled by agentic coding. I don't really have an answer for why everyone's experience is so different - but we should be careful to not paint in broad strokes our personal experience with AI: "everyone knows AI is bad" - nope!

What I suspect is it _heavily_ depends on the quality of the existing codebase and how easy the language is to parse. Languages like C++ really hurts the agent's ability to do anything, unless you're using a very constrained version of it. Similarly, spaghetti codebases which do stupid stuff like asserting true / false in tests with poor error messages, and that kind of thing, also cause the agents to struggle.

Basically - the simpler your PL and codebase, the better the error and debugging messages, the easier it is to be productive with the AI agents.

−

thor-rodrigues

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)

It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.

While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.

−

sussmannbaka

What I want to see at this point are more screencasts, write-ups, anything really, that depict the entire process of how someone expertly wrangles these products to produce non-trivial features. There's AI influencers who make very impressive (and entertaining!) content about building uhhh more AI tooling, hello worlds and CRUD. There's experienced devs presenting code bases supposedly almost entirely generated by AI, who when pressed will admit they basically throw away all code the AI generates and are merely inspired by it. Single-shot prompt to full app (what's being advertised) rapidly turns to "well, it's useful to get motivated when starting from a blank slate" (ok, so is my oblique strategies deck but that one doesn't cost 200 quid a month).

This is just what I observe on HN, I don't doubt there's actual devs (rather than the larping evangelist AI maxis) out there who actually get use out of these things but they are pretty much invisible. If you are enthusiastic about your AI use, please share how the sausage gets made!

−

structural

These things are amazing for maintenance programming on very large codebases (think, 50-100million lines of code or more, the people who wrote the code no longer work there, it's not open source so "just google it or check stack overflow" isn't even an option at all.)

A huge amount of effort goes into just searching for what relevant APIs are meant to be used without reinventing things that already exist in other parts of the codebase. I can send ten different instantiations of an agent off to go find me patterns already in use in code that should be applied to this spot but aren't yet. It can also search through a bug database quite well and look for the exact kinds of mistakes that the last ten years of people just like me made solving problems just like the one I'm currently working on. And it finds a lot.

Is this better than having the engineer who wrote the code and knows it very well? Hell no. But you don't always have that. And at the largest scale you really can't, because it's too large to fit in any one person's memory. So it certainly does devolve to searching and reading and summarizing for a lot of the time.

−

Zababa

https://simonwillison.net/2025/Oct/8/claude-datasette-plugin...

−

sussmannbaka

this is definitely closer to what I had in mind but it's still rather useless because it just shows what winning the lottery is like. what I am really looking for is neither the "Claude oneshot this" nor the "I gave up and wrote everything by hand" case but a realistic, "dirty" day-to-day work example. I wouldn't even mind if it was a long video (though some commentary would be nice in that case).

−

Zababa

I don't think you should consider this as "winning the lottery", the author has been using these tools for a while.

The sibling comment with the writeup by the creator of Ghostty shows stuff in more detail and has a few cases of the agent breaking, though it also involves more "coding by hand".

−

nutjob2

I think the point is that you want to see typical results or process. How does it run when you use it 10 times, or 100 times, what results can you expect generally?

There's a lot of wishful thinking going around in this space and something more informative than cherrypicking is desperately needed.

Not least because lots of capable/smart people have no idea which way to jump when it comes to this stuff. They've trained themselves not to blindly hack solutions through trial and error but this essentially requires that approach to work.

−

Zababa

Yeah that's a good point and the sibling comment seems to be pointing in the same direction. You could take a look at Steve Yegge's beads (https://steve-yegge.medium.com/introducing-beads-a-coding-ag..., https://github.com/steveyegge/beads) but the writeup is not super detailed.

I think your last point is pretty important, that all that we see is done by experienced people, and that today we don't have a good way to teaching "how to effectively use AI agents" other than saying to people "use them a lot, apply software engineering best practices like testing". That is a big issue, compounded because that stuff is new, there are lots of different tools, and they evolve all the time. I don't have a better answer here than "many programmers that I respect have tried using those tools and are sticking with it rather than going back" (with exceptions, like Karpathy's nanochat), and "the best way to learn today is to use them, a lot".

As for "what are they really capable of", I can't give a clear answer. They do make easy stuff easier, especially outside of your comfort zone, and seem to make hard stuff come up more often and earlier (I think because you do stuff outside your comfort zone/core experience zone ; or because you know have to think more carefully about design over a shorter period of time than before with less direct experience with the code, kind of like in Steve Yegge's case ; or because when hard stuff comes up it's stuff they are less good at handling so that means you can't use them).

The lower bound seems to be "small CLI tool", the higher bound seems to be "language learning app with paid users (sottaku I think? the dev talks on twitter. Lots of domain knowledge in japanese here to check the app itself) ; implementing a model on pytorch by someone that didn't know how to code before (00000005 seconds or something like this on twitter, has used all these models and tools a lot); reporting security issues that were missed in cURL", middle bound "very experienced dev shipping a feature faster and while doing other things on a semi mature codebase (Ghostty)", middle bound too is "useful code reviews". That's about the best I can give you I think.

−

sussmannbaka

I'm not sure if you just didn't understand what I'm looking for. If I'm searching for a good rails screencast to get a feeling for how it's used, a blogpost consisting of "rails new" is useless to me. I know that these tools can oneshot tasks, but this doesn't help me when they can't.

−

fragmede

https://mitchellh.com/writing/non-trivial-vibing (not me)

−

skydhash

From the article

  Important: there is a lot of human coding, too. I almost always go in after an AI does work and iterate myself for awhile, too.

Some people like to think for a while (and read docs) and just write it right at the first go. Some people like to build slowly and get a sense of where to go at each steps. But in all of those steps, there’s an heavy factor of expertise needed from the person doing the work. And this expertise does not comes for free.

I can use agentic workflow fine and generate code like any other. But the process is not enjoyable and there’s no actual gain. Especially in an entreprise settings where you’re going to use the same stack for years.

−

Another theory: you have some spec in your mind, write down most of it and expect the LLM to implement it according to the spec. The result will be objectively a deviation from the spec.

Some developers will either retrospectively change the spec in their head or are basically fine with the slight deviation. Other developers will be disappointed, because the LLM didn't deliver on the spec they clearly hold in their head.

It's a bit like a psychological false memory effect where you misremember and/or some people are more flexibel in their expectations and accept "close enough" while others won't accept this.

At least, I noticed both behaviors in myself.

−

benashford

This is true. But, it's also true of assigning tasks to junior developers. You'll get back something which is a bit like what you asked for, but not done exactly how you would have done it.

Both situations need an iterative process to fix and polish before the task is done.

The notable thing for me was, we crossed a line about six months ago where I'd need to spend less time polishing the LLM output than I used to have to spend working with junior developers. (Disclaimer: at my current place-of-work we don't have any junior developers, so I'm not comparing like-with-like on the same task, so may have some false memories there too.)

But I think this is why some developers have good experiences with LLM-based tools. They're not asking "can this replace me?" they're asking "can this replace those other people?"

−

hitarpetar

They're not asking "can this replace me?" they're asking "can this replace those other people?"

In other words, this whole thing is a misanthropic fever dream

−

tyg13

Yeah, I see quite a lot of misanthropy in the rhetoric people sometimes use to advance AI. I'll say something like "most people are able to learn from their mistakes, whereas an LLM won't" and then some smartass will reply "you think too highly of most people" -- as if this simple capability is just beyond a mere mortal's abilities.

−

collingreen

misanthropic

I see what you did there

−

danaris

But a junior developer can learn and improve based on the specific feedback you give them.

GPT5 will, at least to a first approximation, always be exactly as good or as bad as it is today.

−

Jensson

They're not asking "can this replace me?" they're asking "can this replace those other people?"

People in general underestimate other people, so this is the wrong way to think about this. If it can't replace you then it can't replace other people typically.

−

scuff3d

This is a really short sighted way to look at things. Juniors become seniors. LLMs just keep hallucinating.

−

scuff3d

This implies that it executes the spec correctly, just not in a way that's expected. But if you actually look at how these things operate, that's flat out not true.

Mitchell Hashimito just did a write up about his process for shipping a new feature for Ghostty using AI. He clearly knows what he's doing and follows all the AI "best practices" as far as I could tell. And while he very clearly enjoyed the process and thinks it made him more productive, the post is also a laundry list of this thing just shitting the bed. It gets confused, can't complete tasks, and architects the code in ways that don't make sense. He clearly had to watch it closely, step in regularly, and in some cases throw the code out entirely and write it himself.

The amount of work I've seen people describe to get "decent" results is absurd, and a lot of people just aren't going to do that. For my money it's far better as a research assistant and something to bounce ideas off of. Or if it is going to write something it needs to be highly structured input with highly structured output and a very narrow scope.

−

lm28469

The simplest explanation would be...

The simplest explanation is that most of us are code monkeys reinventing the same CRUD wheel over and over again, gluing things together until they kind of work and calling it a day.

"developers" is such a broad term that it basically is meaningless in this discussion

−

mexicocitinluez

or, and get this, software development is an enormous field with 100s of different kinds of variations and priorities and use cases.

lol.

another option is trying to convince yourself that you have any idea what the other 2,000,000 software devs are doing and think you can make grand, sweeping statements about it.

there is no stronger mark of a junior than the sentiment you're expressing

−

lm28469

Well I know for a fact there are more code monkeys than rocket scientists working on advanced technologies. Just look at job offers really...

Anyone with any kind of experience in the industry should be able to tell that so idk where you're going with your "junior" comment. Technically I'm a senior in my company and I'm including myself in the code monkey category, I'm not working on anything revolutionary, as most devs are, just gluing things together, probably things that have been made dozens of times before and will be done dozens of time later... there is no shame in that, it's just the reality of software development. Just like most mechanics don't work on ferraris, even if mechanics working on ferraris do exist.

From my friends, working in small startups and large megacorps, no one is working on anything other than gluing existing packages together, a bit of es, a bit of postgres, a bit of crud, most of them worked on more technical things while getting their degrees 15 years ago than they are right now... while being in the top 5% of earners in the country. 50% of their job consist of bullshitting the n+1 to get a raise and some other variant of office politics

−

mexicocitinluez

From my friends, working in small startups and large megacorps, no one is working on anything other than gluing existing packages together,

And all my friends aren't doing that. So there's some anecdotal evidence to contradict yours.

And I think you're missing the point.

The point is the field is way bigger than either of us could imagine. You could have decades of experience and still only touch a small subset of the different technologies and problems.

Well I know for a fact there are more code monkeys than rocket scientists working on advanced technologies

I don't know what this means as it doesn't disprove that fact that the field is enormous. Of course not everyone is working on rockets. But that is irrelevant.

50% of their job consist of bullshitting the n+1 to get a raise and some other variant of office politics

Again, this doesn't mean we aren't working on different things.

I actually totally agree with this point made in your previous post:

"developers" is such a broad term that it basically is meaningless in this discussion

But your follow-up feels antogonistic to that point.

−

skydhash

It’s kinda like this when they think the software they use is mainstream and everything else is niche.

−

theshrike79

For every coder doing some cutting-edge Computer SCIENCE there are 99 people creating one more CRUD API Glue application or microservice.

I've been doing this for 25 years and everything I do can be boiled down to API Glue.

Stuff comes in, code processes stuff, stuff goes out. Either to another system or to a database. I'm not breaking new ground or inventing new algorithms here.

The basic stuff has been the same for two decades now.

Maybe 5% of the code I write is actually hard, like when the stuff comes in REAL fast and you need to do the processing within a time limit. Or you need to get fancy with PostgreSQL queries to minimise traffic from the app layer to the database.

With LLM assistance I can have it do the boring 95% of scaffolding one more FoobarController.cs , write the models and the entity framework definitions while I browse Hacker News or grab a coffee and chat a bit. Then I have more time to focus on the 5% as well as more time to spend improving my skills and helping others.

Yes. I read the code the LLM produces. I've been here for a long time, I've read way more code than I've written, I'm pretty good at it.

−

mexicocitinluez

I've been doing this for 25 years and everything I do can be boiled down to API Glue.

Oooof, and you still haven't learned how big this field is? Give me the ego of a software developer who thinks they've seen it all in a field that changes almost daily. Lol.

The basic stuff has been the same for two decades now.

hwut?

Maybe 5% of the code I write is actually hard, like when the stuff comes in REAL fast and you need to do the processing within a time limit

God, the irony in saying something like this and not having the self-awareness to realize it's actually a dig at yourself. hahahahaha

Congratulations on being the most lame software developer on this planet who has only found himself in situations that can be solved by building strictly-CRUD software. Here's to hoping you keep pumping out those Wordpress plugins and ecommerce sites.

I have 2 questions for you to ruminate on:

1. How many programming jobs have you had? 2. How many programming jobs exist in the entire world at this moment?

It's gotta be what, a million job difference? lol. But you've seen it all right? hahahazha

−

theshrike79

I didn't say that there aren't people doing cutting edge stuff.

But even John Romero did the boring stuff along with the cool stuff. Andrej Karpathy wrote a ton of boilerplate Python to get his stuff up and running^[0].

Or are you claiming that every single line of the nanochat^[0] project is peak computer science algorithms no LLM can replicate today?

Take the initial commit tasks/ directory for example^[1]. Dude is easily in the top 5 AI scientists in the world and he still spends a good time writing pretty basic string wrangling in Python.

My basic point here is that LLMs automate generating the boilerplate to a crazy degree, letting us spend more time in the bits that aren't boring and are actually challenging and interesting.

[0] https://github.com/karpathy/nanochat [1] https://github.com/karpathy/nanochat/tree/master/tasks

−

Ianjit

expect an LLM to correctly produce the complex thing they have in mind

My guess is that for some types of work people don't know what the complex thing they have in mind is ex ante. The idea forms and is clarified through the process of doing the work. For those types of task there is no efficiency gain in using AI to do the work.

−

danielbln

Why not? Just start iterating in chunks alongside the LLM and change gears/plan/spec as you learn more. You don't have to one-shot everything.

−

Ianjit

"Just start iterating in chunks alongside the LLM".

For those types of tasks it probably takes the same amount of time to form the idea without AI as with AI, this is what Metr found in its study of developer productivity.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... https://arxiv.org/abs/2507.09089

−

danielbln

That study design has some issues. But let's say it takes me the same amount of time, the agentic flow is still beneficial to me. It provides useful structure, helps with breaking down the problem. I can rubber duck, send off web research tasks, come back to answer questions, etc., all within a single interface. That's useful to me, and especially so if you have to jump around different projects a lot (consultancy). YMMV.

−

Ianjit

"That study design has some issues. " This is a study that tries to be scientific, unlike developer self reports and CEO promises of 10x.

Can you point to a better study on the impact of AI on developer productivity? The only other one I can think of finds a 20% uplift in productivity.

https://www.youtube.com/watch?v=tbDDYKRFjhk

−

KronisLV

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

Some possible reasons:

  * different models used by different folks, free vs paid ones, various reasoning effort, quantizations under the hood and other parameters (e.g. samplers and temperature)
  * different tools used, like in my case I've found Continue.dev to be surprisingly bad, Cline to be pretty decent but also RooCode to be really good; also had good experiences with JetBrains Junie, GitHub Copilot is *okay*, but yeah, lots of different options and settings out there
  * different system prompts, various tool use cases (e.g. let the model run the code tests and fix them itself), as well as everything ranging from simple and straightforward codebases that are dime a dozen out there (and in the training data), vs something genuinely new that would trip up both your average junior dev, as well as the LLMs
  * open ended vs well specified tasks, feeding in the proper context, starting new conversations/tasks when things go badly, offering examples so the model has more to go off of (it can predict something closer to what you actually want), most of my prompts at this point are usually multiple sentences, up to a dozen, alongside code/data examples, alongside prompting the model to ask me questions about what I want before doing the actual implementation
  * also sometimes individual models produce output for specific use cases badly, I generally rotate between Sonnet 4.5, Gemini Pro 2.5, GPT-5 and also use Qwen 3 Coder 480B running on Cerebras for the tasks I need done quickly and that are more simple

With all of that, my success rate is pretty great and the statement about the tech not being able to "...barely follow a simple instruction" holds untrue. Then again, most of my projects are webdev adjacent in mostly mainstream stacks, YMMV.

−

surgical_fire

Then again, most of my projects are webdev adjacent in mostly mainstream stacks

This is probably the most significant part of your answer. You are asking it to do things for which there are a ton of examples of in the training data. You described narrowing the scope of your requests too, which tends to be better.

−

lnrd

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

My hypothesis is that developers work on different things and while these models might work very well for some domains (react components?) they will fail quickly in others (embedded?). So one one side we have developers working on X (LLM good at it) claiming that it will revolutionize development forever and the other side we have developers working on Y (LLM bad at it) claiming that it's just a fad.

−

h33t-l4x0r

I think this is right on, and the things that LLM excels at (react components was your example) are really the things that there's just such a ridiculous amount of training data for. This is why LLMs are not likely to get much better at code. They're still useful, don't get me wrong, but they 5x expectations needs to get reined in.

−

danielbln

A breadth and depth of training data is important, but modern models are excellent at in-context learning. Throw them documentation and outline the context for what they're supposed to do and they will be able to handle some out-of-distribution things just fine.

I would love to see some detailed failure cases of people who used agentic LLMs and didn't make it work. Everyone is asking for positive examples, but I want to see the other side.

−

wiether

Also the variation is the focus of each person.

Based on my own personal experience:

- on some topics, I get the x100 productivity that is pushed by some devs; for instance this Saturday I was able to make two features that I was reschudeling for years because, for lack of knowledge, it would have taken me many days to make them, but a few back and forth with an LLM and everything was working as expected; amazing!

- on other topics, no matter how I expose the issue to an LLM, at best it tells me that it's not solvable, at worst they try to push an answer that doesn't make any sense and push an even worst one when I point it out...

And when people ask me what I think about LLM, I say : "that's nice and quite impressive, but still it can't be blindly trusted and needs a lot of overhead, so I suggest caution".

I guess it's the classic half empty or half full glass.

−

Zababa

I've known lots of people that don't know how to properly use Google, and Google has been around for decades. "You're using it wrong" is partially true, I'd say more something like "it is a new tool that changes very quickly, you have to invest a lot of time to learn how to properly use it, most people using it well have been using it a lot over the last two years, you won't catch up in an afternoon. Even after all that time, it may not be the best tool for every job" (proof on the last point being Karpathy saying he wrote nanochat mostly by hand).

It is getting easier and easier to get good results out of them, partially by the models themselves improving, partially by the scaffolding.

non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor

This is a false dichotomy that assumes we know way more about intelligence than we actually do, and also assumes than what you need to ship lots of high quality software is "intelligence".

While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.

"reliably" is doing a lot of work here. If it means "without human guidance" it is true (for now), if it means "without scaffolding" it is true (also for now), if it means "at all" it is not true, if it means it can't increase dev productivity so that they ship more at the same level of quality, assuming a learning period, it is not true.

I think those conversations would benefit a lot from being more precise and more focused, but I also realize that it's hard to do so because people have vastly different needs, levels of experience, expectations ; there are lots of tools, some similar, some completely different, etc.

To answer your question directly, ie “Why do LLM experiences vary so much among developers?”: because "developer" is a very very very wide category already (MISRA C on a car, web frontend, infra automation, medical software, industry automation are all "developers"), with lots of different domains (both "business domains" as in finance, marketing, education and technical domains like networking, web, mobile, databases, etc), filled with people with very different life paths, very different ways of working, very different knowledge of AIs, very different requirements (some employers forbid everything except a few tools), very different tools that have to be used differently.

−

ehnto

Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.

But the other thing is that, your expectations normalise, and you will hit its limits more often if you are relying on it more. You will inevitably be unimpressed by it, the longer you use it.

If I use it here and there, I am usually impressed. If I try to use it for my whole day, I am thoroughly unimpressed by the end, having had to re-do countless things it "should" have been capable of based on my own past experience with it.

−

mexicocitinluez

Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.

Absolutely nuts I had to scroll down this far to find the answer.Totally agree.

Maybe it's the fact that every software development job has different priorities, stakeholders, features, time constraints, programming models, languages, etc. Just a guess lol

−

yoyohello13

I'm convinced it's not the results that are different, it's the expectations.

The SVP of IT for my company is 100% in on AI. He talks about how great it is all the time. I just recently worked on a legacy project in PHP he build years ago, and now I know his bar for what quality code looks like is extremely low...

I use LLMs daily to help with my work, but I tweak the output all the time because it doesn't quite get it right.

Bottom line, if your code is below average AI code will look great.

−

nasmorn

That seems to be the pitch too. I now get google ads where they advertise you can ask your phone things about what it sees. All the examples are so trivial, I can’t believe it. How to make a double espresso. What are these clouds called?

That’s being not a complete idiot as a service.

If it was at least how do I start the decalcification process on this machine so it actually realizes it and turns the service light off.

−

tovej

I would say they can't reliably deliver simple work. They often can, but reliability, to me, means I can expect it to work every time. Or at least as much as any other software tool, with failure rates somewhere in the vicinity of 1 in 10^5, 1 in 10^6. LLMs fail on the order of 1 in 10 times for simple work. And rarely succeed for complex work.

That is not reliable, that's the opposite of reliable.

−

krisoft

One has to look at the alternatives. What would i do if not use the LLM to generate the code? The two answers are “coding myself”, “asking an other dev to code it”. And neither of those approach anywhere a 10^5 failure rate. Not even close.

−

logicchains

I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”

Two of the key skills needed for effective use of LLMs are writing clear specifications (written communication), and management, skills that vary widely among developers.

−

skydhash

There’s no clearer specifications than code, and I can manage my toolset just fines (lines of config, alias, and what not to make my job easier). That allowed me to deliver good results fine and fast without worrying if it’s right this time

−

netdevphoenix

Why do LLM experiences vary so much among developers?

The question assumes that all developers do the same work. The kind of work done by an embedded dev is very different from the work of a front-end dev which is very different from the kind of work a dev at Jane Street does. And even then, devs work on different types of projects: greenfield, brownfield and legacy. Different kind of setups: monorepo, multiple repos. Language diversity: single language, multiple languages, etc.

Devs are not some kind of monolith army working like robots in a factory.

We need to look at these factors before we even consider any sort of ML.

−

nunez

This is very similar to Tesla's FSD adoption in my mind.

For some (me), it's amazing because I use the technology often despite its inaccuracies. Put another way, it's valuable enough to mitigate its flaws.

For many others, it's on a spectrum between "use it sometimes but disengage any time it does something I wouldn't do" and "never use it" depending on how much control they want over their car.

In my case, I'm totally fine handing driving off to AI (more like ML + computer vision) most times but am not okay handing off my brain to AI (LLMs) because it makes too many mistakes and the work I'd need to do to spot-check them is about the same as I'd need to put in to do the thing myself.

−

impossiblefork

It's true though, they can't. It really depends on what they have to work with.

In the fixed world of mathematics, everything could in principle be great. In software, it can in principle be okay even though contexts might be longer. When dealing with new contexts in something like real life, but different-- such as a story where nobody can communicate with the main characters because they speak a different language, then the models simply can't deal with it, always returning to the context they're familiar with.

When you give them contexts that are different enough from the kind of texts they've seen, they do indeed fail to follow basic instructions, even though they can follow seemingly much more difficult instructions in other contexts.

−

ninetyninenine

It’s because people are using different tiers of AI and different models. And many people don’t stick with it long enough to get a more nuanced outlook of AI.

Take Joe. Joe sticks with AI and uses it to build an entire project. Hundreds of prompts. Versus your average HNer who thinks he’s the greatest programmer in the company and thinks he doesn’t need AI but tries it anyway. Then AI fails and fulfills his confirmation bias and he never tries it again.

−

XenophileJKO

At some point people won't care to convince you and you will be left to adapt or fade away.

−

danielbln

That's where I stand now. I use LLMs in some agentic coding way 10h/day to great avail. If someone doesn't see or realize the value, then that's their loss.

−

gf000

Probably a good chunk of the differences in experience is this: https://news.ycombinator.com/item?id=45573521

[..] possibly the repo is too far off the data distribution.

(Karpathy's quote)

−

izacus

My experience is that people clamming they use AI exclusively are usually trying to sell an AI product.

−

smt88

I sometimes meet devs who are "using it wrong" with under-baked prompts.

But mostly my experience is that people who regularly get good output from AI coding tools fall into these buckets:

A) Very limited scope (e.g. single, simple method with defined input/output in context)

B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")

C) Use AI to force multiple iterations of the same prompt to "shake out the bugs" automatically instead of using the dev's time

I don't see many cases outside of this.

−

JamesSwift

D) Understand and use context creatively. They know when to start new conversations, and how to use the filesystem as context storage.

−

smt88

Yeah except I do that with Claude Code, and my output is still shit most of the time. It might save me a little time or typing, but it definitely needs hand-editing. The people who say Claude is a junior dev (at best) are right.

That's why I think a lot of people who think it's a miracle probably aren't experienced enough to see the bugs.

−

bsder

B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")

Oh, boy, this. For example, I often use whatever AI I have to adjust my Nix files because the documentation for Nix is so horrible. Sure, it's slop, but it gets me working again and back to what I'm supposed to be doing instead of farting with Nix.

I would also argue:

D) The fact that an AI can do the task indicates that something with the task is broken.

If an AI can do the task well, there is something fundamentally wrong. Either the abstractions are broken, the documentation is horrible, the task is pure boilerplate, etc.

−

theshrike79

It's a personality thing.

I know Car People who refuse to use even lane keeping assist, because it doesn't fit their driving style EXACTLY and it grates them immensely.

I on the other hand DGAF, I love how I don't need to mess with micro adjustments of the steering wheel on long stretches of the road, the car does that for me. I can spend my brainpower checking if that Gray VW is going to merge without an indicator or not.

Same with LLM, some people have a very specific style of code they want to produce and anything that's not exactly their style is "wrong" and "useless". Even if it does exactly what it should.

−

sorcercode

when we are yet to confidently have any model complete a single simple instruction???

i understand the author might be a little frustrated and employing hyperbole here, but are most folks genuinely having similar problems?

at this point I have found LLMs to more often than not follow my instructions. It requires diligent pruning of instructions, effective prompting and planning. but once you get a sense of how to do those three things, it's possible to fly with these coding agents.

it does get it wrong occasionally but anecdotally this is like 1/10 in my experience. and interrupting and course-correcting quickly gets me right back on track.

I'm just surprised at the skepticism at the usefulness of these tools from the HN comments. there's plenty of reasons to be worried and upset (cost, job transformation and displacement etc) but the effectiveness of coding agents being a common theme, in the comments here as well, is surprising to me.

−

bradfa

My experience has been that if you take the time to explain what the current state is, what your desired state should be, and to give information on how you want the agent to proceed, that then you can work with the agent to craft a plan, refine the plan, and finally execute the plan. In this mode of operation, the current state of the art is quite impressive.

You can't just give it a single sentence and expect it to do something complex correctly. It takes real effort and human time, just like if you were trying to get a smart and capable intern who has no real world experience to do something technical correctly. Just the AI agents work significantly faster than a human intern.

−

qazxcvbnmlp

My experience has been that if you take the time to explain what the current state is, what your desired state should be, and to give information on how you want the agent to proceed,

I have a pet theory. 1. This skill requires a strong theory of mind^[1]. 2. Theory of mind is more difficult in those with autism 3. The same autism that makes people really good at coding, and gives them the time to post on online forms like hn, makes it hard to understand how to work with LLMs and how others work with llms.

To provide good context to the llm you need to have a good understanding of (1) what it will and will not know, (2) what you know and take for granted(ie a theory of your own mind) (3) what your expectations are. None of this you need to do when you are coding on your own, but are critical on getting a good response from the LLM.

See also the black and white thinking that is common in the responses on articles like this.^[2]

[1]https://en.wikipedia.org/wiki/Theory_of_mind [2]https://www.simplypsychology.org/black-and-white-thinking-in...

−

saulpw

An LLM has no mind! What is your strong theory of mind for an LLM? That it knows the whole internet and can regurgitate it like a mindless zombie?

−

kelseyfrog

Source?

It sounds like an unprovable metaphysical statement than something that is supported by scientific evidence.

−

fao_

The burden of proof is on people stating that an AI has a theory of mind, not on the reverse. Until recently it was highly debated on if dogs have theory of mind, and it took decades of evidence to come to the conclusion that yes, they do.

−

kelseyfrog

The burden of proof is on the person making the claim. It doesn't matter whether the claim is positive or negative. The default position is "We don't know if AI has a ToM."

−

sirtaj

Am I incorrect in thinking this is as much true of the linux kernel or emacs as it is of an LLM?

−

judahmeek

GGP didn't say that AI has a theory of mind. GGP said that using AI productively requires a theory of mind, a.k.a. being able to build a mental model of the LLM's context.

−

rmwaite

If you read carefully you will see that they never said AI has a theory of mind.

−

xg15

Whether or not it has a mind is irrelevant to the problem. I think the point is, if you pretend it had a mind and write your prompt accordingly, you will get the best results.

−

bigchillin

That simply psych article is a psyop

−

imtringued

That HN username is also bad news. Meanwhile yours is pretty cool. I really enjoyed the social credit memes with John Cena.

How exactly do you come up with a pet theory out of nowhere, randomly diagnose people on the internet with autism based on how they use LLMs and then start linking to a most likely AI generated blog post (there was simply too much repetition) that ascribes a lot of negative attributes to them with a username that is meant to be unrecognisable.

The post is basically a Kafka trap or engagement bait.

−

wild_egg

This actually makes a disturbing amount of sense and I think I'm going to need to chew on it for a while. Thanks for sharing!

−

bigstrat2003

My experience has been that if you take the time to explain what the current state is, what your desired state should be, and to give information on how you want the agent to proceed, that then you can work with the agent to craft a plan, refine the plan, and finally execute the plan.

People say this a lot, and I'm not even saying you're wrong. But that isn't useful to me. In the time it takes me to do all that, I can just solve the problem myself. If I have to hold its hand through finding a solution, then it is a time suck, not a time saver.

−

drcxd

Agreed. I also have to check if it has implemented the idea correctly.

If my workflow is;

1. Write documentation so that the problem and even the solution to the problem is well explained. 2. Instruct coding agents to work as the document described. 3. Check its if its implementation is correct, and improve its implementation if necessary.

I feel the experience is not as good as me implementing the solution myself, and it may even take more time.

−

bradfa

There are definitely times when it’s not faster to use the tool to do the full job. But sometimes just using the tool to plan the job helps to clarify the task so a human can do it better/faster.

But then there’s also tasks where using the tool is a HUGE speed up.

−

credit_guy

Look at it differently: getting to use AI so it is a productivity multiplier and not a productivity sink is hard. It takes a lot of work. Or rather, a lot of experimentation. You will generate some AI slop, you will annoy some people, you will embarrass yourself, you will retrace your steps, etc, etc. It is hard. Ok, let me say that again. It is hard. It is not easy. It is a tool with a steep learning curve.

It is ok to decide not to use such a tool. But you will be left behind.

−

andoando

You can literally write "Add a feature on the UI where we get a live update of new posts using a websocket connection in the backend server at /app/backend."

And it will it will integrate websockets in your UI, backend and create the models, service logic, etc in under 20 seconds. Can you really do that?

−

mcv

That's boilerplate stuff. That's what it's best at. But the moment I want something slightly off the beaten path (and I always do), it struggles and makes mistakes.

−

satvikpendem

that then you can work with the agent to craft a plan, refine the plan, and finally execute the plan

And Cursor just introduced a separate plan mode themselves, so it gets even better.

−

theshrike79

People don't understand that LLMs aren't humans. There's a lot of implicit context when humans are communicating. LLMs don't do that.

They do have biases, like if you tell them to do something with data, they'll pretty likely grab Python as the tool.

And different models have different biases and styles, you can try to guide them to your specific style with prompts, but it doesn't always work - depending on how esoteric your personal style is.

−

bradfa

Imagining the tool is like a college intern helps me. It has no idea how the real world works. It blindly follows things it previously found online. It’s great at very common boilerplate coding tasks. But it’s super naive and will need hand holding or you to provide a huge amount of context for it so it can operate on its own.

I’m still very much learning how to give it good instructions to accomplish tasks. Different tasks require different types and methods of instruction. It’s extremely interesting to me. I’m far from an expert.

−

theshrike79

I imagine LLMs as an endless stream of consultants, each can work only one day (context).

Every day you need to bring them up to speed (prompt, accessible documentation) and give them the task of the day. If it looks like they can't finish the task (context runs out), you need to tell them to write down where they left (store context to a memory, markdown file is fine) and kick them out the door.

Then GOTO 10, get the next one in.

−

stpedgwdgfhgdd

The person in the Cursor platform is raising a different question and a valid one. We have tons of these frameworks out there, openspec, amplifier, etc. The ultimate dream is to have these subagents work in the background autonomously.

However reality tells us that you constantly have to keep Claude on the right track. Nudge here, nudge there. Close code reviews. Test, test more. Very interactive. Superpowers to the engineer.

It is this contradiction that also makes me believe that it will take another year for agents to work on enterprise codebases autonomously. Maybe more, look at autonomous self driving, surprisingly hard to get to the last 10%.

−

mfdupuis

I think this is the challenge and the dissonance. For something to truly run autonomously you need to provide it some many constraints that it almost loses its usefulness. I've tried using AI, or at least looked into what I could use AI for to automate marketing tasks and I just don't think I can seriously set up a workflow in n8n or AgentKit that would produce sufficiently good results without me jumping in. That said, AI is incredibly helpful in this semi-autonomous mode with the right parameters, to the point of the parent comment.

−

sharts

Moreover they’re not even that great as a search tool. Often just giving incorrect or outdated synthesized results. Marginally better than a raw google search because I can skip all the sponsored/SEO hack results with garbage info.

−

mcv

More often than not is not good enough for me. But worse than not following instructions, is simply being wrong, and then doubling down on it.

This is something that happens to me a lot. When I ask it to do something moderately complex, or to analyse an error I get, quite often it leaps to the wrong conclusion and keeps doubling down on it when it doesn't work, until I tell it what the actual problem is.

It's great with simple boilerplate code, and I have actually used it to implement a new library successfully with a little bit of feedback from me, but it gets stuff wrong so often that I really don't need it doing anything beyond that. Although when I'm stuck, I still use it to spew ideas. Even if they're wrong, they can help me get going.

−

sorcercode

i'm genuinely curious - what do your prompts look like? do you have agent instructions for your repo that you've spent a little time pruning or maintaining? when you execute tasks are you planning these out or one-shotting them?

i ask because your comment is sounding like more often than not, you're actually getting *worse* results? (if i'm reading that right). and i want to understand is it a perception problem (like do you just have a higher bar for what you expect from a coding agent); or is it actually producing bad results, in which case i want to understand what's different in the ways we're using these agents.

(also can you provide a concrete example of how it got something wrong? like what were you asking in that moment and what did it do, and how did it double down.)

−

mcv

A few weeks ago, I was testing the difference between two libraries to draw graphs: cytoscape and NVL. I started with NVL, told it to implement a very basic case: just some simple mock data, and draw the thing on the screen. (Those weren't my prompts; I didn't save those.) I also took the advice of using a context prompt file to lay down coding standards.

It didn't get it right in one go, but after 7 attempts of analysing the error, it suddenly worked, and I was quite amazed. Step by step adding more features was terrible, however; it kept changing the names of functions and properties, and it turned out that was because I was asking for features the library didn't support, or at least didn't document. I ended up rejecting NVL because it was very immature, poorly documented, and there weren't many code examples available. I suspect that's what crippled the AI's ability to use it. But at no point did it tell me I was asking the impossible; it happily kept trying stuff and inventing nonexistent APIs until I was the one to discover it was impossible.

I'm currently using it to connect a neo4j backend to a cytoscape frontend; both of those are well established and well documented, but it still struggles to get details right. It can whip up something that almost works really quickly, but then I spend a lot of time hunting down little errors it made, and they often turn out to be based on a lack of understanding of how these systems work. Just yesterday it offered 4 different approaches to solve a problem, one of which blatantly didn't work, one used a library that didn't exist, one used a library that didn't do what I needed, and only one was a step in the right direction but still not quite correct in its handling of certain node properties.

I'm using Claude 3.7 thinking and Claude 4 for this.

−

hattmall

Could you, or someone, make a video showing a real life coding scenario where an agentic system or just LLM provided significant value? That, to me, would be beyond the value one would get or could have previously gotten from a Google search or stack overflow post?

−

pas

I asked it to generate a simple EffectTS script and it did. (It was my first piece of code using Effect, but I know a bit of Scala ZIO, so it was helpful to get started.)

Effect's docs are quite sparse/terse.

And this is usually what I found, it's good when I have a blank page. (Adding new functionality to a legacy system for example.)

But I get upset in no time when I can't really give feedback to it, as the only options are accepting/rejecting the diff or getting stuck in editing manually, but it's in some strange limbo, so usually IDE tools don't work well, and you can't chat with it about the diff, only about what it already thinks are in the files.

(And while this seems like "just" a tooling issue, fundamentally the whole frenzy soured me, I tried Cursor/Windsurf/ClaudeCode and now I'm out of fucks to give ... even though chatting about solutions and writing Markdown docs would be great for everyone, for the projects I work on, etc.)

−

geldedus

no I won't. The more people would think agentic-AI-assisted programming doesn't work the more I will have a huge advantage over them

−

theshrike79

I just literally sat down on my computer after walking the dogs.

During my walk Codex (gpt-5-high) rewrote a Python application to Go in one shot. I sat down, tested and it works. Except now I can just distribute a single binary instead of a virtualenv mess =)

Maybe the instruction was SUPER simple? Dunno.

−

bn-l

Maybe the code and task was super simple. Maybe the whole rewrite was already in its training data (enough) in some way

−

theshrike79

Naturally it wasn't breaking new ground on computer science :)

It was a "script" (application?, the line is vague) that reads my Obsidian movie/anime/tv watchlist markdown files from my vault, grabs the title and searches for that title in Themoviedb

If there are multiple matches, it displays a dialog to pick the correct one.

Then it fills the relevant front matter, grabs a cover for Obsidian Bases and adds some extra info on the page for that item.

The Python version worked just fine, but I wanted to share the tool and I just can't be arsed to figure that out with Python. With Go it's a single binary.

Without LLM assistance I could've easily spent a few nights doing this, and most likely would've just not done it.

Now it was the effort of me giving the LLM a git worktree to safely go wild in (I specifically said that it can freely delete anything to clean up old crap) and that's what it did.

And this isn't the first Python -> Go transition I've done, I did the same for a bunch of small utility scripts when GPT3/3.5 was the new hotness. Wasn't as smooth then (many a library and API was hallucinated), but still markedly faster than doing it by hand.

−

hattmall

Now, that's actually a very reasonable usage, because it isn't coding, it's translation. The counter to that being beneficial is that many code translation programs already exists and it's certainly something that can be built and done so in a way to be certain of security, proper practices etc, which you can't guarantee with LLM and using a specialized program is orders of magnitude less resource intensive.

Of course it's nice as a hobbyist end user to do exactly what you did for a simple script and that's to the credit of the LLM. The over-arching issue is that extremely inefficient process is only possible thanks to subsidization from Venture capital.

−

theshrike79

I personally prefer Go with LLMs because it has a relatively large amount of analyzers and other tooling to statically check that there are no major issues with the code.

Also the compiler being a stickler for unused code etc. keeps the Agentic models in check, they can't YOLO stuff as hard like in, say, Python.

−

YouAreWRONGtoo

Millions of compilers have been written and those languages have simple grammars. As such, those are trivial problems.

−

theshrike79

Trivial with an LLM yes, but in reality nobody will bother with a full rewrite in another language just for fun if the original program works, but is a bit annoying to run.

This way it's pretty close to zero effort.

−

nasmorn

Rewriting something in another language is super simple. The original code has all the context. Even simple for a human but takes time. I literally did the same thing recently and had Claude rewrite a simple Python lib to elixir so I don’t need to run Python somewhere. Wasn’t perfect actually but very easy to fix the issues.

−

Yondle

I've had the exact same reaction and figured that devs like to say that AI is useless are either: trying to make themselves feel smarter by trying to say that what they are working on is beyond anything that AI has been trained on, OR, aren't able to effectively breakdown the problems they have into smaller digestible chunks.

−

frogperson

That just sounds like probablistic programming to me. Call me old fashioned, but i still prefer my code to be deterministic.

−

thfuran

"More often than not" is zero 9s.

−

true_religion

I'm not sure what people are considering instructions but it talks about the topics that I tell it to talk about, and when parsing prose it will take specific instruction as to word choice, or tone.

This is true across Gemini, ChatGPT, and Qwen.

−

Julien_r2

I actually hope to find better answers here than on cursor forum where people seems to be basically saying "it's you fault" instead of answering the actual question which is about trust, process, and real world use of agents..

So far it's just reinforcing my feeling that none of this is actually used at scale.. We use AI as relatively dumb companions, let them go wilder on side projects which have loser constraints, and Agent are pure hype (or for very niche use cases)

−

zwnow

Exactly, the actual business value is way smaller people think and its honestly frustrating. Yes they can write boilerplate, yes they sometimes do better than humans in well understood areas. But its negligible considering all the huge issues that come with them. Big tech vendorlocks, data poisoning, unverifiable information, death of authenticity, death of creativity, ignorance of LLM evangelists, power hungriness in a time where humanity should look at how to decrease emissions, theft of original human work, theft of data big tech gets away with since way too long. Its puzzling to me how people actually think this is a net benefit to humanity.

−

Zababa

Most of the issues you listed are moral and not technical. Especially "power hungriness in a time where humanity should look at how to decrease emissions", this may be what you think humanity should do but that is just that, what you think.

I derive a lot of business value from them, many of my colleagues do too. Many programmers that were good at writing code by hand are having lots of success with them, for example Thorsten Ball, Simon Willison, Mitchell Hashimoto. A recent example from Mitchell Hashimoto: https://mitchellh.com/writing/non-trivial-vibing.

Its puzzling to me how people actually think this is a net benefit to humanity.

I've used them personally to quickly spin up a microblog where I could post my travel pictures and thoughts. The idea of making the interface like twitter (since that's what I use and know) was from me, not wanting to expose my family and friends to any specific predatory platform like twitter, instagram, etc was also from me, supabase as the backend was from a colleague (helped a lot!), the code was all Claude. The result is that they were able to enjoy my website, including my grandparents that just had to paste an URL on the website. I like to think of it a a perhaps very small but net benefit for a very small part of humanity.

−

danaris

Is it a moral judgement to say that when the stove is on fire, we shouldn't be pouring more grease on it?

Is it a moral judgement to say that you shouldn't pick up a bear cub with its mother nearby?

If neither of these are moral judgements, then why would it be a moral judgement to say that humanity should be seeking to reduce its emissions? Just because you personally don't like it, and want to keep doing whatever you like?

−

Zababa

Pouring grease on a fire will make it worse. Picking up a bear cub when the mother is nearby will increase (by how much I don't know) the risks of getting attacked by a bear. Both of those sentences to me sound like genuine description of the reality we live in. Climate change is real and caused by human emissions is another one of those, though a bit less precise as "climate change" is less precise. Saying we should or shouldn't do something is something different.

Also, you can increase power capacity by a lot while reducing emissions, with stuff like solar panels or nuclear power.

−

hansmayer

I've used them personally to quickly spin up a microblog where I could post my travel pictures and thoughts.

Sorry but this a perfect example of the typical demographic that currently boosts the usage of these non-tools: trivial, almost unnecessary use-cases, of service to no-one but self (maybe friends and family too). You could have also spun up a simple microblog on one of the many blogging platforms, with trivial UI complexity, low costs and much smaller environmental impact.

−

hitarpetar

if climate change doesn't matter why the hell should anyone care about your vibe coded personal twitter clone?

−

theshrike79

What if I'm vibe engineering a solution to global warming? Does it cancel itself out?

−

hitarpetar

−

zwnow

So moral issues are not relevant? Typical tech enthusiast mindset unfortunately...

−

Huppie

The best I can tell you (from working with LLM's) is that... it's complicated.

There are moments where spending 10 min on a good prompt saves me 2hrs of typing and it finishes that in the time it takes me to go make myself a cup of coffee (~10 min) Those are the good moments.

Then there are moments where it's more like 30 min savings for 10 min of prompting. Those are still pretty good.

Then there are plenty of moments where spending 10 mins on a prompt saves me about 15mins of work. But I have to wait 5 mins for the result, so it ends up being a wash except it has a downside that I didn't really write it myself so the actual details of the solution aren't fully internalized.

There's also plenty of moments where the result at first glance looks like a good / great result but once I start reviewing and fixing things it still ends up being a wash.

I find it actually quite difficult to determine the result quality because at first glance it always looks pretty decent, and then sometimes once you start reviewing it's indeed the case and other times I'm like "well it needs some tweaking" and subsequently spend an hour tweaking.

Now I think the problem is that the response is akin to gambling / conditioning in a sense. Every prompt has a smallish chance to trigger a great result, and since the average result is still about 25% faster (my gut feeling based on what I've 'written' the last few months working with Claude Code) it's just very tempting to pull that slot machine lever even in tasks that I know I will most likely type faster than I can prompt.

I did find a place where (to me, at least) it almost certainly adds value: I find it difficult to think about code during meetings (I really need my attention in the meetings I do) but I can send a few quick prompts for small stuff during meetings and don't really have to context switch. This alone is a decent productivity booster. Refactorings that would've been a 'maybe, one day' can now just be triggered. Best case I spend 10 minutes reviewing and accept it. Worst case I just throw it away.

−

sdoering

What specific improvements are you hoping for? Without them (in the original forum post) giving concrete examples, prompts, or methodology – just stating "I write good prompts" – it's hard to evaluate or even help them.

They came in primed against agentic work flow. That is fine. But they also came in without providing anything that might have given other people the chance to show that their initial assumptions was flawed.

I've been working with agents daily for several months. Still learning what fails and what works reliably.

Key insights from my experience: - You need a framework (like agent-os or similar) to orchestrate agents effectively - Balance between guidance and autonomy matters - Planning is crucial, especially for legacy codebases

Recent example: Hit a wall with a legacy system where I kept maxing out the context window with essential background info. After compaction, the agent would lose critical knowledge and repeat previous mistakes.

Solution that worked: - Structured the problem properly - Documented each learning/discovery systematically - Created specialized sub-agents for specific tasks (keeps context windows manageable)

Only then could the agent actually help navigate that mess of legacy code.

−

rafaelmn

So at what point are you doing more work on the agent than working on the code directly ? And what are you losing in the process of shifting from the code author to LLM manager ?

My experience is that once I switch to this mode when something blows up I'm basically stuck with a bunch of code that I sort of know, even tough I reviewed it. I just don't have the same insight as I would if I wrote the code, no ownership, even if it was committed in my name. Like any misconceptions I've had about how things work I will still have because I never had to work through the solution, even if I got the final working solution.

−

cruffle_duffle

My thoughts exactly. They generate a pile of sludge if left to their own devices. Sludge that you will take incredible amounts of time to understand.

The amount of tech debt these things accumulate unchecked is massive.

In some places it doesn’t matter, in some places it matters a lot.

−

tossandthrow

With all that additional work, would you assess you have been more cost effective as just doing these tasks yourself with an Ai companion?

−

hitarpetar

sounds like a huge waste of time

−

wrsh07

I would love to give a quick primer on how I'm using agents:

I'll usually have a main line of work I'm focused on. I'll describe the current behavior and desired changes (need to plumb this var through these functions to use here). "Gpt 5 thinking high" is pretty precise, so if you clearly indicate what you want it usually does exactly what I request. (If this isn't happening for you, make sure you don't have other context in your codebase that confuses it)

While it's working, I'll often be able to prompt another line of work, usually requesting explicitly it not make changes but not switching to ask mode. It will do most of the work to figure out what changes would need to be made and it summarizes them helpfully which allows me to correct it if it's wrong. You can repeat this for as long as the existing models are busy

Types of prompts that work well:

Questions: "what's the function or component for doing X", where else do we do this pattern?

Bug prompts (anything that would take you <2h to fix should be promptable in a single prompt, note you'll get slightly different responses even with the same prompt, so if at first you don't succeed you might explain what went wrong, ask it to improve your prompt, and then try again from scratch. People don't reset context often enough)

Larger scale architecture / plans - this I would recommend switching to plan mode and spending some time going back and forth. Often it will get confused so take your progress (ideally as an .md file) and bring it to a new conversation to keep iterating.

You can even have it suggest jira tickets etc

Understanding different models is important: Claude 4.5 (and most Claude models since 3.5) really want to do stuff. And if you leave them unchecked they'll usually do way more than you asked. And if they perceive themselves to be blocked on a failing test they might delete it or change it to be useless. That said, they're really extraordinary models when you want a quick prototype fleshed out where you don't make all of the decisions. Gpt 5 thinking high is my personal favorite (codex 5 thinking high is also very good in the codex plugin in vscode). Create new context often.

−

wrsh07

Best things about Claude: it will often figure out a good feedback loop where it can build + test and get quick feedback about whether the thing is working. This works best in Claude code but can be effective in cursor too

Best things about gpt: the precision. I don't even care that they're slow, it just let's me queue up more work

Best things about codex: it's a little smarter at handling very hard or very easy tasks. It might spend less time on easy tasks and even more time on hard ones

Best things about grok: speed plus leetcode style ability

All of them tend to benefit from a feedback loop if you can give them great tests or good static analysis etc, but they will cheat if you let them (any in ts)

−

theshrike79

I've used this analogy many times:

Codex + GPT-5-high is an offshore consultant. You give it the spec and it'll do the work and come back with something.

Claude is built like a pair programmer, it chats while it works and you can easily interrupt it without breaking the flow.

Codex is clearly more thorough, it's _excellent_ at picking apart Sonnet 4.5 code and finding the subtle gotchas it leaves behind when it just plows to a result.

And like you said, Claude is results first. It'll get where you want it to go, even if it has to mock the whole application to get the tests to pass. =)

−

laborcontract

The reason why OP is getting terrible results is because he's using Cursor, and Cursor is designed to ruthlessly prune context to curtail costs.

Unlike the model providers, Cursor has to pay the retail price for LLM usage. They're fighting an ugly marginal price war. If you're paying more for inference than your competitors, you have to choose to either 1) deliver equal performance as other models at a loss or 2) economize by way of feeding smaller contexts to the model providers.

Cursor is not transparent on how it handles context. From my experience, it's clear that they use aggressive strategies to prune conversations to the extent that it's not uncommon that cursor has to reference the same file multiple times in the same conversation just to know what's going on.

My advice to anyone using Cursor is to just stop wasting your time. The code it generates creates so much debt. I've moved on to Codex and Claude and I couldn't be happier.

−

addandsubtract

What deal is GitHub Copilot getting then? They also offer all SOTA models. Or is the performance of those models also worse there?

−

rhim

Or is the performance of those models also worse there?

The context and output limit is heavily shrunk down on github copilot^[0]. That's the reason why for example Sonnet 4.5 performs noticeably worse under copilot than in claude code.

[0] https://models.dev/?search=sonnet+4.5

−

laborcontract

Github Copilot is likely running models at or close to cost, given that Azure serves all those models. I haven't used Copilot in several months so I can't speak to its performance. My perception back then was that its underperformance relative to peers was because Microsoft was relatively late to the agentic coding game.

−

subjectivationx

You are spot on and summed it up perfectly.

I am using language models as much as anyone and they work but they don't work the way the marketing and popular delusion behind them is pretending they work.

The best book on LLMs and agentic AI is Extraordinary Popular Delusions and the Madness of Crowds by Charles Mackay.

−

lxgr

I've had agents find several production bugs that slipped me (as I couldn't dedicate enough time to chase down relatively obscure and isolated bug reports).

Of course there are many more bugs they'll currently not find, but when this strategy costs next to nothing (compared to a SWE spending an hour spelunking) and still works sometimes, the trade-off looks pretty good to me.

−

alwahi

from a cursory (heh) reading of the cursor forum, it is clear that the participants in the chat are treating ai like the adeptus mechanicus treats the omnissiah.... the machine spirits aren't cooperating with them though.

−

jpalomaki

That kind of comments would be more meaningful and get better responses, if they came with a practical example. Some reasonable real-world problem and how the author tried to solve it using LLM but failed.

−

hansmayer

The answer is really trivial and really embarrassingly simple, once you remove the engineering/functional/world improvement goggles. The answer is: because the rich folks invested a ton of money and they need it to work. Or at least to make most of the white collar work dependent on it, quality be damned. Hence the ever increasing pushing, nudging, advertising, offering to use the crap-tech everywhere. It seems now it will not win over the engineers. Unfortunately it seems to work with most of the general population. Every lazy recruiter out there is now using chatgpt to generate job summaries and "evaluate" candidates. Every "office worker" of the general type deadweight you meet at every company is happy to use it to produce more powerpoints, slides and documents for you drown in. And I won't even mention the "content" business model of the influencers.

−

nunez

They're using it, yes, but it's still heavily subsidized by VC, and it reminds to be seen whether it will remain as popular as prices percolate upwards.

Either way, layoffs all the same if this "doesn't work".

−

hansmayer

I agree it is heavily subsidized. The problem is, the general population got used to cheap or free software over the past decades and most of them don't really understand how much their polite exchanges with the chatgpt actually cost :) I'd be happy to see the price of the tooling jump up to reflect the real costs.

−

thaumasiotes

This idea is getting a lot of attention right now.

e.g. https://www.noahpinion.blog/p/americas-future-could-hinge-on...

−

rokkamokka

I find it funny that the page subheader is "If the economy's single pillar goes down, Trump's presidency will be seen as a disaster".

Is it not a disaster already? The fast slide towards autocracy should certainly be viewed as a disaster if nothing else.

−

yoyohello13

The fact that sending the US military to occupy our own cities is not seen as a failure is... something.

−

wkat4242

At our place we have two types of users. One that is a deep evangelist, says it revolutionised their office work and has no idea that it might have accuracy problems. I guess those are the people that just create a lot of hot air.

The others tried it and ran into the obvious Achilles heels and are now pretty cautious. But use it for a thing or two.

−

__MatrixMan__

Once you figure out how to get your model to go find the context it needs (for me this usually comes down to really good error messages that feel a bit like a prompt injection attack) and you figure out how to keep the tasks small and uniform-ish such that a passing test for a previous (supervised) task becomes a reason that that output can now be used as context for how to complete the next (unsupervised) task, agents can be pretty darn reliable.

Maybe 50% of the problems we solve are repetitive enough for this to make sense, and 50% of those are unpredictable enough that a model in a loop isn't overkill compared to traditional automation, and 50% of those are too small to be worth investing in the necessary scaffolding. But if you're looking at a problem that's in that magical 12.5%, a properly constrained agent is absolutely the way to go.

−

alganet

You should do a video on that, like a live coding session. Not a tutorial, an organic recording of you being a badass context engineer. That would make it easier to get the message across.

−

__MatrixMan__

Good suggestion, thanks.

I'll likely have to make a "what even is this" video for my coworkers, so maybe the video you're proposing would make a good Part II to that.

Might be tricky to convince my company to bless its release but perhaps with some careful editing...

−

alganet

You could do it on spare time, not using your company's hardware, and work on a public open source repository. This way there's no conflict with potential contracts.

Also, LLMs have been around for a while. Maybe you can just search for someone that did it and share one video that you would endorse as representative of what you believe to be good context engineering. It seems that there should be a lot of those around, lots of people are using this tech, aren't they?

−

__MatrixMan__

I designed the project around what I can lean on LLMs for, so its not like I can just drop into an existing codebase and make it look like a good idea to let your agent crank while you get coffee. Perhaps not so badass a context engineer after all.

Still the pattern is generalizable enough, so yeah I'll think on ways to bring it to a wider audience.

−

alganet

its not like I can just drop into an existing codebase

The devil is always in the details. For example, in enterprise teams the time to onboard a new project or a new programmer is often dismissed but actually very important. Serious companies take this matter very seriously. More inexperienced teams often come up with an excuse for a slow onboarding time without realizing it's a proxy metric for quality. If you can't jump in quickly or jump someone else quickly into the productive train, something is wrong.

I think recording yourself doing it might be a worthwhile endeavour, even if you don't release it to a wider audience.

The act of recording yourself and watching it might reveal blind spots that you were not aware while doing it. Perhaps things you think are fast but actually took more time than you imagined.

Also, by giving this advice I can take my personal judgement out of the equation. Now it's you recording and only you watching yourself and judging yourself (which can be very soul crushing, I can tell from experience).

−

kykat

The replies are all a variation of: "You're using it wrong"

−

motorest

The replies are all a variation of: "You're using it wrong"

I don't know what you are trying to say with your post. I mean, if two persons feed their prompts to an agent and while one is able to reach their goals the other fails to achieve anything, would it be outlandish to suggest one of them is using it right whereas the other is using it wrong? Or do you expect the output to not reflect the input at all?

−

kykat

Of course the output reflects the input, that's why it's a bad idea to let the LLM run in a loop without constraints, it's simple maths, if something is 99% accurate, after 5 times is 95% accurate, after 10 steps it's about 90% accurate, after 100 times it's about 36% accurate.

For LLMs to be effective, you (or something else) needs to constantly find the errors and fix it.

−

ninetyninenine

I’ve seen LLM catch and fix their own mistakes and literally tell me they were wrong and that they are fixing their self made wrong mistake. This analogy is therefore not accurate as error rate can actually decrease over time.

−

kykat

If we assume that each action has 99% success rate, and when it fails, it has 20% chance of recovery, and if the math here by gemini 2.5 pro is correct, that means the system will tend towards 95% chance of success.

===

In equilibrium, the probability of leaving the Success state must equal the probability of entering it.

    (Probability of being in S) * (Chance of leaving S) = (Probability of being in F) * (Chance of leaving F)

Let P(S) be the probability of being in Success and P(F) be the probability of being in Failure.

    P(S) * 0.01 = P(F) * 0.20

Since P(S) + P(F) = 1, we can say P(F) = 1 - P(S). Substituting that in:

    P(S) * 0.01 = (1 - P(S)) * 0.20
    0.01 * P(S) = 0.20 - 0.20 * P(S)
    0.21 * P(S) = 0.20
    P(S) = 0.20 / 0.21 ≈ 0.95238

−

ninetyninenine

That’s math based off of arbitrary initial assumptions. There are numbers that work.

All this math is useless. Use your brain. The entire point I’m communicating is that it’s not a given that it must become less accurate. There are multiple open possibilities here and scenarios that can occur.

Doing random math here as if you’re dropping the mic is just pointless. It doesn’t do anything. It’s like making up a cosmological constant and saying the universe is collapsing look at my math.

−

darkwater

I saw them too. And after that, slip in another mistake.

−

magicalhippo

I've had good experience getting a different LLM perform a technical review, then feed that back to the primary LLM but tell it to evaluate the feedback rather than just blindly accepting it.

You still have to have a hand on the wheel, but it helps a fair bit.

−

Alex_L_Wood

And yours is also "you are using it wrong" in the spirit.

Are they doing the same thing? Are they trying to achieve the same goals, but fail because one is lacking some skill?

One person may be someone who needs a very basic thing like creating a script to batch-rename his files, another one may be trying to do a massive refactoring.

And while the former succeeds, the latter fails. Is it only because someone doesn't know how to use agentic AI, or because agentic AI is simply lacking?

−

throwawayb2025

I had both good and bad experience.

Bad with regex or things involving recursion.

Also bad at integration between modules. I have not tried to solve this yet, by giving documentation of both the modules.

Also model used impacted. To understand java code it was great.

I first ask it to generate the detailed prompt by giving a high level prompt. Then use the detailed prompt to execute the task.

Java code size Upto 20k loc is fine. Other wise context becomes big. So you have to do module by module.

I believe to have a discussion maybe someone has to take an open source code example and then say it doesn't work. Other people can then discuss and decide.

Overall happy with gpt5 and claude code.

Edit:updated bad at integration

−

motorest

Also bad at integration between modules. I have not tried to solve this yet, by giving documentation of both the modules.

You should first draft the interface and roll out coverage with automated tests, and then prompt your way into filling in the implementation. If you just post a vague prompt on how you want multiple modules workinh together, odds are the output might not met implicit constraints.

−

berkes

And some more variations that, in my anecdotal experience make or break the agentic experience:

* strictness of the result - a personal blog entry vs a complex migration to reform a production database of a large, critical system

* team constraints - style guides, peer review, linting, test requirements, TDD, etc

* language, frameworks - quick node-js app vs a java monolyth e.g.

* legacy - a 12+ year Django app vs a greenfield rust microservice

* context - complex, historical, nonsensical business constraints and flows vs a simple crud action

* example body - a simple crud TODO in PHP or JS, done a million times vs a event-sourced, hexagonal architecrtured, cryptographical signing system for govt data.

−

leptons

In my experience it depends on which way the wind is blowing, random chance, and a lot of luck.

For example, I was working on the same kind of change across a few dozen files. The prompt input didn't change, the work didn't change, but the "AI" got it wrong as often as it got it right. So was I "using it wrong" or was the "AI" doing it wrong half the time? I tried several "AI" offerings and they all had similar results. Ultimately, the "AI" wasted as much time as it saved me.

−

l1ng0

[dead]

−

ares623

I expect the $500 billion magic machine to be magic. Especially after all the explicit threats to me and my friends livelihoods.

−

motorest

I expect the $500 billion magic machine to be magic. Especially after all the explicit threats to me and my friends livelihoods.

That's a problem you are creating for yourself by believing in magical nonsense.

Meanwhile, the rest of the world is gradually learning how to use the tool to simplify their work, being it helping onboard onto projects, doing ad-hoc code reviews, serving as sparring partners, helping with design work, and yes even creating complete projects from scratch.

−

theshrike79

It's giving "I paid $200k for this RV, the cruise control should keep the car on the road while I go in the back to make coffee".

−

gwd

Exactly one of two things is true:

1. The tool is capable of doing more than OP has been able to make it do

2. The tool is not capable of doing more than OP has been able to make it do.

If #1 is true, then... he must be using it wrong. OP specifically said:

Please pour in your responses please. I really want to see how many people believe in agentic and are using it successfully

So, he's specifically asking people to tell him how to use it "right".

−

timschmidt

I've certainly gotten a lot of value from adapting my development practices to play to LLM's strengths and investing my effort where they have weaknesses.

"You're using it wrong" and "It could work better than it does now" can be true at the same time, sometimes for the same reason.

−

ZeWaka

I find it quite funny that one of the users actually posted a fully AI-generated reply (dramatically different grammar and structure than their other posts).

−

thundoe

Which is true. Like launching a Ferrari at 200mph without steering doesn’t take anyone anywhere, it’s just a very painful waste of money

−

geldedus

Yes. Because it is the correct answer.

−

jimkri

The push for Agentic models is because people aren't working or they are failing to complete what they are contractually obligated to.

I'm working with a client who has APEX code with 0% test coverage. The last consultant company deployed APEX without any test classes. That is a failure on the Salesforce partner, since you need test classes if you want to update any APEX code, and it leaves the client with more work and more costs.

I used AI to write 100% test coverage in less than 1 hour. I had to give it direction because the first implementation was not to the Salesforce Dev Test Standards. So I downloaded the SFDC Developer PDF guides on APEX and gave it to AI Studio, and was able to write the test classes correctly.

Agents will do what they are programmed and prompted to do, and this is really why corporations are moving in that direction.

−

yeasku

You worked 1 hour on the testing and is finished?

how big is the project and what programming languaje it uses?

Has anybody reviewed all the code created for the test?

−

jimkri

1 hour on writing the code, and then 30 minutes on testing. There were 3 tests in the class, the project isn't large, ~200 lines of code, and the point is that the team that deployed it failed to provide this.

It's not complicated; it's a failure in process and work that they are required to provide based on their Salesforce Partner agreements, and to meet minimum standards.

I already stated the language; it's written in Salesforce APEX. Yes, the code was reviewed and passed all tests within Salesforce as well.

Salesforce partners often upload "Dummy" test methods or fail to provide the test classes, as this requires additional time. It's more a reflection on the partner than the individual programmer, since I've been in the industry, I know that devs are thrown onto multiple projects at once.

However, with Agents, you can reduce that time. Nevertheless, companies are still failing to do even that. So that is why I think corporations are pushing for more agents; the basics are being skipped because the teams working on it don't have the time.

−

mnky9800n

As a scientist there is a ton of boiler plate code that is just slightly different enough for every data set I need to write it myself each time. So coding agents solve a lot of that. At least until you are halfway through something and you realize Claude didn’t listen when you wrote 5 times in capital letters NEVER MAKE UP DATA YOU ARE NOT ALLOWED TO USE np.random IN PLACE OF ACTUAL DATA. It’s all kind of wild because when it works it’s great and when it doesnt there’s no failure state. So if I put on my llm marketing hat I guess the solution is to have an agent that comes behind the coding agent that checks to see if it does its job. We can call it the Performance Improvement Plan Agent (PIPA). PIPAs allow real time monitoring of coding agents to make sure they are working and not slacking off allowing for HR departments and management teams to have full control over their AI employees. Together we will move into the future.

−

theshrike79

Quick, don't think of an elephant in a pink tutu! You did, didn't you?

As a scientist, you should know that LLMs are pretty bad at understanding negatives because they work on tokens, not words.

"NO ELEPHANTS" roughly becomes NO + ELEPHANT. Now "elephant" is in the context and it's going to be "thinking" about it and steering everything towards it.

You need to use positive instructions.

−

ghthor

PIPA scary

−

Netcob

My guess is that the reason why AI works bad for some people is the same reason why a lot of people make bad managers / product owners / team leads. Also the same reason why onboarding is atrocious in a lot of companies ("Here's your login, here's a link to the wiki that hasn't been updated since 2019, if you have any questions ask one of your very busy co-workers, they are happy to help").

You have to very good at writing tasks while being fully aware of what the one executing it knows and doesn't know. What agents can infer about a project themselves is even more limited than their context, so it's up to you to provide it. Most of them will have no or very limited "long-term" memory.

I've had good experiences with small projects using the latest models. But letting them sift through a company repo that has been worked on by multiple developers for years and has some arcane structures and sparse documentation - good luck with that. There aren't many simple instructions to be made there. The AI can still save you an hour or two of writing unit tests if they are easy to set up and really only need very few source files as context.

But just talking to some people makes it clear how difficult the concept of implicit context is. Sometimes it's like listening to a 4 year old telling you about their day. AI may actually be better at comprehending that sort of thing than I am.

One criticism I do have of AI in its current state is that it still doesn't ask questions often enough. One time I forgot to fill out the description of a task - but instead of seeing that as a mistake it just inferred what I wanted from the title and some other files and implemented it anyway. Correctly, too. In that sense it was the exact opposite of what OP was complaining about, but personally I'd rather have the AI assume that I'm fallible instead of confidently plowing ahead.

−

weitendorf

I fully agree with this take and think a lot of people at this point are really just being uncharitable to those using AI productively + unwilling to admit their own faults when they fail to see this.

How can anybody who has managed or worked with inexperienced engineers, or StackOverflow developers, not see how helpful AI is for delegating the kinds of tasks with that particular flavor of content and scope? And how can anybody who is currently working with those kinds of developers not see how much it's helping them improve the quality of their work? (and yes, it's extremely frustrating to see AI used poorly or for people to submit code for review that they did not even review or even understand themselves. But the fact that that's even possible, that it often times still works, really tells you something... And given the right feedback, most offenders do eventually understand why they ought not to do this, I think)

Even for more experienced engineers, for the kind of "unimportant / low priority, uninteresting" work that requires a lot of context and knowledge to get done but isn't really a good use of experienced engineers' time, AI can really lower the barrier to starting and completing those tasks. Let's say my codebase doesn't have any docstrings or unit tests - I can feed it into an LLM and immediately get mediocre versions of all of that and just edit it into being good enough to merge. Or let's say I have an annoying unicode parsing bug, a problem with my regex, or something like that which I can reproduce in tests or a dev environment: a lot of the time I can just give the LLM the part of the code I suspect the bug resides within, tell it what the bug symptoms are and ask it to fix it, and validate the fix.

To be honest and charitable to those who do struggle to use AI this way, since it's most likely just a theory of mind issue (they don't understand what the AI does and doesn't know, and what context it needs to understand them and give them what they want), it could very well be influenced by being somewhere on the autism spectrum or just difficulty with social skills. Since AI is essentially a fresh wipe of the same stranger every time you start a conversation with it (unless you use "memory" features designed for consumer chat rather than coding), it never really gets to know you or understand your quirks like most people that regularly interact with those with social difficulties. So I suppose to a certain extent it requires them to "mask" or interact in a way they're unfamiliar with when dealing with computer tools.

A lot of people for whatever reason seem also to have decided to become emotionally/personally invested in "AI stupid" to the point that they will just flat out refuse to believe there is value in being able to type some little compiler error or stacktrace into a textbox and 80% of the time get a custom fix in 10% of the time it would have taken to do the same thing on google search+stackoverflow.

−

micoti

I have the exact same question, what is hype all about when models can't do simple things. You prompt the model with generate one unit test for function and it somehow always generate more then one. (Just to start with most simple instruction)

I just feel that models are currently not up to speed with experienced engineers where it takes less time to develop something then to instruct model to do it. It is only usefull for boring work.

This is not to say that these tools didn't created oportunities to create new stuff, it is just that the hype overestimates the usefullnes of the tools so they can sell them better just like all other things.

−

ianpri11

completing boring work is still very useful when a large proportion of peoples day jobs are managing CRUD apps

−

micoti

i agree, these tools are usefull. i only oppose agresive marketing that llm is solution for everything. it is just a tool which has its use case, but to me it seems that it is not optimal for use cases that it is advertised.

i work on agentic systems and they can be good if agent has a bit-sized chuck of work it needs to do. problme with the coding agents is that for every more complex thing you will need to write a big prompt which is sometimes counter productive and it seems to me that user in cursor thread is pointing in that direction.

−

JCM9

Because the hype cycle on the original AI wave was fading so folks needed something new to buzz about to keep the hype momentum going. Seriously, that’s the reason.

−

berkes

Do you have any reasoning or anything else to back this up? Edit: honest question, not a diss or a dismissal.

It's an interesting take, one that I believe could be true, but it sounds more like an opinion than a thesis or even fact.

−

JCM9

Folks aren’t seeing measurable returns on AI. Lots written about this. When the bean counters show up, the easiest way to get out of jail is to say “Oh X? Yeah that was last year, don’t worry about it… we’re now focused on Y which is where the impact will come from.”

Every hype cycle goes through some variation of this evolution. As much as folks try to say AI is different it’s following the same very predictable hype cycle curve.

−

Msurrow

The Gartner Hype Cycle^[1]. I wonder were AI should be put on the graph, here in the 2025 fall. Just past the peak? Or are we not there yet.

[1] https://en.wikipedia.org/wiki/Gartner_hype_cycle

−

eschaton

Why do they need reasoning to back it up when the LLMs being promoted don’t actually do any reasoning?

−

Havoc

Same with “context engineering”

−

fabian2k

For me, a big issue is that the performance of the AI tools varies enormously for different tasks. And it's not that predictable when it will fail, which does lead to quite a bit of wasted time. And while having more experience prompting a particular tool is likely to help here, it's still frustrating.

There is a bit of overlap for the stuff you use agents and the stuff that AI is good at. Like generating a bunch of boilerplate for a new thing from scratch. That makes the agent mode more convenient for me to interact with AI for the stuff it's useful in my case. But my experience with these tools is still quite limited.

−

ehnto

When it works well you both normalise your expectations, and expand your usage, meaning you will hit its limits, and be even more disappointed when it fails at something you've seen it do well before.

−

jeswin

There are widely divergent views here. It'd be hard to have a good discussion unless people mention what tasks they're attempting and failing at. And we'll also have to ask if those tasks (or categories) are representative of mainstream developer effort.

Without mentioning what the LLMs are failing or succeeding at, it's all noise.

−

danielbln

We'd need:

- language/framework

- problem space/domain

- SRE experience level

- LLM (model/version)

- agentic harness (claude code, codex, copilot, etc.)

- observed failure modes or win states

- experience wrangling these systems ("I touched ChatGPT once" vs "I spend 12h/day in Claude Code")

And there's more, is the engineer working on a single codebase for 10 years or do they jump around various projects all the time. Is it more greenfield, or legacy maintenance. Is it some frontier never-before-seen research project or CRUD? And so on.

−

throw-10-13

Youre holding it wrong

−

gwd

FWIW all my coding with LLMs is very hands-on. What I've ended up doing with LLMs is something like the following:

1. New conversation. Describe at a high level what change I want made. Point out the relevant files for the LLM to have context. Discuss the overall design with the LLM. At the end of that conversation, ask it to write out a summary (including relevant files to read for context next time) in an "epic" document in llm/epics/. This will almost always have several steps, listed in the document.

Then I review this and make sure it's in line with what I want.

2. New conversation. We're working on @llm/epics/that_epic.md. Please read the relevant files for context. We're going to start work on step N. Let me know if you have any questions; when you're ready, sketch out a detailed plan of implementation.

I may need to answer some questions or help it find more context; then it writes a plan. I review this plan and make sure it's in line with what I want.

3. New conversation. We're working on @llm/epics/that_epic.md. We're going to start implementing step N. Let me know if you have any questions; when you're ready, go ahead and start coding.

Monitor it to make sure it doesn't get stuck. Any time it starts to do something stupid or against the pattern of what I'd like -- from style, to hallucinating (or forgetting) a feature of some sub-package -- add something to the context files.

Repeat until the epic is done.

If this sounds like a lot of work, it is. As xkcd's "Uncomfortable Truths Well" said, "You will never find a programming language that frees you from the burden of clarifying your ideas." LLMs don't fundamentally change that dynamic. But they do often come up with clever solutions to problems; their "stupid questions" often helps me realize how unclear my thinking is; they type a lot faster, and they look up documentation a lot faster too.

Sure, they make a bunch of frustrating mistakes when they're new to the project; but if every time they make a patterned mistake, you add that to your context somehow, eventually these will become fewer and fewer.

−

tejtm

Nigh thirty years ago when dabbling in AI I read a quote I will paraphrase as:

when you hear 'intelligent agent'; think 'trainable ant'

−

jstummbillig

Say more?

−

drittich

Looks like it comes from Scientific American: https://spaf.cerias.purdue.edu/~spaf/Yucks/V5/msg00004.html

−

tejtm

great job digging up a reference! you must be a bot! :)

I read it in a book on AI, unfortunately that aisle in my library is inaccessible due to piles of obsolete crap (I wish I were kidding).

But hope springs eternal and if I get back and find it I will return here and add its deets to see if it was published before or after the SI article.

got to help future agents from going astray ...

−

mihaaly

What I recently experienced on asking for a string manipulation routine that follows very arbitrary logic (for a long existing file format) that it forgots things like UTF string handling (in general, but also its subtle details requiring second round), its own code replacing special characters with escape sequence can be cut in half in limited width fileds (being an input for the function), considers some aspects of the specification document while omiting the others. Needs heavy supervision in the details and constant adjustments.

Yet, it makes the bulk of the work. Saves brain energy, that goes into the edge cases then. The overall time is the same, it is just the result could become more robust in the end. Only with good supervision! (which has better chance when we are not worn out with the tedious heavy lifting part)

But the one undebatable benefit is that the user can feel the smartest person in the whole wide world having so 'excellent questions', and 'knowing the topic like a pro', or being 'fantastic to spot such subtle details'. Anyone feel inadequate should use an agentic AI to boost self morale! (well, only if the person does not get nauseous from that thick flattering)

−

mihau

It feels to me that the OP on the forum expects this to work: "read this existing function, then read my mind and do stuff" (probably followed by "do better").

It still takes a lot of practice to get good at prompting, though.

−

anal_reactor

Literally my manager

−

stanac

The problem in this case is that LLMs are bad with golang, I don't write go, I am guessing from my experience with kotlin. I mainly use kotlin (rest apis) and LLMs are often bad at writing it. They e.g. confuse mockk and mockito functions and then agent spiral into a never ending loop of guessing what's wrong and trying to fix it in 5 different ways. Instead I use only chat, validate every output and point out errors they introduce.

On the other hand colleagues working with react and next have better experience with agents.

−

weitendorf

Because golang is so verbose LLMs are still extremely useful at using it without feeling like you're doing data entry or working as a typist. Converting JSON to a properly typed Go struct or doing the 3-5 things required to create+specify+send+deserialize/parse an HTTP request (with explicit error handling) is about 20x faster when I make an LLM do it, and it makes it so I don't dread or avoid those kinds of tasks.

−

Nifty3929

This reads as someone who has not learned how, when and why to use a new tool, uses it poorly if at all, and then blames the tool.

All the while their peers who have taken the time to understand the new tool and when/how/why to use it are in fact using it very successfully.

Instead of thinking "it doesn't work," think "when and how can this work for me?"

I had an experience recently where I was working on a python notebook that I didn't create, and which was a bit old. I ran into an error that I was unfamiliar with. There had been added an AI-based "explain this error" button to my notebook interface. I figured it would suck, but I gave it a try anyway. It correctly identified the problem (old library conflicting with newer one), suggested the change (switch from old library to newer fork) and offered to implement it. I clicked "yes" and it made all of the (relatively minor) changes required, including a few spots where some method returns had changed and needed to be unpacked. I scrutinized the changes to ensure I knew what was going on, but it all just worked regardless. Probably 8 or so 1-2 line changes throughout 200 lines of code.

This likely saved me roughly an hour, or maybe two, of fiddling around, googling, reading docs, etc. Because I had an open mind and gave it a try, and had reasonable expectations, which were in fact exceeded.

I did not start with "refactor my codebase please," and then get mad and give up when it doesn't help much.

−

DarkNova6

What is the code quality of the average developer? What is the code quality of coding agents?

There is your answer why many find AI coding productive while others do not.

−

yeasku

If you have the money you can get very good developers.

If you want to pay 250 a month. You get what you get.

−

jMyles

A simple, interesting, and too-often-unspoken answer is that LLMs, especially with 100k+ context heaps, seem to be better at tasks that have contextual significance and scenery than they are at following one-off instructions.

Or at least, that's very much the experience of my team.

Like everyone else, I'm working on an MCP server right now. Mine though, is designed to populate a new context window with memories of Billy Strings shows and other bluegrass phenomena. Why? Well, in addition to that content being interesting to me and my wanting to see what LLMs have to say about it, I also notice that having this context seems to make them much better at solving bluegrass-adjacent problems in software, which is what I'm working on.

−

falconinthesun

"You're using it wrong" if a user cannot use a tool intuitively, the tool is not fit for purpose.

−

gwd

The most powerful tools are usually renowned to have the most arcane user interfaces.

Xkcd's "Uncomfortable Truths Well" said, "You will never find a programming language that frees you from the burden of clarifying your ideas." LLMs don't fundamentally change that dynamic.

[1] https://xkcd.com/568/

−

geldedus

There are hundreds of tools that you must learn how to use to get any result. Totally fit for the purpose. Learn how to properly use that tool and you'll get results.

−

coolfox

VC mumbo jumbo; you can apply this same logic to literally all of programming

−

interleave

I'll bite.

We're a classic XP shop. To build new features in our brown-field app, we defined about 8 sub-agents such as "red-test-writer", "minimal-green-implementer" and "refactorer".

Now all I do in Claude Code is: "Build this feature X using our TDD process and the agents." 30 minutes later the feature is complete, looks better and works better than what I would have built in 30 minutes, is 90% tested and is ready for acceptance testing.

Granted it took us years of working XP, pairing, TDD etc. but I keep feeling confused about posts like this.

We've been shipping production-grade code written 95% by AI for over a year now. Non-trivial, complex features.

There is no secret sauce even, in how we do this. It works. Really, really well for us.

−

ai-christianson

People want predictability from LLMs, but these things are inherently stochastic, not deterministic compilers. What’s working right now isn’t "prompting better," it’s building systems that keep the LLM on track over time: logging, retrying, verifying outputs, giving it context windows that evolve with the repo, etc.

That’s why we’ve been investing so much in multi-agent supervision and reproducibility loops at gobii.ai. You can’t just "trust" the model; you need an environment where it’s continuously evaluated, self-corrects, and coordinates with other agents (and humans) around shared state. Once you do that, it stops feeling like RNG and starts looking like an actual engineering workflow, distributed between humans and LLMs.

−

mschuster91

Because there is a lot of money tied up in AI now, in a way that doesn't just reek like a bubble waiting to implode but even more stinks like a bunch of what used to be called "wash trading"^[1]. And that's just the money side.

The "social kool-aid" side is even worse. A lot of very rich and very influential people have bet their career on AI - especially large companies who just outright fired staff to be replaced both by actual AI and "Actually Indians"^[2] and are now putting insane pressure on their underlings and vendors to make something that at least looks on the surface like the promised AI dreams of getting rid of humans.

Both in combination explains why there is so much half-baked barely tested garbage (or to use the term du jour: slop) being pushed out and force fed to end users, despite clearly not being ready for prime time. And on top of that, the Pareto principle also works for AI - most of what's being pushed is now "good enough" for 80%, and everyone is trying to claim and sell that the missing 20% (that would require a lot of work and probably a fundamentally new architecture other than RNG-based LLMs) don't matter.

[1] https://www.bbc.com/news/articles/cz69qy760weo

[2] https://www.osnews.com/story/142488/ai-coding-chatbot-funded...

−

YouAreWRONGtoo

I think that unless you work on a 100000-1000000 GPU research cluster, you don't know what's currently possible.

I wish I could know the kinds of queries that could be answered when there are no economic constraints on existing infrastructure. That would suggest whether they have already hit a scientific wall (and that's the difference between it being a $10T+ industry or a 500B industry). On consumer LLMs, it's still easy to get the LLM to admit queries are beyond its abilities, although many of those questions are also beyond 99.9999% of humanity, to be fair (in that the things I ask don't exist yet anywhere and possibly will never due to their non-trivial engineering nature).

−

buzzin__

In machine learning, boosting is a way to combine weak learners into a strong one. Perhaps something similar can be done with language models?

−

internet_points

look up Mixture of Experts, e.g. Mixtral

−

j45

The bringing of "agentic" into the mainstream this year was a little odd, crewAI was discovered by a broader group, but many people had quietly been playing with it for a while.

The fundamentals of how LLMs are different from what came before it are still being understood.

People keep thinking it's like the world they've always known with software and discovering it's not. That could be an issue with LLMs, or how/what they are used for.

Seeing different ways takes time.

I feel like we're still missing some essential building blocks in widespread use to not only creating agents but making sure they stay stable and consistent across model changes.

−

vigouroustester

While I agree with the sentiment of not just letting it run free on the whole codebase and do what it wants, I still have good experience with letting it do small tasks one at a time, guided by me. Coding ability of models has really improved over the last few months itself and I seem to be clearing less and less AI-generated code mess than I was 5 months ago.

It's got a lot to do with problem framing and prompt imo.

−

timcobb

what prompted this post? well just tried to work with gpt5 and gemini pro

That's the problem. GPT5, at least in Cursor, doesn't work for coding. It burns tokens and does actually nothing my in experience. Claude 3.5, 4 and 4.5, on the other hand, are pretty solid and make lots of forward progress with minimal instruction. It takes iteration, some skill, some critical thinking, and some hand coding! Yes, LLMs forget things and do random things sometimes, but for me it's a big boost.

−

petetnt

I love how the proposed solution is to essentially gaslighting the model to think that it's an expert programmer and then specify and re-specify the prompt until the solution is essentially inefficient pseudocode. Now we are in a world where amateur coders still cannot code or can't learn from their mistakes while experts are essentially JIRA ticket outsourcing specialists.

−

dearilos

This is pretty much what I’m solving. LLMs need a “linter” of sorts that can guide them to write good code.

−

righthand

An agentic chat agent is a worker you can just immediately fire instead of training to be better. Even if you train it better, it’s best to just fire them once the project is done. Executives have invented firing virtual coworkers as a work game.

−

smartbench1

With fine tuning becoming cheaper, agents will become very powerful in the next stage. That is an agent with multiple models as intelligence units in their areas of expertise.

−

taherchhabra

I don't think the models are dumb anymore, codex with gpt5 and claude code can design and build complex systems. The only thing is these models work great on greenfield projects. Legacy projects design evolves over a number of years and LLMs have hard time understanding those unwritten project design decisions

−

chrisjj

When I asked Claude "AI" to count the number of text file lines missing a given initial sub-string, it gave an improbably exaggerated result. When I challenged this, it replied "You are right! Let me try again this time without splitting long lines."

AI = Absent Intelligence.

−

etothet

I recommend you check out Andrej Karpathy’s 2 YouTube videos on how LLMs work (they are easy to find, but be forewarned they are long!). Once one digs in deeper it becomes clear why a model today might fail at the task you described.

Generally speaking, one of the behaviors I see in my day to day work leading engineers is that they often attempt to apply agentic coding tools to problema that don’t really benefit from them.

−

chrisjj

I recommend you check out Andrej Karpathy’s 2 YouTube videos on how LLMs work

I'll reccommend Claude "AI" do that, so it knows to tell the user when it /doesn't/ work.

−

rootlocus

Here's how Claude Code does this:

Find all java files with more than 100 lines.

● I'll search for all Java files with more than 100 lines in the codebase.

● Bash(find . -name "*.java" -type f -exec wc -l {} + | awk '$1 > 100' | sort -nr)

...

● I found 5 Java files with more than 100 lines:

  1. File1.java - 315 lines
  2. File2.java - 156 lines
  3. File3.java - 154 lines
  4. File4.java - 130 lines
  5. File5.java - 117 lines

  The largest file is File1.java with 315 lines.

Or, if you want to count lines:

How many lines don't start with `import` in File1.java?

● Bash(grep -cv "^import " ./File1.java) ⎿ 287

● There are 287 lines that don't start with import in File1.java.

−

theshrike79

Don't ask a language model to do math, they're not very good at it.

Next time ask it to write a script or a program to do it, it'll most likely one-shot it.

−

chrisjj

Don't ask a language model to do math, they're not very good at it.

... and not intelligent enough to know that, apparently.

−

theshrike79

It's not about intelligence, it's about the fact that they don't have an environment to run the program in.

In theory the "count the 'r's in strawberry" type of things could just be run in the browser, which is a semi-safe environment.

−

chrisjj

I wasn't commenting about any program.

The fact it doesn't know the method it does use is defective is defintely about intelligence.

As is the fact it doesn't learn. If I give it the same task tomorrow, it will repeat the same mistake it admitted yesterday.

−

throw_m239339

Management thinks a crutch can effectively replace people massively in sensitive knowledge work. When that crutch starts making errors that cost those businesses millions, or billions, well, hopefully management who implemented all that will get fired...

Yes, LLM are useful, but they are even less trustworthy than real humans, and one needs actual people to verify their output, so when agents write 100K lines of code, they'll make mistakes, extremely subtle ones, and not the kind of mistake any human operator would make.

−

apriljo

I use git so hallucinogenic AI decisions are easy to revert. Why wouldn't I ask it to clean up years of tech debt while I work on something novel?

−

AHTERIX5000

2025 was the year when my fear of being replaced by an AI changed to fear of a big economic disaster caused by AI bubble

−

ehnto

It's been a rollercoaster, and it's still not clear what's on the other side of the loop.

−

hashstring

The push seems to work marketing-wise for a lot of folks.

I meet a lot of people who interfaced with LLMs for about topics that they are not an expert in, and they now believe that AI is taking over.

What a bore.

−

the__alchemist

Like OP in the link, I'm confused too. And I use LLMs for coding every day! With precise prompts, function signatures provided, only using it for problems I know are solved [by others] etc.

−

geldedus

It's been months agentic AI writes thousands of lines of code that go to production. Use better tools.

−

nurettin

After so many months, Gemini pro still shits the bed after failing to update a file several times. I'd expect more from the culmination of human knowledge.

−

rel0gic

i just got an aneurysm from reading the comments over there. are people having a stroke?

−

thdhhghgbhy

First answer: "you're prompting it wrong." I've heard that a few times now about demented autocomplete.

−

yeasku

You are using the autocomplete wrong is kind of funny sentence.

−

dankobgd

because they invested billions and now they have to justify it

−

pjmlp

Got get some of that oil/diamants/...., that's why.

−

throw-10-13

shades of “Self driving is coming next year”

−

0xbadcafebee

LLM is valuable as a research tool. For finding information, answering questions, doing comparisons, etc, it's great.

Otoh, an LLM writing code is a dog painting. If you spend enough billions on it, a dog can churn out a Picasso. Doesn't mean it's a good idea.

−

intended

I am going to try and make it a habit to post this request on all LLM Coding questions -

Can we please make it a point to share the following information when we talk about experiences with code bots?

1) Language - gives us an idea if the language has a large corpus of examples or not

2) Project - what were you using it for?

3) Level of experience - neophyte coder? Dunning Krueger uncertainty? Experience in managing other coders? Understand project implementation best practices ?

From what I can tell/suspect, these 3 features are the likely sources of variation in outcomes.

I suspect level of experience is doing significant heavy lifting, because more experienced devs approach projects in a manner that avoids pitfalls from the get go.