François Chollet: The Arc Prize and How We Get to AGI [video]
If you judged human intelligence by our AI standards, then would humans even pass as Natural General Intelligence? Human intelligence tests are constantly changing, being invalidated, and rerolled as well.
I maintain that today's modern LLMs would pass sufficiently for AGI and is also very close to passing a Turing Test, if measured in 1950 when the test was proposed.
you can always detect the AI by text and timing patterns
I see no reason why an AI couldn't be trained on human data to fake all of that.
If noone has bothered so far, that's because pretty much all commercial applications of this would be illegal or at least leading to major reputational damage when exposed.
[AGI is achieved when] AI systems that can generate at least $100 billion in profits.
https://techcrunch.com/2024/12/26/microsoft-and-openai-have-...
If something would be better at every cognitive task than every human, if it ran a trillion times faster, I would consider that to be AGI even if it isn’t that useful at its actual speed.
Not only do we not have that, I don't think it's possible to have it.
Philosophers have known about this problem for centuries. Wittgenstein recognized that most concepts don't have precise definitions but instead behave more like family resemblances. When we look at a family we recognize that they share physical characteristics, even if there's no single characteristic shared by all of them. They don't need to unanimously share hair color, skin complexion, mannerisms, etc. in order to have a family resemblance.
Outside of a few well-defined things in logic and mathematics, concepts operate in the same way. Intelligence isn't a well-defined concept, but that doesn't mean we can't talk about different types of human intelligence, non-human animal intelligence, or machine intelligence in terms of family resemblances.
Benchmarks are useful tools for assessing relative progress on well-defined tasks. But the decision of what counts as AGI will always come down to fuzzy comparisons and qualitative judgments.
You know stuff that humans have done way before there were computers and screens.
Getting a high score on ARC doesn't mean we have AGI and Chollet has always said as much AFAIK
He only seems to say this recently, since OpenAI cracked the ARC-AGI benchmark. But in the original 2019 abstract he said this:
We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
https://arxiv.org/abs/1911.01547
Now he seems to backtrack, with the release of harder ARC-like benchmarks, implying that the first one didn't actually test for really general human-like intelligence.
This sounds a bit like saying that a machine beating chess would require general intelligence -- but then adding, after Deep Blue beats chess, that chess doesn't actually count as a test for AGI, and that Go is the real AGI benchmark. And after a narrow system beats Go, moving the goalpost to beating Atari, and then to beating StarCraft II, then to MineCraft, etc.
At some point, intuitively real "AGI" will be necessary to beat one of these increasingly difficult benchmarks, but only because otherwise yet another benchmark would have been invented. Which makes these benchmarks mostly post hoc rationalizations.
A better approach would be to question what went wrong with coming up with the very first benchmark, and why a similar thing wouldn't occur with the second.
We can simply check the news every day until it's built...
Can we formalize it as giving out a task expressible in, say, n^m bytes of information that encodes a task of n^(m+q) real algorithmic and verification complexity -- then solving that task within a certain time, compute, and attempt bounds?
Something that captures "the AI was able to unwind the underlying unspoken complexity of the novel problem".
I feel like one could map a variety of easy human "brain teaser" type tasks to heuristics that fit within some mathematical framework and then grow the formalism from there.
This entropy angle has real theoretical backing. Some researchers propose consciousness emerges from the brain's ability to integrate information across different scales and timeframes. This would essentially create temporary "islands of low entropy" in neural networks. Giulio Tononi's Integrated Information Theory suggests consciousness corresponds to a system's ability to generate integrated information, which relates to how it reduces uncertainty (entropy) about its internal states. Then there is Hammeroff and Penrose, which I commented about on here years ago and got blasted for it. Meh. I'm a learner, and I learn by entertaining truths. But I always remain critical of theories until I'm sold.
I'm not selling any of this as a truth, because the fact remains we have no idea what "consciousness" is. We have a better handle on "intelligence", but as others point out, most humans aren't that intelligent. They still manage to drive to the store and feed their dogs, however.
A lot of the current leading ARC solutions use random sampling, which sorta makes sense once you start thinking about having to handle all the different types of problems. At least it seems to be helping out in paring down the decision tree.
My problem with AGI is the lack of a simple, concrete definition.
You can't always start from definitions. There are many research areas where the object of research is to know something well enough that you could converge on such a thing as a definition, e.g. dark matter, consciousness, intelligence, colony collapse syndrome, SIDS. We nevertheless can progress in our understanding of them in a whole motley of strategic ways, by case studies that best exhibit salient properties, trace the outer boundaries of the problem space, track the central cluster of "family resemblances" that seem to characterize the problem, entertain candidate explanations that are closer or further away, etc. Essentially a practical attitude.
I don't doubt in principle that we could arrive at such a thing as a definition that satisfies most people, but I suspect you're more likely to have that at the end than the beginning.
where he introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence
So, a high enough score is a threshold to claim AGI. And, if you use an LLM to work these types of problems, it becomes pretty clear that passing more tests indicates a level of "awareness" that goes beyond rational algorithms.
I thought I had seen everything until I started working on some of the problems with agents. I'm still sorta in awe about how the reasoning manifests. (And don't get me wrong, LLMs like Claude still go completely off the rails where even a less intelligent human would know better.)
They mean an artificial god, and it has become a god of the gaps: we have made artificial general intelligence, and it is more human-like than god-like, and so to make a god we must have it do XYZ precisely because that is something which people can't do.
"Avg. Mturker" has 77% on ARC1 and costs $3/task. "Stem Grad" has 98% on ARC1 and costs $10/task. I would love a segment like "typical US office employee" or something else in between since I don't think you need a stem degree to do better than 77%.
It's also worth noting the "Human Panel" gets 100% on ARC2 at $17/task. All the "Human" models are on the score/cost frontier and exceptional in their score range although too expensive to win the prize obviously.
I think the real argument is that the ARC problems are too abstract and obscure to be relevant to useful AGI, but I think we need a little flexibility in that area so we can have tests that can be objectively and mechanically graded. E.g. "write a NYT bestseller" is an impractical test in many ways even if it's closer to what AGI should be.
Our rod and cone cells could just as well be wired up in any other configuration you care to imagine. And yet, an organisation or mapping that preserves spatial relationships has been strongly preferred over billions of years of evolution, allowing us most easily to make sense of the world. Put another way, spatial feature detectors have emerged as an incredible versatile substrate for ‘live-action’ generation of world models.
What do we do when we visualise, then? We take abstract relationships (in data, in a conceptual framework, whatever) and map them in a structure-preserving way to an embodiment (ink on paper, pixels on screen) that can wind its way through our perceptual machinery that evolved to detect spatial relationships. That is, we leverage our highly developed capability for pattern matching in the visual domain to detect patterns that are not necessarily visual at all, but which nevertheless have some inherent structure that is readily revealed that way.
What does any of this entail for machine intelligence?
On the one hand, if a problem has an inherent spatial logic to it, then it ought to have good learning gradients in the direction of a spatial organisation of the raw input. So, if specifically training for such a problem, the serialisation probably doesn’t much matter.
On the other hand: expecting a language model to generalise to inherently spatial reasoning? I’m totally with you. Why should we expect good performance?
No clue how the unification might be achieved, but I’d wager that language + action-prediction models will be far more capable than models grounded in language alone. After all, what does ‘cat’ mean to a language model that’s never seen one pounce and purr and so on? (Pictures don’t really count.)
Looking at the human side, it takes a while to actually learn something. If you've recently read something it remains in your "context window". You need to dream about it, to think about, to revisit and repeat until you actually learn it and "update your internal model". We need a mechanism for continuous weight updating.
Goal-generation is pretty much covered by your body constantly drip-feeding your brain various hormones "ongoing input prompts".
I'd say we're not far off.
How are we not far off? How can LLMs generate goals and based on what?
Alternately, you can train it on following a goal and then you have a system where you can specify a goal.
At sufficient scale, a model will already contain goal-following algorithms because those help predict the next token when the model is basetrained on goal-following entities, ie. humans. Goal-driven RL then brings those algorithms to prominence.
But also my intuition is that humans are "trained on goals" and then reverse-engineer an explicit goal structure using self-observation and prosaic reasoning. If it works for us, why not the LLMs?
edit: Example: https://arxiv.org/abs/2501.11120 "Tell me about yourself: LLMs are aware of their learned behaviors". When you train a LLM on an exclusively implicit goal, the LLM explicitly realizes that it has been trained on this goal, indicating (IMO) that the implicit training hit explicit strategies.
Noticing this, frameworks like SMART[1], provide explicit generation rules. The existence of explicit frameworks is evidence that humans tend to perform worse than expected at extracting implicit structure from goals they've observed.
1. Independent of the effectiveness of such frameworks
https://github.com/dmf-archive/PILF
https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
A good base test would be to give a manager a mixed team of remote workers, half being human and half being AI, and seeing if the manager or any of the coworkers would be able to tell the difference. We wouldn't be able to say that AI that passed that test would necessarily be AGI, since we would have to test it in other situations. But we could say that AI that couldn't pass that test wouldn't qualify, since it wouldn't be able to successfully accomplish some tasks that humans are able to.
But of course, current AI is nowhere near that level yet. We're left with benchmarks, because we all know how far away we are from actual AGI.
I agree that current AI is nowhere near that level yet. If AI isn't even trying to extract meaning from the words it smiths or the pictures it diffuses then it's nothing more than a cute (albeit useful) parlor trick.
However, I'm not sure an AGI test should be mitigating them. If an AI isn't able to communicate at human speeds, or isn't able to achieve the social understandings that a human does, it would probably be wrong to say that it has the same intelligence capabilities as a human (how AGI has traditionally been defined). It wouldn't be able to provide human level performance in many jobs.
These are all things my kids would do when they were pretty young.
That's not really AGI because xyz
What then? The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
In many respects I feel like all of the claims that models don't really understand or have internal representation or whatever tend to lean on nebulous or circular definitions of the properties in question. Trying to pin the arguments down usually end up with dualism and/or religion.
Doing what Chollet has done is infinitely better, if a person can easily do something and a model cannot then there is clearly something significant missing
It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
Anyone who wants to claim the fundamental inability of these models should be able to provide a task that it is clearly possible to tell when it has been succeeded, and to show that humans can do it (if that's the bar we are claiming can't be met). If they are right, then no future model should be able to solve that class of problems.
When people catalogue the deficiencies in AI systems, they often (at least implicitly) forgive all of our own such limitations. When someone points to something that an AI system clearly doesn't understand, they say that proves it isn't AGI. But if you point at any random human, who fails at the very same task, you wouldn't say they lack "HGI", even if they're too personally limited to ever be taught the skill.
All of which, is to say, I don't think pointing at a limitation of an AI system, really proves it lacks AGI. It's a more slippery definition, than that.
It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
This is a very good point and somewhat novel to me in its explicitness.
There's no reason to think that we already have the concepts and terminology to point out the gaps between the current state and human-level intelligence and beyond. It's incredibly naive to think we have armchair-generated already those concepts by pure self-reflection and philosophizing. This is obvious in fields like physics. Experiments were necessary to even come up with the basic concepts of electromagnetism or relativity or quantum mechanics.
I think the reason is that pure philosophizing is still more prestigious than getting down in the weeds and dirt and doing limited-scope well-defined experiments on concrete things. So people feel smart by wielding poorly defined concepts like "understanding" or "reasoning" or "thinking", contrasting it with "mere pattern matching", a bit like the stalemate that philosophy as a field often hits, as opposed to the more pragmatic approach in the sciences, where empirical contact with reality allows more consensus and clarity without getting caught up in mere semantics.
The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
The difficulty with intelligence is we don't even know what it is in the first place (in a psychology sense, we don't even have a reliable model of anything that corresponds to what humans point at and call intelligence; IQ and g are really poor substitutes).
Add into that Goodhart's Law (essentially, propose a test as a metric for something, and people will optimize for the test rather than what the test is trying to measure), and it's really no surprise that there's no test for AGI.
But conversely, not passing this test is a proof of not being as general as a human's intelligence.
While understanding why a person or AI is doing what it's doing can be important (perhaps specifically in safety contexts) at the end of the day all that's really going to matter to most people is the outcomes.
So if an AI can use what appears to be intelligence to solve general problems and can act in ways that are broadly good for society, whether or not it meets some philosophical definition of "intelligent" or "good" doesn't matter much – at least in most contexts.
That said, my own opinion on this is that the truth is likely in between. LLMs today seem extremely good at being glorified auto-completes, and I suspect most (95%+) of what they do is just recalling patterns in their weights. But unlike traditional auto-completes they do seem to have some ability to reason and solve truly novel problems. As it stands I'd argue that ability is fairly poor, but this might only represent 1-2% of what we use intelligence for.
If I were to guess why this is I suspect it's not that LLM architecture today is completely wrong, but that the way LLMs are trained means that in general knowledge recall is rewarded more than reasoning. This is similar to the trade-off we humans have with education – do you prioritise the acquisition of knowledge or critical thinking? Maybe believe critical thinking is more important and should be prioritised more, but I suspect for the vast majority of tasks we're interested in solving knowledge storage and recall is actually more important.
But when the question is "are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do, otherwise we just see a variant of Amdahl's law in action and the AI perform an arbitrary speed-up of n % of the economy while humans are needed for the remaining 100-n %.
I may be wrong, but it seems to me that the ARC prize is more about the latter.
are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do,
I really don’t think that’s the case. A robot that can stack shelves faster than a human is more valuable at that job than someone who can move items and also appreciate comedy. One that can write software more reliably than person X is more valuable than them at that job even if X is well rounded and can do cryptic crosswords and play the guitar.
Also many tasks they can be worse but cheaper.
I do wonder how many tasks something like o3 or o3 pro can’t do as well as a median employee.
I really don’t think that’s the case. A robot that can stack shelves faster than a human is more valuable at that job than someone who can move items and also appreciate comedy.
Yes, until all the shelves are stacked and that is no longer your limiting factor.
One that can write software more reliably than person X is more valuable than them at that job even if X is well rounded and can do cryptic crosswords and play the guitar.
Cryptic crosswords and guitar playing are already something computers can do, so they're not great examples.
Consider a different example: "computer" used to be a job title of a person who computes. A single Raspberry Pi model zero, given away for free on a magazine cover at launch, can do this faster than the entire human population combined even if we all worked at the speed of the world record holder 24/7. But that wasn't enough to replace all human labour.
I think the people behind the ARC Prize agree that getting a high score doesn't mean we have AGI
The benchmark was literally called ARC-AGI. Only after OpenAI cracked it, they started backtracking and saying that it doesn't test for true AGI. Which undermines the whole premise of a benchmark.
- OpenAI's o3 counts as "AGI" when it did unexpectedly beat the ARC-AGI benchmark or
- Explicitly admit that he was wrong when assuming that ARC-AGI would test for AGI
We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
It is important to note that ARC is a work in progress, not a definitive solution; it does not fit all of the requirements listed in II.3.2, and it features a number of key weaknesses…
Page 53
The study of general artificial intelligence is a field still in its infancy, and we do not wish to convey the impression that we have provided a definitive solution to the problem of characterizing and measuring the intelligence held by an AI system.
Page 56
Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.
The diagnosis is pattern matching (again, roughly). It kinda suggests that a lot of "intelligent" problems are focused on pattern matching, and (relatively straightforward) application of "previous experience". So, pattern matching can bring us a great deal towards AGI.
However, it does rub me the wrong way - as someone who's cynical of how branding can enable breathless AI hype by bad journalism. A hypothetical comparison would be labelling SHRDLU's (1968) performance on Block World planning tasks as "ARC-AGI-(-1)".[0]
A less loaded name like (bad strawman option) "ARC-VeryToughSymbolicReasoning" should capture how the ARC-AGI-n suite is genuinely and intrinsically very hard for current AIs, and what progress satisfactory performance on the benchmark suite would represent. Which Chollet has done, and has grounded him throughout![1]
[0] https://en.wikipedia.org/wiki/SHRDLU [1] https://arxiv.org/abs/1911.01547
In practice when I have seen ARC brought up, it has more nuance than any of the other benchmarks.
Unlike, Humanity's Last Exam, which is the most egregious example I have seen in naming and when it is referenced in terms of an LLMs capability.
My definition of AGI is the one I was brought up with, not an ever moving goal post (to the "easier" side).
And no, I also don't buy that we are just stochastic parrots.
But whatever. I've seen many hypes and if I don't die and the world doesn't go to shit, I'll see a few more in the next couple of decades
"We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific evironments" (from page 22 of On the Measure of Intelligence)
I feel like I'm the only one who isn't convinced getting a high score on the ARC eval test means we have AGI.
Wait, what? Approximately nobody is claiming that "getting a high score on the ARC eval test means we have AGI". It's a useful eval for measuring progress along the way, but I don't think anybody considers it the final word.
But then, I guess it wouldn't be "overfitting" after all, would it?
But on a serious note, I don't think Chollet would disagree. ARC is a necessary but not sufficient condition, and he says that, despite the unfortunate attention-grabbing name choice of the benchmark. I like Chollet's view that we will know that AGI is here when we can't come up with new benchmarks that separate humans from AI.
It's mostly about pattern matching...
For all we know, human intelligence is just an emergent property of really good pattern matching.
Perhaps it's because the representations are fractured. The link above is to the transcript of an episode of Machine Learning Street Talk with Kenneth O. Stanleyabout The Fractured Entangled Representation Hypothesis[1]
If we assume that humans have "general intelligence", we would assume all humans could ace Arc... but they can't. Try asking your average person, i.e. supermarket workers, gas station attendants etc to do the Arc puzzles, they will do poorly, especially on the newer ones, but AI has to do perfectly to prove they have general intelligence? (not trying to throw shade here but the reality is this test is more like an IQ test than an AGI test).
Arc is a great example of AI researchers moving the goal posts for what we consider intelligent.
Let's get real, Claude Opus is smarter than 99% of people right now, and I would trust its decision making over 99% of people I know in most situations, except perhaps emotion driven ones.
Arc agi benchmark is just a gimmick. Also, since it's a visual test and the current models are text based it's actually a rigged (against the AI models) test anyway, since their datasets were completely text based.
Basically, it's a test of some kind, but it doesn't mean quite as much as Chollet thinks it means.
If we think humans have "GI" then I think we have AIs right now with "GI" too. Just like humans do, AIs spike in various directions. They are amazing at some things and weak at visual/IQ test type problems like ARC.
I think the charitable interpretation is that, if intelligence is made up of many skills, and AIs are super human at some, like image recognition.
And that therefore, future efforts need to be on the areas where AIs are significantly less skilled. And also, since they are good at memorizing things, knowledge questions are the wrong direction and anything most humans could solve but that AIs can not, especially if as generic as pattern matching, should be an important target.
My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.
Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character
This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder
I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.
If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.
Instead, the performance dropped sharply - which hints that perception is the bottleneck.
Personally I don't think it's possible at this stage. The cat's out of the bag (this new class of tools are working) the economic incentive is way too strong.
Which I have to admit I was kind of disappointed by.
This is in contrast to the way that GPT-2/3/“original 4” work, which is by repeatedly generating the next finalized token based on the full dialogue thus far.
https://arxiv.org/abs/1911.01547
GPT-3 didn't come out until 2020.
That said, I'd still listen these two guys (+ Schmidhuber) more than any other AI-guy.
One thing he showed is that you can't have a universe with two omniscient intelligences (as it would be intractable for them to predict the other's behavior.)
It's also very questionable whether "humanlike" intelligence is truly general in the first place. I think cognitive neurobiologists would agree that we have a specific "cognitive niche", and while this symbolic niche seems sufficiently general for a lot of problems, there are animals that make us look stupid in other respects. This whole idea that there is some secret sauce special algorithm for universal intelligence is extremely suspect. We flatter ourselves and have committed to a fundamental anthropomorphic fallacy that seems almost cartoonishly elementary for all the money behind it.
You can't define AGI, any more than you can define ASA (artificial sports ability). Intelligence, like athleticism changes both quantitively and qualitatively. The Greek Olympic champions of 2K yrs ago wouldn't qualify for high school championships today, however, they were once regarded as great athletes.
It's conceivable (though not likely) that given training enough training in symbolic mathematics and some experimental data, an LLM-style AI could figure out a neat reconciliation of the two theories. I wouldn't say that makes it AGI though. You could achieve that unification with an AI that was limted to mathematics rather than being something that can function in many domains like a human can.
But consider: technically AlphaTensor found new algorithms no human did before (https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...). So isn't it AGI by your definition of answering a question no human could before: how to do 4x4 matrix multiplication in 47 steps?
The point is to measure fluid intelligence in a way which supports comparisons between models and between models and humans. It's not the obligation of the test to be tailored to the form of model that's most popular now.
The problem is that the test may not be giving an accurate comparison because the test is problematic when used to assess LLMs, which are the kind of model that people are most interested in assessing for general capabilities.
https://news.ycombinator.com/item?id=44492241
My comment was basically instantly flagged. I see at least 3 other flagged comments that I can't imagine deserve to be flagged.
If you see a talk like: "How we will develop diplomacy with the rat-people of TRAPPIST-5." you don't have to make some argument about super-earths and gravity and the rocket equation. You can just point out it's absurd to pretend to know something like whether there are rat-people there.
Either way, it isn't flag-able!
Getting a perfect ARC-AGI-n score isn't a smoking gun indicator of general intelligence. Rather, it simply means we're now able to solve a class of problems previously beyond AI capabilities (which is exciting in itself!).
I view ARC-AGI primarily as a benchmark (similar in spirit to Raven's matrices) that makes memorization substantially harder. Compare this with vocabulary-focused IQ tests, where cognitive skills certainly matter, but results depend heavily on exposure to a particular language.
The second highlight from this video is the section from 29 minutes onward, where he talks about designing systems that can build up rich libraries of abstractions which can be applied to new problems. I wish he had lingered more on exploring and explaining this approach, but maybe they're trying to keep a bit of secret sauce because it's what his company is actively working on.
One of the major points which seems to be emerging from recent AI discourse is that the ability to integrate continuous learning seems like it'll be a key element in building AGI. Context is fine for short tasks, but if lessons are never preserved you're severely capped with how far the system can go.
Look how we learned physics. Aristotelian physics was "An object in motion tends to come to a stop." That looked right most of the time a bowling ball on sand, grass, or even dirt comes to a stop pretty fast. But once you have a nice smooth marble floor the ball goes a lot further.
Newtonian physics solved that and several other issues and works fine, most of the time, but has corner cases when going very fast or getting near a high gravity location. Then relativity and the rest.
We need to build a system that we can teach like we do children that lets them reason that something is true under certain circumstances but may not hold generally so have to update what true is. And that looks like statistics.
There are dozens of ready-made, well-designed, and very creative games there. All are tile-based and solved with only arrow keys and a single action button. Maybe someone should make a PuzzleScript AGI benchmark?
https://nebu-soku.itch.io/golfshall-we-golf
Maybe someone can make an MCP connection for the AIs to practice. But I think the idea of the benchmark is to reserve some puzzles for private evaluation, so that they're not in the training data.
So we might say, “General Intelligence is the ability to do the things we haven’t yet thought of.”
Like what?
Well, as soon as I name something it stops counting.
Gödellian - I like it. Does that mean a constructive definition of General Intelligence is uncomputable?
I think the debate hqas been flat-footed by the speed all this happened. We're not talking AGI any more, we're talking about how to build superintelligences hitherto unseen in nature.
I enjoy seeing people repeatedly move the goalposts for "intelligence" as AIs simply get smarter and smarter every week. Soon AI will have to beat Einstein in Physics, Usain Bolt in running, and Steve Jobs in marketing to be considered AGI...
There's already a big meaningful gap between the things AIs can do which humans can't, so why do you only count as "meaningful" the things humans can do which AIs can't?
Where did I say there was nothing meaningful about current capabilities? I'm saying that's what is novel about a claim of "AGI" (as opposed to a claim of "computer does something better than humans", which has been an obviously true statement since the ENIAC) is the ability to do at some level everything a normal human intelligence can do.
If AI at least equal humans in all intellectual fields then they are super-intelligences, because there are already fields where they dominate humans so outrageously there isn't a competition (nearly all fields, these days). Before they are superintelligences there is a phase where they are just AGIs, we've been in that phase for a while now. Artificial superintelligence is very exciting, but Artificial non-super Intelligence or AGI is here with us in the present.
To me, superintelligence means specifically either dominating us in our highest intellectual accomplishments, i.e. math, science, philosophy or literally dominating us via subordinating or eliminating humans. Neither of these things have happened at all.