A brief take on benchmarks
On the theory-laden nature of benchmarks
Francois Chollet published the Abstraction and Reasoning Corpus (ARC) on 2019 as part of his paper about the Measure of Intelligence. Intelligence should not be understood as achieving a high performance on some task or solving a problem like computer vision or language understanding benchmarks. Rather, intelligence should be understood as the ability to acquire skills to solve problems.

ARC proved to be quiet a difficult benchmark to solve over the years. A plethora of models, many of them based on LLMs, fail to achieve human performance on ARC (i.e. 85% accuracy), despite achieving state-of-the art results across benchmarks spanning language understanding, reasoning, math, coding, among others.
This is until the end of 2024 when OpenAI released their o3 variant of ChatGPT which surpassed human performance via test-time search.
Is ARC solved, and thus intelligence understood? Well, not even Francois Chollet believes o3 solved intelligence or is the basis for AGI.
ARC, like all benchmarks in deep learning, tests models on problems which are previously anticipated by humans, underscoring its theory-laden nature. However, intelligence, as argued by Francois Chollet, is the ability to acquire skills to solve problems.
The environment, or real world, is a source of infinite problems. Thus, building a more comprehensive benchmark of intelligence not only requires simulating the diversity of problems which prompts a deep learning model to acquire new skills, but also can simulate the infiniteness of problems that can drive the infinite amount of skills which can be acquired.
Novelty in benchmarks
Reflecting on this, we need a way to build a dynamic benchmark which can:
- incrementally become difficult to constantly drive the acquisition of new skills
- require minimal human input to automate the growing of challenging problems, and thus driving novel solutions
Given the above desiderata, there is perhaps a way to train a bi-agentic framework consisting of this harmonious feedback loop between problem-solver and problem-generator.

In this loop, we aim to convey how:
- as a starting point, a human designs some initial problems
- then, the problem-solver generates solutions to existing problems.
- the initial, simple solutions inspires more complicated problems, which is the composition of simpler problems
- the more complicated problems drives novel solutions which are the composition of simpler solutions
- so on and so forth…
Any comments? Feedback?
Enjoy Reading This Article?
Here are some more articles you might like to read next: