On the theory-laden nature of benchmarks

Francois Chollet published the Abstraction and Reasoning Corpus (ARC) on 2019 as part of his paper about the Measure of Intelligence. Intelligence should not be understood as achieving a high performance on some task or solving a problem like computer vision or language understanding benchmarks. Rather, intelligence should be understood as the ability to acquire skills to solve problems.

Sorry, an unanticipated error occured and the image can't load. — A sample of an ARC problem

ARC proved to be quiet a difficult benchmark to solve over the years. A plethora of models, many of them based on LLMs, fail to achieve human performance on ARC (i.e. 85% accuracy), despite achieving state-of-the art results across benchmarks spanning language understanding, reasoning, math, coding, among others.

This is until the end of 2024 when OpenAI released their o3 variant of ChatGPT which surpassed human performance via test-time search.

Is ARC solved, and thus intelligence understood? Well, not even Francois Chollet believes o3 solved intelligence or is the basis for AGI.

ARC, like all benchmarks in deep learning, tests models on problems which are previously anticipated by humans, underscoring its theory-laden nature. However, intelligence, as argued by Francois Chollet, is the ability to acquire skills to solve problems.

The environment, or real world, is a source of infinite problems. Thus, building a more comprehensive benchmark of intelligence not only requires simulating the diversity of problems which prompts a deep learning model to acquire new skills, but also can simulate the infiniteness of problems that can drive the infinite amount of skills which can be acquired.

Novelty in benchmarks

Reflecting on this, we need a way to build a dynamic benchmark which can:

incrementally become difficult to constantly drive the acquisition of new skills
require minimal human input to automate the growing of challenging problems, and thus driving novel solutions

Given the above desiderata, there is perhaps a way to train a bi-agentic framework consisting of this harmonious feedback loop between problem-solver and problem-generator.

Sorry. Image couldn't load. — Illustration of the feedback loop between problem-solver and problem-generator

In this loop, we aim to convey how:

as a starting point, a human designs some initial problems
then, the problem-solver generates solutions to existing problems.
the initial, simple solutions inspires more complicated problems, which is the composition of simpler problems
the more complicated problems drives novel solutions which are the composition of simpler solutions
so on and so forth…

Any comments? Feedback?

A brief take on benchmarks

On the theory-laden nature of benchmarks

Novelty in benchmarks

Enjoy Reading This Article?