...Their agent is called "Agentica ARC-AGI-3 agent for Opus 4.6 (120k) High".
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.
I’m not saying it invalidates the result. I am saying that they knew the headline and comparison was not correct and they still decided to roll with it. It’s an incorrect representation of what happened, designed to get eyeballs and possibly vc dollars.
Yes, it's unfair to compare results for the 25 (easier) public games against scores for the 55 semi-private games (scores for which are taken from https://arcprize.org/leaderboard).
But you're wrong to say that a custom harness invalidates the result. Yes, the official "ARC verified" scoreboard for frontier LLMs requires (https://arcprize.org/policy):
> using extremely generic and miminal LLM testing prompts, no client-side "harnesses", no hand-crafted tools, and no tailored model configuration
but these are limitations placed in order to compare LLMs from frontier labs on equal footing, not limitations that apply to submissions in general. It's not as if a solution to ARC-AGI-3 must involve training a custom LLM! This Agentica harness is completely legitimate approach to ARC-AGI-3, similar to J. Berman's for ARC-AGI-1/2, for example.