Notes · model evaluation

Do open-weight coding models live up to their benchmarks?

I ran a batch of open-weight models against fresh, uncontaminated bug-fix tasks to see how they actually stack up. The benchmark headline turned out to be the least useful number — and then I found something that mattered more than any of the models.

Jason · June 2026 · updated

Open-weight model releases have been relentless lately, and every one of them ships with a chart showing it perched near the top of SWE-bench Verified. But a benchmark that's been public for a year is easy to over-fit and hard to trust. So I set up a small, deliberately boring test: take a handful of capable open-weight models, point them at real GitHub bug-fix tasks they almost certainly haven't been tuned on, and see how they actually do, and what each solve costs. The first round told me one thing. Then I noticed I'd been measuring the wrong variable.

What others have already found

I'm not the first to be suspicious here, and the published work is worth setting alongside a small experiment like this, partly because it agrees with the broad strokes, and partly because it points straight at the thing I missed the first time.

Benchmarks leak. Earlier this year OpenAI stopped reporting SWE-bench Verified after a manual audit found that most of its models' "failures" were broken tests, not model mistakes. Academic audits go further: SWE-bench+ found that roughly a third of "solved" patches were really the solution leaking through the issue text, and another third passed on tests too weak to confirm a fix; filtering those collapsed one agent's score from 12.5% to about 4%. The newer SWE-bench Pro, built on copyleft and private repos, shows models that cleared ~70% on Verified landing nearer 23%. The fix the field has converged on is simply fresher tasks, and that is the route I took: my tasks come from SWE-rebench (Nebius), a continuously-refreshed benchmark that mines new GitHub issues and tracks contamination against each model's release date, so every task is genuinely new to the models under test. My "the leaderboard rank barely transferred" is the small-sample version of what this line of work has documented at scale. On this, we agree.

The interface is part of the model's performance. The original SWE-agent paper (NeurIPS 2024) made exactly this its headline: a purpose-built "agent-computer interface" (simple view/edit/search commands with guardrails) more than tripled GPT-4's resolve rate over the same model driving a plain shell, and their ablation pinned about ten points of that on the interface alone. Their conclusion was that interface design matters as much as model capability. I backed into the corollary the hard way: if a good interface adds ten points, a quietly bad one subtracts them, and nothing in your benchmark number tells you it happened.

More thinking isn't more solving. A run of 2025–26 papers ("When More Thinking Hurts", the "Mirage of Test-Time Scaling") find that reasoning tokens carry diminishing and sometimes negative returns: past a point, models talk themselves out of correct answers. That matches what I saw on cost and on the reasoning dial.

So the agreements are clear: benchmarks are contaminated, scaffolds matter, and reasoning has a ceiling. Where I'd add a wrinkle, and gently push back on how the scaffold result is usually framed, is where the harness effect actually shows up. I'll come back to that at the end.

How I set it up

I wanted the tasks to be representative of real agentic coding work and impossible to have memorized, so I kept the rules strict and identical for every model:

Real bug-fix tasks from SWE-rebench (the decontaminated, continuously-refreshed benchmark above): recent GitHub issues filtered against each model's training cutoff, so there is no overlap with anything a model is likely to have trained on.
Blind. The model never sees the failing test. It has to read the issue, localize the bug, and write a fix on its own. I then grade by running the hidden tests in a fresh sandbox.
One shot. Single attempt (pass@1), temperature held constant, a fixed step budget.
Same harness for everyone. A plain agent loop (run shell commands, read and edit files, run tests, submit) with no model-specific scaffolding. A level playing field, not each vendor's bespoke agent framework.

For every run I recorded how many tasks it solved, the cost per solved task at public API pricing, the reasoning tokens it burned, and how many steps it took. The first round of contenders: MiniMax M3, DeepSeek-V4-Pro, Kimi K2.6, Kimi K2.7-Code, GLM-5.1, and Nex-N2-Pro.

The first results

The headline chart is cost versus solves. Each bubble is a model; bigger bubbles burned more reasoning. The interesting region is down and to the left: cheap and effective.

most efficient others bubble size = reasoning tokens per task

Cost per solved task (log scale) vs tasks solved, 10 blind tasks. Down and to the left is better.

And the full table:

Model	Solved	$ / solve	Reasoning / task
MiniMax M3	5 / 10	$0.05	~200 tok	most efficient
GLM-5.1	5 / 10	$0.20	up to ~7k tok	verbose
Kimi K2.6	4 / 10	$0.49	up to ~11k tok	costly
Kimi K2.7-Code	3 / 10	$0.31	~210 tok	lean
DeepSeek-V4-Pro	4 / 10	$0.79	~9,400 tok	over-thinks
Nex-N2-Pro	4 / 10	free via OR	up to ~27k tok	bloated

The leaderboard rank didn't transfer. Models that sit near the top of SWE-bench Verified (DeepSeek-V4 and the Kimi models all advertise around 80%) didn't pull ahead on these fresh tasks. Everything landed between three and six solves out of ten. With only 10 tasks that's noisy, but the absence of separation is the point: a two-point gap on a year-old leaderboard told me almost nothing about how a model would do on a repo it had never seen.

The variable I wasn't controlling: my own harness

Here's the part that humbled me. I'd been reading these as model results. But when I stopped staring at the pass/fail column and actually watched how the models worked, I saw something else: they weren't failing because they didn't know the fix. They were flailing, and a lot of the flailing was my harness's fault.

Two bugs, both subtle, both mine:

The working directory reset between every command. Each shell call ran in a fresh process, so when a model cd'd into the repo, the next command was back at the root. Any model with the entirely reasonable habit of cd path && do-thing tripped over this constantly, then wasted turns trying to figure out why its commands kept missing.
The file editor demanded a whitespace-perfect match of the text to replace. One stray space between the model's memory of a line and the file on disk, and the edit bounced, sending the model into a cat/sed recovery spiral to re-read and retry.

Across a run, more than a third of all tool results were errors, and most of those were these two failure modes, not the model being wrong about the code. So I fixed both: pin every command to the repo root and tell the model exactly where it is; make the editor tolerant of whitespace drift (match on content, re-indent to the file). Then I took one capable model and re-ran it on 150 fresh tasks, twice: once on the old harness, once on the fixed one.

The number of tasks solved barely moved. But the share of clean solves (solved efficiently, with a minimal diff and no thrashing) transformed:

old harness~10%

fixed harness56%

Share of solved tasks that were clean solves: same model, same 150 tasks, only the harness changed. Heavy thrashing fell from a majority of solves to roughly one in ten.

Same model. Same tasks. A 6× swing in solution quality, from two lines of plumbing. A lot of "this open model is bad at agentic coding" is really "my harness is fighting the model."

Re-testing the top of the field, fairly

With a harness that wasn't sabotaging anyone, I re-ran the strongest contenders head-to-head on a fresh set of tasks, and added a model that wasn't in the first round: Qwen3.7-Max. The ranking held, with efficiency still separating the field more than accuracy did, but now I could trust it.

Model (fixed harness)	Solved	Steps / task	Tool errors
Qwen3.7-Max	5 / 11	~29	lowest	cleanest
MiniMax M3	4 / 11	~39	higher	wordier path

Qwen3.7-Max solved the most and got there in the fewest steps with the fewest dead-end commands; M3 stayed close on accuracy but took about 35% more steps to land the same fixes. On a head-to-head of the tasks both solved, it was nearly a coin flip on quality: Qwen won on getting more of the fix right the first time, M3 won on a couple of tidier diffs. Close enough that I'd call it a real contest rather than a blowout.

What I took away from it

1. Before you blame the model, check the harness. This is the one that changed how I work. The scaffold around an open-weight model (how it runs commands, how it edits files, how it recovers from a mistake) swung real-world success more than the choice of model did. If you're evaluating models on a harness that resets the working directory or rejects a near-miss edit, you're mostly benchmarking your own plumbing.

2. Efficiency, not accuracy, is where they actually diverge. Solve counts cluster; cost and verbosity don't. In the first round DeepSeek-V4-Pro burned roughly 9,400 reasoning tokens per task, about 45× what MiniMax M3 used, for the same number of solves. On a task it failed, it spent over 20,000 reasoning tokens exploring and never committed a fix. For real agentic work, where every step is a billable round-trip, that compounds fast.

A lot of what gets sold as "more reasoning" turns out to be motion, not progress. The models that thrashed the longest weren't the ones that solved the most.

3. Cranking the reasoning dial mostly does nothing. Several of these expose a reasoning-effort setting, a "high" and a max mode. Where they did, I tested both. On Qwen3.7-Max, turning it to max produced fewer reasoning tokens, not more (about 5.2k vs 7.5k), and solved one fewer task; the knob did nothing, or slightly hurt. Most models just think however much they think, regardless of what you ask for; only a couple respond cleanly to the dial.

4. Passing the tests isn't the same as solving the problem. Once I separated "did the hidden tests pass" from "did it solve this cleanly," the picture sharpened. On the broken harness, only about one solved task in ten was actually a clean solve; the rest stumbled into a green test after a mess of failed commands. If you only look at pass/fail, two models can look identical while one is calmly making a two-line fix and the other is setting fires and putting them out. Score the path, not just the outcome.

The honest caveats

Small, noisy samples (ten to fifteen tasks per cut), so treat this as a directional signal, not a ranking. Even my "fixed" harness is still one harness; a different scaffold would shift the absolute numbers again (which is rather the point). These are my tasks, not a standardized suite, so your mileage on different repos and languages will differ. And the before/after harness comparison is one capable model run twice, not all seven; I'd want a wider sweep before treating the exact 6× as anything but illustrative. I'd happily be wrong on any single model; the patterns across the set are what stuck.

What this adds

The prior work already establishes two things: a good scaffold helps, and benchmarks leak. Set against that, this small experiment makes two claims of its own.

First, it isolates the harness as a cause, not a correlate. SWE-agent compared different interfaces across different runs. Here the model and the exact task set are held fixed and the only thing that changes is the harness, so the swing is attributable to the harness alone. And the swing is large: roughly a 6× change in clean-solve rate from two lines of plumbing.

Second, and more useful: the harness's effect is mostly invisible to the standard metric. The number of tasks solved barely moved; what moved was solution quality. So pass@1, the number every leaderboard reports, is largely blind to the harness, because a model that flails its way to a green test and one that makes a calm two-line fix score identically. The consequence is uncomfortable: every public pass@1 comparison is silently confounded by the scaffold it ran on. If you care how a model behaves in a real agentic loop, the leaderboard number alone will mislead you; you have to score the path, not just the outcome.

The takeaway I'm keeping: when you're choosing an open-weight model for real coding work, the benchmark headline is the least useful number on the page; cost-per-solved-task and reasoning efficiency tell you far more. But the bigger lesson was quieter. Before you decide a model is good or bad at this, make sure the harness underneath it isn't the thing you're actually measuring.

Models accessed via their public APIs in June 2026; pricing and behavior change over time.