IS  //  Input Systems

Notes · model evaluation

Do open-weight coding models live up to their benchmarks?

With every new open-weight release claiming near-SOTA on SWE-bench, I ran six of them against a small set of fresh, uncontaminated coding tasks, measuring not just whether they solve them, but how efficiently.

Jason · June 2026

Open-weight model releases have been relentless lately, and every one of them ships with a chart showing it perched near the top of SWE-bench Verified. But a benchmark that's been public for a year is easy to over-fit and hard to trust. So I set up a small, deliberately boring test: take a handful of capable open-weight models, point them at real GitHub bug-fix tasks they almost certainly haven't been tuned on, and see how they actually do, and what each solve costs.

How I set it up

I wanted the tasks to be representative of real agentic coding work and impossible to have memorized, so I kept the rules strict and identical for every model:

For every run I recorded four numbers: how many tasks it solved, the cost per solved task (at public API pricing), the reasoning tokens it burned, and how many steps it took. The contenders: MiniMax M3, DeepSeek-V4-Pro, Kimi K2.6, Kimi K2.7-Code, GLM-5.1, and Nex-N2-Pro.

The results

The headline chart is cost versus solves. Each bubble is a model; bigger bubbles burned more reasoning. The interesting region is down and to the left: cheap and effective. One model sits there alone.

most efficient others bubble size = reasoning tokens per task
MiniMax M3: $0.05/solve, 5/10. GLM-5.1: $0.20, 5/10. Kimi K2.6: $0.49, 4/10. Kimi K2.7-Code: $0.31, 3/10. DeepSeek-V4-Pro: $0.79, 4/10.
Cost per solved task (log scale) vs tasks solved, 10 blind tasks. Down and to the left is better.

And the full table:

ModelSolved$ / solveReasoning / task
MiniMax M35 / 10$0.05~200 tokmost efficient
GLM-5.15 / 10$0.20up to ~7k tokverbose
Kimi K2.64 / 10$0.49up to ~11k tokcostly
Kimi K2.7-Code3 / 10$0.31~210 toklean
DeepSeek-V4-Pro4 / 10$0.79~9,400 tokover-thinks
Nex-N2-Pro4 / 10free via ORup to ~27k tokbloated

What I took away from it

1. The leaderboard rank didn't transfer. Models that sit near the top of SWE-bench Verified (DeepSeek-V4 and the Kimi models all advertise around 80%) didn't pull ahead on these fresh tasks. Everything landed between three and six solves out of ten. With only 10 tasks that's noisy, and I wouldn't read a one-solve difference as meaningful. But the absence of separation is the point: a two-point gap on a year-old leaderboard told me almost nothing about how a model would do on a repo it had never seen.

2. Efficiency, not accuracy, was where they actually diverged. The solve counts clustered; the cost and verbosity did not. DeepSeek-V4-Pro burned roughly 9,400 reasoning tokens per task, about 45× what MiniMax M3 used, to land the same number of solves. That's a 10×-plus difference in cost for no measurable gain. On a task it failed, it spent over 20,000 reasoning tokens exploring and still never committed a fix.

A lot of what gets sold as "more reasoning" turns out to be motion, not progress. The models that thrashed the longest weren't the ones that solved the most.

3. Cranking the reasoning dial mostly did nothing. Several of these expose a reasoning-effort setting: a "high" and an "extra-high"/max mode. Where they did, I tested both. On most, turning it up changed nothing I could measure; the run-to-run variance was bigger than the effect. Only M3 showed a clean, consistent response to the dial. The rest just think however much they think, regardless of what you ask for.

4. The standout was the cheap one. MiniMax M3 matched or beat the field on solves while being the cheapest and most concise: on the order of a nickel per solved task against thirty to eighty cents for the frontier-priced models, with a fraction of the reasoning overhead. For real agentic work, where every step is a billable round-trip, that compounds fast.

The honest caveats

Ten tasks is a small, noisy sample; treat this as a directional signal, not a ranking. I used one generic harness for everyone, so this measures efficiency on a level playing field, not peak capability with a vendor's own optimized agent stack. And these are my tasks, not a standardized suite, so your mileage on different repos and languages will differ. I'd happily be wrong on any single model; the pattern across all six is what stuck with me.

The takeaway I'm keeping: when you're choosing an open-weight model for real coding work, the benchmark headline is the least useful number on the page. Cost-per-solved-task and reasoning efficiency told me far more. And on tasks it had never seen, the cheapest, leanest option was also the most effective.

Models accessed via their public APIs in June 2026; pricing and behavior change over time.