Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Why Weibo's tiny VibeThinker-3B has the AI world arguing over benchmarks again (venturebeat.com)

19 points by gmays 16 hours ago | 3 comments

embedding-shape 14 hours ago [-]

> The model, called VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters

Overfitting, no need to argue about anything I think?

The rest of the article seems to echoing people's misunderstanding of pretty elementary stuff.

crote 11 hours ago [-]

That's the obvious answer, yes. But if they are doing it, why should anyone assume the competition isn't doing it?

If it is possible to cheat on the benchmarks used to judge AI performance, how can the general population be certain that any of the AI "innovation" is genuine? Is there true development here worth the many-billion-dollar investments, or are we seeing an industry-wide case of them doing a Theranos by faking the results and hoping they can do real innovation before anyone finds out?

embedding-shape 6 hours ago [-]

> If it is possible to cheat on the benchmarks used to judge AI performance

You make this sound like it's a new thing, but it's been a thing since benchmarks started being a thing in ML, way before LLMs or even before attention was all you needed.

The general population, and even less so the developer community, shouldn't blindly follow benchmark scores, they don't really tell you much about how the model is to use in practice anyways. If you something that gives you that idea, you need your own private benchmark with test cases based on what you use LLMs for, and don't share this suite publicly.

Rendered at 13:45:47 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.