Rendered at 13:45:47 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
embedding-shape 14 hours ago [-]
> The model, called VibeThinker-3B, scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2, a model with 671 billion parameters
Overfitting, no need to argue about anything I think?
The rest of the article seems to echoing people's misunderstanding of pretty elementary stuff.
crote 11 hours ago [-]
That's the obvious answer, yes. But if they are doing it, why should anyone assume the competition isn't doing it?
If it is possible to cheat on the benchmarks used to judge AI performance, how can the general population be certain that any of the AI "innovation" is genuine? Is there true development here worth the many-billion-dollar investments, or are we seeing an industry-wide case of them doing a Theranos by faking the results and hoping they can do real innovation before anyone finds out?
embedding-shape 6 hours ago [-]
> If it is possible to cheat on the benchmarks used to judge AI performance
You make this sound like it's a new thing, but it's been a thing since benchmarks started being a thing in ML, way before LLMs or even before attention was all you needed.
The general population, and even less so the developer community, shouldn't blindly follow benchmark scores, they don't really tell you much about how the model is to use in practice anyways. If you something that gives you that idea, you need your own private benchmark with test cases based on what you use LLMs for, and don't share this suite publicly.
Overfitting, no need to argue about anything I think?
The rest of the article seems to echoing people's misunderstanding of pretty elementary stuff.
If it is possible to cheat on the benchmarks used to judge AI performance, how can the general population be certain that any of the AI "innovation" is genuine? Is there true development here worth the many-billion-dollar investments, or are we seeing an industry-wide case of them doing a Theranos by faking the results and hoping they can do real innovation before anyone finds out?
You make this sound like it's a new thing, but it's been a thing since benchmarks started being a thing in ML, way before LLMs or even before attention was all you needed.
The general population, and even less so the developer community, shouldn't blindly follow benchmark scores, they don't really tell you much about how the model is to use in practice anyways. If you something that gives you that idea, you need your own private benchmark with test cases based on what you use LLMs for, and don't share this suite publicly.