Hacker News — vinext + Cloudflare Workers

NHacker Next

new
past
show
ask
show
jobs
submit

▲GLM 5.2 beats Claude in our benchmarks (semgrep.dev)

147 points by jms703 4 hours ago | 43 comments

bArray 17 minutes ago [-]

Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

crocowhile 15 minutes ago [-]

follow antirez - https://x.com/antirez/status/2071173841175363905?s=20

JamesSwift 2 minutes ago [-]

Thats quantized

WithinReason 38 minutes ago [-]

> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability found

Claude Code is an agent harness, not an LLM.

Claude is a brand (or group of LLMs), not an LLM.

raincole 5 minutes ago [-]

Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.

Onavo 23 minutes ago [-]

Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.

himata4113 1 hours ago [-]

These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.

GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.

solenoid0937 2 hours ago [-]

GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.

Not that it would make any sense.

rgbrenner 1 hours ago [-]

If that happens it'll be an absolute disaster. Imagine a scenario where Anthropic and OpenAI prohibit most US companies from using their latest models because of safety.. And meanwhile attackers use equivalent open source models to attack US companies.

Any prohibition on open source models will do nothing to fix the problem.. since attackers will never feel bound to the law. All advanced models must be available for defensive purposes.

andy99 50 minutes ago [-]

Right, but is there any evidence of intelligence behind any of these (government) decisions? It’s just regulatory capture + marketing (plus some people living out an imaginary fantasy that they’re in Neuromancer or something), absolutely no reason to think they won’t try and target open models as part of this.

popalchemist 35 minutes ago [-]

There's at least one reason: much harder to make a profit in policing non-american companies and open-source models without huge (or even any) MRR.

If the real motive is profit, then open source models are likely simply not a viable means to that end.

solenoid0937 33 minutes ago [-]

> since attackers will never feel bound to the law.

But that's the whole point.

Companies blessed by the government will be allowed to defend themselves and use the best models.

Companies out of favor with it will be forbidden from it, and fall prey to the attackers and behind their competitors.

Now every company will think twice about not donating to the inaugural fund, or saying something the administration disagrees with.

As an aside: HN is very Silicon Valley tech-brained. So it's a foreign, shocking concept to the tech crowd but obvious to the DC crowd and the rest: Anthropic/OpenAI are not the main characters in this story. The executive branch and its propensity wield the full extent of its power is. The AI companies are a sideshow - they will make the decisions the administration lets them make.

aussiegreenie 5 minutes ago [-]

The Americans may ban the use of the Chinese models in America. But like the Chinese car ban, everyone else will use them.

djeastm 6 minutes ago [-]

I think state-of-the-art AI is going to be defense industry only from now on. We can have our toy drones but not the Predators and Reapers.

gruez 2 hours ago [-]

>GLM export controls incoming?

US imposing export restrictions on a model from China?

mcintyre1994 1 hours ago [-]

It’d be restrictions on Americans and American companies, and probably also pressure on America’s allies.

manquer 2 hours ago [-]

While unlikely , it is not without precedent , there are restrictions on ASML a Dutch company to sell EUV machines

throwup238 6 minutes ago [-]

That’s because the Department of Energy originally funded and contributed IP to the EUV Corp joint venture between several semiconductor companies (including ASML and Intel). Their ability to export control EUV was part of that original agreement that the entire technology is built on.

verdverm 1 hours ago [-]

ASML complies as an ally, why would China comply?

The weights are already available and downloaded, is it going to be a crime to have them, run them, make them available? Constitutional rights still exist (I hope)

solenoid0937 1 hours ago [-]

> is it going to be a crime to have them, run them, make them available?

Now you're getting it! Commerce will call it a munition and those harboring it as harboring illegal/foreign munitions.

No business will take the hit, so they will quickly deplatform the models.

No end user has the GPU capacity to use GLM 5.2 or similar models at full precision so the government will call the problem "mostly solved." But they might choose to "make examples" out of a few people using p2p software to download the weights if they choose to.

verdverm 1 hours ago [-]

Or we use the models to work on fixing vulns and stop over-blowing the doom scenarios. Gotta save the kids and kill the terrorists though!

I'm for making software better instead of banning it based on what the rich and powerful claim.

I suspect the real fear is that open weight models undermine the financials and token prices they thought were going to pay off their ludicrous spending because they have all raced and raised hardware prices.

hadlock 1 minutes ago [-]

> making software better instead of banning it

We're still in the middle of the cambrian explosion.

If Anthropic was capable of developing Opus 4.49-4.5 2H 2025.... then any company with a research team capable of reading all the papers and press releases will be capable of producing Opus 4.8 by the end of 2027, either raw model competency, or in a harness like claude code (or better with both). I guess what I am trying to say is that Opus 4.5 does not represent the edge of agentic capability, merely somewhere in the thick meaty layer of "functional and achievable".

We can draw the line at Sonnet 4.6 in the US but much like encryption export restrictions in the 1980s, the line drawn will be laughably low within a few years and simply unthinkable in a decade.

1 hours ago [-]

solenoid0937 58 minutes ago [-]

> making software better instead of banning it

That would be the rational thing to do.

> financials and token prices

I do not think the government thinks this deeply. Market manipulation might be a rational, if unethical reason to ban open source models.

But this admin banned Anthropic models to "own the libs." They will continue to ban what they want for whatever reason they want. I don't think those reasons will be particularly coherent.

verdverm 26 minutes ago [-]

Yeah, the current admin is reactionary, they appear to put little thought in, or at least disregard input they dislike. I don't think Ant's ban was about "owning the libs" as much as it was asserting dominance over someone who spoke up counter to the admin's aims and claims. They do listen to money, which is where I see Big Ai paying for executive orders (because the admin forgot what it means to compromise as part of legislating for all americans).

14 minutes ago [-]

matheusmoreira 40 minutes ago [-]

> it going to be a crime to have them, run them, make them available?

Yeah. Illegal numbers.

fph 41 minutes ago [-]

How would that even work for an open-weight model?

theteapot 17 minutes ago [-]

> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...

What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?

kordlessagain 4 hours ago [-]

You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8

After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.

Signup for GLM-5.2 here: https://z.ai

sanid 17 minutes ago [-]

One can also try https://neuralwatt.com using it in opencode.

I think they give $5 trail credits to test with any of the open weight models.

veselin 2 hours ago [-]

Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.

Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?

blazespin 2 hours ago [-]

I think the point is less "how can we throw shade on the OP" and more "a harness can enable a lot of models to do very serious cybersec, glm 5.2 is one of them"

s3p 1 hours ago [-]

Are you replying to a response to the original comment? I looked but i didn't see anyone saying he's throwing shade.

admax88qqq 1 hours ago [-]

> beats Claude in our Cyber Benchmarks

Beats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).

It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.

InsideOutSanta 1 hours ago [-]

They say "Claude Opus 4.8" in the first paragraph.

ls612 1 hours ago [-]

Opus 4.8 according to TFA. Whether or not the safety guardrails were responsible for the difference is an open question but for a dev who wants to secure their software who doesn’t work at one of the blessed Glasswing companies it doesn’t really matter why, it matters what the best tool you actually have is.

danslo 2 hours ago [-]

It reads like an ad.

Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.

Thirdly it compares to GPT 5.5 and Opus 4.8.

No, we don't have Mythos at home.

vlian2088 1 hours ago [-]

>Thirdly it compares to GPT 5.5

mythos is <10% ahead of gpt 5.5 on all benchmarks, which it gains by being several times the size of opus. had it been economical to provide, it would've been released to the public on day one instead of the marketing circus those effective altruism clowns had exhibited. admitting that it costs >1000% to run inference on a <10% better model would've been very damning.

InsideOutSanta 1 hours ago [-]

In my experience, GLM 5.2 is extremely good at finding vulnerabilities, and more importantly, unlike Opus, I've never seen it refuse a command. It genuinely is a very strong model for finding and fixing vulnerabilities.

NitpickLawyer 29 minutes ago [-]

> Thirdly it compares to GPT 5.5 and Opus 4.8.

> No, we don't have Mythos at home.

That's still useful. To paraphrase the kids these days, GLM5.2 is in the room with us, today. Mythos is not. And for us in the EU, it's even more complicated, as Mythos might be with us in the room one day, and go poof the next day, on the whims of political entities that we have 0 control over.

Knowing where open, accessible, local models are is important. We know they're behind. But there comes a time when "good enough" is useful. Even if they're "just IDORs" today, and even if they're behind SotA today.

As someone else said above, GLM5.2 (and other models in the same tier like kimi, dsv4, etc) is / are slowly becoming "good enough" to assist in automated repo prepare work (download, install, test, edit, re-test, etc). And that translates in RL traces ready to be trained into the next generations. That might be more important than x% behind on benchmarks.

sanid 29 minutes ago [-]

Technically we don't have Mythos at all? You guys have access. This tells me we have Opus at home (open weights).

jimbob45 31 minutes ago [-]

Yeah they straight up say that their criteria is narrow and primarily important for their specific use case. Never let rationality cause your pitchfork to be cast away though!

rode1974 32 minutes ago [-]

Hopefully i get a macbook pro soon enough to run some small or medium sized LLMs

paperterminal 28 minutes ago [-]

Same, but so much $$

29 minutes ago [-]

Rendered at 21:41:19 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.