Rendered at 19:03:27 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
da-x 13 hours ago [-]
> Elevator achieves performance on par with or better than QEMU's user-mode JIT emulation.
I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
pm215 12 hours ago [-]
Yes, QEMU's JIT is a fairly easy target to beat. Notably if you are happy to specialize the design to "only x86 to aarch64" and "only usermode" there's quite a lot of gain to be made. QEMU's usermode support is a kind of "this happens to work" appendix to its system emulation support, and the overall JIT architecture is a "guest to intermediate representation to host" one that is great for supporting a dozen guest architectures and multiple host architectures, but means you can't really take advantage of properties of a specific guest/host pair like "x86 has fewer integer registers so we can hard allocate them" or "we know the fiddly floating point semantics always match if you put the aarch64 CPU into the right mode". Plus there's just more time put into "emulate new architecture feature X" in QEMU development than into "look at optimization opportunities to make it faster", because that's what the people who pay for development work care more about.
himata4113 9 hours ago [-]
qemu is a TCG, not a translator. It's designed to work with n architectures which has limits.
mevinbuilds 9 hours ago [-]
[dead]
linkregister 12 hours ago [-]
A 50x increase in the size of the .text section is enormous, but seems to be a reasonable price to pay for a fully-deterministic translation. The performance difference over emulation will outweigh the inconvenience of the size increase in many cases.
It's exciting to see that multithreading and exception handling are not impossible to support; they're just out of scope of this particular project.
I wonder if the next step is to then use heuristics to prune the possibility space and reduce the size of the binary (thus breaking the guarantees of the translation, but making portability of the binary practical).
wmf 43 minutes ago [-]
The performance difference over emulation will outweigh...
Nope, this translator is much slower than Box64 or FEX. It's just worse unless you can't use JIT for some reason.
codedokode 5 hours ago [-]
I was always curious, how do translators handle indirect jumps? When analyzing a binary we can discover only segments of code connected with direct jumps, where the destination address is known. So it means that whenever indirect jump happens, we need to find the target function, optionally translate it and jump back to translated code. Isn't it slow? Are there faster methods? Can we make translated function addresses match original functions? Or do we place jumps to translated code at original addresses?
evmar 5 hours ago [-]
The translator I made is only hobbyist quality, but I just have a big table that says “if you indirect jmp to address X then the associated block is at location Y”.
This is slower than a direct jmp (which doesn’t use the table) but also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.
jcranmer 5 hours ago [-]
> also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.
The main use-case in performance-critical loops is generally something like a core interpreter loop, where you're dispatching on an opcode.
fizza_pizza 12 hours ago [-]
The certification angle is the most interesting part to me. Regulated industries (aviation, medical devices) often can't use JIT for exactly this reason, the code that runs has to be the code that was certified. Static translation that produces a signable binary is a real unlock there, code bloat notwithstanding.
camillomiller 12 hours ago [-]
I wonder: how relevant is this portion of the software industry? Because I’m guessing there is also no way they can apply LLms at scale, which is never discussed in the larger AI at work narrative
topspin 10 hours ago [-]
I work in an industry that requires reproducible binaries from source, and cryptographic hashes filed with a regulator.
It's also not aviation or medical. So perhaps it's more common than you imagine.
camillomiller 9 hours ago [-]
I think my comment conveyed the wrong sentiment, my bad.
I’m suggesting exactly this: there are extremely common cases in which deterministic software outcomes are needed/mandatory/regulated.
Way more often than we think, often in boring and solved but critical environments.
Yet the entire AI industry acts as if that is an afterthought or an unimportant edge case.
topspin 2 hours ago [-]
> Yet the entire AI industry acts as if that is an afterthought or an unimportant edge case
Certainly it's not on the "AI industries" list of priorities. Perhaps, however, it's not supposed to be. I use AI tools for the use case I mentioned. The source code, build system, binary artifacts and hashes are still regulated in the way I described. The fact that the AI industry was involved in that chain simply isn't relevant.
Other uses cases involving real time agents and whatnot are another story. I'm not dealing with that problem. I suspect the AI industry doesn't really care about such attestation at this point because everyone is still in the frothy world of "new!" and the bureaucrats simply haven't caught up yet, and the adopters are taking advantage while they can. That pattern has recurred throughout the history of communication and computers.
I don't really object to that. There will be plenty of time for security theater after whatever limits are eventually found and exploited, and in the meantime there is free oxygen available.
jy14898 12 hours ago [-]
LLMs aren't relevant to aviation and medical devices
naasking 4 hours ago [-]
Of course they are. LLMs are routinely used to generate Lean proofs, so of course they can also be used to generate verified software. They probably aren't being used that way, yet, but they will be:
Exactly! And yet they’re touted as a catch all business case!!!
rvz 12 hours ago [-]
It is completely relevant, if you want reliable software that you use daily to continue running without a massive rewrite.
Before suggesting to use LLMs to completely rewrite this sort of software, there is a reason why compilers need to be certified to operate in safety critical environments. Not everything needs to use LLMs as the solution to a problem.
I would go as far to say that using an LLM in this context is the wrong solution and is irrelevant to critical systems. Maybe some here see everything as tokens and must solve everything in the form of using LLMs.
Rewriting a toy web app using LLMs from Javascript to Typescript is great, but isn't good for safety critical systems.
adrianN 11 hours ago [-]
Safety critical software is mostly a compliance dance that incidentally produces artifacts with lower defect rates than usual. LLMs can help with safety critical code as long as a human signs their name that they are responsible for its behavior.
junon 10 hours ago [-]
When I'm sitting in the plane that has CAS firmware, I'd like to think it wasn't written by an LLM and that my death in the case of a CAS failure isn't chalked up to "some engineer somewhere gets in trouble".
adrianN 8 hours ago [-]
There probably already is generated code in there, only it was generated from UML. I don’t think that LLM generated code will be treated differently from the point of view of the relevant regulations.
junon 8 hours ago [-]
UML conversion is deterministic.
adrianN 3 hours ago [-]
That doesn’t matter. Once the code is generated it doesn’t change. The reviewed artifact in a safety critical codebase is the last abstraction layer before a fully certified compilation pipeline. So usually it’s not the UML but the generated code.
camillomiller 11 hours ago [-]
I agree with you.
The question is: how the hell is this never discussed when assessing the economic potential of AI-driven disruption.
I ask because I have the impression that all the really relevant industries are resistent to the current narrative.
That said we had Claud helping bomb a school full of kids, you would guess the military would know better but no :/
gabriela_c 6 hours ago [-]
i really like the superset CFG idea, but the following are noteworthy for anybody planning to read the article:
- has to emulate a large part of x86 cpu state like EFLAGS, compute complex movs individually, etc
- only supports single-thread binaries
- no exception handling/unwinding
- doesnt support the full ISA
jonhohle 13 hours ago [-]
This is neat. I haven’t looked into it, but I would think relative offsets could still be an issue, but it seems there must be some translation layer/mmu since the codegen will be different sizes anyway. This would impact jump tables and internal branches, primarily.
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
gblargg 11 hours ago [-]
> Elevator considers all possible interpretations of every byte and produces a separate translation for each feasible one ahead of time [...] pruning only those leading to abnormal termination.
So any real program with the possibility to crash is pruned?
dzaima 10 hours ago [-]
Presumably just set to a canonical crash in the lookup table of address-to-code; which'd still get you a crash, just not that of the directly-run invalid code.
JoheyDev888 11 hours ago [-]
50x isn't reasonable, it's a cache disaster. Any perf win from avoiding JIT gets eaten alive.
wmf 3 hours ago [-]
This is a great case for link-time code reordering. You can put all the hot code together so the unused code will never be loaded.
dzaima 10 hours ago [-]
Only if it is all actually used at runtime; and presumably the vast majority of possible decoding starting points won't be.
po1nt 7 hours ago [-]
I wouldn't jump to conclusions. Instructions aren't so big anyways and they are optimized JIT by CPU.
Panzerschrek 13 hours ago [-]
Can it handle self-modifying code?
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
whizzter 11 hours ago [-]
I think self-modifying outside of JIT runtimes is a pretty rare thing these days compared to the 80s or 90s, .text sections are mostly RO these days and security requirements aren't going to decrease that.
oinkt 13 hours ago [-]
Consider reading the linked article, where this is explicitly addressed:
> Self Modifying and JIT-Compiled Code. Elevator, like all fully static binary rewriters, does not support self modifying or just-in-time-compiled code.
Animats 11 hours ago [-]
So they don't have to handle the really hard case.
In x86 land, it's hard to find the instruction boundaries statically, because, for historical reasons going back to the 8-bit era, x86 nstructions don't have alignment restrictions. This is what makes translation ambiguous.
If you start at the program entry point and start examining reachable instructions, you can find the instruction boundaries. Debuggers and disassemblers do this. Most of the time, it works, but You may have to recognize things such as C++ vtables. Debug info helps there. There may be ambiguity. This seems to be about generating all the possible code options to resolve that ambiguity by brute force case analysis.
x86 doesn't have explicit code/data separation, which some architectures do. So they have to try instruction decoding on all data built into the executable. They cull obvious mistranslations. Yet they still have a 50x space expansion, someone mentioned. Most of those will be unreachable
mistranslated code.
You can't look at a static executable which uses pointers to functions and say "that data cannot possibly be code", without constraining what those pointers point to. That involves predicting run-time behavior, which may not be possible.
linkregister 12 hours ago [-]
Why doesn't it clean my garage also? I've got some leaves to rake as well.
perching_aix 11 hours ago [-]
> Can it handle self-modifying code
If it did, it wouldn't be "fully static" anymore. It's fundamentally contradictory.
burnt-resistor 7 hours ago [-]
On the greenfield x86 development side: Self-modifying code, while possible, is generally terrible because it obliterates cache lines and pipeline branch prediction performance too. And it also violates W^X so it generally has to be used in JIT-compatible memory pages. So avoid it almost always. It was kind of a thing in 486 and P5 days like using code immediates as inner loop variables, but not so much now.
There's a lot of x86 crufty edge-cases to handle to achieve perfect(ish) emulation or translation.
Animats 7 hours ago [-]
> It was kind of a thing in 486 and P5...
After those machines, at the Pentium Pro, with look-ahead instruction decoding, it became a major lose to store into code. Superscalar x86 CPUs have the hardware to detect and handle stores into code, but it requires bringing the CPU to a clean halt, almost like an exception interrupt, discarding pipelined work that's already been done, and then restarting the pipeline, reloading the instructions ahead. All the performance gains of superscalar hardware is lost for a while.
There are RISC architectures where self-modifying code isn't supported, and code pages must be read-only. Then the CPU doesn't need the machinery for detecting and aborting look ahead on a store into code. MacOS has enforced that rule since the PowerPC era.
codedokode 5 hours ago [-]
I think that you can execute JIT code using an interpreter?
gobdovan 13 hours ago [-]
[dead]
ok123456 5 hours ago [-]
Would this blow up space-wise if you tried to run something that was "protected" with vmprot?
Asraelite 8 hours ago [-]
Where is the source code?
mgaunard 11 hours ago [-]
On par with QEMU, but still far behind Rosetta...
whizzter 11 hours ago [-]
Isn't part of that due to Rosetta relying on Apple extensions to ARM to mimic x86-64 memory semantics?
MrBuddyCasino 11 hours ago [-]
The x86 memory model (TSO) is not Apple‘s invention, its a standard ARM extension.
m132 9 hours ago [-]
The memory model by itself isn't, however Apple implemented it before Arm released an (incompatible) set of extensions that approach the problem at the instruction level instead of adding an Apple-style global TSO on/off switch in an IMPDEF register [0].
I thought Apple Silicon also has some extra hardware support for handling x86 flags emulation for Rosetta. But perhaps I’m remembering that incorrectly.
trueno 5 hours ago [-]
blows my mind apple plans to sunset rosetta, it's like a core part of computing for me now that frees me from needing a second device
IshKebab 11 hours ago [-]
And Box64, but I think the point is that this is closer to being guaranteed to work.
pcblues 9 hours ago [-]
Sounds like they tweaked an AI to get a minimal subset of accurate outcomes and started waving their hands for anything more complicated, realistic and ultimately generally useful. The larger problem-space is still an NP-complete problem. I guess if the data-centres become infinitely large, this problem can be worked around.
/s /jk
fguerraz 13 hours ago [-]
Does it mean I can finally run Slack on Asahi?
gobdovan 13 hours ago [-]
From the paper, Elevator currently supports only single-threaded binaries, does not support binaries using exception handling, has unsupported x64 extensions, and does not support self-modifying or JIT-compiled code. Slack is Electorn based, so t embeds Chromium and Node and depends on V8.
static translation is only possible when you assume no adversarial code AND mostly assume compiler-produced binaries. hand-rolled asm gets hard, and adversarial code is provably unsolvable in all cases.
still, pretty cool for cooperative binaries
fsmv 13 hours ago [-]
I only read the abstract but I got the impression that their solution to this is they have both. They translate all the data as if it was code and if it gets called into they use the translation where if it gets read as memory they use the original.
Edit I found this in the paper
> Elevator sidesteps the code-versus-data determination altogether through an application of superset disassembly [6]: we simultaneously interpret every executable byte offset in the original binary as (i) data and (ii) the start of a potential instruction sequence beginning at that offset, and we build the superset control flow graph from every one of the resulting candidate decodes. Every potential target of indirect jumps, callbacks, or other runtime dispatch mechanisms that
cannot be statically analyzed therefore has a corresponding landing point in the rewritten binary. These targets are resolved at runtime through a lookup table from original instruction addresses to translated code addresses that we embed in the final binary.
tlb 13 hours ago [-]
But in fact no modern processor/OS executes this either. Pages are marked as executable or not, and static data is loaded as non-executable pages.
dmitrygr 13 hours ago [-]
that is why it was not "static const char buf[]" ;) it was not an accident
executable stacks are still common (incl on windows with some settings), and sometimes they are required (eg for gcc nested functions)
diamondlovesyou 13 hours ago [-]
That won't be located on the stack either. The underlying buffer will be a TU local - ie static and not rx
lisper 13 hours ago [-]
Good grief, what a useless argument. Isn't it obvious that this could trivially be converted to a non-static array if that's really what was needed?
rowanG077 9 hours ago [-]
If you are going to be pedantic, at least be fully pedantic.
userbinator 13 hours ago [-]
I read those bytes and immediately thought "mov eax, 42; ret".
genxy 13 hours ago [-]
It looks like their system would just generate return 42;
IshKebab 13 hours ago [-]
No based on the abstract it can handle that code. What it can't handle is runtime code generation.
I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
It's exciting to see that multithreading and exception handling are not impossible to support; they're just out of scope of this particular project.
I wonder if the next step is to then use heuristics to prune the possibility space and reduce the size of the binary (thus breaking the guarantees of the translation, but making portability of the binary practical).
Nope, this translator is much slower than Box64 or FEX. It's just worse unless you can't use JIT for some reason.
This is slower than a direct jmp (which doesn’t use the table) but also indirect jumps were slower in the original program to begin with and typically don’t occur in performance-critical loops.
The main use-case in performance-critical loops is generally something like a core interpreter loop, where you're dispatching on an opcode.
It's also not aviation or medical. So perhaps it's more common than you imagine.
Certainly it's not on the "AI industries" list of priorities. Perhaps, however, it's not supposed to be. I use AI tools for the use case I mentioned. The source code, build system, binary artifacts and hashes are still regulated in the way I described. The fact that the AI industry was involved in that chain simply isn't relevant.
Other uses cases involving real time agents and whatnot are another story. I'm not dealing with that problem. I suspect the AI industry doesn't really care about such attestation at this point because everyone is still in the frothy world of "new!" and the bureaucrats simply haven't caught up yet, and the adopters are taking advantage while they can. That pattern has recurred throughout the history of communication and computers.
I don't really object to that. There will be plenty of time for security theater after whatever limits are eventually found and exploited, and in the meantime there is free oxygen available.
* A Case Study on the Effectiveness of LLMs in Verification with Proof Assistants, https://arxiv.org/abs/2508.18587v1
* CoqPilot, a plugin for LLM-based generation of proofs, https://dl.acm.org/doi/10.1145/3691620.3695357
Before suggesting to use LLMs to completely rewrite this sort of software, there is a reason why compilers need to be certified to operate in safety critical environments. Not everything needs to use LLMs as the solution to a problem.
I would go as far to say that using an LLM in this context is the wrong solution and is irrelevant to critical systems. Maybe some here see everything as tokens and must solve everything in the form of using LLMs.
Rewriting a toy web app using LLMs from Javascript to Typescript is great, but isn't good for safety critical systems.
- ~4.75x runtime speed increase (significantly slower than box64, faster than QEMU), 7x executed instruction count increase, 50x binary size increase
- emulates x86 abi until it calls out to external
- has to emulate a large part of x86 cpu state like EFLAGS, compute complex movs individually, etc
- only supports single-thread binaries
- no exception handling/unwinding
- doesnt support the full ISA
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
So any real program with the possibility to crash is pruned?
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
> Self Modifying and JIT-Compiled Code. Elevator, like all fully static binary rewriters, does not support self modifying or just-in-time-compiled code.
In x86 land, it's hard to find the instruction boundaries statically, because, for historical reasons going back to the 8-bit era, x86 nstructions don't have alignment restrictions. This is what makes translation ambiguous.
If you start at the program entry point and start examining reachable instructions, you can find the instruction boundaries. Debuggers and disassemblers do this. Most of the time, it works, but You may have to recognize things such as C++ vtables. Debug info helps there. There may be ambiguity. This seems to be about generating all the possible code options to resolve that ambiguity by brute force case analysis.
x86 doesn't have explicit code/data separation, which some architectures do. So they have to try instruction decoding on all data built into the executable. They cull obvious mistranslations. Yet they still have a 50x space expansion, someone mentioned. Most of those will be unreachable mistranslated code.
You can't look at a static executable which uses pointers to functions and say "that data cannot possibly be code", without constraining what those pointers point to. That involves predicting run-time behavior, which may not be possible.
If it did, it wouldn't be "fully static" anymore. It's fundamentally contradictory.
There's a lot of x86 crufty edge-cases to handle to achieve perfect(ish) emulation or translation.
After those machines, at the Pentium Pro, with look-ahead instruction decoding, it became a major lose to store into code. Superscalar x86 CPUs have the hardware to detect and handle stores into code, but it requires bringing the CPU to a clean halt, almost like an exception interrupt, discarding pipelined work that's already been done, and then restarting the pipeline, reloading the instructions ahead. All the performance gains of superscalar hardware is lost for a while.
There are RISC architectures where self-modifying code isn't supported, and code pages must be read-only. Then the CPU doesn't need the machinery for detecting and aborting look ahead on a store into code. MacOS has enforced that rule since the PowerPC era.
[0] https://lkml.org/lkml/2024/4/10/1531
/s /jk
Maybe try an emulator? There's also this project I found: https://github.com/andirsun/Slacky
still, pretty cool for cooperative binaries
Edit I found this in the paper
> Elevator sidesteps the code-versus-data determination altogether through an application of superset disassembly [6]: we simultaneously interpret every executable byte offset in the original binary as (i) data and (ii) the start of a potential instruction sequence beginning at that offset, and we build the superset control flow graph from every one of the resulting candidate decodes. Every potential target of indirect jumps, callbacks, or other runtime dispatch mechanisms that cannot be statically analyzed therefore has a corresponding landing point in the rewritten binary. These targets are resolved at runtime through a lookup table from original instruction addresses to translated code addresses that we embed in the final binary.
executable stacks are still common (incl on windows with some settings), and sometimes they are required (eg for gcc nested functions)