A big part of the problem is all the tooling around git (like the default github UI) which hides diffs for binary files like these pseudo-"test" files. Makes them an ideal place to hide exploit data since comparatively few people would bother opening a hex editor manually.
How many people read autoconf scripts, though? I think those filters are symptom of the larger problem that many popular C/C++ codebases have these gigantic build files which even experts try to avoid dealing with. I know why we have them but it does seem like something which might be worth reconsidering now that the tool chain is considerably more stable than it was in the 80s and 90s.
The alternatives are _better_ but still not great. build.rs is much easier to read and audit, for example, but it’s definitely still the case that people probably skim past it. I know that the Rust community has been working on things like build sandboxing and I’d expect efforts to be a lot easier there than in a mess of m4/sh where everyone is afraid to break 4 decades of prior usage.
build.rs is easier to read, but it's the tip of the iceberg when it comes to auditing.
If I were to sneak in some underhanded code, I'd do it through either a dependency that is used by build.rs (not unlike what was done for xz) or a crate purporting to implement a very useful procedural macro...
I mean, autoconf is basically a set of template programs for snffing out whether a system has X symbol available to the linker. Any replacement for it would end up morphing into it over time.
We have much better tools now and much simpler support matrices, though. When this stuff was created, you had more processor architectures, compilers, operating systems, etc. and they were all much worse in terms of features and compatibility. Any C codebase in the 90s was half #ifdef blocks with comments like “DGUX lies about supporting X” or “SCO implemented Y but without option Z so we use Q instead”.
Even in binary you can see patterns. Not saying it's perfect to show binary diffs (but it is better than showing nothing) but I know even my slow mammalian brain can spot obvious human readable characters in various binary encoding formats. If I see a few in a row which doesn't make sense, why wouldn't I poke it?
This particular file was described as an archive file with corrupted data somewhere in the middle. Assuming you wanted to scroll that far through a hexdump of it, there could be pretty much any data in there without being suspicious.
You're right - the two exploit files are lzma-compressed and then deliberately corrupted using `tr`, so a hex dump wouldn't show anything immediately suspicious to a reviewer.
Is this lzma compressed? Hard to tell because of the lack of formatting, but this looks like amd64 shellcode to me.
But that's not really important to the point - I'm not looking at a diff of every committed favicon.ico or ttf font or a binary test file to make sure it doesn't contain a shellcode.
testdata should not be on the same machine as the build is done. testdata (and tests generally) aren't as well audited, and therefore shouldn't be allowed to leak into the finished product.
Sure - you want to test stuff, but that can be done with a special "test build" in it's own VM.
in this case the backdoor was hidden in a nesting doll of compressed data manipulated with head/tail and tr, even replacing byte ranges inbetween. Would've been impossible to find if you were just looking at the test fixtures.