Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This kind of application of LLMs is most interesting to me, since it's possible to evaluate correctness and performance quantitatively.


It seems like a poor fit to me precisely because correctness is boolean, difficult to measure and getting it wrong is very bad. I do think there's a place for AI here but it's probably not LLMs in their current form.


They are not using LLM to directly produce the result code, but as tool that lists which optimisations should be done and in which order, which is fairly complex problem to solve. But if optimisation passes are implemented correctly (which is anyway required for a functioning optimising compiler), it cannot produce incorrect code, maybe only suboptimal compared to default heuristics used.


If there's a list of known optimizations that preserve correctness then it becomes an optimization problem based on output length (as a proxy for cycle count). So is the idea that an LLM is more efficient than a search or direct optimization?


There tend to be a lot of heuristics involved when deciding which optimizations to apply and in which order, so there's plenty of room to apply some machine learning.

Whether LLMs are the right approach is a separate question.

In SQL optimization, the problem is a bit trickier (IMO) because compilation is in the query path. One successful approach I know of is Bao: https://arxiv.org/abs/2004.03814


That's neither the only metric of relevance, nor is that a good proxy for modern superscalar vector processor architectures.


Cycle count is not a proper benchmark for performance on “modern” processors.


This is scooching into AlphaGo (or whatever the generic implementation is called) territory: mixing predictive AI with traditional optimization algorithms.


One example where a LLM might be better is which functions to inline.

Current compilers use a complex set of heuristics, a more holistic approach the kind neural networks do might outperform.


For function inlining specifically, I'm not sure LLMs are necessarily the right choice. The original MLGO paper [1] demonstrated a big code-size improvement with a ML model for making inlining decisions (7-20% code size wins), but they used tens of engineered features. Maybe a LLM could squeeze some additional size wins out, but maybe not [2].

Additionally, there are other factors to consider when productionizing these systems. Compile time is important (which LLMs will almost certainly explode), and anyone concerned about code size will probably be doing (Thin)LTO which would require feeding a lot more context into a LLM making inlining decisions.

1. https://arxiv.org/abs/2101.04808 2. https://dl.acm.org/doi/10.1145/3503222.3507744


I'm probably being very naive with this but, could mechanistic interpretability play a role here? Specifically, to experimentally short-list the LLM's most effective optimizations, then try to peek inside the LLM and perhaps fish out some novel optimization algorithm that could be efficiently implemented without the LLM?


I did a bit of work on this last summer on (much) smaller models [1] and it was briefly discussed towards the end of last year's MLGO panel [2]. For heuristic replacements specifically, you might be able to glean some things (or just use interpretable models like decision trees), but something like a neural network works fundamentally differently than the existing heuristics, so you probably wouldn't see most of the performance gains. For just tuning heuristics, the usual practice is to make most of the parameters configurable and then use something like bayesian optimization to try and find an optimal set, and this is sometimes done as a baseline in pieces of ML-in-compiler research.

1. https://github.com/google/ml-compiler-opt/pull/109 2. https://youtu.be/0uUKDQyn1Z4?si=PHrx9RICJIiA3E6C


Ah interesting. Hadn't seen that presentation, thanks for sharing!


I thought so as well but then in the article it says that only like 91% of the code even compiles.


Quantification can be done by measuring in at least two dimensions: (1) the size of the synthesised code, and (2) how precisely the generated code matches the input (which means roughly: on what fraction of input do the two programs give different output). We have set up a challenge that seeks to entice the community to look into this problem domain more. And we've simplified the assumptions, so as to make it more tractable:

- Challenge: https://codalab.lisn.upsaclay.fr/competitions/15096

- Paper describing the challenge: https://arxiv.org/abs/2308.07899

(I am one of the authors, AMA)


How well does (2) really measure accuracy? It seems like a single output that doesn't match the input code could indicate a fundamental floor in the optimized code, so it's essentially 100% wrong even though it gets the correct answer almost all the time.

Good luck on the challenge though, this seems like an interesting and valuable area of research.


Of course (2) is not a perfect measure of accuracy, since it does not quantify how far wrong an output is, e.g. if 111111111111 is the correct output, then both 111111111110 and 829382934783 count as equally faulty. The main advantage of (2) is that it is natural, easy to understand, and easy to measure and compare. We have to start somewhere. I imagine that, in the future, it can be refined (e.g. taking the Hamming distance between desire and actual output). I expect that more refined quantification emerges in response to the community better understanding exactly what is hard in the synthesis of programs.

Feel free to submit something! A simple submission is probably just a few lines of code.


The approach seems to focus on selecting optimizations to apply for LLVM (e.g., imagime: should this be inlined as an example), but the worst case is that the code is poorly optimized, you can't select optimizations to apply and get a wrong result.

I agree that what you say is true of compilation as a whole, but that doesn't seem to be the focus here (rather, it's used as a sort of crutch to help the LLM learn)


The trick is finding a way to ensure that LLM produces something which is always correct. Like in this case, the LLM only changes compiler optimizations, not the assembly itself, so no matter what it outputs the code is correct, it just may be larger. Other possibilities: an LLM which applies semantics-preserving program transformations, or an LLM combined with a proof assistant to verify the output (more generally and for any domain, an LLM as an NP oracle).

But I agree, as of now I haven't seen good uses where LLMs produce reliable output. Not only do you need that guarantee that whatever output always generates a correct program, you need something where an LLM is considerably better than a simple or random algorithm, and you need a lot of training data (severely restricting how creative you can be with the output).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: