Hacker Newsnew | past | comments | ask | show | jobs | submit | bcyn's commentslogin

Which models perform anywhere close to Opus 4.5? In my experience none of the local models are even in the same ballpark.

This week: look at Qwen3 Coder Next and GLM 4.7 but it's changing fast.

I wrote this for the scenario you've run out of quota for the day or week but want a back up plan to keep going to give some options with obvious speed and quality trade-offs. There is also always the option to upgrade if your project and use case needs Opus 4.5.


I don't think the point about 401k stagnation is true. At most fee structures and optionality of funds change. How did that cost you 500k exactly?


Why? How do you draw the line between people who deserve to be "surveilled" (if you can even call it that in this case...) vs. people who don't?

You are entitled to your opinion of course but it just seems extremely arbitrary.


I don't have a good, rational answer.

I think the idea is vaguely that the upper-upper class statistically must've done something wrong or have the power to cause extreme harm, therefore it's okay to snitch on them but not your regular Joe.

I'm just espousing the standard American middle class views about freedom here. Not trying to argue they are sound or rational.


Great read, thanks! Could you dive a little deeper into example 2 & pre-registration? Conceptually I understand how the probability of false positives increases with the number of variants.

But how does a simple act such as "pre-registration" change anything? It's not as if observing another metric that already existed changes anything about what you experimented with.


If you have many metrics that could possibly be construed as "this was what we were trying to improve", that's many different possibilities for random variation to give you a false positive. If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning."

It's basically a variation on the multiple comparisons, but sneakier: it's easy to spend an hour going through data and, over that time, test dozens of different hypotheses. At that point, whatever p-value you'd compute for a single comparison isn't relevant, because after that many comparisons you'd expect at least one to have uncorrected p = 0.05 by random chance.


There are many resources that will explain this rigorously if you search for the term “p-hacking”.

The TLDR as I understand it is:

All data has patterns. If you look hard enough, you will find something.

How do you tell the difference between random variance and an actual pattern?

It’s simple and rigorously correct to only search the data for a single metric; other methods, eg. Bonferroni correction (divide p by k) exist, but are controversial (1).

Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful.

If you see a pattern in another metric, run another experiment.

[1] - https://pmc.ncbi.nlm.nih.gov/articles/PMC1112991/


This doesn't make as much sense as you think it does. If you could predictably trade a flip from bearish to bullish (for example, of course there are other trend-based signals), you would not share that signal because others would overcrowd your trade (by buying/shorting and moving the price more quickly towards the trending direction than you).

A potential argument is that these signals are only applicable to a certain bracket of portfolio sizes (e.g. larger AUM funds would not be able to trade this strategy) -- but you are sharing this with folks presumably in your range of portfolio size.


Overcrowding an entry on highly liquid assets is something that is so far from reality for our service.


The more highly liquid an asset, the more efficient it is and the fewer trading opportunities after accounting for transaction costs. In something like the S&P 500, everything is already priced in.

Meme stocks and shitcoins being manipulated by whales are not efficient and also not as liquid.

The larger point remains that none of the above considerations are discussed on this product's page.


I'm realizing there's a lot of confusion about what trend based models actually are. I was under the assumption the concept was more widely understood, but I'm realizing we need to explain it better.

To be clear, there's nothing new or innovative about a trend based model. It's one of the most commonly used investment strategies by intuitions, etc. It's been widely utilized for far longer than I've been alive.


There's no confusion about the type of edge. Just pointing out that if you are selling an edge rather than trading it yourself, you're either grifting or naive.


It read to me as a (very) small discount for choosing to be billed annually vs. monthly.


The discount is applied to $10/month and is said to make it effectively $8.5 per month, but is actually $8.25 per month, since they claim to charge $99 per year.


Yeah, it's a $20 discount billed annually.


Then that would be $100 per year and none of your numbers for annual plans are accurate. Your website claims $99 per year and claims that is equivalent to $8.50 per month. None of these 3 numbers are equivalent.


You're correct! I made the fix.


You're correct about valuation, but the parent post was meant to address "how much liquid dollars should you expect to receive vs. 409a." You are likely to receive less in most cases (read: unless there are wildly successful public liquidity events) due to liquidation preferences.


Plenty of (non-VC backed) startups raise some money and then sell privately; it’s often the case that preference does not cause the common stock value to drop below the most recent 409a in these cases.

(In my experience, the 409a is on the order of 20% of the most recent raise, and preference is not more than 50%, in my area. And obviously you hope to sell for more than the last raise!).


Any reasonable 409a will be fully aware of those preference terms and will have factored them in.


Very interested to see what the next steps are to evolve the "retrieval" model - I strongly believe that this is where we'll see the next stepwise improvement in coding models.

Just thinking about how a human engineer approaches a problem. You don't just ingest entire relevant source files into your head's "context" -- well, maybe if your code is broken into very granular files, but often files contain a lot of irrelevant context.

Between architecture diagrams, class relationship diagrams, ASTs, and tracing codepaths through a codebase, there should intuitively be some model of "all relevant context needed to make a code change" - exciting that you all are searching for it.


I have a different pov on retrieval. It's a hard problem to solve in a generalizable format with embeddings. I believe this can be solved at a model level where its used to fix an issue. With the model providers (oai, anthropic) going full stack, there is a possibility they solve it at reinforcement learning level. Eg: when you teach a model to solve issues in a codebase, the first step is literally getting the right files. Here basic search (with grep) would work very well as with enough training, you want the model to have an instinct about what to search given a problem. similar to how an experienced dev has that instinct about a given issue. (This might be what the tools like cursor are also looking at). (nothing against anyone, just sharing a pov, i might be wrong)

However, the fast apply model is a thing of beauty. Aider uses it and it's just super accurate and very fast.


Definitely agree with you that it's a problem that will be hard to generalize a solution for, and that the eventual solution is likely not embeddings (at least not alone).


Relevant interview extract from the Claude Code team: https://x.com/pashmerepat/status/1926717705660375463

> Boris from the Claude Code team explains why they ditched RAG for agentic discovery. > "It outperformed everything. By a lot"


This is very cool. They explained the solution better than I did. If I knew, I would have just linked this :)


Adding extra structural information about the codebase is an avenue we're actively exploring. Agentic exploration is a structure-aware system where you're using a frontier model (Claude 4 Sonnet or equivalent) that gives you an implicit binary relevance score based on whatever you're putting into context -- filenames, graph structures, etc.

If a file is "relevant" the agent looks at it and decides if it should keep it in context or not. This process repeats until there's satisfactory context to make changes to the codebase.

The question is whether we actually need a 200b+ parameter model to do this or if we can distill the functionality onto a much smaller, more economical model. A lot of people are already choosing to do it with Gemeni (due to the 1m context window), and they write the code with Claude 4 Sonnet.

Ideally, we want to be able to run this process cheaply in parallel to get really fast generations. That's the ultimate goal we're aiming towards


Many SaaS products are tools. I'm sure when tractors were first invented, people felt that they didn't "control" it compared to directly holding shovels and manually doing the same work.

Not to say that LLMs are at the same reliability of tractors vs. manual labor, but just think that your classification of what's a tool vs. not isn't a fair argument.


I think the OP comment re: AI's value as a tool comes down this this:

Does what it says: When you swing a hammer and make contact, it provides greater and more focused force than your body at that same velocity. People who sell hammers make this claim and sometimes show you that the hammer can even pull out nails really well. The claims about what AI can do are noisy, incorrect and proffered by people who - I imagine OP thinks and would agree - know better. Essentially they are saying "Hammers are amazing. Swing them around everywhere"

Right to repair: Means an opportunity to understand the guts of a thing and fix it to do what you want. You cannot really do this to AI. You can prompt differently but it can be unclear why you're not getting what you want


Intentionally or not the tractor analogy is a rich commentary on this but it might not make the point you intend. Look into all the lawsuits and shit like that with John Deere and the DRM lockouts where farmers are losing whole crops because of remote shutdown cryptography that's physically impossible to remove at a cost or in a timeframe less than a new tractor.

People on HN love to bring up farm subsidies, and its a real issue, but big agriculture has special deals and what not. They have redundancy and leverage.

The only time this stuff kicks in is when the person with the little plot needs next harvest to get solvent and the only outcome it ever achieves is to push one more family farm on the brink into receivership and directly into the hands of a conglomorate.

Software engineers commanded salaries that The Right People have found an affront to the order of things long after they had gotten doctors and lawyers and other high-skill trades largely brought to heel via the joint licensing and pick a number tuition debt load. This isn't easy in software for a variety of reasons but roughly that the history of computer science in academia is kind of a unique one: it's research oriented in universities (mostly, there are programs with an applied tilt) but almost everyone signs up, graduates, and heads to industry without a second thought, and so back when the other skilled trades were getting organized into the class system it was kind of an oddity, regarded as almost an eccentric pursuit by deans and shit.

So while CS fundamentals are critical to good SWE's, schools don't teach them well as a rule any more than a physics undergraduate is going to be an asset at CERN: its prep for theory research most never do. Applied CS is just as serious a topic, but you mostly learn that via serious self study or from coworkers at companies with chops. Even CS graduates who are legends almost always emphasize that if you're serious about hacking then undergrad CS is remedial by the time you run into it (Coders at Work is full of this sentiment).

So to bring this back to tractors and AI, this is about a stubborn nail in what remains of the upwardly mobile skilled middle class that multiple illegal wage fixing schemes have yet to pound flat.

This one will fail too, but that's another mini blog post.


> LLMs are a codification of internet and written content

Only true for pre-trained foundational models without any domain-specific augmentations. A good AI tool in this space would be fine-tuned or have other mechanisms that overshadow the pre-training from internet content.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: