Hacker Newsnew | past | comments | ask | show | jobs | submit | lmeyerov's commentslogin

Once my code exists and passes test, I generally move on to having it iteratively hunt for bugs, security issues, and DRY code reduction opportunities until it stops finding worthwhile ones.

This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.


The phenomena you're describing is why Cobol programmers still exist, and simultaneously, why it's increasingly irrelevant to most programmers

The killer feature is ecosystem: Easily and reliably reusing other libraries and tools that work out-of-the-box with other Python code written in the last few years . There are individually neato features motivating the efforts involved in upgrading a widely-used language & engine as well, but that kind of thinking misses the forest for the trees unfortunately.

It's a bit surprising to me, in the age of AI coding, for this to be a problem. Most features seem friendly to bootstrapping with automation (ex: f-strings that support ' not just "), and it's interesting if any don't fall in that camp. The main discussion seems to still be framed by the 2024 comments, before Claude Code etc became widespread: https://github.com/orgs/pypy/discussions/5145 .


The alternative is when you run a script that you last used a few years ago and now need it again for some reason (very common in research) and you might end up spending way too much time making it work with your now upgraded stack.

Sure you can were you should have pinned dependencies but that's a lot of overhead for a random script...


Most programmers aren't writing scientific software, which you can tell by claims that nicer f-strings is a pressing concern.

We can play that game - items like GIL-free interpreters and memory views are pretty relevant to folks on the more demanding side of scientific computing. But my point is this is a head-in-sand game when the community vastly outweighs any individual feature. My experience with the scientific computing community is that the non-pypy portion of it is much bigger.

I'm not a pypy maintainer, so my only horse in this race is believing cpython folks benefit from seeing the pypy community prove Things Can Be Better. Part of that means I rather pypy live on by avoiding unforced errors.


I liked they did this work + its sister paper, but disliked how it was positioned basically opposite of the truth.

The good: It shows on one kind of benchmark, some flavors of agentically-generated docs don't help on that task. So naively generating these, for one kind of task, doesn't work. Thank you, useful to know!

The bad: Some people assume this means in general these don't work, or automation can't generate useful ones.

The truth: Instruction files help measurably, and just a bit of engineering enables you to guarantee high scores for the typical cases. As soon as you have an objective function, you can flip it into an eval, and set an AI coder to editing these files until they work.

Ex: We recently released https://github.com/graphistry/graphistry-skills for more easily using graphistry via AI coding, and by having our authoring AI loop a bit with our evals, we jumped the scores from 30-50% success rate to 90%+. As we encounter more scenarios (and mine them from our chats etc), it's pretty straight forward to flip them into evals and ask Claude/Codex to loop until those work well too.

We do these kind of eval-driven AI coding loops all the time , and IMO how to engineer these should be the message, not that they don't work on average. Deeper example near the middle/end of the talk here: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...


We split our work:

* Specification extraction. We have security.md and policy.md, often per module. Threat model, mechanisms, etc. This is collaborative and gets checked in for ourselves and the AI. Policy is often tricky & malleable product/business/ux decision stuff, while security is technical layers more independent of that or broader threat model.

* Bug mining. It is driven by the above. It is iterative, where we keep running it to surface findings, adverserially analyze them, and prioritize them. We keep repeating until diminishing returns wrt priority levels. Likely leads to policy & security spec refinements. We use this pattern not just for security , but general bugs and other iterative quality & performance improvement flows - it's just a simple skill file with tweaks like parallel subagents to make it fast and reliable.

This lets the AI drive itself more easily and in ways you explicitly care about vs noise


In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.

It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.

Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...

We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.


I find it confusing in most directions.

Ex: For the above statement, if they're truly dishonest brokers and openly ignore the rules that are inconvenient, they would have zero problems agreeing to Anthropic's terms and then violating them. So what you say may be quite true, but there would still need to be more to the story for it to make sense.

Ex: DoW officials are stating that they were shocked that their vendor checked in on whether signed contractual safety terms were violated: They require a vendor who won't do such a check. But that opens up other confusing oversight questions, eg, instead of a backchannel check, would they have preferred straight to the IG? Or the IG more aggressively checking these things unasked so vendors don't? It's hard to imagine such an important and publicly visible negotiation being driven by internal regulatory politicking.

I wonder if there's a straighter line for all these things. Irrespective of whether folks like or dislike the administration, they love hardball negotiations and to make money. So as with most things in business and government, follow the money...


I have no idea what exactly Anthropic was offering the DoD, but if there were a LLM product, possible that the existing guardrails prevented the model from executing on the DoD vision.

"Find all of the terrorists in this photo", "Which targets should I bomb first?"

Even if the DoD wanted to ignore the legal terms, the model itself would not cooperate. DoD required a specially trained product without limitations.


New funding round investors generally get seniority over old

But new money may allow buyouts of existing at that time so early team or investors can cash out a bit early

And common doesn't cash out till IPO or private market equivalent, or yes, gets screwed


> About 200,000 people put money into the scheme, which offered a stake in the company, discounts and perks.

Hopefully they took advantage of the discounted beer.


My intuition is it comes down to error-correcting codes. We're dealing with lossy systems that get off track, so including parity bits helps.

Ex: <message>...</message> helps keep track. Even better? <message78>...</message78>. That's ugly xml, but great for LLMs. Likewise, using standard ontologies for identifiers (ex: we'll do OCSF, AT&CK, & CIM for splunk/kusto in louie.ai), even if they're not formally XML.

For all these things... these intuitions need backing by evals in practice, and part of why I begrudgingly flipped from JSON to XML


This feels right in theory and wrong in practice

When measuring speed running blue team CTFs ("Breaking BOTS" talk at Chaos Congress), I saw about a ~2x difference in speed (~= tokens) for a database usage between curl (~skills) vs mcp (~python). In theory you can rewrite the mcp into the skill as .md/.py, but at that point ... .

Also I think some people are talking past one another in these discussions. The skill format is a folder that supports dropping in code files, so much of what MCP does can be copy-pasted into that. However, many people discussing skills mean markdown-only and letting the LLM do the rest, which would require a fancy bootstrapping period to make as smooth as the code version. I'd agree that skills, when a folder coming with code, does feel like largely obviating MCPs for solo use cases, until you consider remote MCPs & OAuth, which seem unaddressed and core in practice for wider use.


We do a fun variant of this for louie.ai when working with database and especially log systems -- think incident response, SRE, devops, outage investigations: instead of returning DB query results to the LLM, we create dataframes (think in-memory parquet). These directly go into responses with token-optimized summary views, including hints like "... + 1M rows", so the LLM doesn't have to drown in logs and can instead decide to drill back into the dataframe more intelligently. Less iterative query pressure on operational systems, faster & cheaper agentic reasoning iterations, and you get a nice notebook back with the interactive data views.

A curious thing about the MCP protocol is it in theory supports alternative content types like binary ones. That has made me curious about shifting much of the data side of the MCP universe from text/json to Apache Arrow, and making agentic harnesses smarter about these just as we're doing in louie.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: