Hacker Newsnew | past | comments | ask | show | jobs | submit | binalpatel's commentslogin

The agent making it's own harness idea is really powerful, I gave it a try here with some opinionated choices:

https://github.com/caesarnine/binsmith

Been running it on a locked down Hetzner server + using Tailscale to interact with it and it's been surprisingly useful even just defaulting to Gemini 3 Flash.

It feels like the general shape of things to come - if agents can code then why can't they make their own harness for the very specific environments they end up in (whether it's a business, or a super personalized agent for a user, etc). How to make it not a security nightmare is probably the biggest open question and why I assume Anthropic/others haven't gone full bore into it.


Another way to isolate on a server via LXC containers (disclosure my project):

https://GitHub.com/jgbrwn/vibebin


This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?

>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)

(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)


sure, in some small projects I recommend my friends to use gemini 3 flash. ocrbase is aimed more at scale and self-hosting: fixed infra cost, high throughput, and no data leaving your environment. at large volumes, that tradeoff starts to matter more than per-100-page pricing


I went down (continue to do down) this rabbit hole and agree with the author.

I tried a few different ideas and the most stable/useful so far has been giving the agent a single run_bash tool, explicitly prompting it to create and improve composable CLIs, and injecting knowledge about these CLIs back into it's system prompt (similar to have agent skills work).

This leads to really cool pattens like: 1. User asks for something

2. Agent can't do it, so it creates a CLI

3. Next time it's aware of the CLI and uses it. If the user asks for something it can't do it either improves the CLI it made, or creates a new CLI.

4. Each interaction results in updated/improved toolkits for the things you ask it for.

You as the user can use all these CLIs as well which ends up an interesting side-channel way of interacting with the agent (you add a todo using the same CLI as what it uses for example).

It's also incredibly flexible, yesterday I made a "coding agent" by having it create tools to inspect/analyze/edit a codebase and it could go off and do most things a coding agent can.

https://github.com/caesarnine/binsmith


Every individual programmer having locally-implemented idiosyncratic versions of sed and awk with imperfect reconstruction between sessions sounds like a regression to me


Why would it recreate sed and awk? The screenshot from the repo even shows it using sed.


I already treat awk syntax as something idiocratic, so not much would change for me.


But -- I think, only because of the friction of having to read and parse what they did, which, to me could greatly be alleviated by AI itself.

Put differently -- for those who'd like to share, yes, give me your locally implemented idosyncraticness with a little AI to help explain to me what's going on, and I feel like that's a sweet spot between "AI do the thing" and "give me raw code"


I've been on a similar path. Will have 1000 skills by the end of this week arranged in an evolving DAG. I'm loving the bottoms-up emergence of composable use cases. It's really getting me to rethink computing in general.


Interesting. Could you provide a bit more detail on how the DAG emerges?


2026 paper titled Evolving Programmatic Skill Networks, operationalized in Claude Code


how are they stored?


Have you done a comparison on token usage + cost? I'd imagine there would be some level of re-inventing the wheel (i.e. rewriting code for very similar tasks) for common tasks, or do you re-use previously generated code?


It reuses previously generated code, so tools it creates persists from session to session. It also lets the LLM avoid actually “seeing” the tokens in some cases since it can pipe directly between tools/write to disk instead of getting returned into the LLMs context window.


The point where that breaks down is “next time it’s aware of the CLI and uses it”. That only really works well inside the same session, and often the next session it will create a different tool and use that one.


> That only really works well inside the same session

That was already "fixed" by people adding snippets to agents.md and it worked. Now it's even more streamlined with skills. You can even have cc create a skill after a session (i.e. prompt it like "extract the learnings from this session and put them into a skill for working with this specific implementation of sqlite"). And it works, today.



> I prefer the more deterministic behavior of MCP for complex multi-step tasks, and the fact that I can do it effectively using smaller, cheaper models is just icing on the cake.

Yeah, that makes sense. That's not what the person that I replied was talking about, tho. Skills work fine for "loading context pertinent to one type of task", such as working on a feature without "forgetting" what was done in the previous session.

The article deals with specific, somewhat predefined workflows.


Even if you document the tool and tells what it can do?


Hey that sounds a lot like the project I’m working on, with the twist that it’s containerized. It’s still in dev https://github.com/brycewcole/capsule-agents


That’s pretty cool. Is it practical? What have you used it for?


I've been using it daily, so far it's built CLIs for hackernews, BBC news, weather, a todo manager, fetching/parsing webpages etc. I asked it to make a daily briefing one that just composes some of them. So the first thing it runs when I message it in the morning is the daily briefing which gives me a summary of top tech news/non-tech news, the weather, my open tasks between work/personal. I can ask for follow ups like "summarize the top 5 stories on HN" and it can fetch the content and show it to me in full or give me a bullet list of the key points.

Right now I'm thinking through how to make it more "proactive" even if it's just a cron that wakes it up, so it can do things like query my emails/calendar on an ongoing basis + send me alerts/messages I can respond to instead of me always having to message it first.


Cool to see lots of people independently come to "CLIs are all you need". I'm still not sure if it's a short-term bandaid because agents are so good at terminal use or if it's part of a longer term trend but it's definitely felt much more seamless to me then MCPs.

(my one of many contribution https://github.com/caesarnine/binsmith)


I am also not sure if MCP will eventually be fixed to allow more control over context, or if the CLI approach really is the future for Agentic AI.

Nevertheless, I prefer the CLI for other reasons: it is built for humans and is much easier to debug.



100% - sharing CLIs with the agent has felt like another channel to interact with them once I’ve done it enough, like a task manager the agent and I can both use using the same interface


Thank you for posting binsmith, I've built something similar over the past few days and you've made some great decisions in here


MCP let's you hide secrets from the LLM


you can do same thing with cli via env vars no?


Yes, I'm using Dagger and it has great secret support, obfuscating them even if the agent, for example, cats the contents of a key file, it will never be able to read or print the secret value itself

tl;Dr there are a lot of ways to keep secret contents away from your agent, some without actually having to keep them "physically" separate


Hey this looks cool. So each agent or session is one thread. Nice. I like it.


Pydantic-AI is lovely - I've been working on a forever, fun project to build a coding agent CLI for a year plus now. IMO it does make constructing any given agent very easy, though the lower level APIs are a little painful to use but they seem to be aware of that.

https://github.com/caesarnine/rune-code

Part of the reason I switched to it initially wasn't so much it's niceties versus just being disgusted at how poor the documentation/experience of using LiteLLM was and I thought the folks who make Pydantic would do a better job of the "universal" interface.


I had the opposite experience. I liked the niceties of Pydantic AI, but had trouble with it that I found difficult to deal with. For example, some of the models wouldn't stream, but the OpenAI models did. It took months to resolve, and well before that I switched to LiteLLM and just hand-rolled the agentic logic stuff. LiteLLM's docs were simple and everything worked as expected. The agentic code is simple enough that I'm not sure what the value-add for some of these libraries is besides adding complexity and the opportunity for more bugs. I'm sure for more complex use cases they can be useful, but for most of the applications I've seen, a simple translation layer like LiteLLM or maybe OpenRouter is more than enough.


I'm not sure how long ago you tried streaming with Pydantic AI, but as of right now we (I'm a maintainer) support streaming against the OpenAI, Claude, Bedrock, Gemini, Groq, HuggingFace, and Mistral APIs, as well as all OpenAI Chat Completions-compatible APIs like DeepSeek, Grok, Perplexity, Ollama and vLLM, and cloud gateways like OpenRouter, Together AI, Fireworks AI, Azure AI Foundry, Vercel, Heroku, GitHub and Cerebras.


This is super impressive.

Interesting they're framing this more from the world model/agent environment angle, when this seems like the best example so far of generative games.

720p realtime mostly consistent games for a minute is amazing, considering stable diffusion was originally released 2ish years ago.


Pixelspace is an awful place to be generating 3D assets and maintaining physical self-consistency.


Ultimately even conventional 3d assets are rendered into pixelspace. It all comes down to the constraints in the model itself.


A key strength of conventional 3d assets is that their form is independent of the scenes in which they will be rendered. Models that work purely in pixel space avoid the constraints imposed by representing assets in a fixed format, but they have to do substantial extra work to even approximate the consistency and recomposability of conventional 3d assets. It's unclear whether current approaches to building and training purely pixel-based models will be able to achieve a practically useful balance between their greater flexibility and higher costs. World Labs, for example, seems to be betting that an intermediate point of generating worlds in a flexible but structured format (NERFs, gauss splats, etc) may produce practical value more quickly than going straight for full freedom and working in pixel space.


Under $100:

The heat gun mosquito things that some tech folks were mentioning on Twitter. Always get quarter sized terribly itchy bumps for each mosquito bites and using it makes them essentially itch-free immediately.

Under $1000:

Weekly house cleaning. Such reduced cognitive load/increased free time to not have to clean all the time, think about cleaning, etc especially with a toddler.


If you mean Bite-Away, then yes. I bought mine with leftover FSA funds and as far as I can tell it works, placebo effect or not.

https://www.bite-away.com/en/


I got the much smaller usb-c keyring Heat-It, and works so great . Previous submission, https://shkspr.mobi/blog/2023/12/usb-c-cures-mosquito-bites/ https://news.ycombinator.com/item?id=41548336

There was a study recently on it, which feels fairly encouraging. https://pmc.ncbi.nlm.nih.gov/articles/PMC3257884/

The real winner this year was the $30 (rainbow, because it's cool) mosquito net. It's been shockingly hard setting it up really well, some still get through, but I can sit outside all day & break out the electric swatter two or three times & be fine. And I keep tuning the net a little... (I used to lug a bunch of fans in and out of the house, to keep them off me, but that was only semi-successful & made it a project each time.)


I have a basic mosquito net above my bed and it is miraculous (I got the first one this year). What is particularly interesting is that I am not bothered by mosquito buzzing even when they fly close because somehow my brain knows that they cannot bite me


I used to just run a spoon under very hot water and then hold it against the bite at the hottest I could tolerate and works pretty well, albeit not very temperature accurate.


A suitably hot shower works well too.


Clever - going to steal this.


Another cheap mosquito bite remedy: NOW Foods Tea Tree Roll-On ($5)

Had some big, angry welts this summer but they just stopped itching and disappeared overnight after applying that stuff, no other topical drug/ointment I tried came close.


This is going to wreck Goldman's next analysis.

(context: https://the-decoder.com/goldman-sachs-blunder-adds-to-ai-sto...)



My personal (overly biased view after reading Chip War recently) take is pretty much there, seems like a lot of the same early dynamics of semiconductors are playing out here.

Very large R&D expenditures for the next iterations of the models at the leading edge (the "fabs" of the world), everything downstream getting much cheaper and better with demand increasing as a result.

Like a world where Claude Opus 3.5 is incredibly expensive to train and run, but also results in a Claude Haiku that's on net better than the Opus of the prior generation, occurring every cycle.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: