Hacker Newsnew | past | comments | ask | show | jobs | submit | JoshPurtell's commentslogin

As a demonstration, I've built Crafter in lean - and have hosted it on the web using Lithe

https://lean-crafter-production.up.railway.app/ https://github.com/JoshuaPurtell/lean-crafter


At some point in the future (if not already), claude will install malware less often on average. Just like waymos crash less frequently.

Once you accept that installation will be automated, standardized formats make a lot of sense. Big q is will this particular format, which seems solid, get adopted - probably mostly a timing question


lmao


I had a high opinion of PS before this comment.

Now I have a higher opinion of PS


Looks really good!


Thanks!


34% are between 18 and 65


Being between 18 and 65 does not mean you're in good health, sadly.


The median age for MAID recipients is over 77.


Both can be true, I think.


50% is not a vast majority, so that's a red herring


50% can be a massive majority if there is a plethora of opinions


What does that have to do with the post you replied to? Is your implication that young people can’t enter hospice or have terminal illness? Or are we just quoting random statistics without any regard to relevance whatsoever?


Something I'd consider a game-changer would be making it really easy to kick off multiple claude instances to tackle a large researched task and then to view the results and collect them into a final research document.

IME no matter how well I prompt, a single claude/codex will never get a successful implementation of a significant feature single-shot. However, what does work is having 5 Claudes try it, reading the code and cherry picking the diff segments I like into one franken-spec I give to a final claude instance with essentially just "please implement something like this"

It's super manual nd annoying with git work-trees for me, but sounds like your setup could make it slick


Interesting. So, do you just start multiple instances of Claude Code and ask the same prompt on all of them? Manually cherry picking from 5 different worktrees sounds complicated. Will see what I can do :)


Yeah, exactly, same prompt.

I agree, it's more complex. But, I feel like the potential with a claude code wrapper is precisely in enabling workflows that are a pain to self-implement but nonetheless are incredibly powerful


This is a really great idea!

It sounds like your first approach is to verify that events met an agreed-upon threshold.

Have you looked into Mechanism Design and getting customers to e.g. pay more for great outcomes, and a little for ok outcomes?


Thank you - yes! Would love to lean more about mechanism design. Mind diving deeper?


The basic idea is this. The customer has some "curve" that represents how much he values different outcomes. Maybe he values good outcomes at $1, and great outcomes at $100. The supplier also has a cost curve - by definition, it will cost him more to supply a great outcome than a good outcome (otw he'd just always supply the great outcome).

Setting a fixed price is a simple way to help these two parties transact. But hypothetically, it may be more efficient - e.g. you will let more mutually-beneficial events happen - to ask both parties for what their number is for a given event, and having both transact when the numbers are far enough apart (cost is $10, value is $100).

The problem is, you can't directly ask the parties, because they don't want to reveal how high/low they're willing to go for no reason. So, you should essentially structure your questions into a pre-defined algorithm so that everyone is incentivized to reveal at least the ballpark of where their cost/value is. The study of how to structure those questions is a subset of mechanism design / information design, which is a branch of Econ related to game theory


FWIW, if this sounds like arcane academic musing ... applied mechanism design for a while was essentially just the study of google ad auctions, and Google invested very very heavily in researchers to figure out how to do this for them


Very very interesting. It makes a lot of sense. I appreciate you sharing.


Definitely digging deeper into this now. I think it becomes more and more important as models improve.


RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.

In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.

But copying a strong reference policy ... is still learning a policy. Whether by SFT or not


Hey Simon, I'm happy to oblige here. What would be the most exciting, definitive demonstration?

Do you have a dataset or task in mind?


The three things I'd be most interested in seeing are:

1. A fine-tuned model for structured data extraction. Get something that's REALLY good at outputting in a specific JSON format, then show it running against a wide range of weird inputs.

2. A fine-tuned vision LLM that gains a new ability that the underlying model did not have, such as identifying different breeds of common California garden birds

3. Text to SQL. Text to SQL is always a great demo for this stuff, a fine-tuned model that's demonstrably "better" at text to SQL for a specific complex database schema would be a really great example.


FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.

https://www.llama.com/static-resource/llama-case-study-shopi...


Awesome! I have one eval in mind that I think might demonstrate each of these capabilities, at least to a fair extent


Open request to skeptics or curious minds - do you have a task that's at least somewhat less difficult for me to set up than swe-bench?

I'd be happy to create you a base agent, and a fine-tuned agent, and OSS the traces for you to look at differently.

And if it's really compelling, visualize them in a hosted frontend :-)


A really simple blog post for any task that you think is worthwhile would be enough to move the field forward. The blog post should include:

1) the training configuration and code 2) the data used to fine tune 3) a set of input/output comparisons comparing the tuned bot to the original bot that show it's learned something interesting

For something really compelling it would host the created models on a repo that I could download and use. The gold standard would be to host them and provide a browser interface, but this could be expensive for gpu costs.

This blog post currently doesn't exist, or if it does I haven't been able to find it in the sea of medium articles detailing an outdated hugging face api


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: