JoshPurtell's comments

JoshPurtell · 2026-01-30T19:23:25 1769801005

As a demonstration, I've built Crafter in lean - and have hosted it on the web using Lithe

https://lean-crafter-production.up.railway.app/ https://github.com/JoshuaPurtell/lean-crafter

JoshPurtell · 2026-01-16T23:07:56 1768604876

At some point in the future (if not already), claude will install malware less often on average. Just like waymos crash less frequently.

Once you accept that installation will be automated, standardized formats make a lot of sense. Big q is will this particular format, which seems solid, get adopted - probably mostly a timing question

JoshPurtell · 2026-01-13T00:37:20 1768264640

JoshPurtell · 2025-09-22T21:11:15 1758575475

I had a high opinion of PS before this comment.

Now I have a higher opinion of PS

JoshPurtell · 2025-08-27T04:47:38 1756270058

Looks really good!

antves · 2025-08-27T08:23:10 1756282990

Thanks!

JoshPurtell · 2025-08-26T01:28:41 1756171721

34% are between 18 and 65

clipsy · 2025-08-26T01:35:51 1756172151

Being between 18 and 65 does not mean you're in good health, sadly.

jmacd · 2025-08-26T01:33:08 1756171988

The median age for MAID recipients is over 77.

HardCodedBias · 2025-08-26T01:34:53 1756172093

Both can be true, I think.

JoshPurtell · 2025-08-26T01:33:44 1756172024

50% is not a vast majority, so that's a red herring

johnnienaked · 2025-08-26T19:19:06 1756235946

50% can be a massive majority if there is a plethora of opinions

jmye · 2025-08-26T03:28:24 1756178904

What does that have to do with the post you replied to? Is your implication that young people can’t enter hospice or have terminal illness? Or are we just quoting random statistics without any regard to relevance whatsoever?

JoshPurtell · 2025-08-25T20:56:12 1756155372

Something I'd consider a game-changer would be making it really easy to kick off multiple claude instances to tackle a large researched task and then to view the results and collect them into a final research document.

IME no matter how well I prompt, a single claude/codex will never get a successful implementation of a significant feature single-shot. However, what does work is having 5 Claudes try it, reading the code and cherry picking the diff segments I like into one franken-spec I give to a final claude instance with essentially just "please implement something like this"

It's super manual nd annoying with git work-trees for me, but sounds like your setup could make it slick

wjsekfghks · 2025-08-25T21:22:18 1756156938

Interesting. So, do you just start multiple instances of Claude Code and ask the same prompt on all of them? Manually cherry picking from 5 different worktrees sounds complicated. Will see what I can do :)

JoshPurtell · 2025-08-26T00:56:57 1756169817

Yeah, exactly, same prompt.

I agree, it's more complex. But, I feel like the potential with a claude code wrapper is precisely in enabling workflows that are a pain to self-implement but nonetheless are incredibly powerful

JoshPurtell · 2025-08-21T18:44:07 1755801847

This is a really great idea!

It sounds like your first approach is to verify that events met an agreed-upon threshold.

Have you looked into Mechanism Design and getting customers to e.g. pay more for great outcomes, and a little for ok outcomes?

benjsm · 2025-08-21T18:46:19 1755801979

Thank you - yes! Would love to lean more about mechanism design. Mind diving deeper?

JoshPurtell · 2025-08-21T18:54:45 1755802485

The basic idea is this. The customer has some "curve" that represents how much he values different outcomes. Maybe he values good outcomes at $1, and great outcomes at $100. The supplier also has a cost curve - by definition, it will cost him more to supply a great outcome than a good outcome (otw he'd just always supply the great outcome).

Setting a fixed price is a simple way to help these two parties transact. But hypothetically, it may be more efficient - e.g. you will let more mutually-beneficial events happen - to ask both parties for what their number is for a given event, and having both transact when the numbers are far enough apart (cost is $10, value is $100).

The problem is, you can't directly ask the parties, because they don't want to reveal how high/low they're willing to go for no reason. So, you should essentially structure your questions into a pre-defined algorithm so that everyone is incentivized to reveal at least the ballpark of where their cost/value is. The study of how to structure those questions is a subset of mechanism design / information design, which is a branch of Econ related to game theory

JoshPurtell · 2025-08-21T19:01:58 1755802918

FWIW, if this sounds like arcane academic musing ... applied mechanism design for a while was essentially just the study of google ad auctions, and Google invested very very heavily in researchers to figure out how to do this for them

benjsm · 2025-08-21T19:13:32 1755803612

Very very interesting. It makes a lot of sense. I appreciate you sharing.

benjsm · 2025-08-21T19:17:17 1755803837

Definitely digging deeper into this now. I think it becomes more and more important as models improve.

JoshPurtell · 2025-08-17T22:14:32 1755468872

RL is not about delayed reward. Multi-armed bandit problems have no credit assignment component, but are often the first RL problem taught.

In its most general, RL is about learning a policy (state -> action mapping). Which often requires inferring value, etc.

But copying a strong reference policy ... is still learning a policy. Whether by SFT or not

JoshPurtell · 2025-06-01T22:19:20 1748816360

Hey Simon, I'm happy to oblige here. What would be the most exciting, definitive demonstration?

Do you have a dataset or task in mind?

simonw · 2025-06-02T18:19:27 1748888367

The three things I'd be most interested in seeing are:

1. A fine-tuned model for structured data extraction. Get something that's REALLY good at outputting in a specific JSON format, then show it running against a wide range of weird inputs.

2. A fine-tuned vision LLM that gains a new ability that the underlying model did not have, such as identifying different breeds of common California garden birds

3. Text to SQL. Text to SQL is always a great demo for this stuff, a fine-tuned model that's demonstrably "better" at text to SQL for a specific complex database schema would be a really great example.

efavdb · 2025-06-04T17:07:30 1749056850

FWIW here is a case study from shopify covering a project of theirs using fine tuning on a bi-modal model to extract product features. I get that this is not the situation you care about -- they are running at such scale that they need the inferences to be cheap.

https://www.llama.com/static-resource/llama-case-study-shopi...

JoshPurtell · 2025-06-02T19:21:35 1748892095

Awesome! I have one eval in mind that I think might demonstrate each of these capabilities, at least to a fair extent

JoshPurtell · 2025-06-02T01:03:33 1748826213

Open request to skeptics or curious minds - do you have a task that's at least somewhat less difficult for me to set up than swe-bench?

I'd be happy to create you a base agent, and a fine-tuned agent, and OSS the traces for you to look at differently.

And if it's really compelling, visualize them in a hosted frontend :-)

elliotto · 2025-06-02T01:49:58 1748828998

A really simple blog post for any task that you think is worthwhile would be enough to move the field forward. The blog post should include:

1) the training configuration and code 2) the data used to fine tune 3) a set of input/output comparisons comparing the tuned bot to the original bot that show it's learned something interesting

For something really compelling it would host the created models on a repo that I could download and use. The gold standard would be to host them and provide a browser interface, but this could be expensive for gpu costs.

This blog post currently doesn't exist, or if it does I haven't been able to find it in the sea of medium articles detailing an outdated hugging face api