Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The benchmarks for the different models focus on math and coding accuracy. I have a use-case for a model where those two functions are completely irrelevant and I’m only interested in writing (chat, stories, etc). I guess you can’t really benchmark ‘concepts’ as easily as logic.

With distillation, can a model be made that strips out most of the math and coding stuff?



> completely irrelevant and I’m only interested in writing (chat, stories, etc)

There's a person keeping track of a few writing prompts and the evolution of the quality of text with each new shiny model. They shared this link somewhere, can't find the source but I had it bookmarked for further reading. Have a look at it and see if it's something you'd like.

https://eqbench.com/results/creative-writing-v2/deepseek-ai_...


Here's a better link: https://eqbench.com/creative_writing.html

The R1 sample reads way better than anything else on the leaderboard to me. Quite a jump.


Why is the main character named Rhys in most (?) of them? Llama[1], Claude[3], Mistral[4] & DeepSeek-r1[5] samples all named the main character Rhys, even though that was no where specified in the prompt? GPT-4o gives the character a different name[6]. Gemini[2] names the bookshop person Rhys instead! Am I just missing something really obvious? I feel like I'm missing something big that's right in front of me

[1] https://eqbench.com/results/creative-writing-v2/meta-llama__... [2] https://eqbench.com/results/creative-writing-v2/gemini-1.5-f... [3] https://eqbench.com/results/creative-writing-v2/claude-3-opu... [4] https://eqbench.com/results/creative-writing-v2/mistralai__M... [5] https://eqbench.com/results/creative-writing-v2/deepseek-ai_... [6] https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-...


Completely agree.

The only measurable flaw I could find was the errant use of an opening quote (‘) in

> He huffed a laugh. "Lucky you." His gaze drifted to the stained-glass window, where rain blurred the world into watercolors. "I bombed my first audition. Hamlet, uni production. Forgot ‘to be or not to be,' panicked, and quoted Toy Story."

It's pretty amazing I can find no fault with the actual text. No grammar errors, I like the writing, it competes with the quality and engagingness of a large swath of written fiction (yikes), I wanna read the next chapter.


> It's pretty amazing I can find no fault with the actual text. No grammar errors, I like the writing, it competes with the quality and engagingness of a large swath of written fiction (yikes), I wanna read the next chapter.

The lack of "gpt-isms" is really impressive IMO.


Those outputs are really good and come from deepseek-R1 (I assume the full version, not a distilled version).

R1 is quite large (685B params). I’m wondering if you can make a distilled R1 without the coding and math content. 7B works well for me locally. When I go up to 32B I seem to get worse results - I assume it’s just timing out in its think mode… I haven’t had time to really investigate though.


Yes, you can create a writing-focused model through distillation, but it's tricky. *Complete removal* of math/coding abilities is challenging because language models' knowledge is interconnected - the logical thinking that helps solve equations also helps structure coherent stories.


I understood that at least some of these big models (llama?) is basically bootstrapped with code. is there truth to that?


Yes, code is a key training component. Open-Llama explicitly used programming data as one of seven training components. However, newer models like Llama 3.1 405B have shifted to using synthetic data instead. Code helps develop structured reasoning patterns but isn't the sole foundation - models combine it with general web text, books, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: