Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They just recently released the r1-0528 model which was a massive upgrade over the original R1 and is roughly on par with the current best proprietary western models. Let them take their time on R2.


At this point the only models I use are o3/o3-pro and R1-0528. The OpenAI model is better at handling data and drawing inferences, whereas the DeepSeek model is better at handling text as a thing in itself -- i.e. for all writing and editing tasks.

With this combo, I have no reason to use Claude/Gemini for anything.

People don't realize how good the new Deepseek model is.


My experience with R1-0528 for python code generation was awful. But I was using a context length of 100k tokens, so that might be why. It scores decently in the lmarena code leaderboard, where context length is short.


Would love to see the system/user prompts involved, if possible.

Personally I get it to write the same code I'd produce, which obviously I think is OK code, but seems other's experience differs a lot from my own so curious to understand why. I've iterated a lot on my system prompt so could be as easy as that.


The biggest reason I use Gemini is because it can still get stuff done at 100k context. The other models start wearing out at 30k and are done by 50k.


The biggest reason I avoid Gemini (and all of Google's models I've tried) is because I cannot get them to produce the same code I'd produce myself, while with OpenAI's models it's fairly trivial.

There is something deeper in the model that seemingly can be steered/programmed with the system/user prompts and it still produces kind of shitty code for some reason. Or I just haven't found the right way of prompting Google's stuff, could also be the reason, but seemingly the same approach works for OpenAI, Anthropic and others, not sure what to make of it.


I'm having the same issue with Gemini as soon as the context length exceeds 50k-ish. At that point, it starts to blurp out random code of terrible quality, even with clear instructions. It would often mix up various APIs. I spend a lot of time instructing it about not writing such code, with plenty of fewshot examples, but it doesn't seem to work. It's like it gets "confused".

The large context length is a huge advantage, but it doesn't seem to be able to use it effectively. Would you say that OpenAI models don't suffer from this problem?


New to me: is more context worse? Is there an ideal context length that maps to a bell curve or something?


> New to me: is more context worse?

Yes, definitely. For every model I've used and/or tested, the more context there is, the worse the output, even within the context limits.

When I use chat UIs (which admittedly is less and less), I never let the chat go beyond one of my messages and one response from the LLM. If something is wrong with the response, I figure out what I need to change with my prompt and start new chat/edit the first message and retry, until it works. Any time I've tried to "No, what I meant was ..." or "Great, now change ..." the responses drop sharply in quality.


Do you use the DeepSeek hosted R1, or a custom one?

The published model has a note strongly recommending that you should not use system prompts at all, and that all instructions should be sent as user messages, so I'm just curious about whether you use system prompts and what your experience with them is.

Maybe the hosted service rewrites them into user ones transparently ...


> Do you use the DeepSeek hosted R1, or a custom one?

Mainly the hosted one.

> The published model has a note strongly recommending that you should not use system prompts at all

I think that's outdated, the new release (deepseek-ai/DeepSeek-R1-0528) has the following in the README:

> Compared to previous versions of DeepSeek-R1, the usage recommendations for DeepSeek-R1-0528 have the following changes: System prompt is supported now.

The previous ones, while they said to put everything in user prompts, still seemed steerable/programmable via the system prompt regardless, but maybe it wasn't as effective as it is for other models.

But yeah outside of that, heavy use of system (and obviously user) prompts.


A lemon is on-par with the best western models for the majority of use cases because they do not require "state of the art" intelligence to solve or respond to the user's query. This is what the benchmarks show.

For anything that requires "AI level of intelligence", the difference is vast.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: