I think the LLaMA-7B in general might not just be very good. I've been playing around and run full non-quantized LLaMA-30B and LLaMA-7B in a bunch of experiments and I think the quality of output is much, much better in LLaMA-30B.
Have you done any comparison testing between 30B q4/q8/q16? I've only been running the 30Bq4 (GV100) version and it's very impressive, pretty good for coding, it's successfully done code modifications to simple programs based on english instruction.
I have not, but I want to in near future because I'm really curious myself too. I've been following Rust community that now has llama.cpp port and also my OpenCL thing and one discussion item has been to run a verification and common benchmark for the implementations. https://github.com/setzer22/llama-rs/issues/4
I've mostly heard that, at least for the larger models, quantization has barely any noticeable effect. Would be nice to witness it myself.
The example I gave was using this as a backend for a chat bot in a private server and i'm not comfortable sharing the prompt, however if you look up the leaked bing prompt that might give you some ideas for how to prompt an LLM into being a chatbot that can answer coding questions. I've had pretty good results using it as a bot (with some glue code that does sorta vanilla regex-based prompt cleaning, but not too much, it's mostly prompt)
If you're not trying to get it to be a chatbot it's much easier, here's a prompt that worked for me on the first try in the default mode with 13Bq4 on a 1080Ti:
Here are is a short, clear, well written example of a program that lists the first 10 numbers of the fibonacci sequence, written in javascript:
```js
and when given that it finished it with:
function Fib(n) {
if (n == 0 || n == 1) return 1;
else return Fib(n-1)+Fib(n-2);
}
var i = 0;
while (i < 10) {
console.log("The number " + i + " is: " + Fib(i));
i++;
}
```
\end{code}
(I don't work at OpenAI so take it with a grain of salt) Yes and No they are similar. It is basically just a fancy autocomplete like llama, but I believe it's specifically been trained on chat content, or at least finetuned on such, and it probably uses a more chat focused labeling scheme on the training data as well to help it perform well on that specific task and be conversational.
I ran it on a 128 RAM machine with a Ryzen 5950X. It's not fast, 4 seconds per token. But it's just about fits without swapping. https://github.com/Noeda/rllama/
I am running fp16 LLaMA 30B (via vanilla-llama) on six AMD MI25s. Computer has 384 GB of RAM but the model fits in the VRAM. It takes up about 87 GB of VRAM out of the 96 GB available on the six cards. Performance is about 1.6 words per second in an IRC chat log continuation task and it pulls about 400W additional when "thinking."
I noticed there's a couple of open issues on llama.cpp investigating quality issues. It's interesting if a wrong implementation still generates plausible output. It sounds like an objective quality metric would help track down issues.