From my experimentation I suspect there's some subtle bug in llama.cpp that espe...

adeon · on March 16, 2023

I think the LLaMA-7B in general might not just be very good. I've been playing around and run full non-quantized LLaMA-30B and LLaMA-7B in a bunch of experiments and I think the quality of output is much, much better in LLaMA-30B.

thot_experiment · on March 16, 2023

Have you done any comparison testing between 30B q4/q8/q16? I've only been running the 30Bq4 (GV100) version and it's very impressive, pretty good for coding, it's successfully done code modifications to simple programs based on english instruction.

adeon · on March 16, 2023

I have not, but I want to in near future because I'm really curious myself too. I've been following Rust community that now has llama.cpp port and also my OpenCL thing and one discussion item has been to run a verification and common benchmark for the implementations. https://github.com/setzer22/llama-rs/issues/4

I've mostly heard that, at least for the larger models, quantization has barely any noticeable effect. Would be nice to witness it myself.

tarruda · on March 16, 2023

Can you show some example prompts you use for coding? Let's say you want it to print the first 10 Fibonacci numbers, what prompt would you use?

thot_experiment · on March 16, 2023

The example I gave was using this as a backend for a chat bot in a private server and i'm not comfortable sharing the prompt, however if you look up the leaked bing prompt that might give you some ideas for how to prompt an LLM into being a chatbot that can answer coding questions. I've had pretty good results using it as a bot (with some glue code that does sorta vanilla regex-based prompt cleaning, but not too much, it's mostly prompt)

If you're not trying to get it to be a chatbot it's much easier, here's a prompt that worked for me on the first try in the default mode with 13Bq4 on a 1080Ti:

    Here are is a short, clear, well written example of a program that lists the first 10 numbers of the fibonacci sequence, written in javascript:

    ```js

and when given that it finished it with:

    function Fib(n) {
        if (n == 0 || n == 1) return 1;
        else return Fib(n-1)+Fib(n-2);
    }
    var i = 0;
    while (i < 10) {
        console.log("The number " + i + " is: " + Fib(i));
        i++;
    }
    ```
    \end{code}

tarruda · on March 16, 2023

Thanks, I've saw this leaked chat prompt and that gave me some ideas.

So under the hoods, ChatGPT is just a model like Llama where they prepend every user input with a context that makes it behave like a chatbot?

thot_experiment · on March 16, 2023

(I don't work at OpenAI so take it with a grain of salt) Yes and No they are similar. It is basically just a fancy autocomplete like llama, but I believe it's specifically been trained on chat content, or at least finetuned on such, and it probably uses a more chat focused labeling scheme on the training data as well to help it perform well on that specific task and be conversational.

tarruda · on March 17, 2023

To me it is really mind blowing that these properties (coding, emulate chatbot) emerge from just from feeding these neural nets with text data.

Thanks for the info.

tarruda · on March 16, 2023

What kind of hardware is necessary to run non-quantized LLama-30B?

adeon · on March 16, 2023

I ran it on a 128 RAM machine with a Ryzen 5950X. It's not fast, 4 seconds per token. But it's just about fits without swapping. https://github.com/Noeda/rllama/

umangsh · on March 16, 2023

30B fp16 takes ~500 ms/token on M2 Max 96GB. Interestingly, that's the same performance as 65B q4 quantized.

65B fp16 is ungodly slow, ~300,000 ms/token on the same machine.

BryceSchroeder · on March 18, 2023

I am running fp16 LLaMA 30B (via vanilla-llama) on six AMD MI25s. Computer has 384 GB of RAM but the model fits in the VRAM. It takes up about 87 GB of VRAM out of the 96 GB available on the six cards. Performance is about 1.6 words per second in an IRC chat log continuation task and it pulls about 400W additional when "thinking."

inductive_magic · on March 16, 2023

I've got it working on an rtx a6000 (48GB)

sebzim4500 · on March 16, 2023

I would guess about 70 GB RAM, but I haven't actually tried it.

tveita · on March 17, 2023

I noticed there's a couple of open issues on llama.cpp investigating quality issues. It's interesting if a wrong implementation still generates plausible output. It sounds like an objective quality metric would help track down issues.

https://github.com/ggerganov/llama.cpp/issues/129

https://github.com/ggerganov/llama.cpp/issues/173

tarruda · on March 16, 2023

Still impressed with the output of a 4gb model, thanks for this.