LocalLLaMA

2410 readers

44 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

unsure on how to quantize model (feddit.it)

submitted 19 hours ago by brokenlcd@feddit.it to c/localllama@sh.itjust.works

4 comments fedilink hide all child comments

I was experimenting with oobabooga trying to run this model but due to it's size it wasn't going to fit in ram, so i tried to quantize it using llama.cpp, and that worked, but due to the gguf format it was only running on the cpu. searching for ways to quantize the model while keeping it in safetensors returned nothing; so is there any way to do that?

I'm sorry if this is a stupid question, i still know almost nothing of this field

you are viewing a single comment's thread
view the rest of the comments

[–] Smokeydope@lemmy.world 1 points 13 hours ago* (last edited 11 hours ago) (1 children)

You'll want to look up how to offload GPU layers in ollama . a lower quant gguf should work great with offloading.

Most people use kobold.cpp now. ollama and llama.cpp kind of fell behind. kobold.cpp is a bleeding edge fork of llama.cpp with all the latest and greatest features. its GPU offloading is so damn easy if you have nvidia card use cblast if you have AMD card use vulcan.

Is there a particular reason youre trying to run a mixture of experts model for an RP/storytelling purposed llm? Usually MoE is better suited at logical reasoning and critical analysis of a complex problem. If you're a newbie just starting out you may be better with a RP finetune training of a mistralAI LLM like alriamax based of NeMo 12B.

Theres always a tradeoff with finetunes, typically a model thats finetuned for rp/storytelling sacrifices capabilities in other important areas like reasoning, encylcopedic knowledge, and mathematical/coding ability.

Heres an example starting command for offloading, I have a nvidia 1070ti 8gb and can get 25-35 layers offloaded onto it depending on context size

./koboldcpp --model Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --threads 6 --usecublas --gpulayers 28 --contextsize 8092

[–] brokenlcd@feddit.it 1 points 8 hours ago

I think i may try this way if kobold uses vulkan instead of rocm, It's most likely going to be way less of a headache.

As for the model, it's what came out of a random search for a decent small model on reddit. No reason in particular, thanks for the suggestion.