this post was submitted on 11 Jan 2025
10 points (100.0% liked)
LocalLLaMA
2410 readers
44 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
You'll want to look up how to offload GPU layers in ollama . a lower quant gguf should work great with offloading.
Most people use kobold.cpp now. ollama and llama.cpp kind of fell behind. kobold.cpp is a bleeding edge fork of llama.cpp with all the latest and greatest features. its GPU offloading is so damn easy if you have nvidia card use cblast if you have AMD card use vulcan.
Is there a particular reason youre trying to run a mixture of experts model for an RP/storytelling purposed llm? Usually MoE is better suited at logical reasoning and critical analysis of a complex problem. If you're a newbie just starting out you may be better with a RP finetune training of a mistralAI LLM like alriamax based of NeMo 12B.
Theres always a tradeoff with finetunes, typically a model thats finetuned for rp/storytelling sacrifices capabilities in other important areas like reasoning, encylcopedic knowledge, and mathematical/coding ability.
Heres an example starting command for offloading, I have a nvidia 1070ti 8gb and can get 25-35 layers offloaded onto it depending on context size
I think i may try this way if kobold uses vulkan instead of rocm, It's most likely going to be way less of a headache.
As for the model, it's what came out of a random search for a decent small model on reddit. No reason in particular, thanks for the suggestion.