this post was submitted on 28 Feb 2024
9 points (100.0% liked)
LocalLLaMA
2249 readers
1 users here now
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is sick. Would this lead to better offline LLMs on mobile?
I think we're already getting there. Lots of newer phones include AI accelerators. And all the companies advertise for AI. I don't think they're made to run LLMs, but anyways. Llama.cpp already runs on phones. And the limiting factor seems to be the RAM. I've tried Microsoft's "phi-2", quantized and on slow hardware, it's surprisingly capable for such a small model. Something like a ternary model would significantly cut down on the amount of RAM that is being used which allows to load larger models while also making it faster, everywhere. So I'd say yes. And it would also allow me to load a more intelligent model on my PC.
I think the doing away with matrix multiplications is also a big deal, but has little consequences as of today. You'd first need to re-design the chips to take advantage of that. And local inference is typically limited by memory bandwidth, not multiplication speed. At least as far as I understand.
I'd say if this is true, it allows for a big improvement in parameter count for all kinds if use-cases. But I've also come to the conclusion that there might be a caveat to that. Maybe the training is prohibitively expensive. I don't really know, at this point there is too much speculation going on and I'm not really an expert.
Yeah I knew about the AI chips being more common but this is a really good write up, thanks!
ollama already lets you run many 7b llms on Android with 4bit quantization.