Presumably you will advance along with humanity though, or failing that, just figure out the transcendence thing yourself with so much time?
I don’t think anyone would choose to stay ‘meatbag human’ for trillions of years.
Presumably you will advance along with humanity though, or failing that, just figure out the transcendence thing yourself with so much time?
I don’t think anyone would choose to stay ‘meatbag human’ for trillions of years.
Almost all of Qwen 2.5 is Apache 2.0, SOTA for the size, and frankly obsoletes many bigger API models.
These days, there are amazing “middle sized” models like Qwen 14B, InternLM 20B and Mistral/Codestral 22B that are such a massive step over 7B-9B ones you can kinda run on CPU. And there are even 7Bs that support a really long context now.
IMO its worth reaching for >6GB of VRAM if LLM running is a consideration at all.
I am not a fan of CPU offloading because I like long context, 32K+. And that absolutely chugs if you even offload a layer or two.
For local LLM hosting, basically you want exllama, llama.cpp (and derivatives) and vllm, and rocm support for all of them is just fine. It’s absolutely worth having a 24GB AMD card over a 16GB Nvidia one, if that’s the choice.
The big sticking point I’m not sure about is flash attention for exllama/vllm, but I believe the triton branch of flash attention works fine with AMD GPUs now.
Basically the only thing that matters for LLM hosting is VRAM capacity. Hence AMD GPUs can be OK for LLM running, especially if a used 3090/P40 isn’t an option for you. It works fine, and the 7900/6700 are like the only sanely priced 24GB/16GB cards out there.
I have a 3090, and it’s still a giant pain with wayland, so much that I use my AMD IGP for display output and Nvidia still somehow breaks things. Hence I just do all my gaming in Windows TBH.
CPU doesn’t matter for llm running, cheap out with a 12600K, 5600, 5700x3d or whatever. And the single-ccd x3d chips are still king for gaming AFAIK.
Twitter screenshot of this linked in slack that evening.
The modern internet in a nutshell, lol.
Discord is even worse, as you need to find an invite to a specific Discord, and sometimes go through a lengthy sign up process for each Discord.
Some won’t let you sign up without a phone #.
Matrix.
And… Lemmy.
It doesn’t matter though, the problem is the critical mass is migrating to Discord and shunting everything out of view. Honestly that’s much worse than being on Reddit, even now.
I’m a bit salty this was apparently announced through Discord. Was it even posted anywhere else?
The future of social media is fragmented siloes, I guess.
8GB or 4GB?
Yeah you should get kobold.cpp’s rocm fork working if you can manage it, otherwise use their vulkan build.
llama 8b at shorter context is probably good for your machine, as it can fit on the 8GB GPU at shorter context, or at least be partially offloaded if its a 4GB one.
I wouldn’t recommend deepseek for your machine. It’s a better fit for older CPUs, as it’s not as smart as llama 8B, and its bigger than llama 8B, but it just runs super fast because its an MoE.
Oh I got you mixed up with the other commenter, apologies.
I’m not sure when llama 8b starts to degrade at long context, but I wanna say its well before 128K, and where other “long context” models start to look much more attractive depending on the task. Right now I am testing Amazon’s mistral finetune, and it seems to be much better than Nemo or llama 3.1 out there.
4 core i7, 16gb RAM and no GPU yet
Honestly as small as you can manage.
Again, you will get much better speeds out of “extreme” MoE models like deepseek chat lite: https://huggingface.co/YorkieOH10/DeepSeek-V2-Lite-Chat-Q4_K_M-GGUF/tree/main
Another thing I’d recommend is running kobold.cpp instead of ollama if you want to get into the nitty gritty of llms. Its more customizable and (ultimately) faster on more hardware.
Can you afford an Arc A770 or an old RTX 3060?
Used P100s are another good option. Even an RTX 2060 would help a ton.
27B is just really chunky on CPU, unfortunately. There’s no way around it. But you may have better luck with MoE models like deepseek chat or Mixtral.
Heres a tip, most software has the models default context size set at 512, 2048, or 4092. Part of what makes llama 3.1 so special is that it was trained with 128k context so bump that up to 131072 in the settings so it isnt recalculating context every few minutes…
Some caveats, this massively increases memory usage (unless you quantize the cache with FA) and it also massively slows down CPU generation once the context gets long.
TBH you just need to not keep a long chat history unless you need it,.
My level of worry hasn’t lowered in years…
But honestly? Low on the totem pole. Even with Trumpy governments.
Things like engagement optimized social media warping people’s minds for profit, the internet outside of apps dying before our eyes, Sam Altman/OpenAI trying to squelch open source generative models so we’re dependent on their Earth burning plans, blatant, open collusion with the govt, everything turning into echo chambers… There are just too many disasters for me to even worry about the government spying on me.
If I lived in China or Russia, the story would be different. I know, I know. But even now, I’m confident I can given the U.S. president the middle finger in my country, but I’d really be more scared for my life in more authoritarian strongman regions.