Seriously useful local non-cloud AI tool

nickwalt · May 25, 2025, 3:01am

There are plenty of centralized cloud connected AI programs like Copilot and Grok. Are people achieving seriously good results with truly Open Source AI that runs locally?

The goal is continued learning in programming.

I’m presuming that OLlama is the platform on which to host the model on Linux. Maybe using Distrobox or Toolbx.

The host machine has 384GB RAM, AMD 5700 XT GPU and 32 cores on an Epyc ROME server platform (PCIe 4.0, DDR4). So can handle a reasonably beefy model.

Thanks!

rupert · May 25, 2025, 9:56am

I think ollama is the best tool to run the llm or tokenizer model. Its as simple as ollama pull <model> and ollama run <model> once you have it installed.

AMD 5700 XT GPU - 8GB Ram?

This is going to be quite limiting, even though you have plenty of CPU and system RAM. I think you might be able to run deepseek-r1 or qwen3 models at 8b parameters. I have a total of 40GB of GPU memory, and was able to run a 70b deepseek - it gave pretty good answers and noticeably better than the smaller versions but was slow. In practice I would consider running a small model for speed, and then replaying through the bigger model whilst I make some tea, to get a more polished answer. You could also play with smaller models on your GPU, but outsource your llm to an online service when you need better AI.

I have tried some OS stuff like this: Chat With Your Codebase: Build a Local LLM CLI Powered by Ollama + ChromaDB | by Rafał Kędziorski | Apr, 2025 | Medium

This introduces chromadb for the indexing, certainly not the only choice for OS vector database, but seems reasonably easy to install and get started with.

The resulting system from the above article was pretty useless though, and not just for Elm. The main issue seems to be that the indexing of the code is just chunking it like a text file with no context sensitivity.

Next up, I looked at: 🔍 Ask Code Anything: GitHub Repo RAG MCP Server — Your AI-Powered Dev Assistant | by Pratiksworking | May, 2025 | Medium

This seems a more promising approach as it chunks the code into functions/classes/whatever so the indexing is already going to be better due to that. But no parser for Elm built in.

@jxxcarlson Has recently being hacking on the Elm compiler to output the AST in the format used by this repo rag tool, very much work in progress: elm-compiler/README.md at master · jxxcarlson/elm-compiler · GitHub

I am also interested in getting a decent open source AI search on a codebase running locally. For Elm it would be nice to experiment with. For my work, I have huge amounts of Java, typescript, python that I need to find my way around quickly, but also I cannot put any of this code online, the repos, the AI, the index, because it is a proprietary codebase and I signed off on company regulations around shadow IT and an NDA.

So far I have not found an open source system that I can just install and get working easily, just pieces I can try and work with. I have not really gotten any value out of it yet either, things I have tried so far have not worked well.

nickwalt · May 25, 2025, 4:20pm

Yeah, GPU is 8GB RAM. I figured it would be a bottleneck. The CPU is from 2019 and maxes out at 3.35GHz. It is a machine built for running a lot of threads, Virtual Machines and Containers. An AI could stretch out on it but from what you’ve indicated the GPU is the heart of the compute.

It seems as if I should just find the best online generative AI service and use that for a while. I need something I can ask a lot of questions quickly. A personal trainer rather than a code monkey in the editor. So probably not a Github Copilot.

Too bad about the need for significant compute for local AI. Thanks for your great response it was very helpful.

rupert · May 25, 2025, 4:38pm

You certainly can run AI in 8GB, and it may have quite high token throughput which will make it pleasant to use. You just miss out a bit on quality of the answers.

nickwalt · May 26, 2025, 12:02pm

Yeah, thanks there is plenty here to consider. Great points and links

Laurent · May 28, 2025, 9:04am

To try models locally, LM Stutio is the most convenient way IMHO, thanks to its user-friendly GUI.

I run locally Qwen2.5 Coder for autocompletion with great results on my potato laptop (4GB iGPU) with VSCode llama.cpp extension.

rupert · May 28, 2025, 11:29am

After reading about your linked vscode llama.cpp extension, I learned that if you are using ollama, you can link that up with the combination of ollama + vscode + continue.dev vscode plugin.

I have low expectations but seems easy enough to try since I already have vscode + ollama installed. I suspect that it might generate stuff using the current file as context, but search indexing of the whole project will be missing, lets see…

jweir · June 2, 2025, 11:50pm

If you are vim user, GitHub - ggml-org/llama.vim: Vim plugin for LLM-assisted code/text completion is simple to integrate and get started. It supports different models for different memory profiles.

We have found the auto complete very useful for some tedious and receptive tasks.

nickwalt · June 3, 2025, 4:16am

I think for the level of assistance I’m after, which is as if I had an experienced colleague I could ask questions (in addition to all the mechanical coding enhancements) something like Copilot Pro might be what I’m after.

I’ll spin up an isolated dev environment so that the AI is sandboxed from the rest of the computer.

It seems like local AI is really only fast enough for providing mechanical coding assistance to remove mental repetitive strain injury

There does seem to be more hardware acceleration coming out for local AI so let’s see where that goes. Although most desktop PCs don’t have the direct-to-CPU PCIe lanes to support both GPU and AI hardware in the one box. The Epyc CPU in my “workstation” (which is just a Supermicro server motherboard) has 128 PCIe lanes like the AMD Threadripper HEDT and Pro platforms. I think for me the upgrade of this PC would be to max out the CPU with the latest compatible with this motherboard platform and look at AI acceleration hardware sometime in the future — if I wanted performance.

rupert · June 3, 2025, 8:33am

Desktop motherboards have less PCIe lanes that workstation or server class boards. They are really designed for 1 GPU card for gaming. Workstation and server class are designed to be able to handle >1 GPU, or other PCIe hardware such as multiple NICs or storage controllers and so on.

All PCIe lanes are “direct to CPU”, every available lane is linked to the CPU and the CPU is what provides the number of lanes in conjunction with a supporting motherboard.

I also have an old server, mine is 2011-v3 socket Xeon and I first put it together in maybe 2014. More recently I maxed out the RAM to 256GB and installed the largest CPU available with 22 cores, since this tech is now far behind the cutting edge I was able to do so quite cheaply.

This lets me run lots of virtual machines (under proxmox). Unfortunately this does nothing for the AI capabilities, its really all about the GPU.

GPU and AI hardware are for the most part the same thing, you run AI on your GPU. Unless we are talking about non-GPU hardware accelerators, or Apple silicon where there is NPU built into the CPU and they use the same RAM. A great idea because in theory you can add more RAM, except well… Apple does not like you adding more RAM.

The available RAM is a major restriction for local AI, since you need to buy a GPU with enough RAM for the models you want to run, and that quickly pushes you to looking at high end cards, which are very expensive. Its a shame no-one made a GPU where you can just add more RAM like a motherboard.

Moving on to more recent hardware, PCIe 5 can be a significant step up for GPUs for AI, because it allows direct expansion device to expansion device data flow independantly of the CPU and also has massive bandwidth. nVidia stopped doing SLI to link 2 cards together and say that it is no longer needed anyway due to PCIe 5.

nickwalt · June 4, 2025, 1:26am

Interesting. Just one point of understanding that I have is that the Ryzen platform has less than 30 direct-to-CPU PCIe lanes of which 4 or 8 link to a kind of South Bridge “hub” to which low end peripherals are connected, including secondary M2 NVME drives.

EPYC CPUs and motherboard chipsets have no South Bridge — everything is direct-to-CPU.

If a Ryzen motherboard has a second M.2 NVME slot it will typically be connected to the South Bridge. On my EPYC PC all motherboard M2 NVME slots have full x4 bandwidth. Typically there are two M2 onboard for drive mirroring.

I’ve installed a bifurcation card, from Asus, into a x16 slot. This allows 4 x x4 M2 NVME drives exclusive access to one of four x4 pathways on a x16 direct-to-CPU slot. That’s 4 x 7000+ MB/sec NVME capable drives. If I want another four drives another bifurcation card can be added.

Threadripper Pro has 192 lanes if I’m not mistaken. Ryzen needs to be increased to 64 lanes. The entire Ryzen platform is crippled until it is.

I took a look at the Nvidia compute GPUs with 40GB onboard and the retail cost was well above $25,000 AUD for that single card. The new datacentres dedicated to AI required for bureaucratic governance infrastructure (Digital ID, programmable currencies, social credit management and surveillance) are requiring the building of multiple nuclear power generation facilities. This is the big push right now.

Makes me think my little local AI “entity” is really just a cute little toy garbage in garage out processor

nickwalt · June 4, 2025, 1:30am

FWIW here is an interesting commentary about AI vs human consciousness:

One interesting point raised, maybe not in this article, is that LLMs like chatgpt require human guidance in the form of shaping the questions of enquiry and therefore the quality of output.

rupert · June 4, 2025, 9:13am

I bought 2 x nVidia A4500 second hand off ebay at approx £800 (1600 AUD) each. That gives me a total of 40GB and was the best bang-for-buck way to get that much on the Ada generation. If I run an LLM that needs to use the memory of both cards, ollama will split it between the cards, but each card runs in sequence, so one is at 0% CPU while the other is briefly at 100%, and then they swap over. LLM processing is quite sequential, but it would be no less sequential on a single GPU so the speed is about the same as a single GPU.

pdamoc · June 4, 2025, 9:25am

I have not played enough with local AI but I did a bit of research and I would recommend machines with unified memory.

I have a MacStudio with 32Gb of RAM (I bought it before the AI blew and did not see the point of more RAM). The deepseek models I played with behaved admirably.

Another option is the AI Max+ 395 with up to 128 Gb or RAM. Again, unified, so available to the GPU.

Framework Desktop is still on pre-order but there might be other nice machines like the gmktec EVO-X2.

rupert · June 4, 2025, 9:44am

My view is that LLMs are very “left brain”. There is a lot of BS written about brain assymetry but at the risk of making untrue claims, roughly speaking…

The left brain is more precise, short-range, and tool oriented, and the right brain is more fuzzy, long-range, pattern matching and directly observant of the environment through our senses.

The main speech area and motor control for our hands are close to each other in the left hemisphere and both are involved in how we manipulate our environment. But speech is not exclusive to the left hemisphere. If we speak in a carefully thought out way, its more left, if we use nonsense words, or write poems or songs it can be more right hemisphere generated.

The left brains reward system is mostly driven by dopamine. Dopamine produces a hit, but one that fades the more of it we generate. So the left brain gets tired and stops working so well after it has done too many hits. The right brain uses hormones derived from adrenaline, I am not sure of exact names. This does not have the fading hits effect of dopamine, which is why the right brain can remain alert for long periods of time. If a rabbit is feeding and keeping an eye out for predators, its the right brain that is continually monitoring the environment for a “pattern match” that tells it something unusual is occuring in those bushes over there and it should run.

There is a fascinating book about this called “The Master and His Emissary” which I found very illuminating. Its a tough read though, I think only got about 1/3 of the way through before giving up, just a very accademic style.

One of the main claims in this book is that we go through a right → left → right cycle when interacting with the world. The right observes the environment, tasks the left with performing some more tool oriented job, like responding with speech, then returns control to the right to continue running the overall show. Right is Master, Left is Emissary - its an analogy to a story written by Nietzsche about a master and servant and where the servant tries to take over. That is what the book is really about - that the modern world and digital technology has elevated the status of the left brain to the point it is starting to take over our entire culture and way of being.

Anyway, what I realized from this is that AI as we currently have it is very left brain. Its a bunch of tools that extend our left brain. If I code too long my left brain gets dopamined out. Wouldn’t it be nice if I could have a computer do the hard work for me?

When you talk to an LLM it does not say “Ohh Hi Rupert! I was missing you we haven’t talked in ages!”, because its not conscious for long periods of time and therefore aware of the passing of time in the same way that our right brain is. It hasn’t just been sitting there twiddling its thumbs since our last conversation, it dumped that to disk and went to process someone elses chat.

The left brain cannot do 2 conflicting things at once. The right brain can easily hold 2 conflicting ideas at the same time, one OR the other AND one AND the other all at the same time. So its possible to see how sequential token processing based on data generated by human left brains can be used to train LLMs and will converge to some approximation of the right answer. Not so easy to understand how to build an AI that works more like the right brain, what would determine success here? and what data would be used to train it?

The left brain approach has economic value - computers that can do our mental tasks for us. The right brain angle is much harder to understand the economic value of.

Current direction of travel in AI and recent successes will not produce consciousness. That will require a very different approach and I think we are barely started on it.

damir · June 7, 2025, 11:37am

+1 for “The Master and his Emissary”–one of my favorite world-view-changing books.

rupert · June 12, 2025, 7:52pm

Very interesting what is going on Modular Max right now. Levelling the playing field for AMD vs nVidia. Also when you consider AMD and unified RAM its opening up the possibilities for an AI box where you can add more RAM as you need it.

system · June 22, 2025, 7:52pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Your problems writing Elm code with AI/LLM Request Feedback	9	901	March 13, 2024
Using Lamdera professionally Show and Tell	17	2437	June 16, 2023
Alternative compiler backend Learn	2	1029	March 25, 2019
Elm-test-rs v2.0 is out, now able to run your Lamdera tests Show and Tell	1	1131	December 12, 2021
Looking for a survey of frameworks, tooling and alternative compilers Learn	7	1468	April 27, 2023

Seriously useful local non-cloud AI tool

Related topics