The moment a developer changes a configuration line to `"api": "openai-completions"` and points it toward a local port, the nature of their workspace shifts. The dependency on external servers and the anxiety of API latency vanish, replaced by a private, independent AI workshop. This transition is becoming a priority for engineers who want absolute control over their hardware resources. This week, a growing number of developers are testing the limits of the M4 chipset, specifically the MacBook Pro with 24GB of unified memory, to see if the dream of a fully offline, high-performance coding assistant is finally viable.

The Hardware Ceiling and Model Selection

Running a Large Language Model locally is a constant battle between memory capacity and cognitive capability. In this specific environment—an M4 MacBook Pro with 24GB of RAM—the goal was to maintain a functional development environment. This meant the AI had to coexist with memory-hungry Electron-based applications while still providing a context window of at least 128K tokens to handle large codebases.

To achieve this, the setup utilized a stack consisting of Ollama for streamlined execution, llama.cpp for efficient inference, and LM Studio for a graphical interface. The selection process involved rigorous testing of several candidates. Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B were all tested, but they ultimately failed to meet the requirements, either triggering out-of-memory errors or suffering from severe performance degradation that rendered them unusable for real-time work. On the other end of the spectrum, Gemma 4B ran smoothly but lacked the necessary tool-use capabilities to be an effective coding partner.

The breakthrough came with Qwen 3.5-9B using Q4 quantization. This specific configuration hit the ideal balance, delivering a blistering speed of 40 tokens per second. More importantly, it successfully integrated thinking mode and tool-use capabilities without crashing the system, making it the only model in the test group that felt like a professional-grade tool rather than a technical curiosity.

The Shift from Configuration to Interaction

There was a time when deploying a local model required a grueling process of managing environment variables and resolving complex library dependencies. The current landscape has shifted toward OpenAI-compatible endpoints, where a few lines in a configuration file can bridge the gap between a local weight file and a sophisticated IDE. For those wanting to leverage the reasoning capabilities of Qwen 3.5 9B, the thinking mode can be activated by adding a specific instruction to the LM Studio Prompt Template:

text
{%- set enable_thinking = true %}

When integrating this setup with local agent frameworks like Pi or AI-powered editors like OpenCode, the precision of the JSON configuration becomes the deciding factor in performance. To expand the model's cognitive reach and ensure it can process large files, the context length must be explicitly defined. The following configuration was used to maximize the utility of the 9B model:

{

"models": {

"qwen3.5-9b@q4_k_s": {

"tools": true,

"context_length": 131072,

"max_tokens": 32768

}

}

}

However, the real insight emerges when comparing this local experience to State-of-the-Art (SOTA) cloud models. A local 9B model cannot architect an entire application from scratch in a single prompt the way a massive cloud model can. The tension lies in the workflow: where a cloud model is a consultant, the local model is a pair programmer. It excels in an interactive, step-by-step workflow where the developer provides clear, incremental instructions.

In practice, this was evident during a task involving the Elixir language and its code quality tool, Credo. Qwen 3.5 9B proved highly effective at analyzing Credo warnings, suggesting specific list comparison methods, and performing parallel edits to the code. It functioned as a high-speed utility for tactical fixes. Yet, the limitations were equally clear. When faced with a complex git rebase conflict involving Dependabot, the model struggled to accurately interpret the git command sequence and eventually stalled. This contrast reveals that while local models are becoming incredibly fast, they still lack the deep reasoning required for high-level systemic conflict resolution.

Despite these gaps, the value proposition is undeniable. By trading a small amount of raw reasoning power for total privacy and zero subscription fees, developers gain a sustainable alternative to the cloud. The ability to maintain a high-functioning AI assistant in a completely offline environment transforms the laptop from a terminal into a self-sufficient intelligence hub.

The era of the local AI assistant has moved past the experimental phase and into the realm of practical, daily utility.