/guides/local-models

Local models

Run free-claude-code with fully local models using LM Studio or llama.cpp. No API keys, no rate limits, complete privacy.

Run free-claude-code with fully local inference. No API keys, no rate limits, no network latency, and complete data privacy.

Option A: LM Studio

LM Studio provides a graphical interface for running local models. Best for users who prefer GUI tools.

Go to the “Search” tab
Find a tool-capable GGUF model. Good options:
- LiquidAI/LFM2-24B-A2B-GGUF
- unsloth/MiniMax-M2.5-GGUF
- unsloth/GLM-4.7-Flash-GGUF
- unsloth/Qwen3.5-35B-A3B-GGUF
Download a quantization (Q4_K_M for balance, Q8_0 for quality)

Edit .env:

MODEL="lmstudio/unsloth/GLM-4.7-Flash-GGUF"
LM_STUDIO_BASE_URL="http://localhost:1234/v1"

No API key required. Start the proxy:

uv run uvicorn server:app --host 0.0.0.0 --port 8082

llama.cpp is a lightweight, command-line inference engine. Best for headless servers or users comfortable with terminal workflows.

macOS (Homebrew):

brew install llama.cpp

Linux (build from source):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Windows: Download prebuilt binaries from the releases page.

Download a tool-capable model:

# Example: Qwen3.5 with tool use support
wget https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf

See Unsloth docs for detailed instructions on tool-capable models.

llama-server \
  -m Qwen3.5-4B-Q4_K_M.gguf \
  --port 8080 \
  -np 4 \
  -c 8192

Options explained:

Edit .env:

MODEL="llamacpp/local-model"
LLAMACPP_BASE_URL="http://localhost:8080/v1"

No API key required. The model name is arbitrary—llama-server ignores it when using /v1/messages.

GGUF models come in different quantization levels. Lower bits = smaller/faster but lower quality.

Quantization	Size	Quality	Speed	Use Case
Q4_K_M	~60%	Good	Fast	Daily driver
Q5_K_M	~70%	Better	Fast	Balance
Q6_K	~80%	Great	Medium	Quality first
Q8_0	~95%	Excellent	Medium	Best quality

For free-claude-code, Q4_K_M is usually sufficient. Upgrade if you notice reasoning errors.

Model Size	RAM Required	GPU VRAM (optional)	Notes
4B params	4-6 GB	4 GB	Runs on most laptops
7B params	8-10 GB	6 GB	Good balance
13B params	16-20 GB	12 GB	Desktop/Workstation
35B+ params	32+ GB	24+ GB	High-end only

GPU acceleration (CUDA, Metal, ROCm) dramatically improves speed. CPU-only is usable but slow for larger models.

Reduce context size: Lower -c value uses less memory
Enable GPU layers: Add -ngl 999 to offload all layers to GPU
Use smaller models: 4B-7B models are surprisingly capable for coding tasks
Batch requests: The proxy handles this automatically with PROVIDER_MAX_CONCURRENCY

“Connection refused” errors: Verify LM Studio server or llama-server is running and on the expected port.

Slow responses: Check CPU vs GPU usage. CPU inference is 10-50x slower than GPU for larger models.

Out of memory errors: Use a smaller model, lower quantization, or reduce context size (-c).

Tool calling fails: Not all GGUF models support tool use. Verify your model explicitly advertises tool support.

With local models:

This is ideal for sensitive codebases, proprietary work, or environments with strict data residency requirements.