/guides/local-models
Local models
Run free-claude-code with fully local models using LM Studio or llama.cpp. No API keys, no rate limits, complete privacy.
Local models
Run free-claude-code with fully local inference. No API keys, no rate limits, no network latency, and complete data privacy.
Option A: LM Studio
LM Studio provides a graphical interface for running local models. Best for users who prefer GUI tools.
Install LM Studio
- Download from lmstudio.ai
- Install for your platform (Windows, macOS, Linux)
- Launch the application
Download a model
- Go to the “Search” tab
- Find a tool-capable GGUF model. Good options:
LiquidAI/LFM2-24B-A2B-GGUFunsloth/MiniMax-M2.5-GGUFunsloth/GLM-4.7-Flash-GGUFunsloth/Qwen3.5-35B-A3B-GGUF
- Download a quantization (Q4_K_M for balance, Q8_0 for quality)
Start the server
- Go to the “Developer” tab
- Load your downloaded model
- Click “Start Server”
- Note the server URL (default:
http://localhost:1234)
Configure free-claude-code
Edit .env:
MODEL="lmstudio/unsloth/GLM-4.7-Flash-GGUF"
LM_STUDIO_BASE_URL="http://localhost:1234/v1"
No API key required. Start the proxy:
uv run uvicorn server:app --host 0.0.0.0 --port 8082
Option B: llama.cpp
llama.cpp is a lightweight, command-line inference engine. Best for headless servers or users comfortable with terminal workflows.
Install llama.cpp
macOS (Homebrew):
brew install llama.cpp
Linux (build from source):
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Windows: Download prebuilt binaries from the releases page.
Download a GGUF model
Download a tool-capable model:
# Example: Qwen3.5 with tool use support
wget https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf
See Unsloth docs for detailed instructions on tool-capable models.
Start llama-server
llama-server \
-m Qwen3.5-4B-Q4_K_M.gguf \
--port 8080 \
-np 4 \
-c 8192
Options explained:
-m: Model file path--port: Server port (default 8080)-np: Number of parallel slots (concurrent requests)-c: Context size in tokens
Configure free-claude-code
Edit .env:
MODEL="llamacpp/local-model"
LLAMACPP_BASE_URL="http://localhost:8080/v1"
No API key required. The model name is arbitrary—llama-server ignores it when using /v1/messages.
Choosing GGUF models
GGUF models come in different quantization levels. Lower bits = smaller/faster but lower quality.
| Quantization | Size | Quality | Speed | Use Case |
|---|---|---|---|---|
| Q4_K_M | ~60% | Good | Fast | Daily driver |
| Q5_K_M | ~70% | Better | Fast | Balance |
| Q6_K | ~80% | Great | Medium | Quality first |
| Q8_0 | ~95% | Excellent | Medium | Best quality |
For free-claude-code, Q4_K_M is usually sufficient. Upgrade if you notice reasoning errors.
Hardware requirements
| Model Size | RAM Required | GPU VRAM (optional) | Notes |
|---|---|---|---|
| 4B params | 4-6 GB | 4 GB | Runs on most laptops |
| 7B params | 8-10 GB | 6 GB | Good balance |
| 13B params | 16-20 GB | 12 GB | Desktop/Workstation |
| 35B+ params | 32+ GB | 24+ GB | High-end only |
GPU acceleration (CUDA, Metal, ROCm) dramatically improves speed. CPU-only is usable but slow for larger models.
Performance tips
- Reduce context size: Lower
-cvalue uses less memory - Enable GPU layers: Add
-ngl 999to offload all layers to GPU - Use smaller models: 4B-7B models are surprisingly capable for coding tasks
- Batch requests: The proxy handles this automatically with
PROVIDER_MAX_CONCURRENCY
Troubleshooting
“Connection refused” errors: Verify LM Studio server or llama-server is running and on the expected port.
Slow responses: Check CPU vs GPU usage. CPU inference is 10-50x slower than GPU for larger models.
Out of memory errors: Use a smaller model, lower quantization, or reduce context size (-c).
Tool calling fails: Not all GGUF models support tool use. Verify your model explicitly advertises tool support.
Privacy and security
With local models:
- No data leaves your machine
- No API keys to manage
- No rate limits or usage tracking
- Works offline entirely
This is ideal for sensitive codebases, proprietary work, or environments with strict data residency requirements.