/guides/local-models

Local models

Run free-claude-code with fully local models using LM Studio or llama.cpp. No API keys, no rate limits, complete privacy.

Local models

Run free-claude-code with fully local inference. No API keys, no rate limits, no network latency, and complete data privacy.

Option A: LM Studio

LM Studio provides a graphical interface for running local models. Best for users who prefer GUI tools.

Install LM Studio

  1. Download from lmstudio.ai
  2. Install for your platform (Windows, macOS, Linux)
  3. Launch the application

Download a model

  1. Go to the “Search” tab
  2. Find a tool-capable GGUF model. Good options:
    • LiquidAI/LFM2-24B-A2B-GGUF
    • unsloth/MiniMax-M2.5-GGUF
    • unsloth/GLM-4.7-Flash-GGUF
    • unsloth/Qwen3.5-35B-A3B-GGUF
  3. Download a quantization (Q4_K_M for balance, Q8_0 for quality)

Start the server

  1. Go to the “Developer” tab
  2. Load your downloaded model
  3. Click “Start Server”
  4. Note the server URL (default: http://localhost:1234)

Configure free-claude-code

Edit .env:

MODEL="lmstudio/unsloth/GLM-4.7-Flash-GGUF"
LM_STUDIO_BASE_URL="http://localhost:1234/v1"

No API key required. Start the proxy:

uv run uvicorn server:app --host 0.0.0.0 --port 8082

Option B: llama.cpp

llama.cpp is a lightweight, command-line inference engine. Best for headless servers or users comfortable with terminal workflows.

Install llama.cpp

macOS (Homebrew):

brew install llama.cpp

Linux (build from source):

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Windows: Download prebuilt binaries from the releases page.

Download a GGUF model

Download a tool-capable model:

# Example: Qwen3.5 with tool use support
wget https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf

See Unsloth docs for detailed instructions on tool-capable models.

Start llama-server

llama-server \
  -m Qwen3.5-4B-Q4_K_M.gguf \
  --port 8080 \
  -np 4 \
  -c 8192

Options explained:

  • -m: Model file path
  • --port: Server port (default 8080)
  • -np: Number of parallel slots (concurrent requests)
  • -c: Context size in tokens

Configure free-claude-code

Edit .env:

MODEL="llamacpp/local-model"
LLAMACPP_BASE_URL="http://localhost:8080/v1"

No API key required. The model name is arbitrary—llama-server ignores it when using /v1/messages.

Choosing GGUF models

GGUF models come in different quantization levels. Lower bits = smaller/faster but lower quality.

QuantizationSizeQualitySpeedUse Case
Q4_K_M~60%GoodFastDaily driver
Q5_K_M~70%BetterFastBalance
Q6_K~80%GreatMediumQuality first
Q8_0~95%ExcellentMediumBest quality

For free-claude-code, Q4_K_M is usually sufficient. Upgrade if you notice reasoning errors.

Hardware requirements

Model SizeRAM RequiredGPU VRAM (optional)Notes
4B params4-6 GB4 GBRuns on most laptops
7B params8-10 GB6 GBGood balance
13B params16-20 GB12 GBDesktop/Workstation
35B+ params32+ GB24+ GBHigh-end only

GPU acceleration (CUDA, Metal, ROCm) dramatically improves speed. CPU-only is usable but slow for larger models.

Performance tips

  • Reduce context size: Lower -c value uses less memory
  • Enable GPU layers: Add -ngl 999 to offload all layers to GPU
  • Use smaller models: 4B-7B models are surprisingly capable for coding tasks
  • Batch requests: The proxy handles this automatically with PROVIDER_MAX_CONCURRENCY

Troubleshooting

“Connection refused” errors: Verify LM Studio server or llama-server is running and on the expected port.

Slow responses: Check CPU vs GPU usage. CPU inference is 10-50x slower than GPU for larger models.

Out of memory errors: Use a smaller model, lower quantization, or reduce context size (-c).

Tool calling fails: Not all GGUF models support tool use. Verify your model explicitly advertises tool support.

Privacy and security

With local models:

  • No data leaves your machine
  • No API keys to manage
  • No rate limits or usage tracking
  • Works offline entirely

This is ideal for sensitive codebases, proprietary work, or environments with strict data residency requirements.