The model layer

Ollama: the easiest path to a local LLM.

One binary, one pull command, and you have a chat-capable model serving on http://127.0.0.1:11434. The same API works for Hermes, OpenClaw, OpenCode, and Claude Code. No GPU required if you have an M-series Mac.

Install

Get Ollama running in 60 seconds

# macOS
brew install ollama
ollama serve

# Linux (one-liner)
curl -fsSL https://ollama.com/install.sh | sh
ollama serve

# Windows: download from ollama.com/download

Leave ollama serve running in a terminal (or install it as a launchd service — see below). Default endpoint: http://127.0.0.1:11434. It exposes an OpenAI-compatible API at /v1, which is what every tool on this site uses.

Run as a background service on macOS: brew services start ollama makes it auto-start on login and survive terminal closes. Recommended.

Model picks

What to pull (and why)

There are hundreds of models on ollama.com/library. These four cover ~90% of what you'll actually want:

General chat

llama3.1:8b

Meta's Llama 3.1 8B. The safe default. ~5GB RAM. Good at conversation, decent at reasoning, fast on M-series. Use this if you're not sure.

ollama pull llama3.1:8b

Coding

qwen2.5-coder:14b

Alibaba's Qwen 2.5 Coder. ~9GB RAM. Currently the best open coding model under 32B. Use it for OpenCode and Claude Code.

ollama pull qwen2.5-coder:14b

Reasoning

deepseek-r1:8b

DeepSeek's distilled R1. ~5GB RAM. Surfaces its chain-of-thought, which is great for debugging why an agent made a choice.

ollama pull deepseek-r1:8b

Vision

llava:13b

Multimodal — takes images as input. ~8GB RAM. Useful if you want to drop a screenshot into a chat with your agent.

ollama pull llava:13b

Memory budget

How much RAM do I actually need?

Rule of thumb for M-series Macs (unified memory): the model size in parameters × 1.2 ≈ RAM usage in GB. So 14B ≈ 17GB. Leave 2–4GB for the OS and you're at the 16GB minimum for a 14B model.

16GB Mac

Stick to 7B–8B models. llama3.1:8b, qwen2.5-coder:7b, phi3:mini. Comfortable, fast.

24GB Mac

14B is the sweet spot. qwen2.5-coder:14b works, deepseek-r1:14b works. Still leaves headroom for Chrome.

32GB+ Mac / Linux

32B opens up. qwen2.5-coder:32b, llama3.1:70b (Q4) on the 64GB+ tier. Frontier-ish on a laptop.

Quantization explained: Ollama models ship in Q4_K_M by default, which is roughly 4 bits per parameter. You trade a small amount of quality for a 4× memory savings vs FP16. For most coding and chat tasks, you can't tell the difference.

Custom Modelfile

Tune the system prompt and parameters

When the default model isn't quite right, drop a Modelfile in your project and ollama create a custom variant:

FROM qwen2.5-coder:14b

SYSTEM """You are a senior pair programmer.
Default to TypeScript. Prefer functional, immutable code.
Always explain non-obvious choices. Never invent APIs."""

PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

# Build the custom model
ollama create coder-mike -f Modelfile

# Use it in Hermes / OpenClaw / OpenCode
ollama run coder-mike "refactor this to async/await"

Smoke test

Verify the API is working

# Hit Ollama directly with curl
curl http://127.0.0.1:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder:14b",
    "messages": [{"role": "user", "content": "Say hello in one word."}]
  }'

If you get a JSON response with a choices array, every tool on this site will work against your Ollama. Time to wire it up to Hermes or OpenClaw, or jump straight to OpenCode / Claude Code.