Running an AI agent locally on your laptop in 2026 is no longer an experiment. With Ollama packaging the runtime and Qwen 2.5 hitting near-GPT-4 quality on reasoning and code at sizes that fit consumer GPUs, the only remaining cost is electricity and the half-day it takes to wire the agent into your stack. This guide walks through the complete path: install Ollama on Windows, Linux or macOS, pick the right Qwen model for your VRAM, call it from Python over its HTTP API, integrate with LangChain when the workflow grows, and build a working email-reply assistant in 80 lines of code. All commands tested in May 2026 against Ollama 0.6 and Qwen 2.5 series.
Contents
Why local AI in 2026
Three forces make local inference the right default for a growing share of use cases in 2026. Privacy: customer data, internal documents and code repositories no longer leave your machine for a cloud provider. For regulated industries (healthcare, finance, legal) and for anyone processing PII at scale, that single property is the difference between yes and no. Cost: at moderate to heavy usage, the break-even versus paid APIs sits around 5 million tokens per month. Past that, the electricity-amortised hardware wins decisively. Latency: a 7B model on a recent GPU returns the first token in 50-80 ms, which beats every cloud API even on a fast connection.
The remaining tradeoff is quality: a local 7B or 14B model is not GPT-5 turbo or Claude Opus 4.7. For reasoning-heavy tasks (legal analysis, complex code refactor), a flagship API is still the right call. For classification, extraction, summarisation, routine email drafting and most agent workflows, local Qwen 2.5 14B is now indistinguishable from cloud flagships in blind tests with humans rating outputs.
Install Ollama (Windows, Linux, macOS)
Ollama is a single binary that bundles the model runtime, model storage, and a REST server on port 11434. The installer takes 60 seconds on any platform.
macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh ollama --version # confirm install
Windows (PowerShell)
Download the installer from ollama.com/download, run it once, then verify in PowerShell:
ollama --version # Should print: ollama version is 0.6.x
Ollama installs as a background service that auto-starts on boot. Confirm the API responds:
curl http://localhost:11434/api/tags # Returns JSON with the list of locally installed models (empty at first)
127.0.0.1:11434 by default, which is safe (loopback only). To expose it to other machines on your LAN, set the environment variable OLLAMA_HOST=0.0.0.0:11434 and add a firewall rule scoped to your LAN subnet only.Choosing the right Qwen 2.5 model for your VRAM
Qwen 2.5 ships in five practical sizes. The table below gives the minimum VRAM for the standard Q4_K_M quantisation (the default Ollama uses), the throughput you can expect on a recent consumer GPU, and what each tier is good for.
| Model | VRAM min | Throughput | Sweet spot |
|---|---|---|---|
qwen2.5:0.5b | 1.5 GB | ~150 tok/s | Edge devices, autocomplete, classification |
qwen2.5:1.5b | 2.5 GB | ~120 tok/s | Lightweight chat, extraction, on CPU-only laptops |
qwen2.5:3b | 4 GB | ~90 tok/s | Summarisation, simple tool use, RAG |
qwen2.5:7b | 6 GB | ~70 tok/s | General-purpose agent, code completion |
qwen2.5:14b | 10 GB | ~45 tok/s | Complex reasoning, multi-step agents (best quality / cost on RTX 3060 12GB and up) |
qwen2.5:32b | 22 GB | ~22 tok/s | Near-flagship quality, needs an RTX 3090 / 4090 / 5090 or Apple M3 Max+ |
qwen2.5:72b | 48 GB | ~12 tok/s | Top-tier quality, requires dual GPU or a single A6000 / H100 |
Pull the model you need (one-time download, models cached in ~/.ollama/models):
ollama pull qwen2.5:7b # general purpose default ollama pull qwen2.5:14b # better reasoning if VRAM allows ollama pull qwen2.5-coder:7b # code-specialised variant
Test the model directly in the terminal before wiring it into your code:
ollama run qwen2.5:7b >>> Write a Python function that fetches a URL and returns the JSON. # Streams a complete answer with code.
Ollama HTTP API basics
Ollama exposes three core endpoints. The newer /api/chat uses the OpenAI-compatible message format and is the right default. The legacy /api/generate takes a raw prompt. The /api/embeddings endpoint returns vector embeddings for RAG pipelines.
# Chat (recommended)
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5:7b",
"messages": [
{"role": "system", "content": "You are a concise coding assistant."},
{"role": "user", "content": "Explain async/await in Python in one paragraph."}
],
"stream": false
}'
The response is a single JSON object with message.content holding the model output. With stream: true the server emits one JSON object per token (newline-delimited), exactly like the OpenAI streaming format.
Most parameters from the OpenAI Chat Completions schema are supported either at the top level or under the options key: temperature, top_p, seed, num_ctx (context window), num_predict (max output tokens), stop (sequences that end generation). Tool calling is supported since Ollama 0.5; use the tools array per the chat endpoint documentation.
Calling the model from Python
The minimum-viable Python client is a single requests call. For agents you will outgrow it within a day; jump straight to the official client.
# pip install ollama
import ollama
client = ollama.Client(host="http://localhost:11434")
response = client.chat(
model="qwen2.5:7b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise the benefits of local AI in 3 bullets."}
],
options={"temperature": 0.3, "num_ctx": 8192}
)
print(response.message.content)
The ollama Python package supports streaming (client.chat(..., stream=True) yields per-token chunks), embeddings (client.embeddings), and tool calling. It has no external dependencies beyond requests and is the right default for scripts and small services.
LangChain integration for agents
For anything beyond a single-turn call (chains, RAG, tool routing, memory), LangChain plus its Ollama integration gives you a battle-tested toolkit. Install both packages:
pip install langchain langchain-community langchain-ollama
The minimum example wires Qwen to a prompt template and a parser:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOllama(model="qwen2.5:14b", temperature=0.2)
prompt = ChatPromptTemplate.from_messages([
("system", "Translate the user's message into French."),
("user", "{text}")
])
chain = prompt | llm | StrOutputParser()
print(chain.invoke({"text": "Hello, how are you today?"}))
For an agent that calls tools, swap ChatOllama into the standard LangChain agent constructor. The agent loop, the tool execution and the memory are all framework-managed.
Complete example: email-reply assistant
A practical 80-line agent: reads the last 10 unread emails from your IMAP inbox, drafts a context-aware reply using Qwen 2.5 14B, saves drafts to the IMAP Drafts folder for human review before sending. Replace the credentials and run in a sandboxed test mailbox first.
import imaplib
import email
from email.message import EmailMessage
import ollama
IMAP_HOST = "imap.example.com"
IMAP_USER = "you@example.com"
IMAP_PASS = "app-specific-password"
SYSTEM_PROMPT = """You are a professional assistant drafting replies on
behalf of the user. Reply in the same language as the incoming email.
Be concise (3-6 sentences). Use a friendly but professional tone.
End with a signature line: 'Best regards,\\nThe Assistant (draft)'.
Never send the reply - it will be reviewed before sending."""
def get_unread_emails(limit=10):
m = imaplib.IMAP4_SSL(IMAP_HOST)
m.login(IMAP_USER, IMAP_PASS)
m.select("INBOX")
_, data = m.search(None, "UNSEEN")
ids = data[0].split()[:limit]
messages = []
for i in ids:
_, raw = m.fetch(i, "(RFC822)")
msg = email.message_from_bytes(raw[0][1])
body = ""
if msg.is_multipart():
for part in msg.walk():
if part.get_content_type() == "text/plain":
body = part.get_payload(decode=True).decode(errors="ignore")
break
else:
body = msg.get_payload(decode=True).decode(errors="ignore")
messages.append({
"from": msg["From"], "subject": msg["Subject"],
"body": body, "id": i
})
m.close()
m.logout()
return messages
def draft_reply(msg):
client = ollama.Client()
user_content = f"From: {msg['from']}\nSubject: {msg['subject']}\n\n{msg['body']}"
response = client.chat(
model="qwen2.5:14b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_content},
],
options={"temperature": 0.3, "num_ctx": 8192, "num_predict": 400}
)
return response.message.content
def save_draft(reply_text, original):
draft = EmailMessage()
draft["From"] = IMAP_USER
draft["To"] = original["from"]
draft["Subject"] = "Re: " + (original["subject"] or "")
draft.set_content(reply_text)
m = imaplib.IMAP4_SSL(IMAP_HOST)
m.login(IMAP_USER, IMAP_PASS)
m.append("Drafts", "", imaplib.Time2Internaldate(time.time()), draft.as_bytes())
m.logout()
if __name__ == "__main__":
import time
for msg in get_unread_emails(limit=10):
print(f"Drafting reply to: {msg['subject']}")
reply = draft_reply(msg)
save_draft(reply, msg)
print(f" Saved to Drafts ({len(reply)} chars)")
The same skeleton extends to support tickets (CRM API instead of IMAP), calendar invites (CalDAV), or any other “read input, draft reply, queue for human review” workflow. The model is the cheap part; the integration is what you maintain.
Performance tuning and benchmarks
Three knobs move the needle most on local inference performance.
- Quantisation level. The default Q4_K_M strikes a 90% quality / 4x size reduction balance. Q5_K_M and Q6_K give better quality on the same model size at the cost of more VRAM. Q8_0 is essentially lossless but doubles the VRAM footprint. The right starting point is Q4_K_M; upgrade only if you measure quality issues.
- Context window (num_ctx). Defaults to 2048 in Ollama. For agents that include long history or document chunks, set
num_ctxto 8192 or 16384. Memory cost scales roughly linearly; check the VRAM headroom withnvidia-smiwhile running. - GPU layers (num_gpu). Ollama auto-detects how many layers fit in VRAM, but on systems with both integrated and discrete GPUs the heuristic can pick the wrong one. Force the offload count with
OLLAMA_NUM_GPU=99for full GPU,0for CPU-only.
Indicative throughput on common 2026 hardware running qwen2.5:14b Q4_K_M:
| Hardware | Tokens/sec (output) |
|---|---|
| RTX 4090 (24 GB) | ~75 tok/s |
| RTX 4070 Ti Super (16 GB) | ~55 tok/s |
| Apple M3 Max 64 GB | ~38 tok/s |
| RTX 3060 (12 GB) | ~28 tok/s |
| Apple M2 16 GB (CPU+GPU) | ~14 tok/s |
| Intel i9-13900K CPU only (DDR5-6000) | ~6 tok/s |
Estimate the API cost you would avoid
Our AI Cost Calculator compares your monthly token volume against OpenAI, Anthropic, Gemini and self-hosted to show the break-even point for local AI versus the cloud.
FAQ
Why Qwen 2.5 and not Llama 4 or Mistral?
In May 2026, Qwen 2.5 leads independent benchmarks (MMLU-Pro, HumanEval, MT-Bench) in the 7B-14B size range, especially for multilingual and code tasks. Llama 4 405B is stronger overall but does not fit on consumer hardware; Llama 4 8B is competitive but trails Qwen 2.5 7B on code. Mistral Large 3 is excellent but not open-weight. Reassess every six months — the landscape moves fast.
Can I run Ollama on a laptop without a discrete GPU?
Yes. CPU-only inference works for Qwen 2.5 0.5B, 1.5B and 3B with acceptable throughput on a modern Intel or AMD CPU with at least 16 GB of RAM. Apple Silicon (M1 and later) accelerates inference through the unified memory and Metal GPU; even the base M2 runs 7B at usable speed. 14B on CPU is technically possible but slow (3-6 tokens per second on a top-tier desktop CPU).
How do I expose Ollama securely to the office network?
Three steps. Set OLLAMA_HOST=0.0.0.0:11434 to listen on all interfaces. Add a firewall rule on the host that allows port 11434 only from the office subnet (192.168.1.0/24 or your VPN range). Put a reverse proxy like Caddy or nginx in front to terminate TLS and add basic auth or an API token check. Never expose port 11434 directly to the internet without authentication; the API has no built-in auth.
What is the memory cost of long context?
Approximately 2 MB of VRAM per 1,000 tokens of context for a 7B model in Q4_K_M, roughly doubling for 14B. A 32k context with 14B costs about 1 GB of VRAM on top of the model weights. Check with nvidia-smi during a representative inference call before deciding the cap.
How do I use function calling with Qwen 2.5?
Qwen 2.5 supports tool calling natively. Pass a tools array in the chat request with the OpenAI schema. Ollama 0.5+ proxies it correctly and the response includes message.tool_calls when the model decides to use a tool. Multi-turn loops (call tool, return result, call next) work identically to the OpenAI API.
Can I fine-tune Qwen 2.5 on my own data?
Yes, via LoRA or QLoRA training with unsloth or axolotl. Quantised fine-tuning of Qwen 2.5 7B takes about 4 hours on an RTX 4090 with a 5,000-example dataset. Save the adapter, merge with the base model, repackage as an Ollama model with a Modelfile. The full pipeline is a separate guide; for most use cases, prompt engineering and RAG cover the need without the training step.













