• Latest
  • Trending
  • All
Build a Local AI Agent with Ollama and Qwen 2.5 - complete 2026 guide

Build a Local AI Agent with Ollama + Qwen 2.5: Complete 2026 Guide

June 14, 2026
ssh command cheatsheet

SSH Command Cheatsheet: Connect, Keys, scp, Tunnels (2026)

June 16, 2026
chmod-chown-cheatsheet

chmod and chown Cheatsheet: Linux Permissions, Decoded (2026)

June 16, 2026
systemctl-journalctl-cheatsheet

systemctl + journalctl Cheatsheet: Services and Logs (2026)

June 16, 2026
grep-cheatsheet

The grep Cheatsheet: Search a File, Search a Tree (2026)

June 16, 2026
rsync-cheatsheet

The rsync Cheatsheet: Mirror, Sync, Copy Over SSH (2026)

June 16, 2026
curl-cheatsheet

curl Cheatsheet: Download Files and Test APIs (2026)

June 16, 2026
iptables-vs-nftables-cheatsheet cheatsheet

iptables vs nftables: Linux Firewall Cheatsheet, Side by Side

June 16, 2026
nmcli-cheatsheet cheatsheet

nmcli Cheatsheet: Wi-Fi and Network Connections From the Linux Terminal

June 16, 2026
powershell-networking-cheatsheet cheatsheet

PowerShell Networking Cheatsheet: Test-NetConnection, IP, DNS (2026)

June 16, 2026
tar command cheatsheet

The tar Command Cheatsheet: Create, Extract, Stop Guessing (2026)

June 16, 2026
Linux find command cheatsheet

The find Command Cheatsheet: Every Recipe You Actually Use (2026)

June 15, 2026
Linux networking commands cheatsheet, ip and ss

Linux Networking Commands in 2026: the ip and ss Cheatsheet

June 15, 2026
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
Tuesday, June 16, 2026
  • Login
People Are Geek
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
No Result
View All Result
People Are Geek
No Result
View All Result
Home AI Tools

Build a Local AI Agent with Ollama + Qwen 2.5: Complete 2026 Guide

by People Are Geek
June 14, 2026
in AI Tools
0
Build a Local AI Agent with Ollama and Qwen 2.5 - complete 2026 guide
0
SHARES
19
VIEWS
Share on FacebookShare on Twitter

Complete guide Local AI · 16 min read · Updated May 2026

I run agents on my own boxes now. Not as a science project anymore, just as a thing that works. Ollama hides all the runtime pain, and Qwen 2.5 got close enough to GPT-4 on code and reasoning, at sizes that actually fit a consumer GPU, that the only bill left is the power meter. That plus the afternoon you’ll burn gluing it into your stack. So. Here’s the whole thing in the order I’d do it. Install Ollama on Windows, Linux or macOS, then pick the Qwen size your VRAM can stomach, hit it from Python over the HTTP API, bring in LangChain once the workflow stops being one call, and finish on a real email-reply assistant in 80 lines. I ran every command here in May 2026 against Ollama 0.6 and the Qwen 2.5 series. If something doesn’t match, it’s probably just drifted since.

Architecture of the local agent stack: user input flows into an 80-line Python agent loop using the ollama client or LangChain ChatOllama, which calls Ollama 0.6 on localhost port 11434 serving qwen2.5:14b at Q4_K_M, with tools for fetching unread email over IMAP, drafting replies and saving them to Drafts for human review.
The whole stack on one box: your script, Ollama on 11434, Qwen doing the thinking.

Contents

  1. Why local AI in 2026
  2. Install Ollama (Windows, Linux, macOS)
  3. Choosing the right Qwen 2.5 model for your VRAM
  4. Ollama HTTP API basics
  5. Calling the model from Python
  6. LangChain integration for agents
  7. Complete example: email-reply assistant
  8. Performance tuning and benchmarks
  9. FAQ

Why local AI in 2026

What keeps pulling me back to running this myself? Start with privacy. Customer data, prompts, internal docs, the code repos, none of it leaves the building for someone else’s cloud. If you’re in healthcare or finance or legal, or you just touch PII at any real volume, that one property is the entire yes-or-no. Money’s the other half of it. Once you lean on these models for real, the break-even against paid APIs lands somewhere near 5 million tokens a month, and past that line the hardware you’ve already amortised against the power bill just wins. Then latency, which people forget about. A 7B on a halfway recent GPU spits out the first token in 50-80 ms. No cloud API gets near that, doesn’t matter how fat your pipe is.

The catch (there’s always one) is quality. A 7B or 14B on your desk is not GPT-5 turbo, and it’s not Claude Opus 4.7. Don’t kid yourself there. When the task is genuinely hard, untangling a legal argument, some gnarly code refactor, I still reach for a flagship API and I don’t lose sleep over it. But classification. Extraction. Summarising, the endless routine email drafts, the bread-and-butter agent loops? Qwen 2.5 14B holds its own. I’ve watched people fail to pick it out of a lineup against the big cloud models, and honestly that tells you most of what you need.

Recommended AI gearWe may earn a commission, at no extra cost to you.
Nvidia Rtx Graphics CardCheck price on Amazon →Ai Engineering BookCheck price on Amazon →Usb C HubCheck price on Amazon →Mechanical KeyboardCheck price on Amazon →

Install Ollama (Windows, Linux, macOS)

Best thing about Ollama? One binary. It carries the runtime, the place your models live on disk, plus a little REST server on port 11434. All of it, bundled. Whatever your OS, the installer’s done in under a minute.

macOS and Linux

curl -fsSL https://ollama.com/install.sh | sh
ollama --version  # confirm install

Windows (PowerShell)

Grab the installer from ollama.com/download, run it the once, then check it landed from PowerShell:

ollama --version
# Should print: ollama version is 0.6.x

It sets itself up as a background service and comes back on its own every reboot, so you can mostly forget about it. Poke the API to make sure it’s actually awake:

curl http://localhost:11434/api/tags
# Returns JSON with the list of locally installed models (empty at first)
Firewall note. Out of the box Ollama binds 127.0.0.1:11434, loopback only, so nothing off the box can reach it. That’s the safe default. Leave it there unless you’ve got a reason. Want other machines on the LAN to talk to it? Set OLLAMA_HOST=0.0.0.0:11434, but in the same breath write a firewall rule that only lets your LAN subnet in. Don’t skip that part. Please.

Choosing the right Qwen 2.5 model for your VRAM

Qwen 2.5 ships in a handful of sizes you’d actually use in anger. Bookmark the table. It gives you minimum VRAM at the standard Q4_K_M quant (the one Ollama picks for you) and rough tokens-per-second on a recent consumer card. The last column is what each tier is honestly good for. One thing to get straight: VRAM is the wall you hit first here, not compute. Read that column before you fall for a number.

Requirements card grid for running Qwen 2.5 locally: minimum VRAM per model size at Q4_K_M from 1.5 GB for 0.5b up to 48 GB for 72b, with cards for the one-binary install on Windows, Linux and macOS, the CPU-only path with 16 GB RAM or Apple Silicon, and the Ollama 0.6 plus Python software stack.
What it actually takes to run each Qwen size. Teal is where the guide lives.
ModelVRAM minThroughputSweet spot
qwen2.5:0.5b1.5 GB~150 tok/sEdge devices, autocomplete, classification
qwen2.5:1.5b2.5 GB~120 tok/sLightweight chat, extraction, on CPU-only laptops
qwen2.5:3b4 GB~90 tok/sSummarisation, simple tool use, RAG
qwen2.5:7b6 GB~70 tok/sGeneral-purpose agent, code completion
qwen2.5:14b10 GB~45 tok/sComplex reasoning, multi-step agents (best quality / cost on RTX 3060 12GB and up)
qwen2.5:32b22 GB~22 tok/sNear-flagship quality, needs an RTX 3090 / 4090 / 5090 or Apple M3 Max+
qwen2.5:72b48 GB~12 tok/sTop-tier quality, requires dual GPU or a single A6000 / H100

Pull whichever one you settled on. One-time download, and it lands in ~/.ollama/models for good:

ollama pull qwen2.5:7b       # general purpose default
ollama pull qwen2.5:14b      # better reasoning if VRAM allows
ollama pull qwen2.5-coder:7b # code-specialised variant

Before you write a line of code against it, talk to it in the terminal first. Just to be sure it’s sane:

ollama run qwen2.5:7b
>>> Write a Python function that fetches a URL and returns the JSON.
# Streams a complete answer with code.

Ollama HTTP API basics

You’ll really only touch three endpoints. /api/chat is the newer one. It speaks the OpenAI message format, and it’s what I default to every single time. Then /api/generate, the old raw-prompt style, still hanging around, rarely what you actually want. And /api/embeddings hands you vectors back when you’re building out a RAG pipeline.

# Chat (recommended)
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [
    {"role": "system", "content": "You are a concise coding assistant."},
    {"role": "user", "content": "Explain async/await in Python in one paragraph."}
  ],
  "stream": false
}'

Back comes one JSON object, with the model’s text sitting in message.content. Flip stream: true instead and you get a token per line, newline-delimited. Same shape OpenAI streams in, so whatever you wrote for them mostly just works.

Most knobs from the OpenAI Chat Completions schema carry straight over, either top-level or tucked under options. You’ve got temperature, top_p, seed, then num_ctx for the context window, num_predict to cap output tokens, and stop for sequences that cut generation short. Tool calling’s been in since Ollama 0.5. Pass a tools array the way the chat endpoint docs lay out.

Calling the model from Python

Sure, you can talk to it with a single requests call. But the second you’re building an agent you’ll have outgrown that by lunchtime. Do yourself a favour, start on the official client.

# pip install ollama

import ollama

client = ollama.Client(host="http://localhost:11434")

response = client.chat(
    model="qwen2.5:7b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise the benefits of local AI in 3 bullets."}
    ],
    options={"temperature": 0.3, "num_ctx": 8192}
)

print(response.message.content)

That ollama package handles the stuff you’ll reach for. Streaming, with client.chat(..., stream=True) handing you per-token chunks. Embeddings through client.embeddings. Tool calling too. And it drags in nothing past requests, which is exactly why it stays my default for scripts and those small services nobody ever quite gets around to rewriting.

LangChain integration for agents

Once you need more than one turn (chains, RAG, routing to tools, keeping memory around), LangChain and its Ollama binding save you from reinventing a pile of plumbing you’d only get wrong anyway. Pull both packages in:

pip install langchain langchain-community langchain-ollama

The smallest thing that does anything useful looks like this. Qwen behind a prompt template, parser bolted on the end:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOllama(model="qwen2.5:14b", temperature=0.2)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Translate the user's message into French."),
    ("user", "{text}")
])

chain = prompt | llm | StrOutputParser()
print(chain.invoke({"text": "Hello, how are you today?"}))

Want the thing to actually call tools? Drop that same ChatOllama into LangChain’s normal agent constructor and you’re off. The loop, running the tools, holding the memory between turns, all the prompt plumbing, the framework carries it for you. Which is most of why you’d bother with it in the first place.

Complete example: email-reply assistant

Here’s something you’d actually run. 80 lines that pull your last 10 unread messages over IMAP, get Qwen 2.5 14B to draft a reply that’s clearly read the email, then drop each one into Drafts so a human eyeballs it before anything goes out. Swap in your own credentials. And for the love of everything, point it at a throwaway test mailbox the first time round. I let a half-baked mail script loose on a live inbox once. Just the once.

import imaplib
import email
from email.message import EmailMessage
import ollama

IMAP_HOST = "imap.example.com"
IMAP_USER = "you@example.com"
IMAP_PASS = "app-specific-password"

SYSTEM_PROMPT = """You are a professional assistant drafting replies on
behalf of the user. Reply in the same language as the incoming email.
Be concise (3-6 sentences). Use a friendly but professional tone.
End with a signature line: 'Best regards,\\nThe Assistant (draft)'.
Never send the reply - it will be reviewed before sending."""

def get_unread_emails(limit=10):
    m = imaplib.IMAP4_SSL(IMAP_HOST)
    m.login(IMAP_USER, IMAP_PASS)
    m.select("INBOX")
    _, data = m.search(None, "UNSEEN")
    ids = data[0].split()[:limit]
    messages = []
    for i in ids:
        _, raw = m.fetch(i, "(RFC822)")
        msg = email.message_from_bytes(raw[0][1])
        body = ""
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    body = part.get_payload(decode=True).decode(errors="ignore")
                    break
        else:
            body = msg.get_payload(decode=True).decode(errors="ignore")
        messages.append({
            "from": msg["From"], "subject": msg["Subject"],
            "body": body, "id": i
        })
    m.close()
    m.logout()
    return messages

def draft_reply(msg):
    client = ollama.Client()
    user_content = f"From: {msg['from']}\nSubject: {msg['subject']}\n\n{msg['body']}"
    response = client.chat(
        model="qwen2.5:14b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
        options={"temperature": 0.3, "num_ctx": 8192, "num_predict": 400}
    )
    return response.message.content

def save_draft(reply_text, original):
    draft = EmailMessage()
    draft["From"] = IMAP_USER
    draft["To"] = original["from"]
    draft["Subject"] = "Re: " + (original["subject"] or "")
    draft.set_content(reply_text)
    m = imaplib.IMAP4_SSL(IMAP_HOST)
    m.login(IMAP_USER, IMAP_PASS)
    m.append("Drafts", "", imaplib.Time2Internaldate(time.time()), draft.as_bytes())
    m.logout()

if __name__ == "__main__":
    import time
    for msg in get_unread_emails(limit=10):
        print(f"Drafting reply to: {msg['subject']}")
        reply = draft_reply(msg)
        save_draft(reply, msg)
        print(f"  Saved to Drafts ({len(reply)} chars)")

That same skeleton bends to whatever you’ve got. Point it at a CRM API instead of IMAP and it does support tickets. Calendar invites over CalDAV. Anything that boils down to read a thing, draft a reply, park it for a human. Nobody warns you about this part up front. The model is the cheap, boring bit. It’s the integration around it you’ll be babysitting for years, and I don’t think that ever really changes.

Performance tuning and benchmarks

When local inference drags, it’s usually one of three knobs doing it. Check these before you go blaming the model:

  • Quantisation level. Q4_K_M is the default for a reason. You keep around 90% of the quality and shave the size down 4x. Climb to Q5_K_M or Q6_K and you buy back a little quality on the same model, but you pay in VRAM. Q8_0’s basically lossless and roughly doubles the footprint, which is a lot to spend. Start at Q4_K_M. Only climb once you’ve actually measured the quality biting you, not because a bigger number feels nicer.
  • Context window (num_ctx). Ollama starts you at 2048. That’s tiny the moment you’re feeding in real history or document chunks, so bump num_ctx to 8192 or 16384. The memory cost climbs more or less in step, which means keeping nvidia-smi open and watching your headroom while it runs. That’s where the VRAM quietly disappears to.
  • GPU layers (num_gpu). Ollama tries to figure out how many layers fit in VRAM on its own, and most days it nails it. But on a box with both an integrated and a discrete GPU? I’ve watched it back the wrong horse. When that happens, force the offload. OLLAMA_NUM_GPU=99 throws everything at the GPU, 0 keeps the lot on the CPU.

Rough numbers I’d trust for qwen2.5:14b at Q4_K_M, on the hardware people actually own in 2026. Ballpark, not gospel:

HardwareTokens/sec (output)
RTX 4090 (24 GB)~75 tok/s
RTX 4070 Ti Super (16 GB)~55 tok/s
Apple M3 Max 64 GB~38 tok/s
RTX 3060 (12 GB)~28 tok/s
Apple M2 16 GB (CPU+GPU)~14 tok/s
Intel i9-13900K CPU only (DDR5-6000)~6 tok/s

Estimate the API cost you would avoid

Feed our AI Cost Calculator your monthly token volume. It stacks OpenAI, Anthropic, Gemini and self-hosted side by side, so you can see the exact point where running it yourself starts beating the cloud.

Open Cost Calculator →

FAQ

Why Qwen 2.5 and not Llama 4 or Mistral?

As of May 2026, Qwen 2.5 sits just out in front on the independent benchmarks (MMLU-Pro, HumanEval, MT-Bench) once you’re in that 7B-14B bracket. It pulls further ahead on multilingual and on code. Llama 4 405B is the better model overall, no argument there, but it won’t fit on anything under your desk; the 8B’s right in the mix yet still trails Qwen 2.5 7B on code. Mistral Large 3 is lovely and not open-weight, which for me ends the conversation. Check back in six months though. I’ve been wrong about which model wins more times than I care to admit.

Can I run Ollama on a laptop without a discrete GPU?

You can, yeah. CPU-only runs fine for the 0.5B, 1.5B and 3B on a recent Intel or AMD chip with 16 GB of RAM or more. The throughput’s perfectly livable. Apple Silicon’s a treat here: from the M1 on, the unified memory and Metal GPU pick up the slack, and even a plain M2 runs 7B at a speed you won’t swear at. The 14B on CPU though? It’ll technically go. But you’re looking at 3-6 tokens a second on a top-end desktop, the kind of slow where you go make coffee and it’s still typing when you sit back down.

How do I expose Ollama securely to the office network?

Three moves, in order. First, set OLLAMA_HOST=0.0.0.0:11434 so it listens on every interface. Then lock the host firewall down so port 11434 only opens to the office subnet (192.168.1.0/24, or whatever your VPN range happens to be). Then stand a reverse proxy out front, Caddy if you ask me, nginx if that’s your shop, to terminate TLS and bolt on basic auth or an API-token check. And the one that matters most? Never, ever hang port 11434 straight off the internet bare. The API ships with zero auth. Anyone who finds it owns it.

What is the memory cost of long context?

Rough rule I carry in my head: about 2 MB of VRAM per 1,000 tokens of context on a 7B at Q4_K_M. Call it double that on a 14B. So a full 32k window with the 14B runs you roughly 1 GB on top of the weights themselves. Don’t take my arithmetic as final, though. Fire off a realistic call, watch nvidia-smi while it’s actually working, and set your cap off what you see.

How do I use function calling with Qwen 2.5?

Qwen 2.5 does tool calling out of the box, no tricks. Hand it a tools array in the chat request using the OpenAI schema; Ollama 0.5 and up passes it through cleanly, and when the model decides it wants a tool you’ll find message.tool_calls waiting in the response. The whole multi-turn dance (call the tool, feed the result back, let it call the next) runs exactly like it does against the OpenAI API. If you’ve done it there, you already know this.

Can I fine-tune Qwen 2.5 on my own data?

You can, with LoRA or QLoRA through unsloth or axolotl. Quantised fine-tuning of the 7B runs me about 4 hours on an RTX 4090 against a 5,000-example set. Then you save the adapter, merge it back into the base model, and wrap the result as an Ollama model with a Modelfile. That whole pipeline is its own article, honestly. Let me save you the trouble though. Most of the time, prompt engineering plus a bit of RAG gets you there without touching training at all. I’d exhaust that first before I’d reach for fine-tuning.

PeopleAreGeek tools to go further

AI API Cost Calculator Token Counter (multi-model) AI API Compatibility Tester AI Hallucination Risk Estimator Article: Comparing the Cloud Flagships LLMs.txt Generator Developer Error Fix Hub

Sources & further reading

  • Ollama, project
  • Hugging Face, documentation
ShareTweetPin
People Are Geek

People Are Geek

I'm Stephane, a network and systems engineer with over 15 years of hands-on experience on production infrastructure, virtualization (ESXi, Proxmox), networking, and self-hosting. Earlier in my career I built and ran a Linux resource site that became a well-known reference for sysadmins. Today I focus on cybersecurity, and I also work as a technical trainer, teaching networking and security to people who do it for a living. Everything on People Are Geek comes from real-world practice, not theory. I build every tool on this site myself, and I write about what I've actually deployed, broken, and fixed. If it's here, I've used it.

People Are Geek

Copyright © 2017 JNews.

Navigate Site

  • About PeopleAreGeek
  • Affiliate Disclosure
  • All Tools and Articles
  • Contact
  • Cookie Policy
  • Hyper-V Hub: Tools, Error Fixes and Lab Guides
  • Linux Hub: Cross-Distro Reference, Articles, Tools
  • Privacy Policy
  • Sample Page
  • Terms of Service
  • VMware vSphere & ESXi Hub: Tools, Error Fixes and Guides

Follow Us

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools

Copyright © 2017 JNews.