• Latest
  • Trending
  • All
Build a Local AI Agent with Ollama and Qwen 2.5 - complete 2026 guide

Build a Local AI Agent with Ollama + Qwen 2.5: Complete 2026 Guide

May 27, 2026
Maximizing Website Speed with Image Optimization Techniques for 2026 - cover image

Maximizing Website Speed with Image Optimization Techniques for 2026

June 3, 2026
SSL certificate renewal manager - 8 ACME clients, expiry calculator and monitoring - cover image

SSL Certificate Renewal Manager: certbot, acme.sh, lego, Caddy, cert-manager

June 3, 2026
CORS policy generator - 14 server and framework configs with presets and live security review - cover image

CORS Policy Generator: Headers + Nginx, Apache, Express, FastAPI, Django Config

June 3, 2026
netsh wlan command reference - 72 commands with example output and copy - cover image

netsh wlan Commands: Windows Wi-Fi Cheat Sheet (Show Password, Profiles, Hotspot)

June 2, 2026
Fix: ESXi Host Not Responding / Disconnected in vCenter (2026) - cover image

Fix: ESXi Host Not Responding / Disconnected in vCenter (2026)

June 1, 2026
VMware ESXi Purple Screen of Death (PSOD): Diagnose and Recover (2026) - cover image

VMware ESXi Purple Screen of Death (PSOD): Diagnose and Recover (2026)

June 1, 2026
VMware PowerCLI command generator cover

VMware PowerCLI Command Generator: VM, Snapshots, Networking, esxcli

June 1, 2026
dd Command Generator: Write ISO to USB, Image Disks, Wipe Drives - cover image

dd Command Generator: Write ISO to USB, Image Disks, Wipe Drives

June 1, 2026
SSH Tunnel Command Generator: Local, Remote and Dynamic Forwarding - cover image

SSH Tunnel Command Generator: Local, Remote and Dynamic Forwarding

June 1, 2026
sed Command Generator: Build Substitute, Delete and Print Commands - cover image

sed Command Generator: Build Substitute, Delete and Print Commands

May 31, 2026
VMware Workstation and Hyper-V on the Same Machine (2026 Fix) - cover image

VMware Workstation and Hyper-V on the Same Machine (2026 Fix)

May 31, 2026
VMware ESXi error reference - 70 errors with fixes - cover image

VMware ESXi Error Reference: Searchable Fix Database (PSOD, APD, vMotion)

June 1, 2026
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
Wednesday, June 3, 2026
  • Login
People Are Geek
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools
No Result
View All Result
People Are Geek
No Result
View All Result
Home AI Tools

Build a Local AI Agent with Ollama + Qwen 2.5: Complete 2026 Guide

by People Are Geek
May 27, 2026
in AI Tools
0
Build a Local AI Agent with Ollama and Qwen 2.5 - complete 2026 guide
0
SHARES
8
VIEWS
Share on FacebookShare on Twitter

Complete guide Local AI · 16 min read · Updated May 2026

Running an AI agent locally on your laptop in 2026 is no longer an experiment. With Ollama packaging the runtime and Qwen 2.5 hitting near-GPT-4 quality on reasoning and code at sizes that fit consumer GPUs, the only remaining cost is electricity and the half-day it takes to wire the agent into your stack. This guide walks through the complete path: install Ollama on Windows, Linux or macOS, pick the right Qwen model for your VRAM, call it from Python over its HTTP API, integrate with LangChain when the workflow grows, and build a working email-reply assistant in 80 lines of code. All commands tested in May 2026 against Ollama 0.6 and Qwen 2.5 series.

Contents

  1. Why local AI in 2026
  2. Install Ollama (Windows, Linux, macOS)
  3. Choosing the right Qwen 2.5 model for your VRAM
  4. Ollama HTTP API basics
  5. Calling the model from Python
  6. LangChain integration for agents
  7. Complete example: email-reply assistant
  8. Performance tuning and benchmarks
  9. FAQ

Why local AI in 2026

Three forces make local inference the right default for a growing share of use cases in 2026. Privacy: customer data, internal documents and code repositories no longer leave your machine for a cloud provider. For regulated industries (healthcare, finance, legal) and for anyone processing PII at scale, that single property is the difference between yes and no. Cost: at moderate to heavy usage, the break-even versus paid APIs sits around 5 million tokens per month. Past that, the electricity-amortised hardware wins decisively. Latency: a 7B model on a recent GPU returns the first token in 50-80 ms, which beats every cloud API even on a fast connection.

The remaining tradeoff is quality: a local 7B or 14B model is not GPT-5 turbo or Claude Opus 4.7. For reasoning-heavy tasks (legal analysis, complex code refactor), a flagship API is still the right call. For classification, extraction, summarisation, routine email drafting and most agent workflows, local Qwen 2.5 14B is now indistinguishable from cloud flagships in blind tests with humans rating outputs.

Install Ollama (Windows, Linux, macOS)

Ollama is a single binary that bundles the model runtime, model storage, and a REST server on port 11434. The installer takes 60 seconds on any platform.

macOS and Linux

curl -fsSL https://ollama.com/install.sh | sh
ollama --version  # confirm install

Windows (PowerShell)

Download the installer from ollama.com/download, run it once, then verify in PowerShell:

ollama --version
# Should print: ollama version is 0.6.x

Ollama installs as a background service that auto-starts on boot. Confirm the API responds:

curl http://localhost:11434/api/tags
# Returns JSON with the list of locally installed models (empty at first)
Firewall note. Ollama listens on 127.0.0.1:11434 by default, which is safe (loopback only). To expose it to other machines on your LAN, set the environment variable OLLAMA_HOST=0.0.0.0:11434 and add a firewall rule scoped to your LAN subnet only.

Choosing the right Qwen 2.5 model for your VRAM

Qwen 2.5 ships in five practical sizes. The table below gives the minimum VRAM for the standard Q4_K_M quantisation (the default Ollama uses), the throughput you can expect on a recent consumer GPU, and what each tier is good for.

ModelVRAM minThroughputSweet spot
qwen2.5:0.5b1.5 GB~150 tok/sEdge devices, autocomplete, classification
qwen2.5:1.5b2.5 GB~120 tok/sLightweight chat, extraction, on CPU-only laptops
qwen2.5:3b4 GB~90 tok/sSummarisation, simple tool use, RAG
qwen2.5:7b6 GB~70 tok/sGeneral-purpose agent, code completion
qwen2.5:14b10 GB~45 tok/sComplex reasoning, multi-step agents (best quality / cost on RTX 3060 12GB and up)
qwen2.5:32b22 GB~22 tok/sNear-flagship quality, needs an RTX 3090 / 4090 / 5090 or Apple M3 Max+
qwen2.5:72b48 GB~12 tok/sTop-tier quality, requires dual GPU or a single A6000 / H100

Pull the model you need (one-time download, models cached in ~/.ollama/models):

ollama pull qwen2.5:7b       # general purpose default
ollama pull qwen2.5:14b      # better reasoning if VRAM allows
ollama pull qwen2.5-coder:7b # code-specialised variant

Test the model directly in the terminal before wiring it into your code:

ollama run qwen2.5:7b
>>> Write a Python function that fetches a URL and returns the JSON.
# Streams a complete answer with code.

Ollama HTTP API basics

Ollama exposes three core endpoints. The newer /api/chat uses the OpenAI-compatible message format and is the right default. The legacy /api/generate takes a raw prompt. The /api/embeddings endpoint returns vector embeddings for RAG pipelines.

# Chat (recommended)
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [
    {"role": "system", "content": "You are a concise coding assistant."},
    {"role": "user", "content": "Explain async/await in Python in one paragraph."}
  ],
  "stream": false
}'

The response is a single JSON object with message.content holding the model output. With stream: true the server emits one JSON object per token (newline-delimited), exactly like the OpenAI streaming format.

Most parameters from the OpenAI Chat Completions schema are supported either at the top level or under the options key: temperature, top_p, seed, num_ctx (context window), num_predict (max output tokens), stop (sequences that end generation). Tool calling is supported since Ollama 0.5; use the tools array per the chat endpoint documentation.

Calling the model from Python

The minimum-viable Python client is a single requests call. For agents you will outgrow it within a day; jump straight to the official client.

# pip install ollama

import ollama

client = ollama.Client(host="http://localhost:11434")

response = client.chat(
    model="qwen2.5:7b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise the benefits of local AI in 3 bullets."}
    ],
    options={"temperature": 0.3, "num_ctx": 8192}
)

print(response.message.content)

The ollama Python package supports streaming (client.chat(..., stream=True) yields per-token chunks), embeddings (client.embeddings), and tool calling. It has no external dependencies beyond requests and is the right default for scripts and small services.

LangChain integration for agents

For anything beyond a single-turn call (chains, RAG, tool routing, memory), LangChain plus its Ollama integration gives you a battle-tested toolkit. Install both packages:

pip install langchain langchain-community langchain-ollama

The minimum example wires Qwen to a prompt template and a parser:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOllama(model="qwen2.5:14b", temperature=0.2)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Translate the user's message into French."),
    ("user", "{text}")
])

chain = prompt | llm | StrOutputParser()
print(chain.invoke({"text": "Hello, how are you today?"}))

For an agent that calls tools, swap ChatOllama into the standard LangChain agent constructor. The agent loop, the tool execution and the memory are all framework-managed.

Complete example: email-reply assistant

A practical 80-line agent: reads the last 10 unread emails from your IMAP inbox, drafts a context-aware reply using Qwen 2.5 14B, saves drafts to the IMAP Drafts folder for human review before sending. Replace the credentials and run in a sandboxed test mailbox first.

import imaplib
import email
from email.message import EmailMessage
import ollama

IMAP_HOST = "imap.example.com"
IMAP_USER = "you@example.com"
IMAP_PASS = "app-specific-password"

SYSTEM_PROMPT = """You are a professional assistant drafting replies on
behalf of the user. Reply in the same language as the incoming email.
Be concise (3-6 sentences). Use a friendly but professional tone.
End with a signature line: 'Best regards,\\nThe Assistant (draft)'.
Never send the reply - it will be reviewed before sending."""

def get_unread_emails(limit=10):
    m = imaplib.IMAP4_SSL(IMAP_HOST)
    m.login(IMAP_USER, IMAP_PASS)
    m.select("INBOX")
    _, data = m.search(None, "UNSEEN")
    ids = data[0].split()[:limit]
    messages = []
    for i in ids:
        _, raw = m.fetch(i, "(RFC822)")
        msg = email.message_from_bytes(raw[0][1])
        body = ""
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    body = part.get_payload(decode=True).decode(errors="ignore")
                    break
        else:
            body = msg.get_payload(decode=True).decode(errors="ignore")
        messages.append({
            "from": msg["From"], "subject": msg["Subject"],
            "body": body, "id": i
        })
    m.close()
    m.logout()
    return messages

def draft_reply(msg):
    client = ollama.Client()
    user_content = f"From: {msg['from']}\nSubject: {msg['subject']}\n\n{msg['body']}"
    response = client.chat(
        model="qwen2.5:14b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_content},
        ],
        options={"temperature": 0.3, "num_ctx": 8192, "num_predict": 400}
    )
    return response.message.content

def save_draft(reply_text, original):
    draft = EmailMessage()
    draft["From"] = IMAP_USER
    draft["To"] = original["from"]
    draft["Subject"] = "Re: " + (original["subject"] or "")
    draft.set_content(reply_text)
    m = imaplib.IMAP4_SSL(IMAP_HOST)
    m.login(IMAP_USER, IMAP_PASS)
    m.append("Drafts", "", imaplib.Time2Internaldate(time.time()), draft.as_bytes())
    m.logout()

if __name__ == "__main__":
    import time
    for msg in get_unread_emails(limit=10):
        print(f"Drafting reply to: {msg['subject']}")
        reply = draft_reply(msg)
        save_draft(reply, msg)
        print(f"  Saved to Drafts ({len(reply)} chars)")

The same skeleton extends to support tickets (CRM API instead of IMAP), calendar invites (CalDAV), or any other “read input, draft reply, queue for human review” workflow. The model is the cheap part; the integration is what you maintain.

Performance tuning and benchmarks

Three knobs move the needle most on local inference performance.

  • Quantisation level. The default Q4_K_M strikes a 90% quality / 4x size reduction balance. Q5_K_M and Q6_K give better quality on the same model size at the cost of more VRAM. Q8_0 is essentially lossless but doubles the VRAM footprint. The right starting point is Q4_K_M; upgrade only if you measure quality issues.
  • Context window (num_ctx). Defaults to 2048 in Ollama. For agents that include long history or document chunks, set num_ctx to 8192 or 16384. Memory cost scales roughly linearly; check the VRAM headroom with nvidia-smi while running.
  • GPU layers (num_gpu). Ollama auto-detects how many layers fit in VRAM, but on systems with both integrated and discrete GPUs the heuristic can pick the wrong one. Force the offload count with OLLAMA_NUM_GPU=99 for full GPU, 0 for CPU-only.

Indicative throughput on common 2026 hardware running qwen2.5:14b Q4_K_M:

HardwareTokens/sec (output)
RTX 4090 (24 GB)~75 tok/s
RTX 4070 Ti Super (16 GB)~55 tok/s
Apple M3 Max 64 GB~38 tok/s
RTX 3060 (12 GB)~28 tok/s
Apple M2 16 GB (CPU+GPU)~14 tok/s
Intel i9-13900K CPU only (DDR5-6000)~6 tok/s

Estimate the API cost you would avoid

Our AI Cost Calculator compares your monthly token volume against OpenAI, Anthropic, Gemini and self-hosted to show the break-even point for local AI versus the cloud.

Open Cost Calculator →

FAQ

Why Qwen 2.5 and not Llama 4 or Mistral?

In May 2026, Qwen 2.5 leads independent benchmarks (MMLU-Pro, HumanEval, MT-Bench) in the 7B-14B size range, especially for multilingual and code tasks. Llama 4 405B is stronger overall but does not fit on consumer hardware; Llama 4 8B is competitive but trails Qwen 2.5 7B on code. Mistral Large 3 is excellent but not open-weight. Reassess every six months — the landscape moves fast.

Can I run Ollama on a laptop without a discrete GPU?

Yes. CPU-only inference works for Qwen 2.5 0.5B, 1.5B and 3B with acceptable throughput on a modern Intel or AMD CPU with at least 16 GB of RAM. Apple Silicon (M1 and later) accelerates inference through the unified memory and Metal GPU; even the base M2 runs 7B at usable speed. 14B on CPU is technically possible but slow (3-6 tokens per second on a top-tier desktop CPU).

How do I expose Ollama securely to the office network?

Three steps. Set OLLAMA_HOST=0.0.0.0:11434 to listen on all interfaces. Add a firewall rule on the host that allows port 11434 only from the office subnet (192.168.1.0/24 or your VPN range). Put a reverse proxy like Caddy or nginx in front to terminate TLS and add basic auth or an API token check. Never expose port 11434 directly to the internet without authentication; the API has no built-in auth.

What is the memory cost of long context?

Approximately 2 MB of VRAM per 1,000 tokens of context for a 7B model in Q4_K_M, roughly doubling for 14B. A 32k context with 14B costs about 1 GB of VRAM on top of the model weights. Check with nvidia-smi during a representative inference call before deciding the cap.

How do I use function calling with Qwen 2.5?

Qwen 2.5 supports tool calling natively. Pass a tools array in the chat request with the OpenAI schema. Ollama 0.5+ proxies it correctly and the response includes message.tool_calls when the model decides to use a tool. Multi-turn loops (call tool, return result, call next) work identically to the OpenAI API.

Can I fine-tune Qwen 2.5 on my own data?

Yes, via LoRA or QLoRA training with unsloth or axolotl. Quantised fine-tuning of Qwen 2.5 7B takes about 4 hours on an RTX 4090 with a 5,000-example dataset. Save the adapter, merge with the base model, repackage as an Ollama model with a Modelfile. The full pipeline is a separate guide; for most use cases, prompt engineering and RAG cover the need without the training step.

PeopleAreGeek tools to go further

AI API Cost Calculator Token Counter (multi-model) AI API Compatibility Tester AI Hallucination Risk Estimator Article: Comparing the Cloud Flagships LLMs.txt Generator Developer Error Fix Hub
ShareTweetPin
People Are Geek

People Are Geek

People Are Geek

Copyright © 2017 JNews.

Navigate Site

  • About PeopleAreGeek
  • All Tools and Articles
  • Contact
  • Cookie Policy
  • Hyper-V Hub: Tools, Error Fixes and Lab Guides
  • Linux Hub: Cross-Distro Reference, Articles, Tools
  • Page de test Codex
  • Privacy Policy
  • Sample Page
  • Terms of Service
  • VMware vSphere & ESXi Hub: Tools, Error Fixes and Guides

Follow Us

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Online Tools
  • Network Tools
  • Developer Tools
  • Security Tools

Copyright © 2017 JNews.