Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Qwen3 4B - Full Agent Brain with Thinking Mode

Model Overview

Property Value
Model Qwen3-4B (Qwen/Qwen3-4B)
Developer Alibaba Cloud (Qwen Team)
Architecture Qwen3 (decoder-only transformer, dense)
Parameters 4.0B (4.02B actual)
Quantization Q4_K_M
Download Size 2.5 GB
VRAM Required ~3.5-4 GB
RAM Required (CPU) ~5-6 GB
Context Window 262,144 tokens (256K)
Embedding Dimension 2,560
Training Data 36 trillion tokens (Qwen3 family)
Knowledge Cutoff ~Early 2025
License Apache 2.0 (full commercial use)
Release Date April 2025

Capabilities

Capability Supported
Text completion Yes
Tool/function calling Yes
Conversation/chat Yes (multi-turn)
Thinking/reasoning Yes (think/no-think toggle)
Summarization Yes
Code generation Yes
Math/logic Yes
Multilingual Yes (100+ languages)
Parallel tool calls Yes
Image/vision No (text-only)

Model Parameters (Defaults)

Parameter Value
temperature 0.6
top_k 20
top_p 0.95
repeat_penalty 1.0
stop tokens <|im_start|>, <|im_end|>

Thinking Mode

Qwen3 4B supports hybrid thinking:

  • Thinking mode (/think): Model reasons step-by-step before responding. Better for math, code, complex tool selection.
  • Non-thinking mode (/no_think): Direct response without chain-of-thought. Faster, fewer tokens.
  • Controlled via system prompt or user message prefix.

What This Model Does

Qwen3 4B is a complete agent brain. It can reason about which tool to call, execute the call, understand the result, and explain it to the user in natural language. It rivals Qwen2.5-72B on many benchmarks despite being 18x smaller.

User: "What's the weather in Tokyo and find Bob's contact?"
Qwen3: <thinks about which tools to call>
       -> get_weather(city="Tokyo")
       -> search_contacts(name="Bob")
       -> "The weather in Tokyo is 22C and sunny. Bob's email is bob@example.com"

Performance (Observed on RTX 3050 Ti 4GB)

Metric Value
Tokens/sec ~39.5
Avg response time (warm) 4,000 - 6,000 ms
Multi-tool response time ~10,000 ms
Prompt eval time ~45-50 ms (warm)
Model load time ~200 ms (warm), ~8,000 ms (cold)
Eval tokens per query 170 - 400
Prompt tokens per query ~275

Benchmarks (Qwen3 4B vs Competitors)

Benchmark Qwen3 4B Notes
Performance class Rivals Qwen2.5-72B-Instruct Per Alibaba's claim
STEM/Coding Strong Outperforms larger Qwen2.5 models
Agent/Tool use Leading among open-source Precise tool integration
Multilingual 100+ languages Strong instruction following

Best Use Cases

  • Full agent workflows: Plan -> tool call -> summarize -> respond
  • Multi-tool orchestration: Handle complex queries needing multiple tools
  • Conversational AI with tools: Chatbots that can take actions
  • Code generation + execution: Write and reason about code
  • Complex reasoning tasks: Math, logic, multi-step problem solving
  • Multilingual tool calling: Users can query in any of 100+ languages

When NOT to Use This Model

  • When sub-second latency is required (use FunctionGemma or Qwen3 0.6B)
  • CPU-only environments with strict performance requirements
  • Simple intent classification (overkill - use a smaller model)
  • When 2.5 GB model size exceeds device storage constraints

Cost Optimization Strategy

Use as Tier 2 in a tiered architecture:

User Input -> FunctionGemma (Tier 1: simple routing, <1s)
  |-> Simple? Execute locally
  |-> Moderate? -> Qwen3 4B (Tier 2: reasoning + tools, ~5s)
  |-> Complex? -> Cloud LLM (Tier 3: Claude/GPT, $$)

Infrastructure Requirements

Environment Minimum Recommended
CPU-only 4 cores, 6 GB RAM 8 cores, 8 GB RAM
GPU CUDA GPU, 4 GB VRAM RTX 3050 Ti or better
Disk 2.5 GB for model 4 GB total with deps
OS Linux, macOS, Windows Any with Ollama support

Note: Runs on GPU with your RTX 3050 Ti (4 GB). CPU-only is possible but slow (~5-10x slower).

Prerequisites

  • Python 3.10+
  • Ollama v0.6.0+
  • pip install -r requirements.txt
  • ollama pull qwen3:4b

Running

# Standalone demo with telemetry
python main.py

# FastAPI server (port 8001)
python server.py

# Client (requires server running)
python client.py

Project Files

File Purpose
tools.py Tool definitions (single source of truth)
main.py Standalone demo with tool calling + telemetry
server.py FastAPI server with telemetry in every response
client.py HTTP client with telemetry display + JSON storage
requirements.txt Pinned Python dependencies
responses.json Stored responses with telemetry from last client run

API Endpoints

Method Endpoint Port Description
POST /ask 8001 Auto-route any question to the right tool
POST /weather 8001 Weather queries
POST /calculate 8001 Math/calculation queries
POST /contacts 8001 Contact lookup queries
GET /health 8001 Health check

Telemetry Fields

Every response includes:

Field Description Use For
prompt_tokens Input tokens consumed Cost estimation
eval_tokens Output tokens generated Cost estimation
total_tokens Combined input + output Billing
total_duration_ms End-to-end Ollama time SLA monitoring
load_duration_ms Model load time Cold start tracking
prompt_eval_ms Input processing time Prompt optimization
eval_duration_ms Generation time Performance tuning
tokens_per_sec Generation throughput Capacity planning
wall_time_ms Actual wall clock time User experience
model Model identifier Multi-model routing
done_reason Stop reason (stop/length) Truncation detection

Key Considerations for Production Teams

  1. 256K context window - Largest among the three models. Suitable for long documents.
  2. Thinking mode adds tokens - Expect 2-3x more eval tokens when thinking is enabled. Budget accordingly.
  3. Apache 2.0 license - Full commercial use with no restrictions.
  4. GPU recommended - CPU inference is 5-10x slower. Budget for GPU instances.
  5. Warm vs cold - First request takes ~8s (model load). Subsequent requests ~4-5s. Keep model loaded.
  6. Memory footprint - ~3.5 GB VRAM. Cannot share GPU with other large models on 4 GB cards.

Deployment Options

Platform Tool
Local Ollama, LM Studio, llama.cpp
Fine-tune Hugging Face Transformers, Unsloth, Keras, NVIDIA NeMo
Serve vLLM, Ollama, LiteRT-LM, MLX
Cloud Vertex AI, any CUDA instance

Links