Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
client.py	client.py
main.py	main.py
requirements.txt	requirements.txt
responses.json	responses.json
server.py	server.py
tools.py	tools.py

Qwen3 4B - Full Agent Brain with Thinking Mode

Model Overview

Property	Value
Model	Qwen3-4B (Qwen/Qwen3-4B)
Developer	Alibaba Cloud (Qwen Team)
Architecture	Qwen3 (decoder-only transformer, dense)
Parameters	4.0B (4.02B actual)
Quantization	Q4_K_M
Download Size	2.5 GB
VRAM Required	~3.5-4 GB
RAM Required (CPU)	~5-6 GB
Context Window	262,144 tokens (256K)
Embedding Dimension	2,560
Training Data	36 trillion tokens (Qwen3 family)
Knowledge Cutoff	~Early 2025
License	Apache 2.0 (full commercial use)
Release Date	April 2025

Capabilities

Capability	Supported
Text completion	Yes
Tool/function calling	Yes
Conversation/chat	Yes (multi-turn)
Thinking/reasoning	Yes (think/no-think toggle)
Summarization	Yes
Code generation	Yes
Math/logic	Yes
Multilingual	Yes (100+ languages)
Parallel tool calls	Yes
Image/vision	No (text-only)

Model Parameters (Defaults)

Parameter	Value
temperature	0.6
top_k	20
top_p	0.95
repeat_penalty	1.0
stop tokens	`<\|im_start\|>`, `<\|im_end\|>`

Thinking Mode

Qwen3 4B supports hybrid thinking:

Thinking mode (/think): Model reasons step-by-step before responding. Better for math, code, complex tool selection.
Non-thinking mode (/no_think): Direct response without chain-of-thought. Faster, fewer tokens.
Controlled via system prompt or user message prefix.

What This Model Does

Qwen3 4B is a complete agent brain. It can reason about which tool to call, execute the call, understand the result, and explain it to the user in natural language. It rivals Qwen2.5-72B on many benchmarks despite being 18x smaller.

User: "What's the weather in Tokyo and find Bob's contact?"
Qwen3: <thinks about which tools to call>
       -> get_weather(city="Tokyo")
       -> search_contacts(name="Bob")
       -> "The weather in Tokyo is 22C and sunny. Bob's email is bob@example.com"

Performance (Observed on RTX 3050 Ti 4GB)

Metric	Value
Tokens/sec	~39.5
Avg response time (warm)	4,000 - 6,000 ms
Multi-tool response time	~10,000 ms
Prompt eval time	~45-50 ms (warm)
Model load time	~200 ms (warm), ~8,000 ms (cold)
Eval tokens per query	170 - 400
Prompt tokens per query	~275

Benchmarks (Qwen3 4B vs Competitors)

Benchmark	Qwen3 4B	Notes
Performance class	Rivals Qwen2.5-72B-Instruct	Per Alibaba's claim
STEM/Coding	Strong	Outperforms larger Qwen2.5 models
Agent/Tool use	Leading among open-source	Precise tool integration
Multilingual	100+ languages	Strong instruction following

Best Use Cases

Full agent workflows: Plan -> tool call -> summarize -> respond
Multi-tool orchestration: Handle complex queries needing multiple tools
Conversational AI with tools: Chatbots that can take actions
Code generation + execution: Write and reason about code
Complex reasoning tasks: Math, logic, multi-step problem solving
Multilingual tool calling: Users can query in any of 100+ languages

When NOT to Use This Model

When sub-second latency is required (use FunctionGemma or Qwen3 0.6B)
CPU-only environments with strict performance requirements
Simple intent classification (overkill - use a smaller model)
When 2.5 GB model size exceeds device storage constraints

Cost Optimization Strategy

Use as Tier 2 in a tiered architecture:

User Input -> FunctionGemma (Tier 1: simple routing, <1s)
  |-> Simple? Execute locally
  |-> Moderate? -> Qwen3 4B (Tier 2: reasoning + tools, ~5s)
  |-> Complex? -> Cloud LLM (Tier 3: Claude/GPT, $$)

Infrastructure Requirements

Environment	Minimum	Recommended
CPU-only	4 cores, 6 GB RAM	8 cores, 8 GB RAM
GPU	CUDA GPU, 4 GB VRAM	RTX 3050 Ti or better
Disk	2.5 GB for model	4 GB total with deps
OS	Linux, macOS, Windows	Any with Ollama support

Note: Runs on GPU with your RTX 3050 Ti (4 GB). CPU-only is possible but slow (~5-10x slower).

Prerequisites

Python 3.10+
Ollama v0.6.0+
pip install -r requirements.txt
ollama pull qwen3:4b

Running

# Standalone demo with telemetry
python main.py

# FastAPI server (port 8001)
python server.py

# Client (requires server running)
python client.py

Project Files

File	Purpose
`tools.py`	Tool definitions (single source of truth)
`main.py`	Standalone demo with tool calling + telemetry
`server.py`	FastAPI server with telemetry in every response
`client.py`	HTTP client with telemetry display + JSON storage
`requirements.txt`	Pinned Python dependencies
`responses.json`	Stored responses with telemetry from last client run

API Endpoints

Method	Endpoint	Port	Description
POST	/ask	8001	Auto-route any question to the right tool
POST	/weather	8001	Weather queries
POST	/calculate	8001	Math/calculation queries
POST	/contacts	8001	Contact lookup queries
GET	/health	8001	Health check

Telemetry Fields

Every response includes:

Field	Description	Use For
`prompt_tokens`	Input tokens consumed	Cost estimation
`eval_tokens`	Output tokens generated	Cost estimation
`total_tokens`	Combined input + output	Billing
`total_duration_ms`	End-to-end Ollama time	SLA monitoring
`load_duration_ms`	Model load time	Cold start tracking
`prompt_eval_ms`	Input processing time	Prompt optimization
`eval_duration_ms`	Generation time	Performance tuning
`tokens_per_sec`	Generation throughput	Capacity planning
`wall_time_ms`	Actual wall clock time	User experience
`model`	Model identifier	Multi-model routing
`done_reason`	Stop reason (stop/length)	Truncation detection

Key Considerations for Production Teams

256K context window - Largest among the three models. Suitable for long documents.
Thinking mode adds tokens - Expect 2-3x more eval tokens when thinking is enabled. Budget accordingly.
Apache 2.0 license - Full commercial use with no restrictions.
GPU recommended - CPU inference is 5-10x slower. Budget for GPU instances.
Warm vs cold - First request takes ~8s (model load). Subsequent requests ~4-5s. Keep model loaded.
Memory footprint - ~3.5 GB VRAM. Cannot share GPU with other large models on 4 GB cards.

Deployment Options

Platform	Tool
Local	Ollama, LM Studio, llama.cpp
Fine-tune	Hugging Face Transformers, Unsloth, Keras, NVIDIA NeMo
Serve	vLLM, Ollama, LiteRT-LM, MLX
Cloud	Vertex AI, any CUDA instance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Qwen3 4B - Full Agent Brain with Thinking Mode

Model Overview

Capabilities

Model Parameters (Defaults)

Thinking Mode

What This Model Does

Performance (Observed on RTX 3050 Ti 4GB)

Benchmarks (Qwen3 4B vs Competitors)

Best Use Cases

When NOT to Use This Model

Cost Optimization Strategy

Infrastructure Requirements

Prerequisites

Running

Project Files

API Endpoints

Telemetry Fields

Key Considerations for Production Teams

Deployment Options

Links

FilesExpand file tree

qwen3

Directory actions

More options

Directory actions

More options

Latest commit

History

qwen3

Folders and files

parent directory

README.md

Qwen3 4B - Full Agent Brain with Thinking Mode

Model Overview

Capabilities

Model Parameters (Defaults)

Thinking Mode

What This Model Does

Performance (Observed on RTX 3050 Ti 4GB)

Benchmarks (Qwen3 4B vs Competitors)

Best Use Cases

When NOT to Use This Model

Cost Optimization Strategy

Infrastructure Requirements

Prerequisites

Running

Project Files

API Endpoints

Telemetry Fields

Key Considerations for Production Teams

Deployment Options

Links