You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thinking mode (/think): Model reasons step-by-step before responding. Better for math, code, complex tool selection.
Non-thinking mode (/no_think): Direct response without chain-of-thought. Faster, fewer tokens.
Controlled via system prompt or user message prefix.
What This Model Does
Qwen3 4B is a complete agent brain. It can reason about which tool to call, execute the call, understand the result, and explain it to the user in natural language. It rivals Qwen2.5-72B on many benchmarks despite being 18x smaller.
User: "What's the weather in Tokyo and find Bob's contact?"
Qwen3: <thinks about which tools to call>
-> get_weather(city="Tokyo")
-> search_contacts(name="Bob")
-> "The weather in Tokyo is 22C and sunny. Bob's email is bob@example.com"
Performance (Observed on RTX 3050 Ti 4GB)
Metric
Value
Tokens/sec
~39.5
Avg response time (warm)
4,000 - 6,000 ms
Multi-tool response time
~10,000 ms
Prompt eval time
~45-50 ms (warm)
Model load time
~200 ms (warm), ~8,000 ms (cold)
Eval tokens per query
170 - 400
Prompt tokens per query
~275
Benchmarks (Qwen3 4B vs Competitors)
Benchmark
Qwen3 4B
Notes
Performance class
Rivals Qwen2.5-72B-Instruct
Per Alibaba's claim
STEM/Coding
Strong
Outperforms larger Qwen2.5 models
Agent/Tool use
Leading among open-source
Precise tool integration
Multilingual
100+ languages
Strong instruction following
Best Use Cases
Full agent workflows: Plan -> tool call -> summarize -> respond