A specialized small language model that serves as an ML/DL concept engine -- designed to research, explain, and generate ideas around machine learning and deep learning, which can then be handed off to large models (Claude, GPT, Gemini) for code implementation. Think of it as the "brain" that understands the theory, paired with large models as the "hands" that write the code.
Base Model: Qwen2.5-3B-Instruct Method: QLoRA (4-bit quantized LoRA fine-tuning) Domain: ML/DL concept research, ideation, and explanation
Large language models like Claude, GPT, and Gemini are excellent at writing code -- but they work best when given precise, well-articulated ML/DL concepts as input. NeuralScholar fills that gap.
When building ML/DL systems, the bottleneck is rarely the code -- it's knowing what to build and why. Practitioners often spend hours reading papers, comparing architectures, and understanding the mathematical intuition before writing a single line of code. General-purpose LLMs can help, but they lack the deep, focused expertise of a domain specialist.
STAGE 1: THINK STAGE 2: BUILD
┌─────────────────────────┐ ┌─────────────────────────┐
│ NeuralScholar │ │ Claude / GPT / Gemini │
│ (3B, local) │ --> │ (Large model, API) │
│ │ │ │
│ "Explain how multi- │ │ "Implement multi-head │
│ head attention works, │ │ attention with the │
│ compare it with │ │ scaled dot-product │
│ cross-attention, and │ │ approach, using │
│ when to use each" │ │ PyTorch, with these │
│ │ │ design choices..." │
│ --> Detailed concept │ │ │
│ explanation with │ │ --> Production-ready │
│ mathematical │ │ implementation │
│ intuition │ │ │
└─────────────────────────┘ └─────────────────────────┘
Runs locally, free API call with precise context
Private, no data leaks Higher quality output
Fast concept iteration Less token waste
- Research ideas locally -- Explore ML/DL concepts on your machine without API costs or data privacy concerns. Iterate on ideas freely before committing to expensive API calls.
- Generate precise prompts for large models -- NeuralScholar's concept explanations serve as high-quality context for Claude/GPT/Gemini, resulting in better code output with fewer iterations.
- Bridge theory and implementation -- Ask NeuralScholar "What loss function should I use for imbalanced multi-label classification and why?" -- then feed its answer to a large model with "Implement this in PyTorch."
- Rapid literature review -- Trained on ~100K samples from ArXiv papers, StackExchange, Wikipedia, and curated datasets, the model provides grounded, up-to-date ML/DL knowledge.
- Offline-first ML assistant -- Runs on a laptop GPU (8GB VRAM) via GGUF quantization. No internet required for concept exploration.
You: "Compare batch normalization vs layer normalization.
When should I use each in a transformer?"
NeuralScholar: [Detailed explanation covering:
- Mathematical formulation of both
- Why LayerNorm is preferred in transformers (sequence length invariance)
- BatchNorm's dependency on batch statistics
- Pre-norm vs post-norm transformer variants
- Practical trade-offs]
You → Claude: "Based on this analysis: [paste NeuralScholar output]
Implement a transformer block with pre-LayerNorm in PyTorch,
with an option to swap in RMSNorm."
Claude: [Clean, production-ready PyTorch code informed by precise context]
DATA COLLECTION PROCESSING TRAINING DEPLOYMENT
ArXiv API --> --> --> llama.cpp
StackExchange --> Code Filter --> QLoRA --> Ollama
Wikipedia --> Deduplication (LSH) --> Fine-tuning --> Gradio UI
Distill.pub --> Quality Filter --> on Qwen2.5-3B --> Python API
HuggingFace --> Train/Val Split --> -->
NeuralScholar/
├── configs/
│ └── config.yaml # Pipeline configuration
│
├── collectors/ # Data collection modules
│ ├── arxiv_collector.py # ArXiv papers (cs.LG, cs.CV, cs.CL, cs.AI)
│ ├── stackexchange_collector.py # Stats & AI StackExchange Q&A
│ ├── wikipedia_collector.py # ML/DL Wikipedia articles
│ ├── distill_collector.py # Distill.pub research articles
│ └── huggingface_datasets_collector.py # Open-Platypus, SciQ, ARC, Flan
│
├── processors/ # Data processing pipeline
│ ├── code_filter.py # Remove code-heavy content
│ ├── deduplicator.py # MinHash LSH deduplication
│ ├── quality_filter.py # Language, length, coherence checks
│ └── processors.py # Instruction formatting & train/val split
│
├── utils/
│ ├── io_utils.py # JSONL streaming, config, progress tracking
│ └── text_utils.py # Text cleaning, LaTeX removal, chunking
│
├── data/
│ ├── raw/ # Collected data per source
│ ├── processed/ # Intermediate processing stages
│ └── final/ # Final train.jsonl & validation.jsonl
│
├── outputs/
│ ├── ml-slm-qwen3b/ # QLoRA checkpoints & final model
│ ├── ml-slm-merged/ # Merged LoRA + base (full HF model)
│ └── ml-slm-gguf/ # Quantized GGUF models
│
├── llama.cpp/ # Submodule for GGUF conversion
│
├── train_qlora.py # QLoRA fine-tuning script
├── inference.py # Inference (interactive, eval, compare)
├── export_gguf.py # GGUF export & quantization
├── webui.py # Gradio web interface
├── analyze_dataset.py # Dataset statistics & analysis
├── run_pipeline.py # Pipeline orchestration
├── requirements.txt # Data collection dependencies
└── requirements_training.txt # Training & inference dependencies
- Python 3.10+
- CUDA-capable GPU (8GB+ VRAM recommended)
- llama.cpp (for GGUF export only)
# Clone the repository
git clone https://github.com/<your-username>/NeuralScholar.git
cd NeuralScholar
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install data collection & processing dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Install training & inference dependencies
pip install -r requirements_training.txt# Run everything: collect, process, finalize
python run_pipeline.py --full
# Skip collection, only process & split
python run_pipeline.py --full --skip-collectionCollect from individual sources or all at once:
# All sources
python run_pipeline.py collect --all
# Individual sources
python run_pipeline.py collect --source arxiv
python run_pipeline.py collect --source stackexchange
python run_pipeline.py collect --source wikipedia
python run_pipeline.py collect --source distill
python run_pipeline.py collect --source huggingfaceSources and scale:
| Source | Content | Samples |
|---|---|---|
| ArXiv | Paper summaries & explanations (2018+) | ~95K |
| Wikipedia | 40+ ML/DL topic articles | ~4.5K |
| StackExchange | Stats & AI forum Q&A (score >= 5) | ~1K |
| HuggingFace | Open-Platypus, SciQ, ARC, Flan-CoT | ~2K |
| Distill.pub | Interactive research explanations | ~400 |
# Run processing pipeline: merge → code filter → dedup → quality → format → split
python run_pipeline.py process
python run_pipeline.py finalize
# Analyze the final dataset
python analyze_dataset.pyProcessing stages:
| Stage | Method | Purpose |
|---|---|---|
| Code Filter | Pattern-based detection | Remove code snippets, preserve algorithm descriptions |
| Deduplication | MinHash LSH (threshold: 0.85) | Remove near-duplicate samples |
| Quality Filter | Language, length, coherence checks | Ensure minimum quality (50-2048 tokens, English only) |
| Formatting | Template standardization | Uniform instruction-output format |
| Finalization | Stratified shuffle split (95/5) | Train/validation split with seed 42 |
# Quick test run (100 samples, 1 epoch)
python train_qlora.py --test
# Full training
python train_qlora.py
# Resume from checkpoint
python train_qlora.py --resume
# Override hyperparameters
python train_qlora.py --epochs 2 --lr 1e-4 --output outputs/custom-run# Interactive chat
python inference.py
# Single question
python inference.py --question "Explain the attention mechanism in transformers"
# Evaluate on benchmark questions
python inference.py --eval
# Compare fine-tuned vs base model
python inference.py --compare --question "Why does dropout help prevent overfitting?"# Export with recommended quantization (q4_k_m)
python export_gguf.py
# Choose quantization format
python export_gguf.py --quant q5_k_mAvailable quantization formats:
| Format | Size | Quality | Use Case |
|---|---|---|---|
q4_k_m |
~2.5 GB | Good | Recommended for 8GB VRAM |
q5_k_m |
~3.5 GB | Better | Higher quality, more VRAM |
q6_k |
~4 GB | Very Good | Large systems |
q8_0 |
~5 GB | Excellent | Best quality quantized |
f16 |
~6 GB | Best | Full precision |
python webui.py
# Opens Gradio interface at http://localhost:7860ArXiv / StackExchange / Wikipedia / Distill / HuggingFace
|
v
data/raw/[source]/*.jsonl
|
merge_all_sources
|
v
data/processed/merged.jsonl
|
code_filter -> dedup -> quality -> format
|
v
data/processed/formatted.jsonl
|
shuffle + split (95/5)
|
+---------+---------+
| |
v v
data/final/train.jsonl data/final/validation.jsonl
Data format (each line in JSONL):
{
"instruction": "Explain the concept of attention in transformers.",
"input": "",
"output": "Attention is a mechanism that allows a model to focus on...",
"source": "arxiv",
"category": "concept_explanation/architectures"
}| Parameter | Value |
|---|---|
| Base model | Qwen2.5-3B-Instruct |
| Method | QLoRA (4-bit NF4, double quantization) |
| LoRA rank (r) | 64 |
| LoRA alpha | 128 |
| LoRA target modules | q, k, v, o, gate, up, down projections |
| Trainable parameters | ~120M (6.58% of 3B) |
| Max sequence length | 1024 tokens |
| Epochs | 3 |
| Effective batch size | 16 (1 x 16 gradient accumulation) |
| Learning rate | 2e-4 (cosine schedule, 3% warmup) |
| Optimizer | Paged AdamW 8-bit |
| Precision | BFloat16 |
| VRAM requirement | ~7.5 GB |
The model is trained with the following system prompt enforced at both training and inference:
You are an expert ML/DL teaching assistant. Explain concepts clearly and accurately, use intuitive analogies, provide mathematical intuition, compare and contrast methods, and focus on conceptual understanding. Do NOT write code.
./llama-cli -m outputs/ml-slm-gguf/ml-slm-q4_k_m.gguf \
-p "Explain the attention mechanism in transformers"echo 'FROM ./outputs/ml-slm-gguf/ml-slm-q4_k_m.gguf' > Modelfile
ollama create neural-scholar -f Modelfile
ollama run neural-scholarfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "outputs/ml-slm-qwen3b/final")
tokenizer = AutoTokenizer.from_pretrained("outputs/ml-slm-qwen3b/final")# Step 1: Get concept explanation from NeuralScholar (local, free)
concept = neural_scholar.generate("Explain how LoRA works and why it's parameter-efficient")
# Step 2: Feed concept to Claude/GPT for implementation (API call with rich context)
prompt = f"""Based on this technical explanation:
{concept}
Implement a LoRA layer in PyTorch that can wrap any nn.Linear module.
Include forward pass, weight merging, and rank selection."""
code = claude.generate(prompt) # One precise API call instead of many vague onesThis project is for educational and research purposes. The base model (Qwen2.5-3B-Instruct) is subject to its own license terms.