Skip to content

Latest commit

 

History

History
420 lines (326 loc) · 15.1 KB

File metadata and controls

420 lines (326 loc) · 15.1 KB

NeuralScholar: ML/DL Concept Engine for AI-Assisted Research

A specialized small language model that serves as an ML/DL concept engine -- designed to research, explain, and generate ideas around machine learning and deep learning, which can then be handed off to large models (Claude, GPT, Gemini) for code implementation. Think of it as the "brain" that understands the theory, paired with large models as the "hands" that write the code.

Base Model: Qwen2.5-3B-Instruct Method: QLoRA (4-bit quantized LoRA fine-tuning) Domain: ML/DL concept research, ideation, and explanation


Why NeuralScholar?

Large language models like Claude, GPT, and Gemini are excellent at writing code -- but they work best when given precise, well-articulated ML/DL concepts as input. NeuralScholar fills that gap.

The Problem

When building ML/DL systems, the bottleneck is rarely the code -- it's knowing what to build and why. Practitioners often spend hours reading papers, comparing architectures, and understanding the mathematical intuition before writing a single line of code. General-purpose LLMs can help, but they lack the deep, focused expertise of a domain specialist.

The Solution: A Two-Stage AI Workflow

                    STAGE 1: THINK                          STAGE 2: BUILD
            ┌─────────────────────────┐            ┌─────────────────────────┐
            │     NeuralScholar       │            │   Claude / GPT / Gemini │
            │     (3B, local)         │     -->    │   (Large model, API)    │
            │                         │            │                         │
            │  "Explain how multi-    │            │  "Implement multi-head  │
            │   head attention works, │            │   attention with the    │
            │   compare it with       │            │   scaled dot-product    │
            │   cross-attention, and  │            │   approach, using       │
            │   when to use each"     │            │   PyTorch, with these   │
            │                         │            │   design choices..."    │
            │   --> Detailed concept  │            │                         │
            │      explanation with   │            │   --> Production-ready  │
            │      mathematical       │            │      implementation    │
            │      intuition          │            │                         │
            └─────────────────────────┘            └─────────────────────────┘
                Runs locally, free                   API call with precise context
                Private, no data leaks               Higher quality output
                Fast concept iteration               Less token waste

What This Enables

  • Research ideas locally -- Explore ML/DL concepts on your machine without API costs or data privacy concerns. Iterate on ideas freely before committing to expensive API calls.
  • Generate precise prompts for large models -- NeuralScholar's concept explanations serve as high-quality context for Claude/GPT/Gemini, resulting in better code output with fewer iterations.
  • Bridge theory and implementation -- Ask NeuralScholar "What loss function should I use for imbalanced multi-label classification and why?" -- then feed its answer to a large model with "Implement this in PyTorch."
  • Rapid literature review -- Trained on ~100K samples from ArXiv papers, StackExchange, Wikipedia, and curated datasets, the model provides grounded, up-to-date ML/DL knowledge.
  • Offline-first ML assistant -- Runs on a laptop GPU (8GB VRAM) via GGUF quantization. No internet required for concept exploration.

Example Workflow

You:            "Compare batch normalization vs layer normalization.
                 When should I use each in a transformer?"

NeuralScholar:  [Detailed explanation covering:
                 - Mathematical formulation of both
                 - Why LayerNorm is preferred in transformers (sequence length invariance)
                 - BatchNorm's dependency on batch statistics
                 - Pre-norm vs post-norm transformer variants
                 - Practical trade-offs]

You → Claude:   "Based on this analysis: [paste NeuralScholar output]
                 Implement a transformer block with pre-LayerNorm in PyTorch,
                 with an option to swap in RMSNorm."

Claude:         [Clean, production-ready PyTorch code informed by precise context]

Table of Contents


Architecture

DATA COLLECTION          PROCESSING              TRAINING            DEPLOYMENT

  ArXiv API        -->                      -->                -->  llama.cpp
  StackExchange    -->  Code Filter         -->  QLoRA         -->  Ollama
  Wikipedia        -->  Deduplication (LSH) -->  Fine-tuning   -->  Gradio UI
  Distill.pub      -->  Quality Filter      -->  on Qwen2.5-3B -->  Python API
  HuggingFace      -->  Train/Val Split     -->                -->

Project Structure

NeuralScholar/
├── configs/
│   └── config.yaml                  # Pipeline configuration
│
├── collectors/                      # Data collection modules
│   ├── arxiv_collector.py           # ArXiv papers (cs.LG, cs.CV, cs.CL, cs.AI)
│   ├── stackexchange_collector.py   # Stats & AI StackExchange Q&A
│   ├── wikipedia_collector.py       # ML/DL Wikipedia articles
│   ├── distill_collector.py         # Distill.pub research articles
│   └── huggingface_datasets_collector.py  # Open-Platypus, SciQ, ARC, Flan
│
├── processors/                      # Data processing pipeline
│   ├── code_filter.py               # Remove code-heavy content
│   ├── deduplicator.py              # MinHash LSH deduplication
│   ├── quality_filter.py            # Language, length, coherence checks
│   └── processors.py               # Instruction formatting & train/val split
│
├── utils/
│   ├── io_utils.py                  # JSONL streaming, config, progress tracking
│   └── text_utils.py               # Text cleaning, LaTeX removal, chunking
│
├── data/
│   ├── raw/                         # Collected data per source
│   ├── processed/                   # Intermediate processing stages
│   └── final/                       # Final train.jsonl & validation.jsonl
│
├── outputs/
│   ├── ml-slm-qwen3b/              # QLoRA checkpoints & final model
│   ├── ml-slm-merged/              # Merged LoRA + base (full HF model)
│   └── ml-slm-gguf/                # Quantized GGUF models
│
├── llama.cpp/                       # Submodule for GGUF conversion
│
├── train_qlora.py                   # QLoRA fine-tuning script
├── inference.py                     # Inference (interactive, eval, compare)
├── export_gguf.py                   # GGUF export & quantization
├── webui.py                         # Gradio web interface
├── analyze_dataset.py               # Dataset statistics & analysis
├── run_pipeline.py                  # Pipeline orchestration
├── requirements.txt                 # Data collection dependencies
└── requirements_training.txt        # Training & inference dependencies

Setup

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (8GB+ VRAM recommended)
  • llama.cpp (for GGUF export only)

Installation

# Clone the repository
git clone https://github.com/<your-username>/NeuralScholar.git
cd NeuralScholar

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install data collection & processing dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Install training & inference dependencies
pip install -r requirements_training.txt

Usage

Full Pipeline

# Run everything: collect, process, finalize
python run_pipeline.py --full

# Skip collection, only process & split
python run_pipeline.py --full --skip-collection

1. Data Collection

Collect from individual sources or all at once:

# All sources
python run_pipeline.py collect --all

# Individual sources
python run_pipeline.py collect --source arxiv
python run_pipeline.py collect --source stackexchange
python run_pipeline.py collect --source wikipedia
python run_pipeline.py collect --source distill
python run_pipeline.py collect --source huggingface

Sources and scale:

Source Content Samples
ArXiv Paper summaries & explanations (2018+) ~95K
Wikipedia 40+ ML/DL topic articles ~4.5K
StackExchange Stats & AI forum Q&A (score >= 5) ~1K
HuggingFace Open-Platypus, SciQ, ARC, Flan-CoT ~2K
Distill.pub Interactive research explanations ~400

2. Data Processing

# Run processing pipeline: merge → code filter → dedup → quality → format → split
python run_pipeline.py process
python run_pipeline.py finalize

# Analyze the final dataset
python analyze_dataset.py

Processing stages:

Stage Method Purpose
Code Filter Pattern-based detection Remove code snippets, preserve algorithm descriptions
Deduplication MinHash LSH (threshold: 0.85) Remove near-duplicate samples
Quality Filter Language, length, coherence checks Ensure minimum quality (50-2048 tokens, English only)
Formatting Template standardization Uniform instruction-output format
Finalization Stratified shuffle split (95/5) Train/validation split with seed 42

3. Training

# Quick test run (100 samples, 1 epoch)
python train_qlora.py --test

# Full training
python train_qlora.py

# Resume from checkpoint
python train_qlora.py --resume

# Override hyperparameters
python train_qlora.py --epochs 2 --lr 1e-4 --output outputs/custom-run

4. Inference

# Interactive chat
python inference.py

# Single question
python inference.py --question "Explain the attention mechanism in transformers"

# Evaluate on benchmark questions
python inference.py --eval

# Compare fine-tuned vs base model
python inference.py --compare --question "Why does dropout help prevent overfitting?"

5. GGUF Export

# Export with recommended quantization (q4_k_m)
python export_gguf.py

# Choose quantization format
python export_gguf.py --quant q5_k_m

Available quantization formats:

Format Size Quality Use Case
q4_k_m ~2.5 GB Good Recommended for 8GB VRAM
q5_k_m ~3.5 GB Better Higher quality, more VRAM
q6_k ~4 GB Very Good Large systems
q8_0 ~5 GB Excellent Best quality quantized
f16 ~6 GB Best Full precision

6. Web UI

python webui.py
# Opens Gradio interface at http://localhost:7860

Data Pipeline

ArXiv / StackExchange / Wikipedia / Distill / HuggingFace
                        |
                        v
              data/raw/[source]/*.jsonl
                        |
                  merge_all_sources
                        |
                        v
              data/processed/merged.jsonl
                        |
          code_filter -> dedup -> quality -> format
                        |
                        v
              data/processed/formatted.jsonl
                        |
                shuffle + split (95/5)
                        |
              +---------+---------+
              |                   |
              v                   v
    data/final/train.jsonl  data/final/validation.jsonl

Data format (each line in JSONL):

{
  "instruction": "Explain the concept of attention in transformers.",
  "input": "",
  "output": "Attention is a mechanism that allows a model to focus on...",
  "source": "arxiv",
  "category": "concept_explanation/architectures"
}

Model Details

Training Configuration

Parameter Value
Base model Qwen2.5-3B-Instruct
Method QLoRA (4-bit NF4, double quantization)
LoRA rank (r) 64
LoRA alpha 128
LoRA target modules q, k, v, o, gate, up, down projections
Trainable parameters ~120M (6.58% of 3B)
Max sequence length 1024 tokens
Epochs 3
Effective batch size 16 (1 x 16 gradient accumulation)
Learning rate 2e-4 (cosine schedule, 3% warmup)
Optimizer Paged AdamW 8-bit
Precision BFloat16
VRAM requirement ~7.5 GB

System Prompt

The model is trained with the following system prompt enforced at both training and inference:

You are an expert ML/DL teaching assistant. Explain concepts clearly and accurately, use intuitive analogies, provide mathematical intuition, compare and contrast methods, and focus on conceptual understanding. Do NOT write code.


Deployment

With llama.cpp

./llama-cli -m outputs/ml-slm-gguf/ml-slm-q4_k_m.gguf \
  -p "Explain the attention mechanism in transformers"

With Ollama

echo 'FROM ./outputs/ml-slm-gguf/ml-slm-q4_k_m.gguf' > Modelfile
ollama create neural-scholar -f Modelfile
ollama run neural-scholar

With Python (transformers + peft)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True),
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "outputs/ml-slm-qwen3b/final")
tokenizer = AutoTokenizer.from_pretrained("outputs/ml-slm-qwen3b/final")

Pairing with Large Models (Recommended Workflow)

# Step 1: Get concept explanation from NeuralScholar (local, free)
concept = neural_scholar.generate("Explain how LoRA works and why it's parameter-efficient")

# Step 2: Feed concept to Claude/GPT for implementation (API call with rich context)
prompt = f"""Based on this technical explanation:

{concept}

Implement a LoRA layer in PyTorch that can wrap any nn.Linear module.
Include forward pass, weight merging, and rank selection."""

code = claude.generate(prompt)  # One precise API call instead of many vague ones

License

This project is for educational and research purposes. The base model (Qwen2.5-3B-Instruct) is subject to its own license terms.