Stratified LLM Subsets: Pre-Training, Instruction-Following, and Reasoning SFT Data at 100K-1M Scale
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity while re-balancing prevents category dominance across 5 high-quality open datasets.
This project provides diverse, representative subsets from large-scale training corpora across multiple domains using embedding-based k-means clustering rather than random sampling:
- Scales: 50k, 100k, 250k, 500k, and 1M samples
- Methodology: Deterministic k-means clustering on embeddings (Snowflake Arctic-embed-xs) with 100 iterations
- Balancing: Square-root transformation for imbalanced datasets to prevent category dominance
stratified-kmeans-diverse-pretraining-100K-1M
Combines FineWeb-Edu (educational web content) and Proof-Pile-2 (mathematical/scientific documents):
- FineWeb-Edu: 6 CommonCrawl snapshots from 2025 (99M rows filtered)
- Proof-Pile-2: algebraic-stack, arxiv, open-web-math
stratified-kmeans-diverse-instruction-following-100K-1M
Combines Tulu-3 SFT Mixture and Orca AgentInstruct:
- Tulu-3: State-of-the-art post-training recipe (939K samples)
- Orca AgentInstruct: Agentic multi-step reasoning tasks (~1M samples)
stratified-kmeans-diverse-reasoning-100K-1M
Stratified subset of Llama-Nemotron Post-Training Dataset with square-root rebalancing:
- Original: 80.52% STEM dominated → Rebalanced: 51.81% STEM
- Categories: math, code, science, chat, safety
- Embedding Generation: Text embedded using Snowflake Arctic-embed-xs
- K-Means Clustering: For M required samples, apply k-means with k=M clusters (100 iterations)
- Centroid Selection: Select cluster centroids as representative samples
- Square-Root Balancing (for imbalanced datasets):
- Convert category counts to ratios
- Apply sqrt transformation:
sqrt_ratio = sqrt(original_ratio) - Renormalize:
balanced_ratio = sqrt_ratio / sum(sqrt_ratios)
Original Llama-Nemotron Post-Training Dataset distribution was heavily skewed:
- Math: 66.96% → rebalanced to 52.03% (−22%)
- Code: 30.67% → rebalanced to 34.96% (+14%)
- Science: 2.15% → rebalanced to 9.26% (+330%)
- Chat: 0.12% → rebalanced to 2.15% (+1682%)
- Safety: 0.10% → rebalanced to 1.60% (+1580%)
Square-root transformation reduces math dominance while significantly increasing representation of underrepresented categories.
from datasets import load_dataset
# Load pre-training data
pretraining = load_dataset(
"AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M",
split="100k"
)
# Load instruction-following data
instruction = load_dataset(
"AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M",
split="100k"
)
# Load reasoning data
reasoning = load_dataset(
"AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M",
split="100k"
)| Dataset | Task | License | Link |
|---|---|---|---|
| FineWeb-Edu | Pre-training | ODC-BY 1.0 | HuggingFace |
| Proof-Pile-2 | Pre-training | Mixed | HuggingFace |
| Tulu-3 SFT | Instruction | ODC-BY 1.0 | HuggingFace |
| Orca AgentInstruct | Instruction | CDLA-Permissive 2.0 | HuggingFace |
| Llama-Nemotron | Reasoning | CC BY 4.0 | HuggingFace |
@misc{priyanshu2025stratifiedllm,
title={{Stratified LLM Subsets: Pre-Training, Instruction-Following, and Reasoning SFT Data at 100K-1M Scale}},
author={Priyanshu, Aman and Vijay, Supriti},
year={2025},
howpublished={\url{https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/}},
note={Available at \url{https://huggingface.co/datasets/AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M}}
}Each subset inherits the license from its source datasets. Please refer to individual dataset cards for complete licensing terms.
- FineWeb-Edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
- Proof-Pile-2: https://huggingface.co/datasets/EleutherAI/proof-pile-2
- Tulu-3 SFT Mixture: https://huggingface.co/datasets/allenai/tulu-3-sft-mixture
- Orca AgentInstruct: https://huggingface.co/datasets/microsoft/orca-agentinstruct-1M-v1
- Llama-Nemotron Post-Training Dataset: https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset
Project Website: amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale