Domain Data Strategy: From Raw Sources to Training-Ready Corpus

Prerequisite: 01_Technology_Selection.md, ../02_Scientist/02_Dataset/.

This document addresses the practical question: given a specific domain (e.g., infrastructure construction-operations), how do you plan and execute the data pipeline that feeds your LLM system? This is not about data preprocessing techniques (covered in 02_Scientist/02_Dataset) but about strategic decisions.

1. Data Needs Assessment

Before collecting a single document, answer these questions:

1.1 What Will the Data Be Used For?

Purpose	Data Type Needed	Volume Estimate	Quality Bar
RAG knowledge base	Raw documents, structured records	Hundreds to thousands of documents	Medium (retrieval tolerates noise)
Fine-tuning for domain adaptation	Domain text corpus	Millions of tokens	Medium-high (diverse, representative)
Fine-tuning for task-specific behavior	Input-output pairs (instruction format)	1K-50K examples	High (must be accurate and consistent)
Evaluation	Gold-standard Q&A pairs with verified answers	200-1000 examples	Very high (expert-validated)
Knowledge graph construction	Entity-relation triples	Thousands of triples	High (must be factually correct)

The same raw source can serve multiple purposes, but the processing pipeline differs for each.

1.2 The Data Pyramid

                    /\
                   /  \        Evaluation data (expert-curated, 200-1K)
                  /    \
                 /------\      Instruction data (structured pairs, 1K-50K)
                /        \
               /----------\    Domain corpus (cleaned text, millions of tokens)
              /            \
             /--------------\  Raw sources (everything you can get, uncleaned)

Each layer is smaller but higher quality. You build from bottom to top.

2. Source Identification and Collection

2.1 Source Categories

For a domain like infrastructure construction-operations:

Publicly available:

Academic papers (CNKI, Google Scholar, IEEE Xplore)
Industry standards and regulations (national/international)
Government publications, policy documents
Open textbooks, technical manuals
Patent databases

Semi-public (requires access/agreements):

Industry association reports
Conference proceedings
Consulting firm white papers
Trade publications

Private/proprietary (requires partnerships):

Project documentation (plans, schedules, change orders)
Operational logs, maintenance records
Internal reports, lessons learned
Expert interview transcripts
Meeting minutes, decision records

2.2 Collection Prioritization

Not all sources are equally valuable. Prioritize by:

Relevance density: How much of the document is actually about your domain? A 200-page textbook chapter is better than 200 pages of tangentially related news articles.
Knowledge uniqueness: Does this source contain knowledge the base model already has? General knowledge about "project management" is already in GPT-4. Domain-specific knowledge about "airport runway construction phasing" is not.
Structural quality: Well-structured documents (with headings, tables, clear sections) are easier to process and produce better training data.
Recency: For rapidly evolving domains, prioritize recent sources. For foundational knowledge, older authoritative texts are fine.

2.3 Legal and Ethical Considerations

Copyright status of each source
Data licensing terms (especially for commercial use)
Personal data / PII in project documents (must be anonymized)
Confidentiality agreements with data providers
Institutional review board (IRB) requirements for interview data

3. Processing Pipeline Design

3.1 For RAG Knowledge Base

Raw documents → Format conversion → Cleaning → Chunking → Embedding → Vector DB
                (PDF/DOCX→text)    (noise     (semantic    (domain-aware
                                    removal)   boundaries)  or general)

Key decisions:

Chunk strategy: Fixed-size (simple but breaks context) vs semantic (respects section boundaries) vs hierarchical (parent-child chunks for multi-granularity retrieval)
Metadata preservation: Keep source, date, section title, document type as filterable metadata
Embedding model: General-purpose (BGE, E5) vs domain-fine-tuned. Start general, fine-tune later if retrieval quality is insufficient.

3.2 For Fine-tuning Corpus (Pre-training/CPT)

Raw documents → Format conversion → Cleaning → Deduplication → Quality filtering → De-contamination → Tokenization
                                                                 (perplexity,
                                                                  language ID,
                                                                  content filters)

Industrial Cleaning Pipeline (The "Big Science" Pattern):

Rule-based Filters:
- Language identification (fastText).
- Stop-word ratio (filter out gibberish).
- Symbol-to-word ratio (filter out code or math-heavy noise if not desired).
- "Boilerplate" removal (headers, footers, navigation menus).
Model-based Filtering:
- Use a lightweight model (e.g., fastText or a small BERT) trained on "high-quality" vs "low-quality" samples to score documents.
- Perplexity filtering: Use a small LLM (e.g., Qwen-0.5B) to calculate perplexity; remove extremely high (gibberish) or extremely low (repetitive boilerplate) outliers.
Fuzzy Deduplication:
- MinHash + LSH at the document level.
- Semantic deduplication for instruction data (clustering embeddings).
De-contamination:
- CRITICAL: Use n-gram overlap check (typically 13-gram) between your training corpus and all public benchmarks (MMLU, CMMLU, C-Eval, GSM8K) plus your internal evaluation set. Remove any training sample that overlaps.

3.3 For Instruction Data (SFT & Alignment)

Instruction data is the "intelligence" layer. When domain data is scarce, we use Synthetic Data Engineering.

Synthetic Data Generation Pipelines:

Method	Description	Best For
Self-Instruct	Using an LLM to generate new instructions from seed tasks.	Expanding task variety.
Evol-Instruct	Iteratively increasing instruction complexity (adding constraints, steps, or reasoning depth).	Improving model's reasoning capability.
Magpie	Extracting instruction-response pairs from raw model "self-conversations" without explicit prompts.	High-volume, low-cost diversity.
Knowledge-to-Instruction (K2I)	Converting technical manuals/tables into "Q: [Question] A: [Answer based on Document]" format.	Injecting domain facts.
Back-translation	Given a document, generate a question that would be answered by that document.	Grounded RAG-style SFT.

The "Agent-in-the-loop" Generation Pattern:

Generator Agent: Creates candidates (e.g., using GPT-4o).
Critic Agent: Reviews candidates for logical flaws or domain inaccuracies.
Refiner Agent: Fixes identified issues.
Expert Audit: Human-in-the-loop validation of a 5% random sample.

3.4 Instruction Data Format

For chat-style fine-tuning, structure data as conversations:

{
  "messages": [
    {"role": "system", "content": "You are a construction operations expert..."},
    {"role": "user", "content": "What are the key risks during the transition from construction to operations phase?"},
    {"role": "assistant", "content": "The construction-to-operations transition involves several critical risks: 1) ..."}
  ]
}

For different task types, vary the instruction format:

Knowledge Q&A: Direct question → detailed answer
Document analysis: "Given this report excerpt: [context]. Question: ..." → analysis
Decision support: Scenario description → recommended actions with reasoning
Report generation: Brief inputs → structured report output

4. Data Quality Assurance

4.1 Automated Checks

Language detection (filter out wrong-language content)
Encoding validation (detect and fix garbled text)
Length filtering (too short = low information, too long = likely noise)
Perplexity scoring (flag outliers for manual review)
PII detection (names, phone numbers, addresses)
Duplicate detection (exact and near-duplicate)

4.2 Human Review Protocol

For instruction data, establish a review protocol:

Factual accuracy: Is the answer correct? (Requires domain expert)
Completeness: Does the answer address all aspects of the question?
Consistency: Does this answer contradict other answers in the dataset?
Tone and style: Does it match the desired output style?
Harmful content: Any biased, misleading, or dangerous advice?

Target: Review at least 10-20% of LLM-generated instruction data. Review 100% of evaluation data.

4.3 Iterative Refinement

Data quality is not a one-time effort. After initial model training:

Test the model on held-out evaluation set
Identify failure categories (wrong facts, missing knowledge, wrong format)
Create targeted training data to address each failure category
Retrain and re-evaluate
Repeat

This "data flywheel" is often more effective than scaling up data volume blindly.

5. Data Volume Guidelines

Component	Minimum Viable	Recommended	Notes
RAG corpus	100 documents	1000+ documents	More is generally better for coverage
Fine-tuning corpus (continued pre-training)	10M tokens	100M+ tokens	Domain text for vocabulary/concept adaptation
Instruction data (SFT)	1K examples	5K-20K examples	Quality matters more than quantity
Evaluation set	100 examples	500+ examples	Must be expert-validated, never used for training

These are rough guidelines. The right volume depends on domain complexity, task difficulty, and base model capability.

6. Common Pitfalls

Collecting everything, cleaning nothing: Raw data volume is meaningless. 10K clean examples beat 100K noisy ones.
Ignoring domain balance: If 80% of your data is about "safety regulations" and 20% about "cost management," the model will be great at safety and terrible at cost.
Training on evaluation data: Accidentally including test examples in training data. Use strict data splits and checksums.
Assuming OCR output is clean: PDF extraction and OCR introduce significant noise. Always validate a sample.
Neglecting metadata: Losing track of which data came from where makes debugging and updating impossible.
One-shot data collection: Treating data as a one-time task rather than an ongoing process. Domain knowledge evolves; your data pipeline should too.

Key References

Gururangan et al. (2020): Don't Stop Pretraining: Adapt Pretrained Language Models to Domains and Tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain Data Strategy: From Raw Sources to Training-Ready Corpus

1. Data Needs Assessment

1.1 What Will the Data Be Used For?

1.2 The Data Pyramid

2. Source Identification and Collection

2.1 Source Categories

2.2 Collection Prioritization

2.3 Legal and Ethical Considerations

3. Processing Pipeline Design

3.1 For RAG Knowledge Base

3.2 For Fine-tuning Corpus (Pre-training/CPT)

3.3 For Instruction Data (SFT & Alignment)

3.4 Instruction Data Format

4. Data Quality Assurance

4.1 Automated Checks

4.2 Human Review Protocol

4.3 Iterative Refinement

5. Data Volume Guidelines

6. Common Pitfalls

Key References

FilesExpand file tree

03_Domain_Data_Strategy.md

Latest commit

History

03_Domain_Data_Strategy.md

File metadata and controls

Domain Data Strategy: From Raw Sources to Training-Ready Corpus

1. Data Needs Assessment

1.1 What Will the Data Be Used For?

1.2 The Data Pyramid

2. Source Identification and Collection

2.1 Source Categories

2.2 Collection Prioritization

2.3 Legal and Ethical Considerations

3. Processing Pipeline Design

3.1 For RAG Knowledge Base

3.2 For Fine-tuning Corpus (Pre-training/CPT)

3.3 For Instruction Data (SFT & Alignment)

3.4 Instruction Data Format

4. Data Quality Assurance

4.1 Automated Checks

4.2 Human Review Protocol

4.3 Iterative Refinement

5. Data Volume Guidelines

6. Common Pitfalls

Key References