Prerequisite: 01_Technology_Selection.md, ../02_Scientist/02_Dataset/.
This document addresses the practical question: given a specific domain (e.g., infrastructure construction-operations), how do you plan and execute the data pipeline that feeds your LLM system? This is not about data preprocessing techniques (covered in 02_Scientist/02_Dataset) but about strategic decisions.
Before collecting a single document, answer these questions:
| Purpose | Data Type Needed | Volume Estimate | Quality Bar |
|---|---|---|---|
| RAG knowledge base | Raw documents, structured records | Hundreds to thousands of documents | Medium (retrieval tolerates noise) |
| Fine-tuning for domain adaptation | Domain text corpus | Millions of tokens | Medium-high (diverse, representative) |
| Fine-tuning for task-specific behavior | Input-output pairs (instruction format) | 1K-50K examples | High (must be accurate and consistent) |
| Evaluation | Gold-standard Q&A pairs with verified answers | 200-1000 examples | Very high (expert-validated) |
| Knowledge graph construction | Entity-relation triples | Thousands of triples | High (must be factually correct) |
The same raw source can serve multiple purposes, but the processing pipeline differs for each.
/\
/ \ Evaluation data (expert-curated, 200-1K)
/ \
/------\ Instruction data (structured pairs, 1K-50K)
/ \
/----------\ Domain corpus (cleaned text, millions of tokens)
/ \
/--------------\ Raw sources (everything you can get, uncleaned)
Each layer is smaller but higher quality. You build from bottom to top.
For a domain like infrastructure construction-operations:
Publicly available:
- Academic papers (CNKI, Google Scholar, IEEE Xplore)
- Industry standards and regulations (national/international)
- Government publications, policy documents
- Open textbooks, technical manuals
- Patent databases
Semi-public (requires access/agreements):
- Industry association reports
- Conference proceedings
- Consulting firm white papers
- Trade publications
Private/proprietary (requires partnerships):
- Project documentation (plans, schedules, change orders)
- Operational logs, maintenance records
- Internal reports, lessons learned
- Expert interview transcripts
- Meeting minutes, decision records
Not all sources are equally valuable. Prioritize by:
- Relevance density: How much of the document is actually about your domain? A 200-page textbook chapter is better than 200 pages of tangentially related news articles.
- Knowledge uniqueness: Does this source contain knowledge the base model already has? General knowledge about "project management" is already in GPT-4. Domain-specific knowledge about "airport runway construction phasing" is not.
- Structural quality: Well-structured documents (with headings, tables, clear sections) are easier to process and produce better training data.
- Recency: For rapidly evolving domains, prioritize recent sources. For foundational knowledge, older authoritative texts are fine.
- Copyright status of each source
- Data licensing terms (especially for commercial use)
- Personal data / PII in project documents (must be anonymized)
- Confidentiality agreements with data providers
- Institutional review board (IRB) requirements for interview data
Raw documents → Format conversion → Cleaning → Chunking → Embedding → Vector DB
(PDF/DOCX→text) (noise (semantic (domain-aware
removal) boundaries) or general)
Key decisions:
- Chunk strategy: Fixed-size (simple but breaks context) vs semantic (respects section boundaries) vs hierarchical (parent-child chunks for multi-granularity retrieval)
- Metadata preservation: Keep source, date, section title, document type as filterable metadata
- Embedding model: General-purpose (BGE, E5) vs domain-fine-tuned. Start general, fine-tune later if retrieval quality is insufficient.
Raw documents → Format conversion → Cleaning → Deduplication → Quality filtering → De-contamination → Tokenization
(perplexity,
language ID,
content filters)
Industrial Cleaning Pipeline (The "Big Science" Pattern):
- Rule-based Filters:
- Language identification (fastText).
- Stop-word ratio (filter out gibberish).
- Symbol-to-word ratio (filter out code or math-heavy noise if not desired).
- "Boilerplate" removal (headers, footers, navigation menus).
- Model-based Filtering:
- Use a lightweight model (e.g., fastText or a small BERT) trained on "high-quality" vs "low-quality" samples to score documents.
- Perplexity filtering: Use a small LLM (e.g., Qwen-0.5B) to calculate perplexity; remove extremely high (gibberish) or extremely low (repetitive boilerplate) outliers.
- Fuzzy Deduplication:
- MinHash + LSH at the document level.
- Semantic deduplication for instruction data (clustering embeddings).
- De-contamination:
- CRITICAL: Use n-gram overlap check (typically 13-gram) between your training corpus and all public benchmarks (MMLU, CMMLU, C-Eval, GSM8K) plus your internal evaluation set. Remove any training sample that overlaps.
Instruction data is the "intelligence" layer. When domain data is scarce, we use Synthetic Data Engineering.
Synthetic Data Generation Pipelines:
| Method | Description | Best For |
|---|---|---|
| Self-Instruct | Using an LLM to generate new instructions from seed tasks. | Expanding task variety. |
| Evol-Instruct | Iteratively increasing instruction complexity (adding constraints, steps, or reasoning depth). | Improving model's reasoning capability. |
| Magpie | Extracting instruction-response pairs from raw model "self-conversations" without explicit prompts. | High-volume, low-cost diversity. |
| Knowledge-to-Instruction (K2I) | Converting technical manuals/tables into "Q: [Question] A: [Answer based on Document]" format. | Injecting domain facts. |
| Back-translation | Given a document, generate a question that would be answered by that document. | Grounded RAG-style SFT. |
The "Agent-in-the-loop" Generation Pattern:
- Generator Agent: Creates candidates (e.g., using GPT-4o).
- Critic Agent: Reviews candidates for logical flaws or domain inaccuracies.
- Refiner Agent: Fixes identified issues.
- Expert Audit: Human-in-the-loop validation of a 5% random sample.
For chat-style fine-tuning, structure data as conversations:
{
"messages": [
{"role": "system", "content": "You are a construction operations expert..."},
{"role": "user", "content": "What are the key risks during the transition from construction to operations phase?"},
{"role": "assistant", "content": "The construction-to-operations transition involves several critical risks: 1) ..."}
]
}For different task types, vary the instruction format:
- Knowledge Q&A: Direct question → detailed answer
- Document analysis: "Given this report excerpt: [context]. Question: ..." → analysis
- Decision support: Scenario description → recommended actions with reasoning
- Report generation: Brief inputs → structured report output
- Language detection (filter out wrong-language content)
- Encoding validation (detect and fix garbled text)
- Length filtering (too short = low information, too long = likely noise)
- Perplexity scoring (flag outliers for manual review)
- PII detection (names, phone numbers, addresses)
- Duplicate detection (exact and near-duplicate)
For instruction data, establish a review protocol:
- Factual accuracy: Is the answer correct? (Requires domain expert)
- Completeness: Does the answer address all aspects of the question?
- Consistency: Does this answer contradict other answers in the dataset?
- Tone and style: Does it match the desired output style?
- Harmful content: Any biased, misleading, or dangerous advice?
Target: Review at least 10-20% of LLM-generated instruction data. Review 100% of evaluation data.
Data quality is not a one-time effort. After initial model training:
- Test the model on held-out evaluation set
- Identify failure categories (wrong facts, missing knowledge, wrong format)
- Create targeted training data to address each failure category
- Retrain and re-evaluate
- Repeat
This "data flywheel" is often more effective than scaling up data volume blindly.
| Component | Minimum Viable | Recommended | Notes |
|---|---|---|---|
| RAG corpus | 100 documents | 1000+ documents | More is generally better for coverage |
| Fine-tuning corpus (continued pre-training) | 10M tokens | 100M+ tokens | Domain text for vocabulary/concept adaptation |
| Instruction data (SFT) | 1K examples | 5K-20K examples | Quality matters more than quantity |
| Evaluation set | 100 examples | 500+ examples | Must be expert-validated, never used for training |
These are rough guidelines. The right volume depends on domain complexity, task difficulty, and base model capability.
- Collecting everything, cleaning nothing: Raw data volume is meaningless. 10K clean examples beat 100K noisy ones.
- Ignoring domain balance: If 80% of your data is about "safety regulations" and 20% about "cost management," the model will be great at safety and terrible at cost.
- Training on evaluation data: Accidentally including test examples in training data. Use strict data splits and checksums.
- Assuming OCR output is clean: PDF extraction and OCR introduce significant noise. Always validate a sample.
- Neglecting metadata: Losing track of which data came from where makes debugging and updating impossible.
- One-shot data collection: Treating data as a one-time task rather than an ongoing process. Domain knowledge evolves; your data pipeline should too.
- Gururangan et al. (2020): Don't Stop Pretraining: Adapt Pretrained Language Models to Domains and Tasks.