|
| 1 | +from dingo.model import Model |
| 2 | +from dingo.model.llm.text_quality.base_text_quality import BaseTextQuality |
| 3 | + |
| 4 | + |
| 5 | +@Model.llm_register("LLMTextQualityV5") |
| 6 | +class LLMTextQualityV5(BaseTextQuality): |
| 7 | + # Metadata for documentation generation |
| 8 | + _metric_info = { |
| 9 | + "category": "Pretrain Text Quality Assessment Metrics", |
| 10 | + "metric_name": "LLMTextQualityV5", |
| 11 | + "description": "Impact-driven text quality evaluation for LLM pretraining, focusing on structural completeness, readability, diversity, and safety with quantitative thresholds", |
| 12 | + "paper_title": "WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages", |
| 13 | + "paper_url": "https://arxiv.org/abs/2501.14506", |
| 14 | + "paper_authors": "Yu et al., 2025", |
| 15 | + "evaluation_results": "docs/eval/prompt/redpajama_data_evaluated_by_prompt.md" |
| 16 | + } |
| 17 | + prompt = """ |
| 18 | +# Role |
| 19 | +You are an expert in assessing pretraining data quality for large language models. |
| 20 | +
|
| 21 | +# Goal |
| 22 | +Evaluate whether this text is suitable for LLM pretraining. Focus on issues that would negatively impact model learning, not minor imperfections. |
| 23 | +
|
| 24 | +# Quality Dimensions |
| 25 | +
|
| 26 | +## 1. Completeness (结构完整性) |
| 27 | +**Impact**: Broken structures prevent models from learning correct formatting patterns. |
| 28 | +
|
| 29 | +**Check for**: |
| 30 | +- **Error_Formula**: Mathematical expressions with **unmatched delimiters** or **unclosed environments** |
| 31 | +
|
| 32 | + ⚠️ **Normal patterns (DO NOT flag)**: |
| 33 | + - Mixing inline ($...$) and display ($$...$$) formulas |
| 34 | + - Using \\begin{{align}}...\\end{{align}} within $$...$$ |
| 35 | + - Line breaks with \\\\ in alignment environments |
| 36 | + - HTML tags: <sub>x</sub>, <sup>2</sup> for subscripts/superscripts |
| 37 | + - Mixing LaTeX and HTML in web-extracted content |
| 38 | +
|
| 39 | + ✅ **Only flag when**: |
| 40 | + - Delimiters unmatched: $ without closing $ (LaTeX context, not dollar signs) |
| 41 | + - Environments unclosed: \\begin{{align}} without \\end{{align}} |
| 42 | + - Syntax broken: \\frac{{a}}{{b missing closing }} |
| 43 | + - HTML tags unclosed: <sub>text without </sub> |
| 44 | +
|
| 45 | + ⚠️ **Important**: Distinguish LaTeX $ from dollar signs ($100) |
| 46 | + - Dollar sign: "$100", "$5.99" (followed by numbers) → NOT LaTeX |
| 47 | + - LaTeX delimiter: "$x$", "$\\alpha$" (contains math symbols) → IS LaTeX |
| 48 | + - Example: "The price is $100 and equation $x=y$ costs $50" has 4 dollar symbols but only 2 are LaTeX delimiters (and they match) |
| 49 | +
|
| 50 | + - Example (BAD): "$x^2 + y^2 is broken here $$a = b$$$" |
| 51 | + (First LaTeX $ never closes, extra $ at end) |
| 52 | + - Example (GOOD): "The item costs $100 and satisfies $x^2 + y^2 = z^2$ where price is $50" |
| 53 | + (Dollar signs for money + proper LaTeX pair) |
| 54 | + - Impact: Only flag errors that prevent >50% of mainstream parsers (pdflatex, MathJax, KaTeX, Pandoc, Jupyter) from rendering |
| 55 | +
|
| 56 | +- **Error_Table**: Table structures that are malformed or unreadable |
| 57 | + - Example (BAD): Misaligned columns, missing headers, or garbled HTML tags |
| 58 | + - Impact: Models cannot learn proper table representation |
| 59 | +
|
| 60 | +- **Error_Code**: Code blocks with formatting corruption |
| 61 | + - Example (BAD): Line numbers mixed with code, broken syntax highlighting markers |
| 62 | + - Impact: Teaches incorrect code structure |
| 63 | +
|
| 64 | +**Key Question**: "Can the model learn proper formatting from this structure?" |
| 65 | +
|
| 66 | +--- |
| 67 | +
|
| 68 | +## 2. Effectiveness (可读性) |
| 69 | +**Impact**: Noise prevents models from learning meaningful semantic patterns. |
| 70 | +
|
| 71 | +**Check for**: |
| 72 | +- **Error_Garbled_Characters**: Encoding issues or anti-crawler artifacts |
| 73 | + - Example (BAD): "’" (broken UTF-8), "□□□" (placeholder chars), "" (BOM) |
| 74 | + - Threshold: >1% of characters are garbled |
| 75 | + - Impact: Corrupts token distributions |
| 76 | +
|
| 77 | +- **Error_Words_Stuck**: Missing spaces break tokenization |
| 78 | + - Example (BAD): "Thequickbrownfoxjumpsoverthelazydog" |
| 79 | + - Threshold: >1% of text has word boundaries missing |
| 80 | + - Impact: Wrong subword tokenization patterns |
| 81 | +
|
| 82 | +- **Error_Lack_Punctuation**: Sentence boundaries unclear |
| 83 | + - Example (BAD): "I like apples they are red also I like oranges" |
| 84 | + - Impact: Models cannot learn sentence segmentation |
| 85 | +
|
| 86 | +**Key Question**: "Would a human find this readable and coherent?" |
| 87 | +
|
| 88 | +--- |
| 89 | +
|
| 90 | +## 3. Similarity (重复性) |
| 91 | +**Impact**: Repetitive content reduces training efficiency and causes memorization. |
| 92 | +
|
| 93 | +**Check for**: |
| 94 | +- **Error_Duplicate**: Excessive repetition that dominates the text |
| 95 | + - Example (BAD): "I like blue. I like blue. I like blue. I like blue..." (>30% duplicate) |
| 96 | + - Threshold: Same sentence/phrase repeats >5 times OR duplicate ratio >30% |
| 97 | + - Impact: Over-represents certain patterns |
| 98 | +
|
| 99 | +**Key Question**: "Does this text provide diverse training signal?" |
| 100 | +
|
| 101 | +--- |
| 102 | +
|
| 103 | +## 4. Security (安全性) |
| 104 | +**Impact**: Harmful content should not be learned by models. |
| 105 | +
|
| 106 | +**Check for**: |
| 107 | +- **Error_Politics**: Content promoting extremism, terrorism, ethnic hatred |
| 108 | +- **Error_Prohibition**: Violence, pornography, gambling, drugs |
| 109 | +
|
| 110 | +**Key Question**: "Is this content safe for model training?" |
| 111 | +
|
| 112 | +--- |
| 113 | +
|
| 114 | +# Evaluation Principles |
| 115 | +
|
| 116 | +1. **Focus on Training Impact**: Only flag issues that significantly harm LLM learning |
| 117 | +2. **Severity Matters**: Minor typos are OK; systemic corruption is not |
| 118 | +3. **Context Awareness**: Academic formulas are expected in papers; garbled text never is |
| 119 | +4. **Threshold-Based**: Use quantitative checks (>1%, >30%, >5 times) when possible |
| 120 | +
|
| 121 | +--- |
| 122 | +
|
| 123 | +# Workflow |
| 124 | +
|
| 125 | +1. **Quick Scan**: Does the text look generally readable and well-formed? |
| 126 | +2. **Identify Category**: If problematic, which dimension is most severely affected? |
| 127 | +3. **Verify Impact**: Would this issue meaningfully harm model training? |
| 128 | +4. **Assign Label**: |
| 129 | + - Score: 1 (suitable for training) or 0 (unsuitable) |
| 130 | + - Type: 'Good' OR one of ['Completeness', 'Effectiveness', 'Similarity', 'Security'] |
| 131 | + - Name: Specific error type (see above) |
| 132 | + - Reason: Brief explanation (1-2 sentences) |
| 133 | +
|
| 134 | +--- |
| 135 | +
|
| 136 | +# Output Format |
| 137 | +Return JSON only: {"score": 0/1, "type": "", "name": "", "reason": ""} |
| 138 | +
|
| 139 | +# Examples |
| 140 | +
|
| 141 | +**Example 1 (Good - Simple)**: |
| 142 | +Input: "The Pythagorean theorem states that $a^2 + b^2 = c^2$ for right triangles." |
| 143 | +Output: {"score": 1, "type": "Good", "name": "None", "reason": "Clear, well-formatted text with proper LaTeX"} |
| 144 | +
|
| 145 | +**Example 1.5 (Good - Complex Academic)**: |
| 146 | +Input: "Friedmann equation: |
| 147 | +$$ |
| 148 | +\\begin{{align*}} |
| 149 | +\\left(\\frac{{\\dot{{a}}}}{{a}}\\right)^2 &= \\frac{{8\\pi G}}{{3}}\\rho \\\\ |
| 150 | +H^2 &= H_0^2[\\Omega_m(1+z)^3 + \\Omega_\\Lambda] |
| 151 | +\\end{{align*}} |
| 152 | +$$ |
| 153 | +where $a$ is scale factor and $H$ is Hubble parameter." |
| 154 | +Output: {{"score": 1, "type": "Good", "name": "None", "reason": "Well-formed multi-line equations with proper alignment"}} |
| 155 | +
|
| 156 | +**Example 1.6 (Good - Mixed HTML/LaTeX)**: |
| 157 | +Input: "The eigenstate $\\psi_n$ where <sub>n</sub> is quantum number and energy E<sup>2</sup> = m<sup>2</sup>c<sup>4</sup>" |
| 158 | +Output: {{"score": 1, "type": "Good", "name": "None", "reason": "Normal mix of LaTeX and HTML tags from web content"}} |
| 159 | +
|
| 160 | +**Example 2 (Bad - Completeness)**: |
| 161 | +Input: "The formula $x^2 + y^2 is broken here $$a = b$$$" |
| 162 | +Output: {"score": 0, "type": "Completeness", "name": "Error_Formula", "reason": "Unmatched delimiters: first $ never closes, extra $ at end"} |
| 163 | +
|
| 164 | +**Example 3 (Bad - Effectiveness)**: |
| 165 | +Input: "Theappleisredandtasty�withsomegarbledtext□□" |
| 166 | +Output: {"score": 0, "type": "Effectiveness", "name": "Error_Garbled_Characters", "reason": "Contains encoding corruption (�, □) and missing spaces (>1% of text)"} |
| 167 | +
|
| 168 | +**Example 4 (Bad - Similarity)**: |
| 169 | +Input: "Blue is nice. Blue is nice. Blue is nice. Blue is nice. Blue is nice. Blue is nice." |
| 170 | +Output: {"score": 0, "type": "Similarity", "name": "Error_Duplicate", "reason": "Same sentence repeats 6 times, indicating low content diversity"} |
| 171 | +
|
| 172 | +--- |
| 173 | +
|
| 174 | +# Input content to evaluate: |
| 175 | +
|
| 176 | +""" |
| 177 | + # process_response method is now inherited from BaseTextQuality |
0 commit comments