Merge pull request #53 from e06084/main

e06084 · web-flow · commit 389fba9d18ae · 2025-10-30T17:11:31.000+08:00
docs: update readme
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ WebMainBench 是一个专门用于端到端评测网页正文抽取质量的基
 ## 功能特点
 
 ### 🎯 **核心功能**
-- **多抽取器支持**: 支持 LLM-WebKit、Jina AI 等多种抽取工具
+- **多抽取器支持**: 支持 trafilatura,resiliparse 等多种抽取工具
 - **全面的评测指标**: 包含文本编辑距离、表格结构相似度(TEDS)、公式抽取质量等多维度指标
 - **人工标注支持**: 评测数据集100%人工标注
 
@@ -56,7 +56,7 @@ from webmainbench import DataLoader, Evaluator, ExtractorFactory
 dataset = DataLoader.load_jsonl("your_dataset.jsonl")
 
 # 2. 创建抽取器
-extractor = ExtractorFactory.create("llm-webkit")
+extractor = ExtractorFactory.create("trafilatura")
 
 # 3. 运行评测
 evaluator = Evaluator()
@@ -81,40 +81,37 @@ print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
   "meta": {
     "language": "en",  # 网页的语言
     "style": "artical",  # 网页的文体
-    "DOM_WIDTH": 176,
-    "DOM_DEPTH": 27,
-    "text_linktext_ratio": 0.12252270850536746,
-    "table_text_ratio": 0,
-    "table_dom_depth": -1,
-    "text_distribution_dispersion": 0.2663,
     "table": [],  # [], ["layout"], ["data"], ["layout", "data"]
     "equation": [],  # [], ["inline"], ["interline"], ["inline", "interline"]
     "code": [],  # [], ["inline"], ["interline"], ["inline", "interline"]
-    "table_complexity_score": 0,
-    "dom_complexity_score": 0.8442,
-    "text_dispersion_score": 0.2663,
-    "content_diversity_score": 0,
-    "link_complexity_score": 0.1225,
-    "overall_complexity_score": 0.3083,
     "level": "mid"  # simple, mid, hard
   }
 }
 ```
 
 ## 支持的抽取器
 
-- **LLM-WebKit**: 基于大语言模型的智能抽取
-- **Jina AI**: Reader API 服务
+- **trafilatura**: trafilatura抽取器
+- **resiliparse**: resiliparse抽取器
 - **自定义抽取器**: 通过继承 `BaseExtractor` 实现
 
+## 评测榜单
+
+| extractor | extractor_version | dataset | total_samples | overall（macro avg） | code_edit | formula_edit | table_TEDS | table_edit | text_edit |
+|-----------|-------------------|---------|---------------|---------------------|-----------|--------------|------------|-----------|-----------|
+| llm-webkit | 4.1.1 | WebMainBench1.0 | 545 | 0.8256 | 0.9093 | 0.9399 | 0.7388 | 0.678 | 0.8621 |
+| magic-html | 0.1.5 | WebMainBench1.0 | 545 | 0.5141 | 0.4117 | 0.7204 | 0.3984 | 0.2611 | 0.7791 |
+| trafilatura_md | 2.0.0 | WebMainBench1.0 | 545 | 0.3858 | 0.1305 | 0.6242 | 0.3203 | 0.1653 | 0.6887 |
+| trafilatura_txt | 2.0.0 | WebMainBench1.0 | 545 | 0.2657 | 0 | 0.6162 | 0 | 0 | 0.7126 |
+| resiliparse | 0.14.5 | WebMainBench1.0 | 545 | 0.2954 | 0.0641 | 0.6747 | 0 | 0 | 0.7381 |
 
 ## 高级功能
 
 ### 多抽取器对比评估
 
 ```python
 # 对比多个抽取器
-extractors = ["llm-webkit", "jina-ai"]
+extractors = ["trafilatura", "resiliparse"]
 results = evaluator.compare_extractors(dataset, extractors)
 
 for name, result in results.items():
@@ -131,7 +128,6 @@ python examples/multi_extractor_compare.py
 
 1. **加载测试数据集**：使用包含代码、公式、表格、文本等多种内容类型的样本数据
 2. **创建多个抽取器**：
-   - `llm-webkit`：支持预处理HTML的智能抽取器
    - `magic-html`：基于 magic-html 库的抽取器
    - `trafilatura`：基于 trafilatura 库的抽取器  
    - `resiliparse`：基于 resiliparse 库的抽取器
@@ -152,7 +148,6 @@ python examples/multi_extractor_compare.py
 `leaderboard.csv` 内容示例：
 ```csv
 extractor,dataset,total_samples,success_rate,overall,code_edit,formula_edit,table_TEDS,table_edit,text_edit
-llm-webkit,sample_dataset,4,1.0,0.2196,0.5,0.0,0.0,0.0,0.5982
 magic-html,sample_dataset,4,1.0,0.1526,0.1007,0.0,0.0,0.0,0.6624
 resiliparse,sample_dataset,4,1.0,0.1379,0.0,0.0,0.0,0.0,0.6897
 trafilatura,sample_dataset,4,1.0,0.1151,0.1007,0.0,0.0,0.0,0.4746
@@ -204,194 +199,6 @@ class MyExtractor(BaseExtractor):
 ExtractorFactory.register("my-extractor", MyExtractor)
 ```
 
-### 数据集统计分析工具
-
-WebMainBench 提供了强大的数据集统计分析工具 `scripts/statics.py`，用于分析数据集的各种特征并自动生成复杂度评分和难易程度分类。
-
-#### 功能特性
-
-- **DOM结构分析**：计算网页DOM树的深度和宽度
-- **文本链接比例分析**：统计文本与链接的比例关系
-- **表格复杂度分析**：评估表格内容的复杂程度
-- **内容类型检测**：自动识别公式、代码、表格等特殊内容
-- **复杂度评分**：基于多维度指标计算综合复杂度得分
-- **动态难易程度分类**：基于数据分布自动分类为 simple/mid/hard
-
-#### 使用方法
-
-```bash
-# 基本用法
-python scripts/statics.py data/input.jsonl --output data/output_with_stats.jsonl
-
-# 使用默认数据集
-python scripts/statics.py
-```
-
-#### 参数说明
-
-```bash
-# 查看所有可用参数
-python scripts/statics.py --help
-
-```
-
-#### 输出结果
-
-工具会在每条数据的 `meta` 字段中添加以下统计信息：
-
-```json
-{
-  "meta": {
-    "DOM_DEPTH": 25,                    // DOM树深度
-    "DOM_WIDTH": 1200,                  // DOM树宽度
-    "text_linktext_ratio": 0.85,        // 文本链接比例
-    "table_complexity_score": 0.3,      // 表格复杂度得分
-    "dom_complexity_score": 0.6,        // DOM复杂度得分
-    "text_dispersion_score": 0.4,       // 文本分布得分
-    "content_diversity_score": 0.7,     // 内容多样性得分
-    "link_complexity_score": 0.5,       // 链接复杂度得分
-    "overall_complexity_score": 0.52,   // 综合复杂度得分
-    "level": "mid"                      // 难易程度 (simple/mid/hard)
-  }
-}
-```
-
-#### 复杂度评分算法
-
-综合复杂度得分由以下维度加权计算：
-
-- **DOM结构复杂度 (25%)**：基于DOM深度和宽度，使用动态归一化
-- **文本分布复杂度 (25%)**：基于文本在DOM中的分布离散程度
-- **内容多样性 (25%)**：基于公式、代码、表格等特殊内容的种类
-- **链接复杂度 (25%)**：基于文本与链接的比例关系
-
-#### 运行示例
-
-```bash
-# 分析数据集并生成统计报告
-python scripts/statics.py data/sample_dataset.jsonl --output data/analyzed_dataset.jsonl
-
-# 输出示例：
-🔄 第一阶段: 计算基础统计和复杂度得分...
-  📊 已处理 100 条数据...
-  📊 已处理 200 条数据...
-
-🔄 第二阶段: 计算动态阈值和难易程度分类...
-📊 复杂度分布阈值计算:
-   总样本数: 1,827
-   30%分位数 (simple/mid分界): 0.3245
-   70%分位数 (mid/hard分界): 0.6789
-   复杂度得分范围: 0.0944 - 1.0000
-
-📊 难易程度分类结果:
-   Simple: 548 (30.0%)
-   Mid:    731 (40.0%)  
-   Hard:   548 (30.0%)
-
-📝 正在写入数据到: data/analyzed_dataset.jsonl
-✅ 成功写入 1,827 条数据
-```
-
-### 语言分类工具
-
-WebMainBench 提供了语言分类工具 `scripts/language_classify.py`，用于为数据集中的文本内容自动添加符合 ISO 639-1 标准的语言标签。
-
-#### 主要特性
-
-- **多种检测方式**：支持基于规则的快速检测和基于LLM的高精度检测
-- **ISO 639-1 标准**：返回标准的两字母语言代码（如 en, zh, es）
-- **广泛语言支持**：支持80+种主要语言的检测
-- **批量处理**：高效处理大规模数据集
-- **智能回退**：多字段检测，自动处理缺失数据
-
-#### 使用方法
-
-```bash
-# 基于规则的快速检测（推荐用于大规模数据）
-python scripts/language_classify.py data/input.jsonl --output data/output.jsonl
-
-# 使用LLM进行高精度检测
-python scripts/language_classify.py data/input.jsonl --output data/output.jsonl \
-    --use-llm --api-key YOUR_OPENAI_API_KEY
-
-# 自定义批处理大小
-python scripts/language_classify.py data/input.jsonl --output data/output.jsonl \
-    --batch-size 50
-```
-
-#### Prompt设计建议
-
-如果你使用LLM进行语言检测，工具内置了优化的prompt模板：
-
-**核心设计原则：**
-1. **明确输出格式**：只返回ISO 639-1两字母代码
-2. **处理边界情况**：空文本、多语言文本、符号等
-3. **语言映射规则**：中文统一返回"zh"，未支持语言返回最接近的
-4. **文本截断**：只分析前2000字符，提高效率
-
-**示例Prompt结构：**
-```
-Please identify the primary language of the following text and return ONLY the ISO 639-1 two-letter language code.
-
-SUPPORTED LANGUAGES: en (English), zh (Chinese), es (Spanish), ...
-
-RULES:
-1. Return ONLY the two-letter code
-2. For mixed languages, return the DOMINANT language
-3. Empty text defaults to "en"
-4. Chinese variants all return "zh"
-
-TEXT TO ANALYZE: [your text here]
-
-LANGUAGE CODE:
-```
-
-#### 输出结果
-
-工具会在数据的 `meta.language` 字段中添加语言标签：
-
-```json
-{
-  "convert_main_content": "Hello, this is sample content.",
-  "meta": {
-    "language": "en"
-  }
-}
-```
-
-#### 运行示例
-
-```bash
-# 处理示例
-python scripts/language_classify.py data/sample.jsonl --output data/sample_with_lang.jsonl
-
-# 输出：
-🔄 开始处理语言分类...
-📄 输入文件: data/sample.jsonl
-📄 输出文件: data/sample_with_lang.jsonl  
-🧠 检测方法: 基于规则
-  📊 已处理 100 条数据...
-  📊 已处理 200 条数据...
-
-✅ 处理完成!
-📊 总计处理: 1,000 条数据
-📊 语言分布:
-   en (English): 650 (65.0%)
-   zh (Chinese): 200 (20.0%)
-   es (Spanish): 80 (8.0%)
-   fr (French): 40 (4.0%)
-   de (German): 30 (3.0%)
-```
-
-#### 支持的语言
-
-工具支持80+种主要语言，包括：
-- **欧洲语言**：英语(en)、西班牙语(es)、法语(fr)、德语(de)、意大利语(it)等
-- **亚洲语言**：中文(zh)、日语(ja)、韩语(ko)、泰语(th)、越南语(vi)等  
-- **其他语言**：阿拉伯语(ar)、俄语(ru)、葡萄牙语(pt)、印地语(hi)等
-
-完整列表请运行：`python examples/language_classify_demo.py`
-
 ## 项目架构
 
 ```