MigoXLab
diff --git a/‎.github/workflows/IntegrationTest.yml‎
Lines changed: 3 additions & 2 deletions b/‎.github/workflows/IntegrationTest.yml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 63 additions & 7 deletions b/‎README.md‎
Lines changed: 63 additions & 7 deletions
diff --git a/‎README_mcp.md‎
Lines changed: 165 additions & 0 deletions b/‎README_mcp.md‎
Lines changed: 165 additions & 0 deletions
@@ -5,9 +5,9 @@ name: Python application
 
 on:
   push:
-    branches: [ "main", "dev" ]
+    branches: [ "*" ]
   pull_request:
-    branches: [ "main", "dev" ]
+    branches: [ "*" ]
   workflow_dispatch:
 
 
@@ -37,6 +37,7 @@ jobs:
     - name: Integration Test(local plaintext)
       run: |
         python -m dingo.run.cli --input_path test/data/test_local_plaintext.txt --dataset local -e default --data_format plaintext
+        python -m dingo.run.cli --input_path test/data/test_local_plaintext.txt --dataset local -e default --data_format plaintext --save_data
     - name: Integration Test(local json)
       run: |
         python -m dingo.run.cli --input_path test/data/test_local_json.json --dataset local -e default --data_format json --column_content prediction
 
@@ -17,15 +17,19 @@
 
 </div>
 
-[English](README.md) | [简体中文](README_zh-CN.md)
+
+<div align="center">
+
+[English](README.md) · [简体中文](README_zh-CN.md)
+
+</div>
+
 
 <div align="center">
   <a href="https://discord.gg/Jhgb2eKWh8" style="text-decoration:none;">
     <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="Discord" /></a>
-  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
   <a href="https://huggingface.co/spaces/DataEval/dingo" style="text-decoration:none;">
     <img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="3%" alt="Hugging Face" /></a>
-  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
 </div>
 
 
@@ -51,7 +55,43 @@ pip install dingo-python
 
 ## Example Use Cases
 
-### 1. Evaluate Local Text File (Plaintext)
+### 1. Using Evaluate Core
+
+```python
+from dingo.config.config import DynamicLLMConfig
+from dingo.io.input.MetaData import MetaData
+from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase
+from dingo.model.rule.rule_common import RuleEnterAndSpace
+
+
+def llm():
+    data = MetaData(
+        data_id='123',
+        prompt="hello, introduce the world",
+        content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
+    )
+
+    LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig(
+        key='',
+        api_url='',
+        # model='',
+    )
+    res = LLMTextQualityModelBase.eval(data)
+    print(res)
+
+
+def rule():
+    data = MetaData(
+        data_id='123',
+        prompt="hello, introduce the world",
+        content="Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty."
+    )
+
+    res = RuleEnterAndSpace().eval(data)
+    print(res)
+```
+
+### 2. Evaluate Local Text File (Plaintext)
 
 ```python
 from dingo.io import InputArgs
@@ -72,7 +112,7 @@ result = executor.execute()
 print(result)
 ```
 
-### 2. Evaluate Hugging Face Dataset
+### 3. Evaluate Hugging Face Dataset
 
 ```python
 from dingo.io import InputArgs
@@ -92,7 +132,7 @@ result = executor.execute()
 print(result)
 ```
 
-### 3. Evaluate JSON/JSONL Format
+### 4. Evaluate JSON/JSONL Format
 
 ```python
 from dingo.io import InputArgs
@@ -114,7 +154,7 @@ result = executor.execute()
 print(result)
 ```
 
-### 4. Using LLM for Evaluation
+### 5. Using LLM for Evaluation
 
 ```python
 from dingo.io import InputArgs
@@ -229,6 +269,7 @@ Dingo provides several LLM-based assessment methods defined by prompts in the `d
 |-------------|--------|-------------|
 | `TEXT_QUALITY_KAOTI` | Exam question quality | Specialized assessment for evaluating the quality of exam questions, focusing on formula rendering, table formatting, paragraph structure, and answer formatting |
 | `Html_Abstract` | HTML extraction quality | Compares different methods of extracting Markdown from HTML, evaluating completeness, formatting accuracy, and semantic coherence |
+| `DATAMAN_ASSESSMENT` | Data Quality & Domain | Evaluates pre-training data quality using the DataMan methodology (14 standards, 15 domains). Assigns a score (0/1), domain type, quality status, and reason. |
 
 ### Classification Prompts
 
@@ -420,6 +461,21 @@ Example summary:
 }
 ```
 
+
+# MCP Server (Experimental)
+
+Dingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:
+
+[**Dingo MCP Server Documentation (README_mcp.md)**](README_mcp.md)
+
+
+# Research & Publications
+
+- **"Comprehensive Data Quality Assessment for Multilingual WebData"** : [WanJuanSiLu: A High-Quality Open-Source Webtext
+Dataset for Low-Resource Languages](https://arxiv.org/pdf/2501.14506)
+- **"Pre-training data quality using the DataMan methodology"** : [DataMan: Data Manager for Pre-training Large Language Models](https://openreview.net/pdf?id=eNbA8Fqir4)
+
+
 # Future Plans
 
 - [ ] Richer graphic and text evaluation indicators
 
@@ -0,0 +1,165 @@
+# Dingo MCP Server
+
+## Overview
+
+The `mcp_server.py` script provides an experimental Model Context Protocol (MCP) server for Dingo, powered by [FastMCP](https://github.com/modelcontextprotocol/fastmcp). This allows MCP clients, such as Cursor, to interact with Dingo's data evaluation capabilities programmatically.
+
+## Features
+
+*   Exposes Dingo's evaluation logic via MCP.
+*   Provides two primary tools:
+    *   `run_dingo_evaluation`: Executes rule-based or LLM-based evaluations on specified data.
+    *   `list_dingo_components`: Lists available rule groups and registered LLM models within Dingo.
+*   Enables interaction through MCP clients like Cursor.
+
+## Installation
+
+1.  **Prerequisites**: Ensure you have Git and a Python environment (e.g., 3.8+) set up.
+2.  **Clone the Repository**: Clone this repository to your local machine.
+    ```bash
+    git clone https://github.com/DataEval/dingo.git
+    cd dingo
+    ```
+3.  **Install Dependencies**: Install the required dependencies, including FastMCP and other Dingo requirements. It's recommended to use the `requirements.txt` file.
+    ```bash
+    pip install -r requirements.txt
+    # Alternatively, at minimum: pip install fastmcp
+    ```
+4.  **Ensure Dingo is Importable**: Make sure your Python environment can find the `dingo` package within the cloned repository when you run the server script.
+
+## Running the Server
+
+Navigate to the directory containing `mcp_server.py` and run it using Python:
+
+```bash
+python mcp_server.py
+```
+
+By default, the server starts using the Server-Sent Events (SSE) transport protocol. You can customize its behavior using arguments within the script's `mcp.run()` call:
+
+```python
+# Example customization in mcp_server.py
+mcp.run(
+    transport="sse",      # Communication protocol (sse is default)
+    host="127.0.0.1",     # Network interface to bind to (default: 0.0.0.0)
+    port=8888,            # Port to listen on (default: 8000)
+    log_level="debug"     # Logging verbosity (default: info)
+)
+```
+
+**Important**: Note the `host` and `port` the server is running on, as you will need these to configure your MCP client.
+
+## Integration with Cursor
+
+### Configuration
+
+To connect Cursor to your running Dingo MCP server, you need to edit Cursor's MCP configuration file (`mcp.json`). This file is typically located in Cursor's user configuration directory (e.g., `~/.cursor/` or `%USERPROFILE%\.cursor\`).
+
+Add or modify the entry for your Dingo server within the `mcpServers` object. Use the `url` property to specify the address of your running server.
+
+**Example `mcp.json` entry:**
+
+```json
+{
+  "mcpServers": {
+    // ... other servers ...
+    "dingo_evaluator": {
+      "url": "http://127.0.0.1:8888/sse" // <-- MUST match host, port, and transport of your running server
+    }
+    // ...
+  }
+}
+```
+
+*   Ensure the `url` exactly matches the `host`, `port`, and `transport` (currently only `sse` is supported for the URL scheme) your `mcp_server.py` is configured to use. If you didn't customize `mcp.run`, the default URL is likely `http://127.0.0.1:8000/sse` or `http://0.0.0.0:8000/sse`.
+*   Restart Cursor after saving changes to `mcp.json`.
+
+### Usage in Cursor
+
+Once configured, you can invoke the Dingo tools within Cursor:
+
+*   **List Components**: "Use the dingo_evaluator tool to list available Dingo components."
+*   **Run Evaluation**: "Use the dingo_evaluator tool to run a rule evaluation..." or "Use the dingo_evaluator tool to run an LLM evaluation..."
+
+Cursor will prompt you for the necessary arguments.
+
+## Tool Reference
+
+### `list_dingo_components()`
+
+Lists available Dingo rule groups and registered LLM model identifiers.
+
+*   **Arguments**: None
+*   **Returns**: `Dict[str, List[str]]` - A dictionary containing `rule_groups` and `llm_models`.
+
+**Example Cursor Usage**:
+> Use the dingo_evaluator tool to list dingo components.
+
+### `run_dingo_evaluation(...)`
+
+Runs a Dingo evaluation (rule-based or LLM-based).
+
+*   **Arguments**:
+    *   `input_path` (str): Path to the input file or directory (relative to the project root or absolute).
+    *   `evaluation_type` (Literal["rule", "llm"]): Type of evaluation.
+    *   `eval_group_name` (str): Rule group name for `rule` type (default: `""` which uses 'default'). Only 'default', 'sft', 'pretrain' are validated by the server logic. Ignored for `llm` type.
+    *   `output_dir` (Optional[str]): Directory to save outputs. Defaults to a `dingo_output_*` subdirectory within the parent directory of `input_path`.
+    *   `task_name` (Optional[str]): Name for the task (used in output path generation). Defaults to `mcp_eval_<uuid>`.
+    *   `save_data` (bool): Whether to save detailed JSONL output (default: True).
+    *   `save_correct` (bool): Whether to save correct data (default: True).
+    *   `kwargs` (dict): Dictionary for additional `dingo.io.InputArgs`. Common uses:
+        *   `dataset` (str): Dataset type (e.g., 'local', 'hugging_face'). Defaults to 'local' if `input_path` is given.
+        *   `data_format` (str): Input data format (e.g., 'json', 'jsonl', 'plaintext'). Inferred from `input_path` extension if possible.
+        *   `column_content` (str): **Required** for formats like JSON/JSONL - specifies the key containing the text to evaluate.
+        *   `column_id`, `column_prompt`, `column_image`: Other column mappings.
+        *   `custom_config` (str | dict): Path to a JSON config file, a JSON string, or a dictionary for LLM evaluation or custom rule settings. API keys for LLMs **must** be provided here.
+        *   `max_workers`, `batch_size`: Dingo execution parameters (default to 1 in MCP for stability).
+*   **Returns**: `str` - The absolute path to the primary output file (e.g., `summary.json`).
+
+**Example Cursor Usage (Rule-based):**
+
+> Use the Dingo Evaluator tool to run the default rule evaluation on `test/data/test_local_jsonl.jsonl`. Make sure to use the 'content' column.
+
+*(Cursor should propose a tool call like below)*
+```xml
+<use_mcp_tool>
+<server_name>dingo_evaluator</server_name>
+<tool_name>run_dingo_evaluation</tool_name>
+<arguments>
+{
+  "input_path": "test/data/test_local_jsonl.jsonl",
+  "evaluation_type": "rule",
+  "eval_group_name": "default",
+  "kwargs": {
+    "column_content": "content"
+    // data_format="jsonl" and dataset="local" will be inferred
+  }
+}
+</arguments>
+</use_mcp_tool>
+```
+
+**Example Cursor Usage (LLM-based):**
+
+> Use the Dingo Evaluator tool to perform an LLM evaluation on `test/data/test_local_jsonl.jsonl`. Use the 'content' column. Configure it using the file `examples/mcp/config_self_deployed_llm.json`.
+
+*(Cursor should propose a tool call like below. Note `eval_group_name` can be omitted or set when using `custom_config` for LLM evals)*
+```xml
+<use_mcp_tool>
+<server_name>dingo_evaluator</server_name>
+<tool_name>run_dingo_evaluation</tool_name>
+<arguments>
+{
+  "input_path": "test/data/test_local_jsonl.jsonl",
+  "evaluation_type": "llm",
+  "kwargs": {
+    "column_content": "content",
+    "custom_config": "examples/mcp/config_self_deployed_llm.json"
+    // data_format="jsonl" and dataset="local" will be inferred
+  }
+}
+</arguments>
+</use_mcp_tool>
+```
+
+Refer to `examples/mcp/config_api_llm.json` (for API-based LLMs) and `examples/mcp/config_self_deployed_llm.json` (for self-hosted LLMs) for the structure of the `custom_config` file, including where to place API keys or URLs.