MemTensor · J1awei-Yang · Jul 18, 2025 · Copilot · Jul 18, 2025 · Copilot
diff --git a/MMLongBench-Doc/README.md b/MMLongBench-Doc/README.md
@@ -0,0 +1,113 @@
+<p align="center">
+  <h1 align="center">MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations</h1>
+    <p align="center">
+    <a href="https://mayubo2333.github.io"><strong>Yubo Ma</strong></a>
+    ·
+    <a href="https://yuhangzang.github.io/"><strong>Yuhang Zang</strong></a>
+    ·
+    <a href="https://cliangyu.com/"><strong>Liangyu Chen</strong></a>
+    ·
+    <a href="https://chenmeiqii.github.io/"><strong>Meiqi Chen</strong></a>
+    ·
+    <a href="https://yzjiao.github.io/"><strong>Yizhu Jiao</strong></a>
+    ·
+    <strong>Xinze Li</strong>
+    ·
+     <a href="https://xinyuanlu00.github.io"><strong>Xinyuan Lu</strong></a>
+      ·
+     <a href="https://liuziyu77.github.io/"><strong>Ziyu Liu</strong></a>
+    ·
+    <strong>Yan Ma</strong>
+    ·
+    <a href="https://lightdxy.github.io/"><strong>Xiaoyi Dong</strong></a>
+    ·
+    <a href="https://panzhang0212.github.io/"><strong>Pan Zhang</strong></a>
+    ·
+    <a href="http://www.liangmingpan.com/"><strong>Liangming Pan</strong></a>
+    .
+    <strong>Yu-Gang Jiang</strong>
+    .
+    <a href="https://myownskyw7.github.io/"><strong>Jiaqi Wang</strong></a>
+    .
+    <a href="https://sites.google.com/view/yixin-homepage"><strong>Yixin Cao</strong></a>
+    .
+    <a href="https://personal.ntu.edu.sg/axsun/"><strong>Aixin Sun</strong></a>
+  </p>
+  <!-- <h2 align="center">Submitted to arXiv</h2> -->
+  📖<a href="https://arxiv.org/abs/2407.01523">Paper</a> |🏠<a href="https://mayubo2333.github.io/MMLongBench-Doc/">Homepage</a></h3>|🤗<a href="https://huggingface.co/datasets/yubo2333/MMLongBench-Doc">Huggingface</a></h3>
+<div align="center"></div>
+<p align="center">
+  <p>
+The automatic understanding of lengthy documents (Long-context Document Understanding; DU) stands as a long-standing task in urgent and practical needs. Although many LVLMs now claim (and show promising cases) their capabilities on long-context DU, there lacks a unified and quantitative evaluation of existing models due to the absence of related benchmark.<br>
+To bridge this gap, we construct <strong>MMLongBench-Doc</strong> which comprises 135 documents and 1091 qustions (each accompanied by a short, deterministic reference answer and detailed meta information.). The documents have an average of 47.5 pages and 21,214 tokens, cover 7 diverse domains, and are PDF-formatted with rich layouts and multi-modal components. The questions are either curated from existing datasets or newly-annotated by expert-level annotators. Towards a comprehensive evaluation, the questions cover different sources like text, table, chart, image, etc., and different locations (page index) of the documents. Notably, 33.0% questions are cross-page questions necessitating comprehension and reasoning on evidences across multiple pages. And 22.5% questions are designed to be unanswerable for reducing the shortcuts in this benchmark and detecting LVLMs' hallucinations.
+  </p>
+  <a href="">
+    <img src="asset/top_figure.png" alt="Logo" width="100%">
+  </a>
+<br>
+
+## 📢 News
+- 🚀 [07/2024] We further refine and update the questions in MMLongBench-Doc!
+- 🚀 [07/2024] We have integrated MMLongBench-Doc to evaluation toolkit [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), providing a highly convenient testing solution!
+- 🚀 [06/2024] We upload MMLongBench-Doc to huggingface.
+
+## 💡 Highlights
+- 🔥 **Multi-modality**: All selected documents are PDF-formatted with rich layouts and multi-modal components including text, table, chart and image. We annotate questions carefully from these multi-modal evidences.
+- 🔥 **Long-context**: Each document has an average of 47.5 pages and 21,214 tokens. Additionally, 33.0% of the questions are cross-page questions which necessitate the information collection and reasoning over multiple pages.
+- 🔥 **Challenging**: Experiments on 14 LVLMs demonstrate that long-context document understanding greatly challenges current models. Even the best-performing LVLM, GPT-4o, achieves an overall F1 score of only 44.9%.
+
+## Dataset
+We save our benchmark, including both questions and documents, in `./data`.
+* The questions are provided in json format and contain the following attributes:
+```
+    {
+        "doc_id": "Independents-Report.pdf",
+        "doc_type": "Research report / Introduction",
+        "question": "What's the percentage of people who are democrats and voted in the last election compared to the entire population in 2018?",
+        "answer": "18.29%",
+        "evidence_pages": "[3, 5]",
+        "evidence_sources": "['Pure-text (Plain-text)']",
+        "answer_format": "Float",
+    }
+```
+* The documents are saved in `./data/documents` as the format of PDF files.
+
+You can also download this dataset by the following command (make sure that you have installed Huggingface Datasets):
+```
+from datasets import load_dataset
+samples = load_dataset("yubo2333/MMLongBench-Doc/data")["train"]
+```
+
+## 🛠️ Usage
+### Environment
+```
+python 3.9
+2.1.2+cu121
+```
+You can install other dependencies by `pip install -r requirements.txt`.
+
+
+### Quick Use
+```
+MODEL_NAME=[gpt-4o|gpt-4-turbo|gemini-1.5-pro-latest|internvl|4khd|minicpm_llama3] bash run.sh
+```
+Note that
+* `OPENAI_API_KEY` should be set no matter what models you are evaluating because we adopt a three-stage evaluation protocol as detailed in Section 4.1 of [our paper](https://arxiv.org/abs/2407.01523). The conversion from a long-form response to a short-form prediction necessitates GPT-4o's involving.
+* We now support various popular open-source and closed-source LVLMs, including **GPT-4o**, **GPT-4V**, **Gemini-Pro-1.5**,**InternLM-Xcomposer2-4KHD**, **Intern-VL-Chat-v1.5** and **MiniCPM-Llama3-V2.5**. More LVLMs will be supported in the near future (we are cleaning related code).
+
+## ✒️Citation
+```
+@misc{ma2024mmlongbenchdocbenchmarkinglongcontextdocument,
+      title={MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations},
+      author={Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun},
+      year={2024},
+      eprint={2407.01523},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2407.01523},
+}
+```
+
+## 📄 License
+![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only.
+License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
diff --git a/MMLongBench-Doc/asset/top_figure.png b/MMLongBench-Doc/asset/top_figure.png
diff --git a/MMLongBench-Doc/eval/__init__.py b/MMLongBench-Doc/eval/__init__.py
diff --git a/MMLongBench-Doc/eval/eval_score.py b/MMLongBench-Doc/eval/eval_score.py
@@ -0,0 +1,260 @@
+import re
+
+from collections import defaultdict
+from math import isclose
+
+
+def levenshtein_distance(s1, s2):
+    if len(s1) > len(s2):
+        s1, s2 = s2, s1
+
+    distances = range(len(s1) + 1)
+    for i2, c2 in enumerate(s2):
+        distances_ = [i2 + 1]
+        for i1, c1 in enumerate(s1):
+            if c1 == c2:
+                distances_.append(distances[i1])
+            else:
+                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
+        distances = distances_
+    return distances[-1]
+
+
+def anls_compute(groundtruth, prediction, threshold=0.5):
+    dist = levenshtein_distance(groundtruth, prediction)
+    length = max(len(groundtruth.upper()), len(prediction.upper()))
+    value = 0.0 if length == 0 else float(dist) / float(length)
+    anls = 1.0 - value
+    if anls <= threshold:
+        anls = 0.0
+    return anls
+
+
+def is_float_equal(
+    reference, prediction, include_percentage: bool = False, is_close: float = False
+) -> bool:
+    def get_precision(gt_ans: float) -> int:
+        precision = 3
+        if "." in str(gt_ans):
+            precision = len(str(gt_ans).split(".")[-1])
+        return precision
+
+    reference = float(str(reference).strip().rstrip("%").strip())
+    try:
+        prediction = float(str(prediction).strip().rstrip("%").strip())
+    except:
-    except:
+    except (ValueError, TypeError):
-    except:
+    except (ValueError, TypeError):
+        return False
+
+    if include_percentage:
+        gt_result = [reference / 100, reference, reference * 100]
+    else:
+        gt_result = [reference]
+    for item in gt_result:
+        try:
+            if is_close:
+                if isclose(item, prediction, rel_tol=0.01):
+                    return True
+            precision = max(min(get_precision(prediction), get_precision(item)), 2)
+            if round(prediction, precision) == round(item, precision):
+                return True
+        except Exception:
+            continue
+    return False
+
+
+def get_clean_string(s):
+    s = str(s).lower().strip()
+    if s.endswith("mile"):
+        s.rstrip("mile").strip()
+    if s.endswith("miles"):
+        s.rstrip("miles").strip()
+    if s.endswith("million"):
+        s.rstrip("million").strip()
+    # remove parenthesis
+    s = re.sub(r"\s*\([^)]*\)", "", s).strip()
+    # remove quotes
+    s = re.sub(r"^['\"]|['\"]$", "", s).strip()
+    s = s.strip().lstrip("$").strip()
+    s = s.strip().rstrip("%").strip()
+    return s
+
+
+def is_exact_match(s):
+    flag = False
+    # Website
+    if "https://" in s:
+        flag = True
+    # code file
+    if s.endswith(".py") or s.endswith("ipynb"):
+        flag = True
+    if s.startswith("page"):
+        flag = True
+    # telephone number
+    if re.fullmatch(r"\b\d+(-\d+|\s\d+)?\b", s):
+        flag = True
+    # time
+    if "a.m." in s or "p.m." in s:
+        flag = True
+    # YYYY-MM-DD
+    if re.fullmatch(r"\b\d{4}[-\s]\d{2}[-\s]\d{2}\b", s):
+        flag = True
+    # YYYY-MM
+    if re.fullmatch(r"\b\d{4}[-\s]\d{2}\b", s):
+        flag = True
+    # Email address
+    if re.fullmatch(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", s):
+        flag = True
+    return flag
+
+
+def isfloat(num):
+    try:
+        float(num)
+        return True
+    except ValueError:
+        return False
+
+
+def eval_score(gt, pred, answer_type):
+    if answer_type == "Int":
+        try:
+            gt, pred = int(gt), int(float(pred))
+        except:
-        except:
+        except (ValueError, TypeError):
-        except:
+        except (ValueError, TypeError):
+            pred = ""
+        score = gt == pred
+    elif answer_type == "Float":
+        try:
+            gt = float(get_clean_string(str(gt)))
+            pred = float(get_clean_string(str(pred)))
+        except:
-        except:
+        except (ValueError, TypeError):
-        except:
+        except (ValueError, TypeError):
+            pred = ""
+        score = is_float_equal(gt, pred, include_percentage=True, is_close=True)
+    elif answer_type in ["Str", "None"]:
+        gt = get_clean_string(gt)
+        pred = get_clean_string(pred)
+        if is_exact_match(gt):
+            score = gt == pred
+        else:
+            score = anls_compute(gt, pred)
+    else:
+        if isinstance(gt, str) and gt.startswith("["):
+            gt = eval(gt)
+        if not isinstance(gt, list):
+            gt = [gt]
+        if isinstance(pred, str) and pred.startswith("["):
+            pred = eval(pred)
+        if not isinstance(pred, list):
+            pred = [pred]
+        print(len(gt), len(pred))
+        if len(gt) != len(pred):
+            score = 0.0
+        else:
+            gt = sorted([get_clean_string(a) for a in gt])
+            pred = sorted([get_clean_string(a) for a in pred])
+            print(gt, pred)
+            if isfloat(gt[0]) or is_exact_match(gt[0]):
+                score = "-".join(gt) == "-".join(pred)
+            else:
+                score = min(
+                    [anls_compute(gt_v, pred_v) for gt_v, pred_v in zip(gt, pred, strict=False)]
+                )
+
+    return float(score)
+
+
+def eval_acc_and_f1(samples):
+    evaluated_samples = [sample for sample in samples if "score" in sample]
+    if not evaluated_samples:
+        return 0.0, 0.0
+
+    acc = sum([sample["score"] for sample in evaluated_samples]) / len(evaluated_samples)
+    try:
+        recall = sum(
+            [
+                sample["score"]
+                for sample in evaluated_samples
+                if sample["answer"] != "Not answerable"
+            ]
+        ) / len([sample for sample in evaluated_samples if sample["answer"] != "Not answerable"])
+        precision = sum(
+            [
+                sample["score"]
+                for sample in evaluated_samples
+                if sample["answer"] != "Not answerable"
+            ]
+        ) / len([sample for sample in evaluated_samples if sample["pred"] != "Not answerable"])
+        f1 = 2 * recall * precision / (recall + precision) if (recall + precision) > 0.0 else 0.0
+    except:
-    except:
+    except (ZeroDivisionError, ValueError):
-    except:
+    except (ZeroDivisionError, ValueError):
+        f1 = 0.0
+
+    return acc, f1
+
+
+def show_results(samples, show_path=None):
+    for sample in samples:
+        sample["evidence_pages"] = eval(sample["evidence_pages"])
+        sample["evidence_sources"] = eval(sample["evidence_sources"])
+
+    with open(show_path, "w") as f:
+        acc, f1 = eval_acc_and_f1(samples)
+        f.write(f"Overall Acc: {acc} | Question Number: {len(samples)}\n")
+        f.write(f"Overall F1-score: {f1} | Question Number: {len(samples)}\n")
+        f.write("-----------------------\n")
+
+        #####################
+        acc_single_page, _ = eval_acc_and_f1(
+            [sample for sample in samples if len(sample["evidence_pages"]) == 1]
+        )
+        acc_multi_page, _ = eval_acc_and_f1(
+            [
+                sample
+                for sample in samples
+                if len(sample["evidence_pages"]) != 1 and sample["answer"] != "Not answerable"
+            ]
+        )
+        acc_neg, _ = eval_acc_and_f1(
+            [sample for sample in samples if sample["answer"] == "Not answerable"]
+        )
+
+        f.write(
+            "Single-page | Accuracy: {} | Question Number: {}\n".format(
+                acc_single_page,
+                len([sample for sample in samples if len(sample["evidence_pages"]) == 1]),
+            )
+        )
+        f.write(
+            "Cross-page | Accuracy: {} | Question Number: {}\n".format(
+                acc_multi_page,
+                len(
+                    [
+                        sample
+                        for sample in samples
+                        if len(sample["evidence_pages"]) != 1
+                        and sample["answer"] != "Not answerable"
+                    ]
+                ),
+            )
+        )
+        f.write(
+            "Unanswerable | Accuracy: {} | Question Number: {}\n".format(
+                acc_neg, len([sample for sample in samples if sample["answer"] == "Not answerable"])
+            )
+        )
+        f.write("-----------------------\n")
+
+        #####################
+        source_sample_dict, document_type_dict = defaultdict(list), defaultdict(list)
+        for sample in samples:
+            for answer_source in sample["evidence_sources"]:
+                source_sample_dict[answer_source].append(sample)
+            document_type_dict[sample["doc_type"]].append(sample)
+        for type, sub_samples in source_sample_dict.items():
+            f.write(
+                f"Evidence Sources: {type} | Accuracy: {eval_acc_and_f1(sub_samples)[0]} | Question Number: {len(sub_samples)}\n"
+            )
+
+        f.write("-----------------------\n")
+        for type, sub_samples in document_type_dict.items():
+            f.write(
+                f"Document Type: {type} | Accuracy: {eval_acc_and_f1(sub_samples)[0]} | Question Number: {len(sub_samples)}\n"
+            )