Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions MMLongBench-Doc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
<p align="center">
<h1 align="center">MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations</h1>
<p align="center">
<a href="https://mayubo2333.github.io"><strong>Yubo Ma</strong></a>
·
<a href="https://yuhangzang.github.io/"><strong>Yuhang Zang</strong></a>
·
<a href="https://cliangyu.com/"><strong>Liangyu Chen</strong></a>
·
<a href="https://chenmeiqii.github.io/"><strong>Meiqi Chen</strong></a>
·
<a href="https://yzjiao.github.io/"><strong>Yizhu Jiao</strong></a>
·
<strong>Xinze Li</strong>
·
<a href="https://xinyuanlu00.github.io"><strong>Xinyuan Lu</strong></a>
·
<a href="https://liuziyu77.github.io/"><strong>Ziyu Liu</strong></a>
·
<strong>Yan Ma</strong>
·
<a href="https://lightdxy.github.io/"><strong>Xiaoyi Dong</strong></a>
·
<a href="https://panzhang0212.github.io/"><strong>Pan Zhang</strong></a>
·
<a href="http://www.liangmingpan.com/"><strong>Liangming Pan</strong></a>
.
<strong>Yu-Gang Jiang</strong>
.
<a href="https://myownskyw7.github.io/"><strong>Jiaqi Wang</strong></a>
.
<a href="https://sites.google.com/view/yixin-homepage"><strong>Yixin Cao</strong></a>
.
<a href="https://personal.ntu.edu.sg/axsun/"><strong>Aixin Sun</strong></a>
</p>
<!-- <h2 align="center">Submitted to arXiv</h2> -->
📖<a href="https://arxiv.org/abs/2407.01523">Paper</a> |🏠<a href="https://mayubo2333.github.io/MMLongBench-Doc/">Homepage</a></h3>|🤗<a href="https://huggingface.co/datasets/yubo2333/MMLongBench-Doc">Huggingface</a></h3>
<div align="center"></div>
<p align="center">
<p>
The automatic understanding of lengthy documents (Long-context Document Understanding; DU) stands as a long-standing task in urgent and practical needs. Although many LVLMs now claim (and show promising cases) their capabilities on long-context DU, there lacks a unified and quantitative evaluation of existing models due to the absence of related benchmark.<br>
To bridge this gap, we construct <strong>MMLongBench-Doc</strong> which comprises 135 documents and 1091 qustions (each accompanied by a short, deterministic reference answer and detailed meta information.). The documents have an average of 47.5 pages and 21,214 tokens, cover 7 diverse domains, and are PDF-formatted with rich layouts and multi-modal components. The questions are either curated from existing datasets or newly-annotated by expert-level annotators. Towards a comprehensive evaluation, the questions cover different sources like text, table, chart, image, etc., and different locations (page index) of the documents. Notably, 33.0% questions are cross-page questions necessitating comprehension and reasoning on evidences across multiple pages. And 22.5% questions are designed to be unanswerable for reducing the shortcuts in this benchmark and detecting LVLMs' hallucinations.
</p>
<a href="">
<img src="asset/top_figure.png" alt="Logo" width="100%">
</a>
<br>

## 📢 News
- 🚀 [07/2024] We further refine and update the questions in MMLongBench-Doc!
- 🚀 [07/2024] We have integrated MMLongBench-Doc to evaluation toolkit [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), providing a highly convenient testing solution!
- 🚀 [06/2024] We upload MMLongBench-Doc to huggingface.

## 💡 Highlights
- 🔥 **Multi-modality**: All selected documents are PDF-formatted with rich layouts and multi-modal components including text, table, chart and image. We annotate questions carefully from these multi-modal evidences.
- 🔥 **Long-context**: Each document has an average of 47.5 pages and 21,214 tokens. Additionally, 33.0% of the questions are cross-page questions which necessitate the information collection and reasoning over multiple pages.
- 🔥 **Challenging**: Experiments on 14 LVLMs demonstrate that long-context document understanding greatly challenges current models. Even the best-performing LVLM, GPT-4o, achieves an overall F1 score of only 44.9%.

## Dataset
We save our benchmark, including both questions and documents, in `./data`.
* The questions are provided in json format and contain the following attributes:
```
{
"doc_id": "Independents-Report.pdf",
"doc_type": "Research report / Introduction",
"question": "What's the percentage of people who are democrats and voted in the last election compared to the entire population in 2018?",
"answer": "18.29%",
"evidence_pages": "[3, 5]",
"evidence_sources": "['Pure-text (Plain-text)']",
"answer_format": "Float",
}
```
* The documents are saved in `./data/documents` as the format of PDF files.

You can also download this dataset by the following command (make sure that you have installed Huggingface Datasets):
```
from datasets import load_dataset
samples = load_dataset("yubo2333/MMLongBench-Doc/data")["train"]
```

## 🛠️ Usage
### Environment
```
python 3.9
2.1.2+cu121
```
You can install other dependencies by `pip install -r requirements.txt`.


### Quick Use
```
MODEL_NAME=[gpt-4o|gpt-4-turbo|gemini-1.5-pro-latest|internvl|4khd|minicpm_llama3] bash run.sh
```
Note that
* `OPENAI_API_KEY` should be set no matter what models you are evaluating because we adopt a three-stage evaluation protocol as detailed in Section 4.1 of [our paper](https://arxiv.org/abs/2407.01523). The conversion from a long-form response to a short-form prediction necessitates GPT-4o's involving.
* We now support various popular open-source and closed-source LVLMs, including **GPT-4o**, **GPT-4V**, **Gemini-Pro-1.5**,**InternLM-Xcomposer2-4KHD**, **Intern-VL-Chat-v1.5** and **MiniCPM-Llama3-V2.5**. More LVLMs will be supported in the near future (we are cleaning related code).

## ✒️Citation
```
@misc{ma2024mmlongbenchdocbenchmarkinglongcontextdocument,
title={MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations},
author={Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun},
year={2024},
eprint={2407.01523},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.01523},
}
```

## 📄 License
![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg) **Usage and License Notices**: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
Binary file added MMLongBench-Doc/asset/top_figure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
260 changes: 260 additions & 0 deletions MMLongBench-Doc/eval/eval_score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
import re

from collections import defaultdict
from math import isclose


def levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1

distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2 + 1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]


def anls_compute(groundtruth, prediction, threshold=0.5):
dist = levenshtein_distance(groundtruth, prediction)
length = max(len(groundtruth.upper()), len(prediction.upper()))
value = 0.0 if length == 0 else float(dist) / float(length)
anls = 1.0 - value
if anls <= threshold:
anls = 0.0
return anls


def is_float_equal(
reference, prediction, include_percentage: bool = False, is_close: float = False
) -> bool:
def get_precision(gt_ans: float) -> int:
precision = 3
if "." in str(gt_ans):
precision = len(str(gt_ans).split(".")[-1])
return precision

reference = float(str(reference).strip().rstrip("%").strip())
try:
prediction = float(str(prediction).strip().rstrip("%").strip())
except:
Copy link

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except clause is not recommended. Consider catching ValueError or TypeError specifically since this is handling float conversion.

Suggested change
except:
except (ValueError, TypeError):

Copilot uses AI. Check for mistakes.
return False

if include_percentage:
gt_result = [reference / 100, reference, reference * 100]
else:
gt_result = [reference]
for item in gt_result:
try:
if is_close:
if isclose(item, prediction, rel_tol=0.01):
return True
precision = max(min(get_precision(prediction), get_precision(item)), 2)
if round(prediction, precision) == round(item, precision):
return True
except Exception:
continue
return False


def get_clean_string(s):
s = str(s).lower().strip()
if s.endswith("mile"):
s.rstrip("mile").strip()
if s.endswith("miles"):
s.rstrip("miles").strip()
if s.endswith("million"):
s.rstrip("million").strip()
# remove parenthesis
s = re.sub(r"\s*\([^)]*\)", "", s).strip()
# remove quotes
s = re.sub(r"^['\"]|['\"]$", "", s).strip()
s = s.strip().lstrip("$").strip()
s = s.strip().rstrip("%").strip()
return s


def is_exact_match(s):
flag = False
# Website
if "https://" in s:
flag = True
# code file
if s.endswith(".py") or s.endswith("ipynb"):
flag = True
if s.startswith("page"):
flag = True
# telephone number
if re.fullmatch(r"\b\d+(-\d+|\s\d+)?\b", s):
flag = True
# time
if "a.m." in s or "p.m." in s:
flag = True
# YYYY-MM-DD
if re.fullmatch(r"\b\d{4}[-\s]\d{2}[-\s]\d{2}\b", s):
flag = True
# YYYY-MM
if re.fullmatch(r"\b\d{4}[-\s]\d{2}\b", s):
flag = True
# Email address
if re.fullmatch(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", s):
flag = True
return flag


def isfloat(num):
try:
float(num)
return True
except ValueError:
return False


def eval_score(gt, pred, answer_type):
if answer_type == "Int":
try:
gt, pred = int(gt), int(float(pred))
except:
Copy link

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except clause is not recommended. Consider catching ValueError or TypeError specifically since this is handling type conversions.

Suggested change
except:
except (ValueError, TypeError):

Copilot uses AI. Check for mistakes.
pred = ""
score = gt == pred
elif answer_type == "Float":
try:
gt = float(get_clean_string(str(gt)))
pred = float(get_clean_string(str(pred)))
except:
Copy link

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except clause is not recommended. Consider catching ValueError or TypeError specifically since this is handling float conversion.

Suggested change
except:
except (ValueError, TypeError):

Copilot uses AI. Check for mistakes.
pred = ""
score = is_float_equal(gt, pred, include_percentage=True, is_close=True)
elif answer_type in ["Str", "None"]:
gt = get_clean_string(gt)
pred = get_clean_string(pred)
if is_exact_match(gt):
score = gt == pred
else:
score = anls_compute(gt, pred)
else:
if isinstance(gt, str) and gt.startswith("["):
gt = eval(gt)
if not isinstance(gt, list):
gt = [gt]
if isinstance(pred, str) and pred.startswith("["):
pred = eval(pred)
if not isinstance(pred, list):
pred = [pred]
print(len(gt), len(pred))
if len(gt) != len(pred):
score = 0.0
else:
gt = sorted([get_clean_string(a) for a in gt])
pred = sorted([get_clean_string(a) for a in pred])
print(gt, pred)
if isfloat(gt[0]) or is_exact_match(gt[0]):
score = "-".join(gt) == "-".join(pred)
else:
score = min(
[anls_compute(gt_v, pred_v) for gt_v, pred_v in zip(gt, pred, strict=False)]
)

return float(score)


def eval_acc_and_f1(samples):
evaluated_samples = [sample for sample in samples if "score" in sample]
if not evaluated_samples:
return 0.0, 0.0

acc = sum([sample["score"] for sample in evaluated_samples]) / len(evaluated_samples)
try:
recall = sum(
[
sample["score"]
for sample in evaluated_samples
if sample["answer"] != "Not answerable"
]
) / len([sample for sample in evaluated_samples if sample["answer"] != "Not answerable"])
precision = sum(
[
sample["score"]
for sample in evaluated_samples
if sample["answer"] != "Not answerable"
]
) / len([sample for sample in evaluated_samples if sample["pred"] != "Not answerable"])
f1 = 2 * recall * precision / (recall + precision) if (recall + precision) > 0.0 else 0.0
except:
Copy link

Copilot AI Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except clause is not recommended. Consider catching ZeroDivisionError specifically since this is handling division operations.

Suggested change
except:
except (ZeroDivisionError, ValueError):

Copilot uses AI. Check for mistakes.
f1 = 0.0

return acc, f1


def show_results(samples, show_path=None):
for sample in samples:
sample["evidence_pages"] = eval(sample["evidence_pages"])
sample["evidence_sources"] = eval(sample["evidence_sources"])

with open(show_path, "w") as f:
acc, f1 = eval_acc_and_f1(samples)
f.write(f"Overall Acc: {acc} | Question Number: {len(samples)}\n")
f.write(f"Overall F1-score: {f1} | Question Number: {len(samples)}\n")
f.write("-----------------------\n")

#####################
acc_single_page, _ = eval_acc_and_f1(
[sample for sample in samples if len(sample["evidence_pages"]) == 1]
)
acc_multi_page, _ = eval_acc_and_f1(
[
sample
for sample in samples
if len(sample["evidence_pages"]) != 1 and sample["answer"] != "Not answerable"
]
)
acc_neg, _ = eval_acc_and_f1(
[sample for sample in samples if sample["answer"] == "Not answerable"]
)

f.write(
"Single-page | Accuracy: {} | Question Number: {}\n".format(
acc_single_page,
len([sample for sample in samples if len(sample["evidence_pages"]) == 1]),
)
)
f.write(
"Cross-page | Accuracy: {} | Question Number: {}\n".format(
acc_multi_page,
len(
[
sample
for sample in samples
if len(sample["evidence_pages"]) != 1
and sample["answer"] != "Not answerable"
]
),
)
)
f.write(
"Unanswerable | Accuracy: {} | Question Number: {}\n".format(
acc_neg, len([sample for sample in samples if sample["answer"] == "Not answerable"])
)
)
f.write("-----------------------\n")

#####################
source_sample_dict, document_type_dict = defaultdict(list), defaultdict(list)
for sample in samples:
for answer_source in sample["evidence_sources"]:
source_sample_dict[answer_source].append(sample)
document_type_dict[sample["doc_type"]].append(sample)
for type, sub_samples in source_sample_dict.items():
f.write(
f"Evidence Sources: {type} | Accuracy: {eval_acc_and_f1(sub_samples)[0]} | Question Number: {len(sub_samples)}\n"
)

f.write("-----------------------\n")
for type, sub_samples in document_type_dict.items():
f.write(
f"Document Type: {type} | Accuracy: {eval_acc_and_f1(sub_samples)[0]} | Question Number: {len(sub_samples)}\n"
)
Loading
Loading