Skip to content

Commit 21fb408

Browse files
committed
MTBench: Added seprarate leaderboards for different tasks and added figures and descriptions to the content
1 parent 6350ada commit 21fb408

File tree

7 files changed

+198
-11
lines changed

7 files changed

+198
-11
lines changed
629 KB
Loading
183 KB
Loading
3.6 MB
Loading

app/projects/mtbench/page.mdx

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11

22
import { Authors, Badges} from '@/components/utils'
3-
import Table from '@/components/table'
3+
import { Table, Table1, Table2, Table3, Table4}from '@/components/table'
44

55
# MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering
66

@@ -11,8 +11,8 @@ import Table from '@/components/table'
1111
<Badges
1212
venue=""
1313
github="https://github.com/Graph-and-Geometric-Learning/MTBench"
14-
arxiv=""
15-
pdf=""
14+
arxiv="https://arxiv.org/abs/2503.16858"
15+
pdf="https://arxiv.org/pdf/2503.16858"
1616
/>
1717

1818

@@ -22,11 +22,41 @@ News influences the world around us—from stock markets reacting to financial r
2222

2323
To address this, we introduce **MTBench** (**M**ultimodal **T**ime Series **Bench**mark), a dataset designed to evaluate how well AI models understand the relationship between text and time-series data. MTBench pairs financial news with stock market movements and weather reports with historical temperature changes. Unlike existing benchmarks that focus on text or numbers separately, MTBench challenges models to analyze both together, helping to assess their ability to detect trends, interpret news, and make predictions.
2424

25-
- **Finance**: 200K+ news articles with stock movements from 2021–2023.
26-
- **Weather**: Historical temperature trends covering nearly two decades with reports of extreme events.
25+
- **Finance**: Two datasets, each with 20K news articles paired with stock time-series data.
26+
- **Weather**: 2K news and time-series pairs from 50 weather stations across the U.S. (see Figure 1).
2727

28-
We evaluate state-of-the-art large language models (LLMs) on MTBench to measure their ability to link news with data trends (see our **Leaderboard**). The results reveal key challenges—models struggle with long-term pattern recognition, cause-and-effect relationships, and seamlessly combining insights from text and numbers.
28+
![Figure 1. Geographical distribution of weather stations |scale=0.4](./assets/map.png)
29+
30+
As shown in Figure 2, MTBench enables a range of complex reasoning tasks beyond simple forecasting, including semantic trend analysis, technical indicator prediction, and news-driven Q&A. These tasks challenge LLMs to integrate numerical patterns with contextual information.
31+
32+
![Figure 2. An overview of tasks in MTBench |scale=0.4](./assets/diagram.png)
33+
34+
The news-driven QA task includes two sub-tasks: correlation prediction and multi-choice QA. As shown in Figure 3, this task requires models to analyze both text and time-series data, understanding the news content while predicting its potential impact on future trends based on historical time-series.
35+
36+
![Figure 3. An Example of Multi-choice QA and Correlation Prediction on Finance Dataset |scale=0.8](./assets/QA_sample.png)
37+
38+
Various state-of-the-art large language models (LLMs) were evaluated on MTBench to measure their ability to link news with time-series trends (see **Leaderboard**). The results reveal key challenges—models struggle with long-term pattern recognition, cause-and-effect relationships, and seamlessly combining insights from text and numbers.
2939

3040
## Leaderboard
3141

32-
<Table/>
42+
<Table/>
43+
44+
<details>
45+
<summary>Leaderboard for Time-Series Forecasting</summary>
46+
<Table1/>
47+
</details>
48+
49+
<details>
50+
<summary>Leaderboard Trend Prediction</summary>
51+
<Table2/>
52+
</details>
53+
54+
<details>
55+
<summary>Leaderboard for Technical Indicator Calculation</summary>
56+
<Table3/>
57+
</details>
58+
59+
<details>
60+
<summary>Leaderboard for News-driven Question Answering</summary>
61+
<Table4/>
62+
</details>

components/sortable-table.tsx

Lines changed: 122 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,4 +230,125 @@ function SortableTable({ data }: { data: Data }) {
230230
);
231231
}
232232

233-
export default SortableTable;
233+
// Generic function to create sortable tables with different headers
234+
function createSortableTable(headers: { key: SortKeys; label: string }[]) {
235+
return function SortableTableComponent({ data }: { data: Data }) {
236+
const [sortKey, setSortKey] = useState<SortKeys>(headers[0].key);
237+
const [sortOrder, setSortOrder] = useState<SortOrder>("ascn");
238+
239+
const sortedData = useCallback(() => sortData({ tableData: data, sortKey, reverse: sortOrder === "desc" }), [data, sortKey, sortOrder]);
240+
241+
function changeSort(key: SortKeys) {
242+
setSortOrder(sortOrder === "ascn" ? "desc" : "ascn");
243+
setSortKey(key);
244+
}
245+
246+
return (
247+
<table>
248+
<thead>
249+
<tr>
250+
{headers.map((row) => (
251+
<th key={row.key}>
252+
{row.label}{" "}
253+
<SortButton columnKey={row.key} onClick={() => changeSort(row.key)} sortOrder={sortOrder} sortKey={sortKey} />
254+
</th>
255+
))}
256+
</tr>
257+
</thead>
258+
<tbody>
259+
{sortedData().map((model) => (
260+
<tr key={model.model_name}>
261+
{/* <td className="headcol">{model.model_name}</td> */}
262+
{headers.map((col) => (
263+
<td key={col.key}>{model[col.key]}</td>
264+
))}
265+
</tr>
266+
))}
267+
</tbody>
268+
</table>
269+
);
270+
};
271+
}
272+
273+
// Define headers for each table
274+
const headers1: { key: SortKeys; label: string }[] = [
275+
{ key: "model_name", label: "Model" },
276+
{ key: "stock_price_forecast_7_day_mae_ts", label: "Stock price predict. \n for 7 days under TS (MAE)" },
277+
{ key: "stock_price_forecast_7_day_mae_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAE)" },
278+
{ key: "stock_price_forecast_7_day_mape_ts", label: "Stock price predict. for 7 days under TS (MAPE)" },
279+
{ key: "stock_price_forecast_7_day_mape_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAPE)" },
280+
{ key: "stock_price_forecast_30_day_mae_ts", label: "Stock price predict. for 30 days under TS (MAE)" },
281+
{ key: "stock_price_forecast_30_day_mae_ts_w_text", label: "Stock price predict. for 30 days under TS+Text (MAE)" },
282+
{ key: "stock_price_forecast_30_day_mape_ts", label: "Stock price predict. for 30 days under TS (MAPE)" },
283+
{ key: "stock_price_forecast_30_day_mape_ts_w_text", label: "Stock price predict. for 30 days under TS+Text (MAPE)" },
284+
{ key: "temp_forecast_7_day_mse_ts", label: "Temp. predict. for 7 days under TS (MSE)" },
285+
{ key: "temp_forecast_7_day_mse_ts_w_text", label: "Temp. predict. for 7 days under TS+Text (MSE)" },
286+
{ key: "temp_forecast_7_day_mae_ts", label: "Temp. predict. for 7 days under TS (MAE)" },
287+
{ key: "temp_forecast_7_day_mae_ts_w_text", label: "Temp. predict. for 7 days under TS+Text (MAE)" },
288+
{ key: "temp_forecast_14_day_mse_ts", label: "Temp. predict. for 14 days under TS (MSE)" },
289+
{ key: "temp_forecast_14_day_mse_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MSE)" },
290+
{ key: "temp_forecast_14_day_mae_ts", label: "Temp. predict. for 14 days under TS (MAE)" },
291+
{ key: "temp_forecast_14_day_mae_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MAE)" },
292+
];
293+
294+
const headers2: { key: SortKeys; label: string }[] = [
295+
{ key: "model_name", label: "Model" },
296+
{ key: "stock_trend_predict_acc_7_day_3_way_ts", label: "Stock trend predict. for 7 days 3-way under TS (Acc)"},
297+
{ key: "stock_trend_predict_acc_7_day_3_way_ts_w_text", label: "Stock trend predict. for 7 days 3-way under TS+Text (Acc)"},
298+
{ key: "stock_trend_predict_acc_7_day_5_way_ts", label: "Stock trend predict. for 7 days 5-way under TS (Acc)"},
299+
{ key: "stock_trend_predict_acc_7_day_5_way_ts_w_text", label: "Stock trend predict. for 7 days 5-way under TS+Text (Acc)"},
300+
{ key: "stock_trend_predict_acc_30_day_3_way_ts", label: "Stock trend predict. for 30 days 3-way under TS (Acc)"},
301+
{ key: "stock_trend_predict_acc_30_day_3_way_ts_w_text", label: "Stock trend predict. for 30 days 3-way under TS+Text (Acc)"},
302+
{ key: "stock_trend_predict_acc_30_day_5_way_ts", label: "Stock trend predict. for 30 days 5-way under TS (Acc)"},
303+
{ key: "stock_trend_predict_acc_30_day_5_way_ts_w_text", label: "Stock trend predict. for 30 days 5-way under TS+Text (Acc)"},
304+
{ key: "temp_trend_predict_acc_past_ts", label: "Temp. trend predict. past under TS (Acc)"},
305+
{ key: "temp_trend_predict_acc_past_ts_w_text", label: "Temp. trend predict. past under TS+Text (Acc)"},
306+
{ key: "temp_trend_predict_acc_future_ts", label: "Temp. trend predict. future under TS (Acc)"},
307+
{ key: "temp_trend_predict_acc_future_ts_w_text", label: "Temp. trend predict. future under TS+Text (Acc)"},
308+
];
309+
310+
const headers3: { key: SortKeys; label: string }[] = [
311+
{ key: "model_name", label: "Model" },
312+
{ key: "stock_indicator_predict_mse_7_day_macd_ts", label: "MACD predict. for 7 days under TS (MSE)"},
313+
{ key: "stock_indicator_predict_mse_7_day_macd_ts_w_text", label: "MACD predict. for 7 days under TS+Text (MSE)"},
314+
{ key: "stock_indicator_predict_mse_7_day_bb_ts", label: "Bollinger Bands predict. for 7 days under TS (MSE)"},
315+
{ key: "stock_indicator_predict_mse_7_day_bb_ts_w_text", label: "Bollinger Bands predict. for 7 days under TS+Text (MSE)"},
316+
{ key: "stock_indicator_predict_mse_30_day_macd_ts", label: "MACD predict. for 30 days under TS (MSE)"},
317+
{ key: "stock_indicator_predict_mse_30_day_macd_ts_w_text", label: "MACD predict. for 30 days under TS+Text (MSE)"},
318+
{ key: "stock_indicator_predict_mse_30_day_bb_ts", label: "Bollinger Bands predict. for 30 days under TS (MSE)"},
319+
{ key: "stock_indicator_predict_mse_30_day_bb_ts_w_text", label: "Bollinger Bands predict. for 30 days under TS+Text (MSE)"},
320+
{ key: "temp_predict_max_mse_ts", label: "Temp. predict. max under TS (MSE)"},
321+
{ key: "temp_predict_max_mse_ts_w_text", label: "Temp. predict. max under TS+Text (MSE)"},
322+
{ key: "temp_predict_max_mae_ts", label: "Temp. predict. max under TS (MAE)"},
323+
{ key: "temp_predict_max_mae_ts_w_text", label: "Temp. predict. max under TS+Text (MAE)"},
324+
{ key: "temp_predict_min_mse_ts", label: "Temp. predict. min under TS (MSE)"},
325+
{ key: "temp_predict_min_mse_ts_w_text", label: "Temp. predict. min under TS+Text (MSE)"},
326+
{ key: "temp_predict_min_mae_ts", label: "Temp. predict. min under TS (MAE)"},
327+
{ key: "temp_predict_min_mae_ts_w_text", label: "Temp. predict. min under TS+Text (MAE)"},
328+
{ key: "temp_predict_diff_mse_ts", label: "Temp. predict. diff. under TS (MSE)"},
329+
{ key: "temp_predict_diff_mse_ts_w_text", label: "Temp. predict. diff. under TS+Text (MSE)"},
330+
{ key: "temp_predict_diff_mae_ts", label: "Temp. predict. diff. under TS (MAE)"},
331+
{ key: "temp_predict_diff_mae_ts_w_text", label: "Temp. predict. diff. under TS+Text (MAE)"},
332+
];
333+
334+
const headers4: { key: SortKeys; label: string }[] = [
335+
{ key: "model_name", label: "Model" },
336+
{ key: "news_stock_corr_acc_7_day_3_way", label: "News stock corr. for 7 days 3-way (Acc)"},
337+
{ key: "news_stock_corr_acc_7_day_5_way", label: "News stock corr. for 7 days 5-way (Acc)"},
338+
{ key: "news_stock_corr_acc_30_day_3_way", label: "News stock corr. for 30 days 3-way (Acc)"},
339+
{ key: "news_stock_corr_acc_30_day_5_way", label: "News stock corr. for 30 days 5-way (Acc)"},
340+
{ key: "news_driven_mcqa_acc_7_day_fin", label: "News driven MCQA for 7 days for Finance data (Acc)"},
341+
{ key: "news_driven_mcqa_acc_7_day_weather", label: "News driven MCQA for 7 days for Weather data (Acc)"},
342+
{ key: "news_driven_mcqa_acc_30_day_fin", label: "News driven MCQA for 30 days for Finance data (Acc)"},
343+
{ key: "news_driven_mcqa_acc_30_day_weather", label: "News driven MCQA for 30 days for Weather data (Acc)"}
344+
];
345+
346+
// Create separate sortable tables
347+
const SortableTable1 = createSortableTable(headers1);
348+
const SortableTable2 = createSortableTable(headers2);
349+
const SortableTable3 = createSortableTable(headers3);
350+
const SortableTable4 = createSortableTable(headers4);
351+
352+
// export default SortableTable;
353+
// Export all tables
354+
export { SortableTable ,SortableTable1, SortableTable2, SortableTable3, SortableTable4 };

components/table.tsx

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import { useState } from "react";
2-
import SortableTable from "./sortable-table";
2+
import {SortableTable, SortableTable1, SortableTable2, SortableTable3, SortableTable4} from "./sortable-table";
33
import data from "../app/projects/mtbench/data/data_leaderboard.json";
44
import "../styles/table.css";
55

@@ -11,4 +11,40 @@ function Table() {
1111
);
1212
}
1313

14-
export default Table;
14+
function Table1() {
15+
return (
16+
<div className="table-wrapper">
17+
<SortableTable1 data={data} />
18+
</div>
19+
);
20+
}
21+
22+
function Table2() {
23+
return (
24+
<div className="table-wrapper">
25+
<SortableTable2 data={data} />
26+
</div>
27+
);
28+
}
29+
30+
function Table3() {
31+
return (
32+
<div className="table-wrapper">
33+
<SortableTable3 data={data} />
34+
</div>
35+
);
36+
}
37+
38+
function Table4() {
39+
return (
40+
<div className="table-wrapper">
41+
<SortableTable4 data={data} />
42+
</div>
43+
);
44+
}
45+
46+
// Exporting all four tables
47+
export { Table, Table1, Table2, Table3, Table4 };
48+
49+
// export default Table;
50+

config/publications.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ export const publications: Publication[] = [
2525
venue: "",
2626
page: "mtbench",
2727
code: "https://github.com/Graph-and-Geometric-Learning/MTBencht",
28-
paper: "",
28+
paper: "https://arxiv.org/abs/2503.16858",
2929
abstract: "We introduce MTBench, a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTBench comprises of paired time-series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records.",
3030
impact: "We evaluate state-of-the-art LLMs on MTBench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.",
3131
tags: [Tag.Benchmark, Tag.MultiModalFoundationModel],

0 commit comments

Comments
 (0)