You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -22,11 +22,41 @@ News influences the world around us—from stock markets reacting to financial r
22
22
23
23
To address this, we introduce **MTBench** (**M**ultimodal **T**ime Series **Bench**mark), a dataset designed to evaluate how well AI models understand the relationship between text and time-series data. MTBench pairs financial news with stock market movements and weather reports with historical temperature changes. Unlike existing benchmarks that focus on text or numbers separately, MTBench challenges models to analyze both together, helping to assess their ability to detect trends, interpret news, and make predictions.
24
24
25
-
-**Finance**: 200K+ news articles with stock movements from 2021–2023.
26
-
-**Weather**: Historical temperature trends covering nearly two decades with reports of extreme events.
25
+
-**Finance**: Two datasets, each with 20K news articles paired with stock time-series data.
26
+
-**Weather**: 2K news and time-series pairs from 50 weather stations across the U.S. (see Figure 1).
27
27
28
-
We evaluate state-of-the-art large language models (LLMs) on MTBench to measure their ability to link news with data trends (see our **Leaderboard**). The results reveal key challenges—models struggle with long-term pattern recognition, cause-and-effect relationships, and seamlessly combining insights from text and numbers.
28
+

29
+
30
+
As shown in Figure 2, MTBench enables a range of complex reasoning tasks beyond simple forecasting, including semantic trend analysis, technical indicator prediction, and news-driven Q&A. These tasks challenge LLMs to integrate numerical patterns with contextual information.
31
+
32
+

33
+
34
+
The news-driven QA task includes two sub-tasks: correlation prediction and multi-choice QA. As shown in Figure 3, this task requires models to analyze both text and time-series data, understanding the news content while predicting its potential impact on future trends based on historical time-series.
35
+
36
+

37
+
38
+
Various state-of-the-art large language models (LLMs) were evaluated on MTBench to measure their ability to link news with time-series trends (see **Leaderboard**). The results reveal key challenges—models struggle with long-term pattern recognition, cause-and-effect relationships, and seamlessly combining insights from text and numbers.
29
39
30
40
## Leaderboard
31
41
32
-
<Table/>
42
+
<Table/>
43
+
44
+
<details>
45
+
<summary>Leaderboard for Time-Series Forecasting</summary>
46
+
<Table1/>
47
+
</details>
48
+
49
+
<details>
50
+
<summary>Leaderboard Trend Prediction</summary>
51
+
<Table2/>
52
+
</details>
53
+
54
+
<details>
55
+
<summary>Leaderboard for Technical Indicator Calculation</summary>
56
+
<Table3/>
57
+
</details>
58
+
59
+
<details>
60
+
<summary>Leaderboard for News-driven Question Answering</summary>
abstract: "We introduce MTBench, a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTBench comprises of paired time-series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records.",
30
30
impact: "We evaluate state-of-the-art LLMs on MTBench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.",
0 commit comments