2026-03-04 DOT-Hybrid holdout validation (OWA 0.885→0.877, Quarterly -1.25%, Monthly -2.55%)

eddmpython · eddmpython · commit faeef3a7fa44 · 2026-03-04T15:52:30.000+09:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,38 @@ All notable changes to Vectrix will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.0.12] - 2026-03-04
+
+DOT-Hybrid holdout validation release — 8-way config selection for period>1 data now uses holdout validation instead of in-sample MAE, reducing overfitting on Quarterly (-1.25%) and Monthly (-2.55%) forecasts. AVG OWA improved from 0.8831 to ~0.876.
+
+### Changed
+
+**DOT-Hybrid Engine Holdout Validation**
+- `engine/dot.py`: `_fitHybrid()` now uses holdout-based config selection when `period > 1` and sufficient data available
+- When `period > 1`: splits data into train/validation, evaluates 8 variant configurations on held-out segment, selects best by validation MAE, then refits on full data
+- When `period <= 1` (Yearly, Daily, Weekly): preserves original in-sample MAE selection — no behavioral change
+- When `period >= 24` (Hourly): unchanged, uses classic DOT path as before
+- Added `_predictVariantSteps()` helper method for multi-step holdout prediction
+- Net effect: Quarterly OWA -1.25%, Monthly OWA -2.55%, zero regression on other groups
+
+### Added
+
+**Experiment Files (4 new DOT improvement experiments)**
+- `modelCreation/043_dotAutoPeriodHoldout.py`: ACF-based auto period detection (REJECTED, +1.29%) + holdout validation (ACCEPTED, -0.79%)
+- `modelCreation/044_dailyWeeklySpecialist.py`: Classic DOT for Weekly (ACCEPTED, -2.18%) + Core3 ensemble for Daily/Weekly (REJECTED, +21%/+8%)
+- `modelCreation/045_integratedImprovement.py`: Integrated holdout + Weekly classic (AVG -0.94%, but Yearly +1.16% regression)
+- `modelCreation/046_finalIntegration.py`: Final rule validation — period<=1 classic vs period>1 holdout isolation confirmed safe
+
+### Key Findings
+
+- ACF-based auto period detection detects spurious short periods (2,3) from noise — harmful for accuracy
+- Holdout validation eliminates in-sample overfitting in 8-way config selection for seasonal data
+- Core3 ensemble (DOT+CES+4Theta) is harmful for period=1 data — CES/4Theta struggle without seasonality
+- Classic DOT is good for Weekly (period=1) but catastrophic for Yearly (period=1) — Yearly needs Hybrid's trend exploration
+- Safe improvement scope: only `1 < period < 24` benefits from holdout validation
+
+[0.0.12]: https://github.com/eddmpython/vectrix/compare/v0.0.11...v0.0.12
+
 ## [0.0.11] - 2026-03-04
 
 Progressive Disclosure release — Easy API now supports Level 2 guided control with model selection, ensemble strategy, and confidence interval parameters, while maintaining full backward compatibility with Level 1 zero-config usage.
diff --git a/README.md b/README.md
@@ -346,13 +346,13 @@ result = caf.apply(predictions, lower95, upper95, constraints=[
 
 Evaluated on **M4 Competition 100,000 time series** (2,000 sample per frequency, seed=42). OWA < 1.0 means better than Naive2.
 
-**DOT-Hybrid** (single model, OWA 0.885 — beats M4 #18 Theta 0.897):
+**DOT-Hybrid** (single model, OWA 0.877 — beats M4 #18 Theta 0.897):
 
 | Frequency | OWA | vs Naive2 |
 |:----------|:---:|:---------:|
 | Yearly | **0.797** | -20.3% |
-| Quarterly | **0.905** | -9.5% |
-| Monthly | **0.933** | -6.7% |
+| Quarterly | **0.894** | -10.6% |
+| Monthly | **0.897** | -10.3% |
 | Weekly | **0.959** | -4.1% |
 | Daily | **0.996** | -0.4% |
 | Hourly | **0.722** | -27.8% |
@@ -364,7 +364,7 @@ Evaluated on **M4 Competition 100,000 time series** (2,000 sample per frequency,
 | #1 | ES-RNN (Smyl) | 0.821 |
 | #2 | FFORMA | 0.838 |
 | #11 | 4Theta | 0.874 |
-| — | **Vectrix DOT-Hybrid** | **0.885** |
+| — | **Vectrix DOT-Hybrid** | **0.877** |
 | #18 | Theta | 0.897 |
 
 Full results with sMAPE/MASE breakdown: [benchmarks](https://eddmpython.github.io/vectrix/docs/benchmarks/)
@@ -523,7 +523,7 @@ Every parameter at Level 2 has a sensible default that reproduces Level 1 behavi
 
 | Priority | Area | Current | Target | Status |
 |:---------|:-----|:--------|:-------|:-------|
-| **P0** | M4 Accuracy | OWA 0.885 | OWA < 0.850 | In progress |
+| **P0** | M4 Accuracy | OWA 0.877 | OWA < 0.850 | In progress |
 | **P1** | Easy API Progressive Disclosure | Level 1 only | Levels 1-3 | In progress |
 | **P2** | Pipeline Speed | 48ms forecast() | < 10ms | Planned |
 | **P3** | Foundation Model Depth | Basic wrappers | Full integration | Planned |
diff --git a/README_KR.md b/README_KR.md
@@ -343,13 +343,13 @@ result = caf.apply(predictions, lower95, upper95, constraints=[
 
 **M4 Competition 100,000 시계열** 벤치마크 (빈도별 2,000 샘플, seed=42). OWA < 1.0이면 Naive2보다 우수.
 
-**DOT-Hybrid** (단일 모델, OWA 0.885 — M4 #18 Theta 0.897 초과):
+**DOT-Hybrid** (단일 모델, OWA 0.877 — M4 #18 Theta 0.897 초과):
 
 | 빈도 | OWA | vs Naive2 |
 |:-----|:---:|:---------:|
 | Yearly | **0.797** | -20.3% |
-| Quarterly | **0.905** | -9.5% |
-| Monthly | **0.933** | -6.7% |
+| Quarterly | **0.894** | -10.6% |
+| Monthly | **0.897** | -10.3% |
 | Weekly | **0.959** | -4.1% |
 | Daily | **0.996** | -0.4% |
 | Hourly | **0.722** | -27.8% |
@@ -361,7 +361,7 @@ result = caf.apply(predictions, lower95, upper95, constraints=[
 | #1 | ES-RNN (Smyl) | 0.821 |
 | #2 | FFORMA | 0.838 |
 | #11 | 4Theta | 0.874 |
-| — | **Vectrix DOT-Hybrid** | **0.885** |
+| — | **Vectrix DOT-Hybrid** | **0.877** |
 | #18 | Theta | 0.897 |
 
 sMAPE/MASE 상세 결과: [벤치마크 상세](https://eddmpython.github.io/vectrix/docs/benchmarks/)
@@ -520,7 +520,7 @@ Level 2의 모든 파라미터에는 Level 1 동작을 재현하는 합리적인
 
 | 우선순위 | 영역 | 현재 | 목표 | 상태 |
 |:---------|:-----|:-----|:-----|:-----|
-| **P0** | M4 정확도 | OWA 0.885 | OWA < 0.850 | 진행 중 |
+| **P0** | M4 정확도 | OWA 0.877 | OWA < 0.850 | 진행 중 |
 | **P1** | Easy API Progressive Disclosure | Level 1만 | Level 1-3 | 진행 중 |
 | **P2** | 파이프라인 속도 | 48ms forecast() | < 10ms | 계획 |
 | **P3** | Foundation Model 깊이 | 기본 래퍼 | 완전 통합 | 계획 |
diff --git a/docs/benchmarks.ko.md b/docs/benchmarks.ko.md
@@ -13,12 +13,12 @@ Vectrix는 표준 시계열 예측 대회(M3, M4)에서 **OWA**(Overall Weighted
 | 빈도 | DOT-Hybrid OWA | M4 대비 |
 |------|:--------------:|---------|
 | Yearly | **0.797** | M4 1위 ES-RNN(0.821)에 근접 |
-| Quarterly | **0.905** | M4 상위권 수준 |
-| Monthly | **0.933** | 안정적 중상위 |
+| Quarterly | **0.894** | M4 상위권 수준 |
+| Monthly | **0.897** | M4 상위권 수준 |
 | Weekly | **0.959** | Naive2 초과 |
 | Daily | **0.996** | Naive2와 동등 |
 | Hourly | **0.722** | 세계 최정상급, M4 우승자 수준 |
-| **평균** | **0.885** | **M4 #18 Theta(0.897) 초과** |
+| **평균** | **0.877** | **M4 #18 Theta(0.897) 초과** |
 
 ### M4 공식 순위 비교
 
@@ -29,7 +29,7 @@ Vectrix는 표준 시계열 예측 대회(M3, M4)에서 **OWA**(Overall Weighted
 | 3 | Theta (Fiorucci) | 0.854 |
 | 11 | 4Theta (Petropoulos) | 0.874 |
 | 18 | Theta (Assimakopoulos) | 0.897 |
-| -- | **Vectrix DOT-Hybrid** | **0.885** |
+| -- | **Vectrix DOT-Hybrid** | **0.877** |
 
 Vectrix DOT-Hybrid는 M4 Competition의 **모든 순수 통계 방법**을 능가합니다. 더 높은 순위의 방법들은 모두 하이브리드(ES-RNN = LSTM + ETS, FFORMA = 메타러닝 앙상블)입니다.
 
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -13,12 +13,12 @@ The [M4 Competition](https://www.sciencedirect.com/science/article/pii/S01692070
 | Frequency | DOT-Hybrid OWA | M4 Context |
 |-----------|:--------------:|------------|
 | Yearly | **0.797** | Near M4 #1 ES-RNN (0.821) |
-| Quarterly | **0.905** | Competitive with M4 top methods |
-| Monthly | **0.933** | Solid mid-table performance |
+| Quarterly | **0.894** | Competitive with M4 top methods |
+| Monthly | **0.897** | Competitive with M4 top methods |
 | Weekly | **0.959** | Beats Naive2 |
 | Daily | **0.996** | Near parity with Naive2 |
 | Hourly | **0.722** | World-class, near M4 winner level |
-| **AVG** | **0.885** | **Beats M4 #18 Theta (0.897)** |
+| **AVG** | **0.877** | **Beats M4 #18 Theta (0.897)** |
 
 ### M4 Competition Leaderboard Context
 
@@ -29,7 +29,7 @@ The [M4 Competition](https://www.sciencedirect.com/science/article/pii/S01692070
 | 3 | Theta (Fiorucci) | 0.854 |
 | 11 | 4Theta (Petropoulos) | 0.874 |
 | 18 | Theta (Assimakopoulos) | 0.897 |
-| -- | **Vectrix DOT-Hybrid** | **0.885** |
+| -- | **Vectrix DOT-Hybrid** | **0.877** |
 
 Vectrix DOT-Hybrid outperforms **all pure statistical methods** in the M4 Competition. Only hybrid methods (ES-RNN = LSTM + ETS, FFORMA = meta-learning ensemble) rank higher.
 
diff --git a/docs/blog/002_howWeKnowForecastsWork.md b/docs/blog/002_howWeKnowForecastsWork.md
@@ -129,7 +129,7 @@ It's easy to build a model, test it on your own data, and convince yourself it w
 
 ### 2. They guide tool selection
 
-When choosing a forecasting library, you want evidence. "Our library uses advanced algorithms" is marketing. "Our library achieves OWA 0.885 on the M4 Competition dataset" is a measurable claim you can verify.
+When choosing a forecasting library, you want evidence. "Our library uses advanced algorithms" is marketing. "Our library achieves OWA 0.877 on the M4 Competition dataset" is a measurable claim you can verify.
 
 ### 3. They reveal method strengths and weaknesses
 
@@ -308,12 +308,12 @@ Transparency matters. Here's how Vectrix performs on the M4 benchmark, using 2,0
 | Frequency | Vectrix OWA | Context |
 |-----------|:-----------:|---------|
 | Yearly | **0.797** | Near M4 winner level |
-| Quarterly | **0.905** | Competitive with top methods |
-| Monthly | **0.933** | Solid mid-table |
+| Quarterly | **0.894** | Competitive with top methods |
+| Monthly | **0.897** | Competitive with top methods |
 | Weekly | **0.959** | Beats Naive2 |
 | Daily | **0.996** | Near parity with Naive2 |
 | Hourly | **0.722** | World-class |
-| **Average** | **0.885** | **Outperforms M4 #18 Theta (0.897)** |
+| **Average** | **0.877** | **Outperforms M4 #18 Theta (0.897)** |
 
 These numbers aren't cherry-picked or inflated. They represent honest performance — strong in some frequencies, room for improvement in others. We publish our benchmark code so you can [reproduce every number](https://eddmpython.github.io/vectrix/docs/benchmarks/).
 
diff --git a/docs/blog/assets/benchmark-hero.svg b/docs/blog/assets/benchmark-hero.svg
@@ -67,7 +67,7 @@
   <text x="96" y="234" font-family="Inter, system-ui, sans-serif" font-size="12" font-weight="700" fill="#22d3ee">Vectrix DOT-Hybrid</text>
   <rect x="340" y="221" width="208" height="18" rx="9" fill="rgba(6,182,212,0.1)"/>
   <rect x="340" y="221" width="208" height="18" rx="9" fill="url(#bhcyan)" opacity="0.2"/>
-  <text x="444" y="234" text-anchor="middle" font-family="JetBrains Mono, monospace" font-size="11" font-weight="700" fill="#22d3ee">0.885</text>
+  <text x="444" y="234" text-anchor="middle" font-family="JetBrains Mono, monospace" font-size="11" font-weight="700" fill="#22d3ee">0.877</text>
 
   <!-- Row 5 — Original Theta -->
   <rect x="60" y="258" width="680" height="36" rx="6" fill="rgba(148,163,184,0.02)"/>
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [project]
 name = "vectrix"
-version = "0.0.11"
+version = "0.0.12"
 description = "Zero-config time series forecasting & analysis library. 30+ models with built-in Rust engine for blazing-fast performance."
 readme = "README.md"
 license = {file = "LICENSE"}
diff --git a/src/vectrix/engine/dot.py b/src/vectrix/engine/dot.py
@@ -9,8 +9,9 @@
 (2 trend types x 2 model types x 2 season types) for improved
 accuracy on low-frequency data. For period>=24, uses original
 3-parameter optimization which excels on high-frequency data.
+For period>1, uses holdout validation for config selection (E043).
 
-M4 Competition benchmark: OWA 0.885 (DOT-Hybrid) vs 0.905 (original).
+M4 Competition benchmark: OWA 0.877 (DOT-Hybrid) vs 0.905 (original).
 """
 
 from typing import Tuple
@@ -274,16 +275,27 @@ def _fitHybrid(self, y: np.ndarray) -> 'DynamicOptimizedTheta':
         else:
             base = 1.0
 
+        useHoldout = self.period > 1 and n >= self.period * 4
+        if useHoldout:
+            holdoutSize = max(1, min(n // 5, self.period * 2))
+            holdoutSize = min(holdoutSize, n // 3)
+            trainPart = scaled[:n - holdoutSize]
+            valPart = scaled[n - holdoutSize:]
+            nTrain = len(trainPart)
+        else:
+            trainPart = scaled
+
         bestMae = np.inf
         bestConfig = None
-        bestModel = None
+
+        fitData = trainPart if useHoldout else scaled
 
         for seasonType in seasonTypes:
             if seasonType != 'none':
-                seasonal, deseasonalized = self._deseasonalizeAdvanced(scaled, self.period, seasonType)
+                seasonal, deseasonalized = self._deseasonalizeAdvanced(fitData, self.period, seasonType)
             else:
                 seasonal = None
-                deseasonalized = scaled
+                deseasonalized = fitData
 
             for trendType in ['linear', 'exponential']:
                 thetaLine0 = self._fitTrendLine(deseasonalized, trendType)
@@ -300,20 +312,44 @@ def _fitHybrid(self, y: np.ndarray) -> 'DynamicOptimizedTheta':
                     if result is None:
                         continue
 
-                    fittedVals = result['fittedValues']
-                    if seasonal is not None:
-                        fittedVals = self._reseasonalize(fittedVals, seasonal, seasonType)
+                    if useHoldout:
+                        valPred = self._predictVariantSteps(result, trendType, modelType, holdoutSize)
+                        if seasonal is not None:
+                            for h in range(holdoutSize):
+                                idx = (nTrain + h) % self.period
+                                if seasonType == 'multiplicative':
+                                    valPred[h] *= seasonal[idx]
+                                else:
+                                    valPred[h] += seasonal[idx]
+                        mae = np.mean(np.abs(valPart - valPred))
+                    else:
+                        fittedVals = result['fittedValues']
+                        if seasonal is not None:
+                            fittedVals = self._reseasonalize(fittedVals, seasonal, seasonType)
+                        mae = np.mean(np.abs(fitData - fittedVals))
 
-                    mae = np.mean(np.abs(scaled - fittedVals))
                     if mae < bestMae:
                         bestMae = mae
                         bestConfig = (trendType, modelType, seasonType)
-                        bestModel = result
-                        bestModel['seasonal'] = seasonal
-                        bestModel['base'] = base
 
+        if bestConfig is None:
+            return self._fitClassic(y)
+
+        trendType, modelType, seasonType = bestConfig
+        if seasonType != 'none':
+            seasonal, deseasonalized = self._deseasonalizeAdvanced(scaled, self.period, seasonType)
+        else:
+            seasonal = None
+            deseasonalized = scaled
+
+        thetaLine0 = self._fitTrendLine(deseasonalized, trendType)
+        if thetaLine0 is None:
+            return self._fitClassic(y)
+        bestModel = self._fitVariant(deseasonalized, thetaLine0, trendType, modelType)
         if bestModel is None:
             return self._fitClassic(y)
+        bestModel['seasonal'] = seasonal
+        bestModel['base'] = base
 
         self._hybridMode = True
         self._hybridConfig = bestConfig
@@ -322,10 +358,29 @@ def _fitHybrid(self, y: np.ndarray) -> 'DynamicOptimizedTheta':
         self.intercept = bestModel['intercept']
         self.slope = bestModel['slope']
         self.lastLevel = bestModel['lastLevel']
-        self.residuals = y - bestModel['fittedValues'] * base
+
+        fittedVals = bestModel['fittedValues']
+        if seasonal is not None:
+            fittedVals = self._reseasonalize(fittedVals, seasonal, seasonType)
+        self.residuals = y - fittedVals * base
         self.fitted = True
         return self
 
+    def _predictVariantSteps(self, model, trendType, modelType, steps):
+        n = model['n']
+        futureX = np.arange(n, n + steps, dtype=np.float64)
+        if trendType == 'exponential':
+            forecastTrend = np.exp(model['intercept'] + model['slope'] * futureX)
+        else:
+            forecastTrend = model['intercept'] + model['slope'] * futureX
+        forecastSES = np.full(steps, model['lastLevel'])
+        if modelType == 'additive':
+            w = 1.0 / max(model['theta'], 1.0)
+            return w * forecastSES + (1.0 - w) * forecastTrend
+        invTheta = 1.0 / max(model['theta'], 1.0)
+        return np.power(np.maximum(forecastSES, 1e-10), invTheta) * \
+               np.power(np.maximum(forecastTrend, 1e-10), 1.0 - invTheta)
+
     def predict(self, steps: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
         if not self.fitted:
             raise ValueError("Model not fitted.")
diff --git a/src/vectrix/experiments/modelCreation/043_dotAutoPeriodHoldout.py b/src/vectrix/experiments/modelCreation/043_dotAutoPeriodHoldout.py
diff --git a/src/vectrix/experiments/modelCreation/044_dailyWeeklySpecialist.py b/src/vectrix/experiments/modelCreation/044_dailyWeeklySpecialist.py
diff --git a/src/vectrix/experiments/modelCreation/045_integratedImprovement.py b/src/vectrix/experiments/modelCreation/045_integratedImprovement.py
diff --git a/src/vectrix/experiments/modelCreation/046_finalIntegration.py b/src/vectrix/experiments/modelCreation/046_finalIntegration.py