Fix prediction/backtest inconsistencies in ensemble scripts through improved data alignment and normalization logic, and update multilingual documentation.

Tonny@Home · Tonny@Home · commit 59e2977fc289 · 2026-03-10T22:42:38.000+08:00
diff --git a/docs/02_BRUTE_FORCE_GUIDE.md b/docs/02_BRUTE_FORCE_GUIDE.md
@@ -51,6 +51,14 @@ python quantpits/scripts/brute_force_ensemble.py --use-groups --group-config con
 - **权重优化**：对 Top 10 单模型做 Max Sharpe / Risk Parity 优化对比
 - **综合报告**：自动输出最佳组合、MVP 核心模型
 
+> [!NOTE]
+> **关于单模型表现与融合回测的评测差异说明**
+>
+> 融合与穷举脚本在评估模型表现时，引入了严格的 **Z-Score 归一化**（Z-Score Normalization）和 **数据对齐**（Data Alignment）处理，因此由于 TopK 截断的存在，单模型在此处的回测结果可能与训练期间通过 `run_analysis.py` 查看到的原始预测分值回测结果存在合理且微小的差异：
+> 1. **独立归一化隔离**：每个模型的预测分值会首先仅基于自身非为空的预测股票池进行按天的 Z-Score 归一化处理。这保证了模型之间的评分尺度统一，且某个模型的数据缺失不会影响并在归一化前污染其他模型的分布。
+> 2. **延迟交集对齐**：仅在计算最终特定组合的均值或加权打分时，系统才会对当前组合涉及的模型取交集（即执行 `dropna(how='any')`），这避免了无关模型的数据缺失引发当前组合评测池的不当缩水。
+> 3. **评估排名的对齐**：所有提供参照的基准数据（如单模型的历史排行榜回测指标）均会严格根据当前评价矩阵实际生成的时间窗口进行动态切片对齐，从而为您提供“同时间段”的一致性比对。
+
 ## 完整参数列表
 
 | 参数 | 默认值 | 说明 |
diff --git a/docs/03_ENSEMBLE_FUSION_GUIDE.md b/docs/03_ENSEMBLE_FUSION_GUIDE.md
@@ -193,6 +193,14 @@ output/
 > [!TIP]
 > Default combo 会额外保存一份不带 combo 名的 `ensemble_{date}.csv`，确保向后兼容 `order_gen.py` 等下游脚本。
 
+> [!NOTE]
+> **关于单模型表现与融合回测的评测差异说明**
+>
+> 融合与穷举脚本在评估模型表现时，引入了严格的 **Z-Score 归一化**（Z-Score Normalization）和 **数据对齐**（Data Alignment）处理，因此由于 TopK 截断的存在，单模型在此处的回测结果可能与训练期间通过 `run_analysis.py` 查看到的原始预测分值回测结果存在合理且微小的差异：
+> 1. **独立归一化隔离**：每个模型的预测分值会首先仅基于自身非为空的预测股票池进行按天的 Z-Score 归一化处理。这保证了模型之间的评分尺度统一，且某个模型的数据缺失不会影响并在归一化前污染其他模型的分布。
+> 2. **延迟交集对齐**：仅在计算最终特定组合的均值或加权打分时，系统才会对当前组合涉及的模型取交集（即执行 `dropna(how='any')`），这避免了无关模型的数据缺失引发当前组合评测池的不当缩水。
+> 3. **评估排名的对齐**：所有提供参照的基准数据（如单模型的历史排行榜回测指标）均会严格根据当前评价矩阵实际生成的时间窗口进行动态切片对齐，从而为您提供“同时间段”的一致性比对。
+
 ## 典型工作流
 
 ```bash
diff --git a/docs/en/02_BRUTE_FORCE_GUIDE.md b/docs/en/02_BRUTE_FORCE_GUIDE.md
@@ -51,6 +51,14 @@ python quantpits/scripts/brute_force_ensemble.py --use-groups --group-config con
 - **Weight Optimization**: Comparative trials on Top 10 single models simulating Max Sharpe / Risk Parity optimization mappings.
 - **Comprehensive Reporting**: Generates autonomous summaries of MVP models and superior fusions.
 
+> [!NOTE]
+> **Understanding Metric Discrepancies: Single Models vs. Ensemble Backtests**
+>
+> When evaluating model performance within fusion and brute-force architectures, strict **Z-Score Normalization** and **Data Alignment** processing govern the engine. Therefore, because of TopK position bounding, backtest results of a single model here may exhibit reasonable, micro-level disparities from the raw metrics evaluated naturally post-training (e.g. via `run_analysis.py`):
+> 1. **Isolated Normalization**: Each model calculates its daily cross-sectional Z-scores purely on its *own* non-null predicted universe. Scaling remains mathematically uniform, and a single model's signal scale cannot be skewed by other models' data coverage gaps prior to scoring.
+> 2. **Delayed Intersection**: Strict intersection dropping (`dropna(how='any')`) is executed strictly at the exact combo scoring phase and is limited precisely to the subset of models within that specific combo iteration. This guarantees irrelevant sub-models don't unilaterally shrink the evaluated combination universe.
+> 3. **Benchmarking Alignment**: The sub-model evaluation leaderboard dynamically slices historical records to match the precise temporal boundaries established by the current ensemble matrix index. This constructs a perfect "apples-to-apples" comparison avoiding overlapping timeframe distortion.
+
 ## Full Parameter List
 
 | Parameter | Default | Description |
diff --git a/docs/en/03_ENSEMBLE_FUSION_GUIDE.md b/docs/en/03_ENSEMBLE_FUSION_GUIDE.md
@@ -193,6 +193,14 @@ output/
 > [!TIP]
 > The `default` combo will redundantly output a nameless `ensemble_{date}.csv` artifact, guaranteeing absolute zero-modification compliance for downstream utilities like `order_gen.py`.
 
+> [!NOTE]
+> **Understanding Metric Discrepancies: Single Models vs. Ensemble Backtests**
+>
+> When evaluating model performance within fusion and brute-force architectures, strict **Z-Score Normalization** and **Data Alignment** processing govern the engine. Therefore, because of TopK position bounding, backtest results of a single model here may exhibit reasonable, micro-level disparities from the raw metrics evaluated naturally post-training (e.g. via `run_analysis.py`):
+> 1. **Isolated Normalization**: Each model calculates its daily cross-sectional Z-scores purely on its *own* non-null predicted universe. Scaling remains mathematically uniform, and a single model's signal scale cannot be skewed by other models' data coverage gaps prior to scoring.
+> 2. **Delayed Intersection**: Strict intersection dropping (`dropna(how='any')`) is executed strictly at the exact combo scoring phase and is limited precisely to the subset of models within that specific combo iteration. This guarantees irrelevant sub-models don't unilaterally shrink the evaluated combination universe.
+> 3. **Benchmarking Alignment**: The sub-model evaluation leaderboard dynamically slices historical records to match the precise temporal boundaries established by the current ensemble matrix index. This constructs a perfect "apples-to-apples" comparison avoiding overlapping timeframe distortion.
+
 ## Typical Operations Sequence
 
 ```bash
diff --git a/quantpits/scripts/brute_force_ensemble.py b/quantpits/scripts/brute_force_ensemble.py
@@ -193,13 +193,14 @@ def load_predictions(train_records):
     if not all_preds:
         raise ValueError("未加载到任何预测数据！")
 
-    # 合并 & Z-Score 归一化
-    merged_df = pd.concat(all_preds, axis=1).dropna()
+    # 合并 & Z-Score 归一化 (注意：不在这里 dropna，避免模型之间由于股票池差异互相削减样本)
+    merged_df = pd.concat(all_preds, axis=1)
     print(f"合并后数据维度: {merged_df.shape}")
 
     norm_df = pd.DataFrame(index=merged_df.index)
     for col in merged_df.columns:
-        norm_df[col] = zscore_norm(merged_df[col])
+        # 独立依靠单模型自身非空范围进行 Z-Score
+        norm_df[col] = zscore_norm(merged_df[col].dropna())
 
     return norm_df, model_metrics
 
@@ -386,8 +387,8 @@ def run_single_backtest(
     if bt_config is None:
         bt_config = strategy.get_backtest_config(st_config)
 
-    # 1. 合成信号 (等权均值，归一化后的)
-    combo_score = norm_df[list(combo_models)].mean(axis=1)
+    # 1. 合成信号 (等权均值，归一化后的) (仅在当前组合子集上求交集 dropna)
+    combo_score = norm_df[list(combo_models)].dropna(how='any').mean(axis=1)
 
     # 2. 准备组件
     # 注意: Account 必须每次新建，不能复用 (状态会累积)
diff --git a/quantpits/scripts/ensemble_fusion.py b/quantpits/scripts/ensemble_fusion.py
@@ -229,13 +229,13 @@ def load_selected_predictions(train_records, selected_models):
     if not all_preds:
         raise ValueError("未加载到任何预测数据！")
 
-    # 合并 & Z-Score 归一化
-    merged_df = pd.concat(all_preds, axis=1).dropna()
+    # 合并 & Z-Score 归一化 (注意：不要在这里提前 dropna，避免由于单模型标的变动殃及其他模型)
+    merged_df = pd.concat(all_preds, axis=1)
     print(f"合并后数据维度: {merged_df.shape}")
 
     norm_df = pd.DataFrame(index=merged_df.index)
     for col in merged_df.columns:
-        norm_df[col] = zscore_norm(merged_df[col])
+        norm_df[col] = zscore_norm(merged_df[col].dropna())
 
     return norm_df, model_metrics, loaded_models
 
@@ -1000,16 +1000,23 @@ def risk_analysis_and_leaderboard(report_df, norm_df, train_records,
     freq_suffix = '1week' if freq_val == 'week' else '1day'
     report_filename = f"portfolio_analysis/report_normal_{freq_suffix}.pkl"
 
+    # 从当前 combo_norm_df 获取评价区间，以便子模型评估对齐
+    eval_start = str(norm_df.index.get_level_values('datetime').min().date())
+    eval_end = str(norm_df.index.get_level_values('datetime').max().date())
+
     for model_name in loaded_models:
         record_id = models.get(model_name)
         if not record_id:
             continue
         try:
             recorder = R.get_recorder(recorder_id=record_id, experiment_name=experiment_name)
             hist_report = recorder.load_object(report_filename)
+            
+            # 裁剪历史报告到当前的评价区间
+            hist_report = hist_report[(hist_report.index >= pd.to_datetime(eval_start)) & (hist_report.index <= pd.to_datetime(eval_end))]
             all_reports[model_name] = hist_report
 
-            if 'return' in hist_report.columns:
+            if 'return' in hist_report.columns and not hist_report.empty:
                 # Up-sample sub-model report for consistent metric calculation
                 sub_da_df = pd.DataFrame(index=hist_report.index)
                 sub_da_df['收盘价值'] = hist_report['account']
@@ -1294,7 +1301,7 @@ def run_single_combo(combo_name, selected_models, method, manual_weights_str,
         print(f"Warning: combo {combo_name} 没有有效模型，跳过")
         return None
 
-    combo_norm_df = norm_df[combo_models]
+    combo_norm_df = norm_df[combo_models].dropna(how='any')
     combo_metrics = {m: model_metrics.get(m, 0) for m in combo_models}
 
     # ---- Stage 2: 相关性分析 ----
diff --git a/tests/quantpits/scripts/test_ensemble_fusion.py b/tests/quantpits/scripts/test_ensemble_fusion.py
@@ -419,10 +419,13 @@ def test_risk_analysis_and_leaderboard(mock_R, mock_env, tmp_path):
     mock_recorder.load_object.return_value = report_df
     mock_R.get_recorder.return_value = mock_recorder
     
+    idx = pd.MultiIndex.from_tuples([(pd.Timestamp("2020-01-01"), "A"), (pd.Timestamp("2020-01-02"), "A")], names=["datetime", "instrument"])
+    norm_df = pd.DataFrame({"M1": [0.5, 0.6]}, index=idx)
+
     with patch('quantpits.scripts.ensemble_fusion.calculate_safe_risk') as mock_risk:
         mock_risk.return_value = {"annualized_return": 0.5}
         reports, lb = ef.risk_analysis_and_leaderboard(
-            report_df, None, train_records, ["M1"], "day", str(out_dir), "2020-01-01"
+            report_df, norm_df, train_records, ["M1"], "day", str(out_dir), "2020-01-01"
         )
         
     assert "Ensemble" in reports