|
| 1 | + |
| 2 | +# Example 29: Block Bootstrap for Autocorrelation |
| 3 | + |
| 4 | +## The "Why": The Autocorrelation Problem |
| 5 | +Standard Mann-Kendall tests assume that data points are **independent** (like rolling a die). |
| 6 | +However, environmental data is often **autocorrelated**: |
| 7 | +* A hot day is likely to be followed by another hot day. |
| 8 | +* A drought year might influence groundwater levels for the next year. |
| 9 | + |
| 10 | +**The Danger:** Autocorrelation reduces the "effective" amount of information in your data. If you ignore it, the standard test underestimates the variance and often reports a **statistically significant trend when none exists (False Positive).** |
| 11 | + |
| 12 | +**The Solution:** Correcting for autocorrelation. We demonstrate three approaches: |
| 13 | +1. **Standard Test (Uncorrected):** The naive baseline. |
| 14 | +2. **Yue & Wang (2004) Correction:** Adjusts the sample size based on lag-1 autocorrelation. |
| 15 | +3. **Block Bootstrap:** A robust, non-parametric resampling method that preserves correlation structure. |
| 16 | + |
| 17 | +## The "How": Code Walkthrough |
| 18 | + |
| 19 | +We will simulate data with strong autocorrelation (AR1 process) but **no trend**, and compare the results. Note the **Confidence Intervals** in each case—uncorrected tests often have overly narrow intervals that exclude zero (indicating a trend), while corrected tests widen the interval. |
| 20 | + |
| 21 | +### Step 1: Python Code |
| 22 | +```python |
| 23 | +import numpy as np |
| 24 | +import pandas as pd |
| 25 | +import MannKS as mk |
| 26 | +import matplotlib.pyplot as plt |
| 27 | + |
| 28 | +# 1. Generate Autocorrelated Data (AR1 Process) |
| 29 | +# Autocorrelation (serial dependence) is common in environmental data. |
| 30 | +# A high value today often means a high value tomorrow, even without a long-term trend. |
| 31 | +# Standard Mann-Kendall assumes independence and often produces FALSE POSITIVES |
| 32 | +# (finding a 'trend' that is just persistence). |
| 33 | + |
| 34 | +np.random.seed(45) |
| 35 | +n = 100 |
| 36 | +t = pd.date_range('2000-01-01', periods=n, freq='ME') |
| 37 | + |
| 38 | +# Generate AR(1) noise: x[t] = 0.8 * x[t-1] + noise |
| 39 | +# High correlation (phi=0.8) makes the data "wander" and look like a trend. |
| 40 | +x = np.zeros(n) |
| 41 | +x[0] = np.random.normal(0, 1) |
| 42 | +for i in range(1, n): |
| 43 | + x[i] = 0.8 * x[i-1] + np.random.normal(0, 1) |
| 44 | + |
| 45 | +# Add a very weak trend (or no trend) |
| 46 | +trend_signal = 0.0 * np.arange(n) |
| 47 | +x = x + trend_signal |
| 48 | + |
| 49 | +print(f"Data generated: {n} points with AR(1) coefficient ~0.8 and NO trend.") |
| 50 | + |
| 51 | +# 2. Run Standard Trend Test (The "Naive" Approach) |
| 52 | +print("\n--- Test 1: Standard Mann-Kendall (Assuming Independence) ---") |
| 53 | +result_std = mk.trend_test( |
| 54 | + x, t, |
| 55 | + autocorr_method='none', |
| 56 | + slope_scaling='year', |
| 57 | + plot_path='plot_std.png' |
| 58 | +) |
| 59 | + |
| 60 | +print(f"Trend: {result_std.trend}") |
| 61 | +print(f"p-value: {result_std.p:.4f}") |
| 62 | +print(f"Significant? {result_std.h}") |
| 63 | +print(f"Slope: {result_std.slope:.4f} {result_std.slope_units}") |
| 64 | +print(f"Confidence Interval (95%): [{result_std.lower_ci:.4f}, {result_std.upper_ci:.4f}]") |
| 65 | +print("Note: If p < 0.05, this is a False Positive caused by autocorrelation.") |
| 66 | + |
| 67 | + |
| 68 | +# 3. Run Yue & Wang (2004) Correction |
| 69 | +# This method adjusts the variance of the Mann-Kendall statistic based on the |
| 70 | +# effective sample size (ESS), which is reduced by autocorrelation. |
| 71 | +# It is a parametric correction assuming an AR(1) process. |
| 72 | +print("\n--- Test 2: Yue & Wang (2004) Correction ---") |
| 73 | +result_yw = mk.trend_test( |
| 74 | + x, t, |
| 75 | + autocorr_method='yue_wang', |
| 76 | + slope_scaling='year' |
| 77 | +) |
| 78 | + |
| 79 | +print(f"Trend: {result_yw.trend}") |
| 80 | +print(f"p-value: {result_yw.p:.4f}") |
| 81 | +print(f"Significant? {result_yw.h}") |
| 82 | +print(f"Slope: {result_yw.slope:.4f} {result_yw.slope_units}") |
| 83 | +print(f"Confidence Interval (95%): [{result_yw.lower_ci:.4f}, {result_yw.upper_ci:.4f}]") |
| 84 | +print(f"Effective Sample Size: {result_yw.n_effective:.1f} (vs {n} original)") |
| 85 | + |
| 86 | + |
| 87 | +# 4. Run Block Bootstrap Test (The "Robust" Approach) |
| 88 | +# The block bootstrap method preserves the autocorrelation structure by resampling |
| 89 | +# blocks of data instead of individual points. This correctly estimates the |
| 90 | +# significance even when data is dependent. |
| 91 | +print("\n--- Test 3: Block Bootstrap Mann-Kendall ---") |
| 92 | +result_boot = mk.trend_test( |
| 93 | + x, t, |
| 94 | + autocorr_method='block_bootstrap', |
| 95 | + n_bootstrap=1000, |
| 96 | + slope_scaling='year', |
| 97 | + plot_path='plot_boot.png' |
| 98 | +) |
| 99 | + |
| 100 | +print(f"Trend: {result_boot.trend}") |
| 101 | +print(f"p-value: {result_boot.p:.4f}") |
| 102 | +print(f"Significant? {result_boot.h}") |
| 103 | +print(f"Slope: {result_boot.slope:.4f} {result_boot.slope_units}") |
| 104 | +print(f"Confidence Interval (95%): [{result_boot.lower_ci:.4f}, {result_boot.upper_ci:.4f}]") |
| 105 | +print(f"Autocorrelation (ACF1): {result_boot.acf1:.3f}") |
| 106 | +print(f"Block Size Used: {result_boot.block_size_used}") |
| 107 | + |
| 108 | + |
| 109 | +# 5. Seasonal Block Bootstrap |
| 110 | +# For seasonal data, we must bootstrap 'cycles' (e.g., years) to preserve seasonality. |
| 111 | +print("\n--- Test 4: Seasonal Block Bootstrap ---") |
| 112 | + |
| 113 | +# Generate seasonal data with autocorrelation between years |
| 114 | +n_years = 20 |
| 115 | +dates_seas = pd.date_range('2000-01-01', periods=n_years*12, freq='ME') |
| 116 | +seasonal_pattern = np.tile(np.sin(np.linspace(0, 2*np.pi, 12)), n_years) |
| 117 | + |
| 118 | +# Autocorrelated noise (inter-annual persistence) |
| 119 | +noise_seas = np.zeros(n_years*12) |
| 120 | +current_noise = 0 |
| 121 | +for i in range(n_years*12): |
| 122 | + current_noise = 0.7 * current_noise + np.random.normal(0, 1) |
| 123 | + noise_seas[i] = current_noise |
| 124 | + |
| 125 | +x_seas = seasonal_pattern + noise_seas # No trend |
| 126 | + |
| 127 | +result_seas = mk.seasonal_trend_test( |
| 128 | + x_seas, dates_seas, |
| 129 | + period=12, |
| 130 | + autocorr_method='block_bootstrap', |
| 131 | + slope_scaling='year', |
| 132 | + plot_path='plot_seas_boot.png' |
| 133 | +) |
| 134 | + |
| 135 | +print(f"Seasonal Trend: {result_seas.trend}") |
| 136 | +print(f"p-value: {result_seas.p:.4f}") |
| 137 | +print(f"Significant? {result_seas.h}") |
| 138 | +print(f"Slope: {result_seas.slope:.4f} {result_seas.slope_units}") |
| 139 | +print(f"Confidence Interval (95%): [{result_seas.lower_ci:.4f}, {result_seas.upper_ci:.4f}]") |
| 140 | +print(f"Block Size Used: {result_seas.block_size_used} (cycles)") |
| 141 | +``` |
| 142 | + |
| 143 | +### Step 2: Text Output |
| 144 | +```text |
| 145 | +Data generated: 100 points with AR(1) coefficient ~0.8 and NO trend. |
| 146 | +
|
| 147 | +--- Test 1: Standard Mann-Kendall (Assuming Independence) --- |
| 148 | +Trend: increasing |
| 149 | +p-value: 0.0013 |
| 150 | +Significant? True |
| 151 | +Slope: 0.2300 units per year |
| 152 | +Confidence Interval (95%): [0.0917, 0.3704] |
| 153 | +Note: If p < 0.05, this is a False Positive caused by autocorrelation. |
| 154 | +
|
| 155 | +--- Test 2: Yue & Wang (2004) Correction --- |
| 156 | +Trend: increasing |
| 157 | +p-value: 0.2447 |
| 158 | +Significant? False |
| 159 | +Slope: 0.2300 units per year |
| 160 | +Confidence Interval (95%): [-0.1672, 0.6631] |
| 161 | +Effective Sample Size: 13.1 (vs 100 original) |
| 162 | +
|
| 163 | +--- Test 3: Block Bootstrap Mann-Kendall --- |
| 164 | +Trend: increasing |
| 165 | +p-value: 0.1550 |
| 166 | +Significant? False |
| 167 | +Slope: 0.2300 units per year |
| 168 | +Confidence Interval (95%): [-0.0598, 0.8128] |
| 169 | +Autocorrelation (ACF1): 0.809 |
| 170 | +Block Size Used: 10 |
| 171 | +
|
| 172 | +--- Test 4: Seasonal Block Bootstrap --- |
| 173 | +Seasonal Trend: increasing |
| 174 | +p-value: 0.4400 |
| 175 | +Significant? False |
| 176 | +Slope: 0.0267 units per year |
| 177 | +Confidence Interval (95%): [-0.0339, 0.0828] |
| 178 | +Block Size Used: 3 (cycles) |
| 179 | +
|
| 180 | +``` |
| 181 | + |
| 182 | +## Interpreting the Results |
| 183 | + |
| 184 | +### 1. The False Positive (Standard Test) |
| 185 | +* In Test 1, the standard Mann-Kendall test likely reported a low p-value (e.g., $p < 0.05$) and claimed a significant trend. |
| 186 | +* The **Confidence Interval** for the slope likely excludes zero, falsely confirming a trend. |
| 187 | +* **Why?** The data "wandered" due to persistence, and the test mistook this for a deterministic trend. |
| 188 | + |
| 189 | +### 2. Yue & Wang Correction (Parametric) |
| 190 | +* In Test 2, `autocorr_method='yue_wang'` was used. |
| 191 | +* It reduced the **Effective Sample Size** (e.g., from 100 down to ~15-20), reflecting that highly correlated data contains less independent information. |
| 192 | +* The p-value increases, and the confidence interval widens, correctly identifying 'no trend'. |
| 193 | + |
| 194 | +### 3. Block Bootstrap (Non-Parametric) |
| 195 | +* In Test 3, `autocorr_method='block_bootstrap'` was used. |
| 196 | +* This method builds the null distribution by shuffling blocks of data. |
| 197 | +* The **Confidence Interval** is estimated directly from the bootstrap distribution. It should be wide enough to include zero, correctly indicating no significant trend. |
| 198 | + |
| 199 | +### 4. Seasonal Application |
| 200 | +* Test 4 shows how to apply this to seasonal data using `seasonal_trend_test`. |
| 201 | +* Here, the bootstrap resamples entire **cycles** (e.g., years) to preserve the seasonal pattern while accounting for inter-annual correlation. |
| 202 | + |
| 203 | +## Visual Results |
| 204 | + |
| 205 | +### Standard (Naive) Test |
| 206 | + |
| 207 | +*Notice how the red trend line might look convincing, but the data points are clustered above/below the line for long periods? That's autocorrelation.* |
| 208 | + |
| 209 | +### Bootstrap (Robust) Test |
| 210 | + |
| 211 | +*The trend line and data are the same, but the statistical conclusion (p-value) in the title/results is now correct.* |
| 212 | + |
| 213 | +## Key Takeaway |
| 214 | +If you suspect your data has serial correlation (common in daily or monthly environmental data), use `autocorr_method='block_bootstrap'` (or `'auto'`) to avoid false positives. |
0 commit comments