Skip to content

Commit 4e1f636

Browse files
committed
push v1 from dev to public
1 parent 4b90809 commit 4e1f636

File tree

364 files changed

+39925
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

364 files changed

+39925
-0
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
2+
# Example 1: Getting Started - Inspecting Your Data
3+
4+
## The "Why": Verify Before You Analyze
5+
In environmental data analysis, datasets are rarely perfect. They often contain:
6+
* **Missing values (Gaps):** Sensors fail, samples get lost.
7+
* **Censored data:** Concentrations fall below laboratory detection limits (e.g., `< 0.5 mg/L`).
8+
* **Irregular sampling:** Samples might be taken daily in summer but monthly in winter.
9+
10+
Running a trend test blindly on such data can lead to misleading results. The `MannKS.inspect_trend_data` function is your "sanity check."
11+
12+
## The "How": Code Walkthrough
13+
14+
In this example, we generate a synthetic "messy" dataset and inspect it. We use `return_summary=True` to get a programmatic report on data availability across different potential time increments (monthly, quarterly, etc.).
15+
16+
### Step 1: Python Code
17+
```python
18+
import numpy as np
19+
import pandas as pd
20+
import MannKS as mk
21+
22+
# 1. Generate Synthetic Data
23+
# We create a 5-year monthly dataset (60 points) with some "messy" real-world features:
24+
# - An underlying upward trend.
25+
# - Some random noise.
26+
# - Missing data (NaNs).
27+
# - Censored data (values below detection limit).
28+
np.random.seed(42)
29+
n_years = 5
30+
dates = pd.date_range(start='2020-01-01', periods=n_years*12, freq='ME')
31+
t = np.arange(len(dates))
32+
values = 0.1 * t + np.random.normal(0, 1, len(t)) + 10
33+
censored_mask = values < 10.5
34+
values_str = values.astype(str)
35+
values_str[censored_mask] = '<' + np.round(values[censored_mask] + 0.5, 1).astype(str)
36+
values_str[10:13] = np.nan
37+
values_str[45] = np.nan
38+
39+
# 2. Pre-process the Data
40+
# Raw environmental data often comes as strings (e.g., '< 0.5', '12.4').
41+
# Standard statistical functions fail on these strings.
42+
# The `prepare_censored_data` function is critical because it:
43+
# 1. Parses the strings to identify censored values (detects '<').
44+
# 2. Separates the data into a numeric 'value' column and a boolean 'censored' column.
45+
# 3. Handles multiple detection limits automatically.
46+
df = mk.prepare_censored_data(values_str)
47+
df['date'] = dates
48+
49+
# 3. Inspect the Data
50+
# Before running a trend test, we must verify the data is suitable.
51+
# The `inspect_trend_data` function acts as a diagnostic tool.
52+
# It checks for:
53+
# - Data Availability: Do we have enough data points?
54+
# - Time Structure: Is the data monthly? Quarterly? Irregular?
55+
# - Gaps: Are there long periods with no data?
56+
# - Censoring: What percentage of data is non-detect?
57+
# We request `return_summary=True` to get the statistical table back.
58+
print("Running Data Inspection...")
59+
result = mk.inspect_trend_data(
60+
data=df,
61+
time_col='date',
62+
return_summary=True,
63+
plot=True,
64+
plot_path='inspection_plots.png'
65+
)
66+
67+
# Print the availability summary
68+
print("\nData Availability Summary:")
69+
print(result.summary.to_markdown(index=False))
70+
```
71+
72+
### Step 2: Text Output
73+
The function returns a summary DataFrame, which we printed:
74+
75+
```text
76+
Running Data Inspection...
77+
78+
Data Availability Summary:
79+
| increment | n_obs | n_year | prop_year | n_incr_year | prop_incr_year | data_ok |
80+
|:------------|--------:|---------:|------------:|--------------:|-----------------:|:----------|
81+
| daily | 56 | 5 | 1 | 56 | 0.0306849 | False |
82+
| weekly | 56 | 5 | 1 | 56 | 0.215385 | False |
83+
| monthly | 56 | 5 | 1 | 56 | 0.933333 | True |
84+
| bi-monthly | 56 | 5 | 1 | 29 | 0.966667 | True |
85+
| quarterly | 56 | 5 | 1 | 20 | 1 | True |
86+
| bi-annually | 56 | 5 | 1 | 10 | 1 | True |
87+
| annually | 56 | 5 | 1 | 5 | 1 | True |
88+
89+
```
90+
91+
## Interpreting the Results
92+
93+
### 1. Statistical Summary (Text Output)
94+
The table above evaluates different time increments (monthly, quarterly, etc.) to see if they are suitable for analysis:
95+
* **`increment`**: The time unit being tested.
96+
* **`prop_year`**: Fraction of years that have at least one sample. High values (near 1.0) are good.
97+
* **`prop_incr_year`**: Fraction of expected periods (e.g., 12 months/year) that have data.
98+
* **`data_ok`**: A boolean flag suggesting if this increment is viable for seasonal analysis.
99+
* For **monthly** analysis, we see high coverage, confirming our data is suitable despite the gaps.
100+
101+
### 2. Visual Diagnostics (Plots)
102+
The function generated `inspection_plots.png`:
103+
104+
![Inspection Plots](inspection_plots.png)
105+
106+
* **Top-Left (Time Series):** Visualizes the trend, gaps, and censored values (red dots).
107+
* **Top-Right (Value Matrix):** Heatmap of values (Row=Year, Col=Month). Useful for spotting seasonal blocks.
108+
* **Bottom-Left (Censoring Matrix):** Heatmap of censored data locations.
109+
* **Bottom-Right (Sample Count Matrix):** Heatmap of sampling frequency.
110+
111+
## Conclusion
112+
We have confirmed our data is messy but sufficient for a **monthly** trend analysis. We are ready to proceed!
394 KB
Loading
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
import os
2+
import io
3+
import contextlib
4+
import numpy as np
5+
import pandas as pd
6+
import MannKS as mk
7+
import matplotlib.pyplot as plt
8+
9+
# --- 1. Define the Example Code as a String ---
10+
example_code = """
11+
import numpy as np
12+
import pandas as pd
13+
import MannKS as mk
14+
15+
# 1. Generate Synthetic Data
16+
# We create a 5-year monthly dataset (60 points) with some "messy" real-world features:
17+
# - An underlying upward trend.
18+
# - Some random noise.
19+
# - Missing data (NaNs).
20+
# - Censored data (values below detection limit).
21+
np.random.seed(42)
22+
n_years = 5
23+
dates = pd.date_range(start='2020-01-01', periods=n_years*12, freq='ME')
24+
t = np.arange(len(dates))
25+
values = 0.1 * t + np.random.normal(0, 1, len(t)) + 10
26+
censored_mask = values < 10.5
27+
values_str = values.astype(str)
28+
values_str[censored_mask] = '<' + np.round(values[censored_mask] + 0.5, 1).astype(str)
29+
values_str[10:13] = np.nan
30+
values_str[45] = np.nan
31+
32+
# 2. Pre-process the Data
33+
# Raw environmental data often comes as strings (e.g., '< 0.5', '12.4').
34+
# Standard statistical functions fail on these strings.
35+
# The `prepare_censored_data` function is critical because it:
36+
# 1. Parses the strings to identify censored values (detects '<').
37+
# 2. Separates the data into a numeric 'value' column and a boolean 'censored' column.
38+
# 3. Handles multiple detection limits automatically.
39+
df = mk.prepare_censored_data(values_str)
40+
df['date'] = dates
41+
42+
# 3. Inspect the Data
43+
# Before running a trend test, we must verify the data is suitable.
44+
# The `inspect_trend_data` function acts as a diagnostic tool.
45+
# It checks for:
46+
# - Data Availability: Do we have enough data points?
47+
# - Time Structure: Is the data monthly? Quarterly? Irregular?
48+
# - Gaps: Are there long periods with no data?
49+
# - Censoring: What percentage of data is non-detect?
50+
# We request `return_summary=True` to get the statistical table back.
51+
print("Running Data Inspection...")
52+
result = mk.inspect_trend_data(
53+
data=df,
54+
time_col='date',
55+
return_summary=True,
56+
plot=True,
57+
plot_path='inspection_plots.png'
58+
)
59+
60+
# Print the availability summary
61+
print("\\nData Availability Summary:")
62+
print(result.summary.to_markdown(index=False))
63+
"""
64+
65+
# --- 2. Execute the Code and Capture Output ---
66+
output_buffer = io.StringIO()
67+
68+
with contextlib.redirect_stdout(output_buffer):
69+
local_scope = {}
70+
exec(example_code, globals(), local_scope)
71+
72+
captured_output = output_buffer.getvalue()
73+
74+
# --- 3. Generate the README.md ---
75+
readme_content = f"""
76+
# Example 1: Getting Started - Inspecting Your Data
77+
78+
## The "Why": Verify Before You Analyze
79+
In environmental data analysis, datasets are rarely perfect. They often contain:
80+
* **Missing values (Gaps):** Sensors fail, samples get lost.
81+
* **Censored data:** Concentrations fall below laboratory detection limits (e.g., `< 0.5 mg/L`).
82+
* **Irregular sampling:** Samples might be taken daily in summer but monthly in winter.
83+
84+
Running a trend test blindly on such data can lead to misleading results. The `MannKS.inspect_trend_data` function is your "sanity check."
85+
86+
## The "How": Code Walkthrough
87+
88+
In this example, we generate a synthetic "messy" dataset and inspect it. We use `return_summary=True` to get a programmatic report on data availability across different potential time increments (monthly, quarterly, etc.).
89+
90+
### Step 1: Python Code
91+
```python
92+
{example_code.strip()}
93+
```
94+
95+
### Step 2: Text Output
96+
The function returns a summary DataFrame, which we printed:
97+
98+
```text
99+
{captured_output}
100+
```
101+
102+
## Interpreting the Results
103+
104+
### 1. Statistical Summary (Text Output)
105+
The table above evaluates different time increments (monthly, quarterly, etc.) to see if they are suitable for analysis:
106+
* **`increment`**: The time unit being tested.
107+
* **`prop_year`**: Fraction of years that have at least one sample. High values (near 1.0) are good.
108+
* **`prop_incr_year`**: Fraction of expected periods (e.g., 12 months/year) that have data.
109+
* **`data_ok`**: A boolean flag suggesting if this increment is viable for seasonal analysis.
110+
* For **monthly** analysis, we see high coverage, confirming our data is suitable despite the gaps.
111+
112+
### 2. Visual Diagnostics (Plots)
113+
The function generated `inspection_plots.png`:
114+
115+
![Inspection Plots](inspection_plots.png)
116+
117+
* **Top-Left (Time Series):** Visualizes the trend, gaps, and censored values (red dots).
118+
* **Top-Right (Value Matrix):** Heatmap of values (Row=Year, Col=Month). Useful for spotting seasonal blocks.
119+
* **Bottom-Left (Censoring Matrix):** Heatmap of censored data locations.
120+
* **Bottom-Right (Sample Count Matrix):** Heatmap of sampling frequency.
121+
122+
## Conclusion
123+
We have confirmed our data is messy but sufficient for a **monthly** trend analysis. We are ready to proceed!
124+
"""
125+
126+
with open(os.path.join(os.path.dirname(__file__), 'README.md'), 'w') as f:
127+
f.write(readme_content)
128+
129+
print("Example 1 generated successfully.")
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
2+
# Example 2: Basic Non-Seasonal Trend Test (Numeric Time)
3+
4+
## The "Why": The Fundamental Trend Test
5+
This example demonstrates the core function of the package: `mk.trend_test`.
6+
It answers the most basic question: **"Is there a statistically significant upward or downward trend in my data?"**
7+
8+
We use the **Mann-Kendall test** for significance and **Sen's Slope** for magnitude because they are "non-parametric." This means:
9+
1. They don't assume your data follows a bell curve (normal distribution).
10+
2. They are robust to outliers (one crazy high value won't ruin the result).
11+
3. They handle missing data gracefully.
12+
13+
## The "How": Code Walkthrough
14+
15+
We analyze a simple dataset where time is represented by numeric years (integers).
16+
17+
### Step 1: Python Code
18+
```python
19+
import numpy as np
20+
import pandas as pd
21+
import MannKS as mk
22+
23+
# 1. Generate Synthetic Data
24+
# We create a simple dataset with 10 yearly observations.
25+
# The time vector 't' is just a sequence of integers (years).
26+
# The value vector 'x' has a clear upward trend.
27+
t = np.arange(2000, 2011) # Years 2000 to 2010
28+
x = np.array([5.1, 5.5, 5.9, 6.2, 6.8, 7.1, 7.5, 7.9, 8.2, 8.5, 9.0])
29+
30+
print(f"Time (t): {t}")
31+
print(f"Values (x): {x}")
32+
33+
# 2. Run the Trend Test
34+
# The `trend_test` function is the core of the package. It performs two key statistical tasks:
35+
# A. Mann-Kendall Test: Checks *if* there is a trend (Significance).
36+
# - It compares every pair of data points to see if they increase or decrease.
37+
# B. Sen's Slope Estimator: Calculates *how strong* the trend is (Magnitude).
38+
# - It finds the median of all pairwise slopes.
39+
# We pass 'plot_path' to automatically generate a visualization of these results.
40+
print("\nRunning Mann-Kendall Trend Test...")
41+
result = mk.trend_test(x, t, plot_path='trend_plot.png')
42+
43+
# 3. Inspect the Results
44+
# The function returns a namedtuple with all statistical metrics.
45+
# Key fields include:
46+
# - result.trend: A basic description ('increasing', 'decreasing', 'no trend').
47+
# - result.classification: A more nuanced category (e.g., 'Likely Increasing') based on confidence.
48+
# - result.p: The p-value (significance).
49+
# - result.slope: The magnitude of change per unit time.
50+
print("\n--- Trend Test Results ---")
51+
print(f"Basic Trend: {result.trend} (Confidence: {result.C:.1%})")
52+
print(f"Classification: {result.classification}")
53+
print(f"Kendall's S: {result.s}")
54+
print(f"p-value: {result.p:.4f}")
55+
print(f"Sen's Slope: {result.slope:.4f}")
56+
print(f"Confidence Interval: [{result.lower_ci:.4f}, {result.upper_ci:.4f}]")
57+
```
58+
59+
### Step 2: Text Output
60+
```text
61+
Time (t): [2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010]
62+
Values (x): [5.1 5.5 5.9 6.2 6.8 7.1 7.5 7.9 8.2 8.5 9. ]
63+
64+
Running Mann-Kendall Trend Test...
65+
66+
--- Trend Test Results ---
67+
Basic Trend: increasing (Confidence: 100.0%)
68+
Classification: Highly Likely Increasing
69+
Kendall's S: 55.0
70+
p-value: 0.0000
71+
Sen's Slope: 0.3889
72+
Confidence Interval: [0.3667, 0.4000]
73+
74+
```
75+
76+
## Interpreting the Results
77+
78+
### 1. Statistical Results
79+
* **Basic Trend (Increasing)**: The test detected an upward trend.
80+
* **Classification (Highly Likely Increasing)**: The package assigns a descriptive category based on the confidence level (`result.C`).
81+
* **Increasing/Decreasing**: High confidence (≥ 90% or 95% depending on `alpha`).
82+
* **Likely Increasing/Decreasing**: Moderate confidence (e.g., 85-90%).
83+
* **Stable/No Trend**: Low confidence.
84+
* **Confidence (result.C) (100.0%)**: This is derived from the p-value (`1 - p/2` for increasing trends). It means we are very certain this isn't just random noise.
85+
* **Kendall's S (55.0)**: This is the raw score. It implies that when comparing all possible pairs of data points, 55 more pairs were increasing than decreasing. A positive number indicates growth.
86+
* **p-value (0.0000)**: The probability that this trend happened by random chance is virtually zero. Standard practice considers $p < 0.05$ as significant.
87+
* **Sen's Slope (0.3889)**: The median rate of change. Since our time unit is "years", this means the value increases by roughly **0.39 units per year**.
88+
89+
### 2. Visual Results (`trend_plot.png`)
90+
The function automatically generated this plot:
91+
92+
![Trend Plot](trend_plot.png)
93+
94+
* **Blue Dots**: The actual data points.
95+
* **Solid Line**: The Sen's Slope trend line. Notice it passes through the "center of gravity" of the data but doesn't necessarily hit the mean.
96+
* **Shaded Area**: The 90% confidence interval (default `alpha=0.1`). If the trend is significant, the edges of this zone usually won't cross zero (flat).
97+
98+
## Key Takeaway
99+
For simple numeric time series (years, index numbers), `mk.trend_test(x, t)` is all you need. It provides the "Yes/No" (significance), the "How Much" (slope), and a user-friendly classification.

0 commit comments

Comments
 (0)