|
| 1 | +@.ai-instructions/profiles/tier-a.md @.ai-instructions/modules/jax.md |
| 2 | +@.ai-instructions/modules/pandas.md |
| 3 | + |
| 4 | +# CLAUDE.md |
| 5 | + |
| 6 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in |
| 7 | +this repository. |
| 8 | + |
| 9 | +## Project Overview |
| 10 | + |
| 11 | +GETTSIM (German Taxes and Transfers SIMulator) is a Python microsimulation model for the |
| 12 | +German tax and transfer system. It enables research applications from dynamic |
| 13 | +programming models to detailed microsimulation studies. |
| 14 | + |
| 15 | +The core computation engine is provided by `ttsim-backend`. GETTSIM contains the policy |
| 16 | +definitions, parameters, and tests specific to Germany. |
| 17 | + |
| 18 | +Never work around limitations in `ttsim-backend`; any such changes should be made there. |
| 19 | + |
| 20 | +## Common Commands |
| 21 | + |
| 22 | +```bash |
| 23 | +# Run all tests |
| 24 | +pixi run -e py314-jax tests -n 7 |
| 25 | + |
| 26 | +# Run a single test file |
| 27 | +pixi run -e py314-jax tests src/gettsim/tests_germany/test_policy_cases.py |
| 28 | + |
| 29 | +# Run tests for a specific policy area (by test ID pattern) |
| 30 | +pixi run -e py314-jax tests -k "kindergeld" |
| 31 | + |
| 32 | +# Type checking |
| 33 | +pixi run ty |
| 34 | +pixi run ty-jax |
| 35 | + |
| 36 | +# Quality checks (linting, formatting) |
| 37 | +pixi run prek run --all-files |
| 38 | + |
| 39 | +# Build documentation |
| 40 | +pixi run docs |
| 41 | +``` |
| 42 | + |
| 43 | +Before finishing any task that modifies code, always run these three verification steps |
| 44 | +in order: |
| 45 | + |
| 46 | +1. `pixi run ty` and `pixi run ty-jax` (type checker) |
| 47 | +1. `pixi run prek run --all-files` (quality checks: linting, formatting, yaml, etc.) |
| 48 | +1. `pixi run -e py314-jax tests -n 7` (full test suite) |
| 49 | + |
| 50 | +## Architecture |
| 51 | + |
| 52 | +### Source Layout |
| 53 | + |
| 54 | +- `src/gettsim/germany/` - Policy implementations organized by area (einkommensteuer, |
| 55 | + kindergeld, bürgergeld, etc.) |
| 56 | +- `src/gettsim/tests_germany/policy_cases/` - Test cases organized by policy area and |
| 57 | + date |
| 58 | +- `src/gettsim/tt/` - Re-exports from ttsim-backend (decorators, types) |
| 59 | + |
| 60 | +### Two-Level DAG System (GEP 4, 7) |
| 61 | + |
| 62 | +1. **Interface DAG**: High-level orchestration connecting inputs to outputs via |
| 63 | + `main()`. Key concepts: |
| 64 | + |
| 65 | + - `policy_date`: Date for which the policy environment is set up |
| 66 | + - `policy_environment`: Functions/parameters relevant at the policy date |
| 67 | + - `input_data`: User-provided data (via DataFrame + mapper or direct pytree) |
| 68 | + - `tt_targets`: Which outputs to compute |
| 69 | + - `results`: Final outputs in user-requested format |
| 70 | + |
| 71 | +1. **TT DAG**: The core computation layer. Contains policy functions that operate on |
| 72 | + data columns. |
| 73 | + |
| 74 | +### Entry Point (GEP 7) |
| 75 | + |
| 76 | +`gettsim.main()` is the single entry point. Users specify: |
| 77 | + |
| 78 | +- `main_target` or `main_targets`: What to compute (use `MainTarget` for autocompletion) |
| 79 | +- `policy_date_str`: Date for the policy environment (ISO format `YYYY-MM-DD`) |
| 80 | +- `input_data`: User data (via `InputData` helper classes) |
| 81 | +- `tt_targets`: Which tax/transfer outputs to compute (via `TTTargets`) |
| 82 | + |
| 83 | +```python |
| 84 | +from gettsim import InputData, MainTarget, TTTargets, main |
| 85 | + |
| 86 | +outputs_df = main( |
| 87 | + main_target=MainTarget.results.df_with_mapper, |
| 88 | + policy_date_str="2025-01-01", |
| 89 | + input_data=InputData.df_and_mapper(df=inputs_df, mapper=inputs_map), |
| 90 | + tt_targets=TTTargets(tree=targets_tree), |
| 91 | +) |
| 92 | +``` |
| 93 | + |
| 94 | +### Policy Functions (GEP 4, 6) |
| 95 | + |
| 96 | +Policy functions use decorators from `gettsim.tt`: |
| 97 | + |
| 98 | +```python |
| 99 | +@policy_function(start_date="2023-01-01", leaf_name="betrag_m") |
| 100 | +def betrag_ohne_staffelung_m(anzahl_ansprüche: int, satz: float) -> float: |
| 101 | + return satz * anzahl_ansprüche |
| 102 | +``` |
| 103 | + |
| 104 | +Key decorators: |
| 105 | + |
| 106 | +- `@policy_function` - Main policy calculation functions with date ranges (`start_date`, |
| 107 | + `end_date`, `leaf_name`) |
| 108 | +- `@policy_input` - Input column definitions (no implementation body) |
| 109 | +- `@param_function` - Functions that transform raw parameters into usable forms |
| 110 | +- `@agg_by_p_id_function` - Aggregation functions by person ID (e.g., summing children's |
| 111 | + claims to parent) |
| 112 | +- `@agg_by_group_function` - Aggregation functions by group (e.g., sum to household |
| 113 | + level) |
| 114 | +- `@group_creation_function` - Functions that create group IDs (e.g., fg_id, bg_id) |
| 115 | + |
| 116 | +Additional `@policy_function` parameters: |
| 117 | + |
| 118 | +- `vectorization_strategy="not_required"` - For functions that operate on full columns |
| 119 | + using `xnp` |
| 120 | +- `rounding_spec=RoundingSpec(...)` - Optional rounding (GEP 5) |
| 121 | +- `fail_msg_if_included="..."` - Error message if function is included in DAG (for |
| 122 | + unimplemented periods) |
| 123 | + |
| 124 | +### Automatic DAG Features (GEP 4) |
| 125 | + |
| 126 | +**Auto-aggregation**: If `my_col` exists and `my_col_hh` is requested, a sum aggregation |
| 127 | +is auto-generated. |
| 128 | + |
| 129 | +**Time conversion**: Automatic conversion between `_y`, `_q`, `_m`, `_w`, `_d` suffixes |
| 130 | +using these factors relative to year: 1, 4, 12, 365.25/7, 365.25. |
| 131 | + |
| 132 | +### Parameters (GEP 3) |
| 133 | + |
| 134 | +Policy parameters are in YAML files alongside the Python code. Each parameter has: |
| 135 | + |
| 136 | +- Date-keyed values (e.g., `2023-01-01:`) |
| 137 | +- Metadata: `name` (de/en), `description` (de/en), `unit`, `reference_period`, `type` |
| 138 | +- Legal references in each date entry |
| 139 | +- Schema: `docs/geps/params-schema.json` |
| 140 | + |
| 141 | +Parameter types: |
| 142 | + |
| 143 | +- `scalar` - Single value (accessed via `value` key) |
| 144 | +- `dict` - Homogeneous dictionary with string/int keys |
| 145 | +- `piecewise_constant`, `piecewise_linear`, `piecewise_quadratic`, `piecewise_cubic` - |
| 146 | + For `piecewise_polynomial` function |
| 147 | +- `birth_year_based_phase_inout`, `birth_month_based_phase_inout` - Age threshold |
| 148 | + lookups by birth cohort |
| 149 | +- `require_converter` - Complex structures needing a `@param_function` converter |
| 150 | + |
| 151 | +### Rounding (GEP 5) |
| 152 | + |
| 153 | +```python |
| 154 | +@policy_function( |
| 155 | + rounding_spec=RoundingSpec( |
| 156 | + base=0.0001, |
| 157 | + direction="nearest", # or "up", "down" |
| 158 | + reference="§76g SGB VI Abs. 4 Nr. 4", |
| 159 | + ), |
| 160 | + start_date="2021-01-01", |
| 161 | +) |
| 162 | +def höchstbetrag_m(...) -> float: ... |
| 163 | +``` |
| 164 | + |
| 165 | +## Naming Conventions (GEP 1, 6) |
| 166 | + |
| 167 | +### Language |
| 168 | + |
| 169 | +- **German** for policy-specific code (law names: Kindergeld, Bürgergeld, |
| 170 | + Einkommensteuer) |
| 171 | +- **English** for infrastructure code |
| 172 | +- **UTF-8** characters allowed (ä, ö, ü, ß) |
| 173 | + |
| 174 | +### Namespaces and Qualified Names (GEP 6) |
| 175 | + |
| 176 | +- Directory structure defines namespaces (e.g., `germany/kindergeld/` → namespace |
| 177 | + `kindergeld`) |
| 178 | +- Within a namespace, use local names: `betrag_m`, `satz` |
| 179 | +- Cross-namespace references use qualified names with double underscores: |
| 180 | + `arbeitslosengeld_2__einkommen_m_bg` |
| 181 | +- `betrag` is the convention for monetary amounts of a tax/transfer |
| 182 | + |
| 183 | +### Column/Function Name Suffixes (GEP 1) |
| 184 | + |
| 185 | +**Time units** (appear before aggregation): |
| 186 | + |
| 187 | +- `_y` (year), `_q` (quarter), `_m` (month), `_w` (week), `_d` (day) |
| 188 | + |
| 189 | +**Aggregation levels**: |
| 190 | + |
| 191 | +- `_sn` (Steuernummer - tax unit) |
| 192 | +- `_hh` (Haushalt - household) |
| 193 | +- `_fg` (Familiengemeinschaft) |
| 194 | +- `_bg` (Bedarfsgemeinschaft) |
| 195 | +- `_eg` (Einstandsgemeinschaft) |
| 196 | +- `_ehe` (Ehegemeinschaft) |
| 197 | + |
| 198 | +Example: `arbeitslosengeld_2__betrag_m_bg` = monthly ALG2 amount at Bedarfsgemeinschaft |
| 199 | +level |
| 200 | + |
| 201 | +### Special Column Types (GEP 2) |
| 202 | + |
| 203 | +- `p_id` - Primary person identifier (required) |
| 204 | +- `[x]_id` - Group identifiers (e.g., `hh_id`, `bg_id`) - same value for all group |
| 205 | + members |
| 206 | +- `p_id_[y]` - Person-to-person pointers (e.g., `p_id_elternteil_1`, `p_id_empfänger`). |
| 207 | + Value -1 = no link. |
| 208 | + |
| 209 | +## Test Cases |
| 210 | + |
| 211 | +Tests use YAML files in `tests_germany/policy_cases/{area}/{date}/`: |
| 212 | + |
| 213 | +```yaml |
| 214 | +inputs: |
| 215 | + provided: |
| 216 | + alter: [35, 35, 12] |
| 217 | + p_id: [0, 1, 2] |
| 218 | + hh_id: [0, 0, 0] |
| 219 | + # Nested paths use double underscore in code, but nested dicts in YAML |
| 220 | + kindergeld: |
| 221 | + in_ausbildung: [false, false, true] |
| 222 | + p_id_empfänger: [-1, -1, 0] |
| 223 | +outputs: |
| 224 | + kindergeld: |
| 225 | + betrag_m: [250, 0, 0] |
| 226 | +``` |
| 227 | +
|
| 228 | +## Code Restrictions for Vectorization |
| 229 | +
|
| 230 | +Functions must follow these rules for automatic vectorization: |
| 231 | +
|
| 232 | +1. **If-else blocks**: Only one operation per branch, no return inside single if (must |
| 233 | + have else) |
| 234 | +1. **Function calls**: `sum`, `any`, `all` require iterable arguments; `min`, `max` take |
| 235 | + exactly 2 args or 1 iterable |
| 236 | +1. **No elif after else**: Use nested if-else instead |
| 237 | + |
| 238 | +## Useful Imports from gettsim.tt |
| 239 | + |
| 240 | +```python |
| 241 | +from gettsim.tt import ( |
| 242 | + # Decorators |
| 243 | + policy_function, |
| 244 | + policy_input, |
| 245 | + param_function, |
| 246 | + agg_by_group_function, |
| 247 | + agg_by_p_id_function, |
| 248 | + group_creation_function, |
| 249 | + # Types |
| 250 | + AggType, # SUM, COUNT, MEAN, MAX, MIN, ANY, ALL |
| 251 | + RoundingSpec, |
| 252 | + ConsecutiveIntLookupTableParamValue, |
| 253 | + PiecewisePolynomialParamValue, |
| 254 | + # Functions |
| 255 | + piecewise_polynomial, |
| 256 | + join, # For person-to-person lookups |
| 257 | + get_consecutive_int_lookup_table_param_value, |
| 258 | + get_piecewise_parameters, |
| 259 | + intervals_to_thresholds, |
| 260 | + merge_piecewise_intervals, |
| 261 | + PiecewisePolynomialInterval, |
| 262 | +) |
| 263 | +``` |
| 264 | + |
| 265 | +## Relevant GEPs |
| 266 | + |
| 267 | +The [GETTSIM Enhancement Protocols](docs/geps/) define conventions: |
| 268 | + |
| 269 | +- **GEP 0**: Purpose and process for GEPs |
| 270 | +- **GEP 1**: Naming conventions (identifiers, German names, time/unit suffixes) |
| 271 | +- **GEP 2**: Internal data representation (1-d arrays, group identifiers, person |
| 272 | + pointers) |
| 273 | +- **GEP 3**: Parameters of the taxes and transfers system (YAML structure, types) |
| 274 | +- **GEP 4**: DAG-based computational backend (core architecture) |
| 275 | +- **GEP 5**: Optional rounding via `RoundingSpec` |
| 276 | +- **GEP 6**: Unified architecture (namespaces, qualified names, `start_date`/`end_date`) |
| 277 | +- **GEP 7**: User interface (`main()` function, `MainTarget`, input/output handling) |
0 commit comments