Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
084713e
sampling zoo integration first stage
v1docq Mar 6, 2026
5a6c43e
add extension contract and mirrored tests scaffold
v1docq Mar 12, 2026
e1d1ea1
make remote pipeline config parsing safe and typed
v1docq Mar 12, 2026
4dca87d
extract typed repository query rules from operation repository
v1docq Mar 12, 2026
1d6617d
extract typed repository
v1docq Mar 12, 2026
2115b47
extract pure assumption and preset rules from api shell
v1docq Mar 12, 2026
8e4f55d
extract pure input data rules from api data adapter
v1docq Mar 12, 2026
8881357
extract pure recommendation rules from input analyser
v1docq Mar 12, 2026
d39e2f7
extract typed fit and composer planning rules
v1docq Mar 12, 2026
c56c7ec
extract typed api params validation and normalization rules
v1docq Mar 12, 2026
bf09df6
extract typed assumption handler rules and either-based fit result
v1docq Mar 12, 2026
5b5eabd
extract pure api params repository defaulting rules
v1docq Mar 12, 2026
4e2a1fe
extract pure api params
v1docq Mar 12, 2026
d0f18ba
extract cache and tuner setup rules from api composer
v1docq Mar 12, 2026
5645139
extract pure builder parameter merge rules
v1docq Mar 12, 2026
88f7d3b
extract preprocessing source and merge rule
v1docq Mar 12, 2026
703e4d2
integrate extension manifest discovery into operation queries
v1docq Mar 12, 2026
522eca4
extract pipeline operation split rules and fix fluent repository setup
v1docq Mar 12, 2026
caf961b
add typed extension parameter resolution and schema defaults
v1docq Mar 12, 2026
0ef2232
extract pipeline preprocess and postprocess rules
v1docq Mar 13, 2026
1e2a3c6
extract pipeline node parameter normalization rules
v1docq Mar 13, 2026
45edef5
extract operation parameter normalization and change tracking rules
v1docq Mar 13, 2026
fdab6c0
`Refactor OOP shells to typed pure-core rules and add first mirrored …
v1docq Mar 13, 2026
bdb1bad
chore: add setuptools pkg_resources libs
Lopa10ko Mar 19, 2026
fd4c5b8
fix: change repo kinds enum values to lowercase
Lopa10ko Mar 19, 2026
dbc1bd7
fix: change the order of using best preset name in presets parsing
Lopa10ko Mar 19, 2026
e9156d1
fix: add proper chained exception thread in assumptions fit stage
Lopa10ko Mar 19, 2026
4a0352e
fix: add inheritance for fake test pipeline from actual pipeline
Lopa10ko Mar 19, 2026
d3413c2
fix: add Right monad in extension strategy params build method
Lopa10ko Mar 19, 2026
03efc0b
fix: add OperationParameters support for FP extraction
Lopa10ko Mar 19, 2026
a7d3994
fix: update validation checks to use is_left, is_right methods for mo…
Lopa10ko Mar 19, 2026
04885bb
Automated autopep8 fixes
github-actions[bot] Mar 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions docs/dev/fp_refactoring_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# План OOP-first refactoring с подготовкой к FP-informed архитектуре

## Summary

Первая волна рефакторинга сохраняет ключевые OOP-абстракции в `fedot/api` и `fedot/core` как публичный и координирующий слой, но выносит вычислительную, валидационную и selection-логику в pure core. Идея не в том, чтобы “сломать” существующий `Facade/Builder/Composite/Strategy` дизайн, а в том, чтобы сделать его тоньше, типобезопаснее и лучше совместимым с дальнейшей FP-интеграцией.

## OOP boundaries to preserve

- В `fedot/api` сохраняются `Fedot`, `FedotBuilder`, `ApiDataProcessor`, `ApiComposer`, `PredefinedModel`, `ApiParamsRepository`, `ApiParams`, `InputAnalyser`, assumptions/preset/filter builders и handlers как OOP-координаторы и boundary-объекты.
- В `fedot/core` сохраняются `PipelineNode`, `Pipeline`, `PipelineBuilder`, `PipelineTemplate`, `PipelineAdapter`, factory-слой, operation hierarchy, `EvaluationStrategy`, `Composer`, `ComposerBuilder`, objective/splitter abstractions.
- Правило рефакторинга: классы владеют lifecycle и orchestration, а правила выбора, валидация, трансформации и фильтрация выносятся в typed pure modules.

## First-wave implementation focus

1. Стабилизировать OOP API-слой через typed requests/results/specs без ломки `Facade/Builder`.
2. Вынести assumptions/preset/filter rules в отдельный pure core при сохранении текущих strategy/builder классов.
3. Выделить preprocessing plan/state и сократить неявный mutable state внутри preprocessor-а.
4. Разделить repository IO и pure parsing/filtering/query logic.
5. Ввести единый extension contract для внешних моделей без правки нескольких внутренних конфигов.
6. Переписать remote config parsing на безопасную typed модель без `eval` и sentinel `'None'`.

## External model contract

- Канонические сущности: `ExtensionManifest`, `ExternalModelSpec`, `ModelCapabilities`, `ModelFactory`, `ModelHyperparamsSchema`, `ExtensionError`.
- Канонический путь интеграции:
`create manifest -> validate/register -> smoke test`.
- Новый contract должен быть OOP-friendly для пользователей и LLM-agent-friendly для автоматизации.
- Legacy JSON-репозитории остаются поддерживаемым boundary-слоем, но не рекомендуемым основным механизмом расширения.

## Test strategy

- Новая каноническая тестовая структура: `tests/`, зеркалящая `fedot/`.
- Тип теста выражается через pytest markers, а не через имя директории.
- Для OOP-координаторов обязательны service/facade tests.
- Для pure collaborators обязательны unit/property tests.
- Первые mirrored-кластеры: `tests/extensions`, затем `tests/api`, `tests/core`, `tests/preprocessing`, `tests/remote`.
53 changes: 53 additions & 0 deletions docs/dev/fp_refactoring_pr1_slice.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Первый PR: OOP Shell over Typed Pure Core

## В чем идея PR

В этом ПР сделана первая последовательная вертикальная часть плана рефакторинга:
оставлен общедоступный API ООП и объекты ядра,
логику принятия решений вынесена в чистые функции,
валидацию и нормализацию параметров так же.

## Что поменялось

- `fedot/extensions`
- extension contract
- registry
- operation discovery bridge
- runtime adapter
- typed extension parameter resolution
- `fedot/remote`
- safe typed pipeline config parsing without `eval`
- `fedot/api`
- typed run/service planning rules
- extracted params/defaulting/recommendation/preset/assumption rules
- `Fedot` facade still preserved as OOP shell
- `fedot/preprocessing`
- source, merge and optional-preprocessing planning rules
- `fedot/core/repository`
- typed operation query and pipeline operation split rules
- `fedot/core/pipelines`
- pipeline preprocess/postprocess rules
- pipeline node parameter normalization rules
- `fedot/core/operations`
- operation parameter normalization/change-tracking rules
- `tests/`
- mirrored tree for `api`, `core`, `extensions`, `preprocessing`, `remote`

## Архитектурный эффект

- Зоны влияния ООП остаются на месте.
- Скрытая логика ветвления и нормализации перенесена в небольшие чистые функции.
- Ожидаемые сбои на новых границах представлены более явно.
- Интеграция с внешней моделью больше не зависит от редактирования нескольких внутренних конфигураций.

## Что намерено не было сделано в этом PR

- рефактор индастриала
- рефактор CI
- работа над моделями и методами для фичей

## В каком порядке сомтреть
1. extension contract and runtime bridge
2. remote config safety changes
3. api/core/preprocessing pure-rule extractions
4. mirrored tests structure and new `pytest` markers
266 changes: 266 additions & 0 deletions docs/files/sampling_stage_change_review_2026-03-06.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# Sampling Stage Integration Review
Date: 2026-03-06
Scope: changes for `sampling_config` + pre-fit `sampling_stage` integration into `Fedot.fit`.

## Final Stage Status
- Attempted final test run:
- `python -m pytest ...` -> `No module named pytest`
- Final pytest stage is skipped as not runnable in current environment.
- Fallback verification completed:
- `python -m py_compile` passed for all changed/new Python files.

## Change Scope (Implemented)
- API/defaults: `sampling_config` added and validated.
- New subsystem: `fedot/api/sampling_stage/{config.py,providers.py,executor.py}`.
- Fit integration: sampling stage executed before composition, metadata exposed.
- Optional dependency: `fedot[sampling_zoo]` extra added.
- Tests: unit + integration tests for config, provider, executor, and fit behavior.
- Docs/examples: advanced guide, README section, classification example note.

## 1) Architecture Review

### A1. Provider Contract Is Heuristic and Version-Sensitive
Problem:
- `SamplingZooProvider` discovers indices through multiple fallback paths (`sample_indices`, attrs, `get_partitions`) without a strict external contract.

Why it matters:
- Changes in Sampling Zoo internals can break extraction logic, causing hard failures or incorrect sampling behavior.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Medium

2. Define and enforce a strict provider adapter contract for FEDOT integration.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Low

3. Maintain versioned adapters (`sampling_zoo_v1`, `sampling_zoo_v2`) with explicit compatibility checks.
- Effort: Medium/High
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 for near term; move to Option 3 when multiple Sampling Zoo API generations must be supported.

### A2. Subset Construction May Share Mutable Supplementary State
Problem:
- `SamplingStageExecutor._subset_by_positions` reuses `data.supplementary_data` reference when creating reduced `InputData`.

Why it matters:
- Mutable shared state can create non-obvious side effects across stages/pipelines.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Medium

2. Deep-copy `supplementary_data` for sampled dataset.
- Effort: Low
- Risk: Low
- Payoff: Medium
- Maintenance cost: Low

3. Introduce immutable or copy-on-write semantics for supplementary metadata.
- Effort: Medium/High
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 now; Option 3 only if broader data mutability problems appear.

## 2) Code Quality Review

### C1. Guard Validation Is Partially Type-Specific
Problem:
- Heavy-parameter guards in `validate_sampling_config` primarily check integer forms (`n_partitions`, `sample_size`, etc.), while some non-int shapes may bypass limits.

Why it matters:
- Invalid or heavy configs may slip through validation and produce expensive runtime behavior.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Medium

2. Normalize and validate all accepted numeric representations (int/float/list/tuple where relevant).
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Low

3. Add provider-specific schema validation plugins.
- Effort: Medium/High
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 in V1 hardening; Option 3 only when multiple providers are active.

### C2. Final Sampling Randomly Re-Selects from Extracted Indices
Problem:
- After extracting candidate indices from strategy output, provider may randomly choose a subset up to `sample_size`.

Why it matters:
- If strategy output is already ranked/structured, extra random reduction can weaken algorithm intent and reproducibility semantics.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Low

2. Prefer strategy-native final selection when available; fallback to random only if needed.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Low

3. Require strategy to return exactly final indices count and fail otherwise.
- Effort: Medium
- Risk: Medium
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 for compatibility + quality balance.

## 3) Test Review

### T1. No Executed End-to-End Test with Real Sampling Zoo in This Environment
Problem:
- Tests were authored, but final execution is blocked by missing `pytest` package; no runtime E2E signal with installed optional dependency.

Why it matters:
- Integration defects can remain hidden until real environment execution.

Options:
1. Do nothing.
- Effort: Low
- Risk: High
- Payoff: Low
- Maintenance cost: Low

2. Add CI lane with `fedot[sampling_zoo]` and run dedicated markers.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

3. Add nightly AMLB-style smoke benchmark for sampling stage.
- Effort: Medium/High
- Risk: Low
- Payoff: High
- Maintenance cost: Medium/High

Recommended:
- Option 2 immediately; Option 3 as performance/quality observability extension.

### T2. Missing Regression Cases for DataFrame Features and Metadata Isolation
Problem:
- Tests mostly use numpy-like datasets and mocked provider paths.

Why it matters:
- Potential regressions in DataFrame handling and shared supplementary metadata may not be detected.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Low

2. Add unit/integration tests for DataFrame features, categorical columns, and supplementary metadata isolation.
- Effort: Low/Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Low

3. Add property-based tests for sampling indices and data consistency invariants.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 now; Option 3 later if index-related bugs appear in production.

## 4) Performance Review

### P1. Repeated Feature Encoding for Each Candidate Ratio
Problem:
- The effective-size protocol rebuilds training matrices and model fits for each candidate.

Why it matters:
- Sampling overhead can consume a meaningful part of budget on medium/large tabular datasets.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Low

2. Cache transformed validation matrix and reusable feature engineering outputs.
- Effort: Medium
- Risk: Low
- Payoff: Medium/High
- Maintenance cost: Medium

3. Add adaptive candidate schedule with early elimination and dynamic stopping.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 3 plus targeted caching from Option 2 for the largest workloads.

### P2. Fixed RF Baseline Complexity (`n_estimators=100`)
Problem:
- Baseline model cost is fixed and may be too expensive under tight time budgets.

Why it matters:
- High stage cost can reduce AutoML search time and offset sampling benefit.

Options:
1. Do nothing.
- Effort: Low
- Risk: Medium
- Payoff: Low
- Maintenance cost: Low

2. Add lightweight baseline config (`n_estimators`, depth, model family) in `sampling_config`.
- Effort: Medium
- Risk: Low
- Payoff: High
- Maintenance cost: Low

3. Auto-scale baseline complexity from dataset size and stage budget.
- Effort: Medium/High
- Risk: Medium
- Payoff: High
- Maintenance cost: Medium

Recommended:
- Option 2 first; Option 3 later when benchmark telemetry is available.

## Consolidated Recommendation
- The current implementation is a solid V1 integration aligned with fail-fast and dynamic cap constraints.
- Main hardening targets before production broad rollout:
- enforce stricter provider contract,
- isolate mutable dataset metadata,
- extend guard validation,
- execute CI with real optional dependency.
1 change: 1 addition & 0 deletions docs/source/advanced/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Advanced usage
automated_pipelines_design
hyperparameters_tuning
data_preprocessing
sampling_stage
project_import_export
pipeline_import_export
cli_call
Expand Down
Loading
Loading