Skip to content

Sampling zoo integration#1423

Open
artemlunev2000 wants to merge 3 commits intocodex/arch_refactoringfrom
sampling_zoo_integration
Open

Sampling zoo integration#1423
artemlunev2000 wants to merge 3 commits intocodex/arch_refactoringfrom
sampling_zoo_integration

Conversation

@artemlunev2000
Copy link
Collaborator

Summary

This PR continues the first stage of Sampling Zoo integration. Chunking and subset strategies are now explicitly separated by strategy_kind. Subset runs follow the standard single‑dataset training path with sample selection, while chunking produces multiple InputData partitions and trains a PipelineEnsemble over them. The ensemble replaces the current pipeline for predict, and preserves existing API behavior where possible.

Context

@github-actions
Copy link
Contributor

Code in this pull request contains PEP8 errors, please write the /fix-pep8 command in the comments below to create commit with automatic fixes.

Comment on lines +63 to +83
if strategy_kind == 'chunking':
return self._sample_chunking(
factory=factory,
features=features,
target=target,
strategy=strategy,
strategy_params=strategy_params,
random_state=random_state
)
elif strategy_kind == 'subset':
return self._sample_subset(
factory=factory,
features=features,
target=target,
strategy=strategy,
strategy_params=strategy_params,
random_state=random_state,
injectable_params=injectable_params
)
else:
raise ValueError(f'Unsupported sampling strategy kind: {strategy_kind}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

все подобные ветвления переписать на проверки на вхождение в перечислимый тип или маппинг (словарь) для поддержки расширяемости

strategy_kind in available_strategies
или в данном случае
return available_sample_methods[strategy_kind]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

позже неудобно будет добавлять новые стратегии сэмплирования при таком подходе, который сейчас в реализации

Comment on lines +85 to +93
def _sample_subset(self,
factory: Any,
features: np.ndarray,
target: np.ndarray,
strategy: str,
strategy_params: Dict[str, Any],
random_state: Optional[int],
injectable_params: Optional[Dict[str, Any]]) -> SamplingProviderResult:
n_rows = int(features.shape[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

выносить в pure функции вне SamplingProvider

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decouple providers

Comment on lines +104 to +109
def _execute_chunking(self,
train_data: InputData,
started_at: float,
budget_seconds: float) -> SamplingStageOutput:
self._raise_if_budget_exceeded(started_at, budget_seconds)
remaining_budget = self._remaining_budget(started_at, budget_seconds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_execute_* методы вынести в pure фунции, decouple executors

return np.asarray(target)[indices]

@staticmethod
def _partitions_to_input_data_list(partitions: Dict[str, Any],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

очень нагруженный метод _partitions_to_input_data_list, посмотреть наработки @Romankkl03 по TensorData - изучить новый протокол потока данных и адаптировать работу в этом PR

_SAMPLING_MODULE_CANDIDATES = (
'sampling_zoo.core.api.api_main',
'sampling_zoo.api.api_main',
'core.api.api_main',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

сразу же отказаться от внутренней зависимости

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants