Skip to content

Conversation

@abhinavg4
Copy link
Contributor

@abhinavg4 abhinavg4 commented Jan 5, 2026

Description

Making URL Generator's / FilePartioningStage Xenna spec to be one worker per node, since these stages start from EmptyTask, we do not need multiple workers per stage for these

Usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR optimizes resource allocation for two fanout stages (FilePartitioningStage and URLGenerationStage) by configuring them to use exactly one worker per node in the Xenna executor.

Context: Both stages are "fanout" stages that start from _EmptyTask (a task with no actual data) and generate multiple output tasks. FilePartitioningStage scans file paths and creates FileGroupTask objects for parallel processing, while URLGenerationStage generates URLs and creates tasks for each URL. Since these stages perform lightweight orchestration rather than heavy data processing, they don't benefit from multiple workers per node.

Changes Made:

  • Added xenna_stage_spec() method to FilePartitioningStage that returns {"num_workers_per_node": 1}
  • Added xenna_stage_spec() method to URLGenerationStage that returns {"num_workers_per_node": 1}

How it works: The Xenna executor calls stage.xenna_stage_spec() to get stage-specific configuration and passes the num_workers_per_node value to the underlying StageSpec (see nemo_curator/backends/xenna/executor.py, line 91). This configuration controls how many workers are allocated per node for that specific stage.

Consistency: This change follows the existing pattern used in other stages like DocumentDownloadStage (which uses the downloader's num_workers_per_node() method) and is architecturally similar to ImageDuplicatesRemovalStage (which conditionally sets this value).

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are minimal, well-scoped, and follow existing patterns in the codebase. Both modifications add the same simple method that returns a static configuration dictionary. The implementation matches the pattern used in DocumentDownloadStage and is architecturally sound. Since these are fanout stages starting from EmptyTask, limiting to one worker per node is the correct optimization. No logic changes, no edge cases introduced, and the code is straightforward.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nemo_curator/stages/file_partitioning.py 5/5 Added xenna_stage_spec() method returning {"num_workers_per_node": 1} to optimize resource allocation for fanout stage starting from EmptyTask
nemo_curator/stages/text/download/base/url_generation.py 5/5 Added xenna_stage_spec() method returning {"num_workers_per_node": 1} to optimize resource allocation for URL generation fanout stage

Sequence Diagram

sequenceDiagram
    participant Executor as XennaExecutor
    participant FPS as FilePartitioningStage
    participant UGS as URLGenerationStage
    participant Xenna as Xenna Pipeline

    Note over Executor: execute() called with stages

    Executor->>FPS: xenna_stage_spec()
    FPS-->>Executor: {"num_workers_per_node": 1}
    
    Executor->>Xenna: Create StageSpec with num_workers_per_node=1
    Note over Xenna: Allocates 1 worker per node for FPS

    Executor->>FPS: process(_EmptyTask)
    Note over FPS: Scans file paths<br/>Creates FileGroupTasks
    FPS-->>Xenna: [FileGroupTask_0, FileGroupTask_1, ...]
    Note over Xenna: Fanout: Multiple tasks generated<br/>from single EmptyTask

    Executor->>UGS: xenna_stage_spec()
    UGS-->>Executor: {"num_workers_per_node": 1}
    
    Executor->>Xenna: Create StageSpec with num_workers_per_node=1
    Note over Xenna: Allocates 1 worker per node for UGS

    Executor->>UGS: process(_EmptyTask)
    Note over UGS: Generates URLs<br/>Creates FileGroupTasks
    UGS-->>Xenna: [FileGroupTask_0, FileGroupTask_1, ...]
    Note over Xenna: Fanout: One task per URL
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants