Skip to content

torchdata or torchdata-contrib? #1471

@keunwoochoi

Description

@keunwoochoi

my team has been implementing quite several utilities. some are close to core features, some other are more advanced and utilities. for example, their class names and features are like:

class RoundRobinNode(BaseNode[T]):
    """A node that cycles through multiple datasets in a round-robin way.
class FileListNode(BaseNode[Dict]):
    """Node that lists files from any supported filesystem (local, S3) matching specified patterns.

    Uses fsspec to provide universal file access capabilities for both local and remote files.

    Features:
    - Lists files from supported filesystems (local, S3)
    - Supports glob patterns for file matching
    - Maintains state for checkpointing and resumption
class FileReaderNode(BaseNode[Dict]):
    """Universal node that reads file contents from any supported filesystem.

    Uses smart_open to support local files, S3, HTTP, and more file systems.
class TextStreamDecodeNode(BaseNode[Dict]):
   """Node that streams text files line by line from any source.

   This node combines functionality of file reading and line-by-line processing,
   supporting both local and remote (S3, HTTP, etc.) files via smart_open.

   Features:
   - Streams files line-by-line (memory efficient)
   - Supports local files, S3, HTTP, and more
   - Handles compressed files (.gz, .bz2) transparently
   - Maintains state for checkpointing and resumption
   - Preserves metadata from source nodes
class HuggingFaceDatasetStreamNode(BaseNode[dict]):
    """
    Node that streams examples from a HuggingFace dataset.

    Output format:
        {
            "data": {...},           # Original dataset item
            "metadata": {
                "dataset_name": "squad",
                "split": "train",
                "index": 42
            }
        }

    Input: None (configured with dataset name and split at initialization)
    Output: Dict containing example data and metadata
class JsonlStreamNode(TextStreamDecodeNode):
    """Node that streams JSONL files and parses each line as JSON.

    This node extends TextStreamDecodeNode to add JSON parsing for each line.
    It maintains the same state management and streaming capabilities while adding
    JSONL-specific processing.

and some more.

conservatively, i'd say these can be part of, say, torchdata-contrib. but i'd like to hear from the maintainers. where would you suggest drawing the line? any other suggestions would be great, too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions