-
Notifications
You must be signed in to change notification settings - Fork 169
Open
Description
my team has been implementing quite several utilities. some are close to core features, some other are more advanced and utilities. for example, their class names and features are like:
class RoundRobinNode(BaseNode[T]):
"""A node that cycles through multiple datasets in a round-robin way.
class FileListNode(BaseNode[Dict]):
"""Node that lists files from any supported filesystem (local, S3) matching specified patterns.
Uses fsspec to provide universal file access capabilities for both local and remote files.
Features:
- Lists files from supported filesystems (local, S3)
- Supports glob patterns for file matching
- Maintains state for checkpointing and resumption
class FileReaderNode(BaseNode[Dict]):
"""Universal node that reads file contents from any supported filesystem.
Uses smart_open to support local files, S3, HTTP, and more file systems.
class TextStreamDecodeNode(BaseNode[Dict]):
"""Node that streams text files line by line from any source.
This node combines functionality of file reading and line-by-line processing,
supporting both local and remote (S3, HTTP, etc.) files via smart_open.
Features:
- Streams files line-by-line (memory efficient)
- Supports local files, S3, HTTP, and more
- Handles compressed files (.gz, .bz2) transparently
- Maintains state for checkpointing and resumption
- Preserves metadata from source nodes
class HuggingFaceDatasetStreamNode(BaseNode[dict]):
"""
Node that streams examples from a HuggingFace dataset.
Output format:
{
"data": {...}, # Original dataset item
"metadata": {
"dataset_name": "squad",
"split": "train",
"index": 42
}
}
Input: None (configured with dataset name and split at initialization)
Output: Dict containing example data and metadata
class JsonlStreamNode(TextStreamDecodeNode):
"""Node that streams JSONL files and parses each line as JSON.
This node extends TextStreamDecodeNode to add JSON parsing for each line.
It maintains the same state management and streaming capabilities while adding
JSONL-specific processing.
and some more.
conservatively, i'd say these can be part of, say, torchdata-contrib
. but i'd like to hear from the maintainers. where would you suggest drawing the line? any other suggestions would be great, too.
ramanishsinghdivyanshk
Metadata
Metadata
Assignees
Labels
No labels