This document provides a complete reference of all blocks available in SDG Hub, their purposes, parameters, and usage examples.
Blocks are the fundamental processing units in SDG Hub. Each block performs a specific transformation on datasets, and blocks can be chained together in flows to create complex data processing pipelines.
- Registered Name:
Block - Purpose: Abstract base class providing common functionality for all blocks
- Key Features:
- Template validation using Jinja2
- Configuration file loading (YAML)
- Input validation methods
- Parameters:
block_name: str- Name of the block instance
- Registered Name:
LLMBlock - Purpose: Core block for text generation using language models
- Key Features:
- OpenAI-compatible API integration
- Jinja2 prompt templating
- Configurable output parsing
- Batch processing support
- Automatic server capability detection
Parameters:
block_name: str- Name of the blockconfig_path: str- Path to configuration fileclient: openai.OpenAI- OpenAI client instanceoutput_cols: List[str]- Output column namesparser_kwargs: Dict[str, Any]- Parser configurationmodel_prompt: str- Template for model prompt (default: "{prompt}")model_id: Optional[str]- Model ID to use**batch_kwargs- Additional batch processing arguments
Example Usage:
- block_type: LLMBlock
block_config:
block_name: gen_knowledge
config_path: configs/knowledge/simple_generate_qa.yaml
model_id: mistralai/Mixtral-8x7B-Instruct-v0.1
output_cols:
- output
gen_kwargs:
temperature: 0.7
max_tokens: 2048- Registered Name:
ConditionalLLMBlock - Purpose: LLM block that selects different prompt templates based on a selector column
- Key Features:
- Multiple configuration file support
- Conditional prompt selection
- Inherits all LLMBlock functionality
Parameters:
block_name: str- Name of the blockconfig_paths: Dict[str, str]- Mapping of selector values to config pathsclient: openai.OpenAI- OpenAI client instancemodel_id: str- Model ID to useoutput_cols: List[str]- Output column namesselector_column_name: str- Column used for template selectionmodel_prompt: str- Template for model prompt**batch_kwargs- Additional batch processing arguments
Example Usage:
- block_type: ConditionalLLMBlock
block_config:
block_name: conditional_gen
config_paths:
"math": configs/skills/math.yaml
"coding": configs/skills/coding.yaml
selector_column_name: category
output_cols: [response]- Registered Name:
FilterByValueBlock - Purpose: Filter datasets based on column values using various operations
- Key Features:
- Multiple filter operations (eq, contains, ge, le, gt, lt, ne)
- Optional data type conversion
- Parallel processing support
Parameters:
block_name: str- Name of the blockfilter_column: str- Column to filter onfilter_value: Union[Any, List[Any]]- Value(s) to filter byoperation: Callable[[Any, Any], bool]- Filter operationconvert_dtype: Optional[Union[Type[float], Type[int]]]- Data type conversion**batch_kwargs- Additional batch processing arguments
Example Usage:
- block_type: FilterByValueBlock
block_config:
block_name: filter_high_quality
filter_column: quality_score
filter_value: 0.8
operation: operator.ge
convert_dtype: float- Registered Name:
SamplePopulatorBlock - Purpose: Populate dataset samples with data from configuration files
- Key Features:
- Multiple configuration file loading
- Data mapping based on column values
- Configuration file postfix support
Parameters:
block_name: str- Name of the blockconfig_paths: List[str]- List of configuration file pathscolumn_name: str- Column used as key for data mappingpost_fix: str- Suffix for configuration filenames**batch_kwargs- Additional batch processing arguments
- Registered Name:
SelectorBlock - Purpose: Select and map values from one column to another based on choice mapping
Parameters:
block_name: str- Name of the blockchoice_map: Dict[str, str]- Mapping of choice values to column nameschoice_col: str- Column containing choice valuesoutput_col: str- Column to store selected values**batch_kwargs- Additional batch processing arguments
Example Usage:
- block_type: SelectorBlock
block_config:
block_name: select_best_response
choice_map:
"A": "response_a"
"B": "response_b"
choice_col: preferred_choice
output_col: selected_response- Registered Name:
CombineColumnsBlock - Purpose: Combine multiple columns into a single column with a separator
Parameters:
block_name: str- Name of the blockcolumns: List[str]- List of column names to combineoutput_col: str- Name of output columnseparator: str- Separator between combined values (default: "\n\n")**batch_kwargs- Additional batch processing arguments
Example Usage:
- block_type: CombineColumnsBlock
block_config:
block_name: combine_qa_pair
columns: [question, answer]
output_col: qa_text
separator: "\n\nAnswer: "- Registered Name:
FlattenColumnsBlock - Purpose: Transform wide format to long format by melting columns into rows
Parameters:
block_name: str- Name of the blockvar_cols: List[str]- Columns to be melted into rowsvalue_name: str- Name of new value columnvar_name: str- Name of new variable column
- Registered Name:
DuplicateColumns - Purpose: Create copies of existing columns with new names
Parameters:
block_name: str- Name of the blockcolumns_map: Dict[str, str]- Mapping of existing to new column names
Example Usage:
- block_type: DuplicateColumns
block_config:
block_name: backup_columns
columns_map:
original_text: backup_text
processed_text: backup_processed- Registered Name:
RenameColumns - Purpose: Rename columns in a dataset according to mapping dictionary
Parameters:
block_name: str- Name of the blockcolumns_map: Dict[str, str]- Mapping of old to new column names
Example Usage:
- block_type: RenameColumns
block_config:
block_name: standardize_names
columns_map:
input_text: prompt
output_text: response- Registered Name:
SetToMajorityValue - Purpose: Set all values in a column to the most frequent (majority) value
Parameters:
block_name: str- Name of the blockcol_name: str- Name of column to set to majority value
- Registered Name:
IterBlock - Purpose: Apply another block multiple times iteratively
Parameters:
block_name: str- Name of the blocknum_iters: int- Number of iterationsblock_type: Type[Block]- Block class to instantiateblock_kwargs: Dict[str, Any]- Arguments for block constructor**kwargs- Additional arguments including gen_kwargs
Example Usage:
- block_type: IterBlock
block_config:
block_name: iterative_improvement
num_iters: 3
block_type: LLMBlock
block_kwargs:
config_path: configs/improve_response.yaml
output_cols: [improved_response]These are custom blocks implemented in examples
- Registered Name:
AddStaticValue - Purpose: Add a static value to a specified column in a dataset
Parameters:
block_name: str- Name of the blockcolumn_name: str- Column to populatestatic_value: str- Constant value to add
- Registered Name:
DoclingParsePDF - Purpose: Parse PDF documents into markdown format using Docling
Parameters:
block_name: str- Name of the blockpdf_path_column: str- Column containing PDF file pathsoutput_column: str- Column to store markdown output
- Registered Name:
JSONFormat - Purpose: Format and standardize JSON output from text analysis results
Parameters:
block_name: str- Name of the blockoutput_column: str- Column to store formatted JSON
- Registered Name:
PostProcessThinkingBlock - Purpose: Post-process thinking tokens from model outputs
Parameters:
block_name: str- Name of the blockcolumn_name: str- Column to process
- Registered Name:
RegexParserBlock - Purpose: Parse text using regular expressions and extract structured data
Parameters:
block_name: str- Name of the blockcolumn_name: str- Column to parseparsing_pattern: str- Regex pattern for parsingparser_cleanup_tags: List[str]- Tags to clean upoutput_cols: List[str]- Output columns
All blocks are registered using the @BlockRegistry.register() decorator, enabling dynamic discovery and instantiation.
Blocks can load YAML configuration files containing prompts, templates, and other settings.
Most blocks support multiprocessing through the num_procs parameter in batch_kwargs.
Blocks validate inputs using Jinja2 templates to ensure required variables are provided.
Blocks are designed to work seamlessly in data processing pipelines, with consistent input/output interfaces.
To create a custom block:
from sdg_hub.blocks import Block
from sdg_hub.registry import BlockRegistry
from datasets import Dataset
@BlockRegistry.register("MyCustomBlock")
class MyCustomBlock(Block):
def __init__(self, block_name: str, custom_param: str, **kwargs):
super().__init__(block_name)
self.custom_param = custom_param
def generate(self, dataset: Dataset, **kwargs) -> Dataset:
# Custom processing logic
processed_dataset = dataset.map(
lambda x: {"processed": f"{self.custom_param}: {x['input']}"}
)
return processed_datasetThen use it in a flow:
- block_type: MyCustomBlock
block_config:
block_name: my_processor
custom_param: "Processed"- Descriptive Names: Use clear, descriptive block names for easier debugging
- Configuration Files: Store complex prompts and templates in separate YAML files
- Error Handling: Blocks should handle edge cases gracefully
- Documentation: Include docstrings describing block purpose and parameters
- Testing: Test blocks with various input formats and edge cases
- Performance: Use batch processing and parallel execution for large datasets