Skip to content

Add parquet row group size and sorting capabilities to writers#419

Merged
joocer merged 2 commits intomainfrom
copilot/fix-cea36f76-2dcb-45c1-8e0b-58274dce7c47
Oct 3, 2025
Merged

Add parquet row group size and sorting capabilities to writers#419
joocer merged 2 commits intomainfrom
copilot/fix-cea36f76-2dcb-45c1-8e0b-58274dce7c47

Conversation

Copy link
Contributor

Copilot AI commented Oct 3, 2025

Overview

This PR adds two new features to the mabel writer for better control over parquet file generation:

  1. Row Group Size Control: Set the parquet row group size (default: 5,000 rows)
  2. Pre-commit Sorting: Sort records before writing to parquet (default: append order)

Motivation

Large parquet files benefit from optimized row group sizes and sorted data:

  • Row groups control the granularity of data access - smaller groups enable more selective filtering, larger groups improve compression
  • Sorting improves query performance when filtering or sorting on the sorted column(s)

These features allow users to optimize parquet files for their specific use cases without post-processing.

Changes

Modified mabel/data/writers/internals/blob_writer.py

  • Added parquet_row_group_size: int = 5000 parameter to control row group size
  • Added sort_by: Optional[str] = None parameter to enable pre-commit sorting
  • Updated commit() method to:
    • Sort PyArrow table using pytable.sort_by() when sort_by is specified
    • Pass row_group_size to pyarrow.parquet.write_table()

Added tests/test_writer_parquet_features.py

Comprehensive test suite covering:

  • Custom row group size validation
  • Ascending sort functionality
  • Descending sort functionality
  • Default behavior (no sorting, 5000 row groups)

Usage

from mabel.data import BatchWriter
from mabel.adapters.disk import DiskWriter

# Control row group size
w = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    schema=[{"name": "id", "type": "INTEGER"}, {"name": "value", "type": "VARCHAR"}],
    parquet_row_group_size=1000,  # 1000 rows per group
)

# Sort records before writing
w = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    schema=[{"name": "timestamp", "type": "INTEGER"}, {"name": "event", "type": "VARCHAR"}],
    sort_by="timestamp",  # Sort by timestamp column
)

# Use both features together
w = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    schema=[{"name": "id", "type": "INTEGER"}],
    parquet_row_group_size=1000,
    sort_by="id",
)

Implementation Notes

  • Minimal changes: Only one file modified (blob_writer.py)
  • Backward compatible: Default values maintain existing behavior
  • Universal support: Works with Writer, BatchWriter, and StreamWriter via **kwargs propagation
  • Parquet-only: Features only affect parquet format, ignored for other formats

Testing

All tests pass successfully:

  • ✅ 4 new parquet feature tests
  • ✅ 7 existing writer tests
  • ✅ 10 existing batch writer tests
  • ✅ 2 existing schema error tests

Total: 23/23 tests passed

Integration testing validated:

  • Correct row group creation (500 records → 5 row groups at size=100)
  • Proper sorting of all records
  • Data integrity preservation
Original prompt

Add a feature to the writer that allows us to set the parquet row group size (default to 5,000 rows) and can order the records (default to the append order, but allow us to sort pre-commit)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Copilot AI changed the title [WIP] Add a feature to the writer that allows us to set the parquet row group size (default to 5,000 rows) and can order the records (default to the append order, but allow us to sort pre-commit) Add parquet row group size and sorting capabilities to writers Oct 3, 2025
Copilot AI requested a review from joocer October 3, 2025 13:23
@joocer joocer marked this pull request as ready for review October 3, 2025 15:32
@joocer joocer merged commit d493add into main Oct 3, 2025
3 of 16 checks passed
@joocer joocer deleted the copilot/fix-cea36f76-2dcb-45c1-8e0b-58274dce7c47 branch October 3, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants