Skip to content

Decimal unscale fails with empty column #2263

@berg2043

Description

@berg2043

Apache Iceberg version

0.9.1 (latest release)

Please describe the bug 🐞

After applying the fix from #1983 to fix decimal conversion, "conversion from NoneType to Decimal is not supported" is thrown if a decimal column is empty. Here's a snippet of code to replicate

from decimal import Decimal

import pyarrow as pa
from pyiceberg.io.pyarrow import pyarrow_to_schema
from pyiceberg.schema import Schema
from pyiceberg.types import DecimalType, NestedField
from pyiceberg.catalog import Catalog, load_catalog
from pyiceberg.table.name_mapping import MappedField, NameMapping


warehouse_path = '/tmp'

catalog = load_catalog(
    "default",
    type = "sql",
    uri = f"sqlite://///{warehouse_path}/test",
    warehouse = f'file://{warehouse_path}',
)

catalog.create_namespace_if_not_exists(
  'test',
  {'loacation': f'file://{warehouse_path}'}
)

decimal8 = pa.array([Decimal("123.45"), Decimal("678.91")], pa.decimal128(8, 2))
decimal16 = pa.array([Decimal("12345679.123456"), Decimal("67891234.678912")], pa.decimal128(16, 6))
decimal19 = pa.array([Decimal("1234567890123.123456"), Decimal("9876543210703.654321")], pa.decimal128(19, 6))
empty_decimal8 = pa.array([None, None], pa.decimal128(8,2))
empty_decimal16 = pa.array([None, None], pa.decimal128(16, 6))
empty_decimal19 = pa.array([None, None], pa.decimal128(19, 6))

table = pa.Table.from_pydict(
    {
        "decimal8": decimal8,
        "decimal16": decimal16,
        "decimal19": decimal19,
        "empty_decimal8": empty_decimal8,
        "empty_decimal16": empty_decimal16,
        "empty_decimal19": empty_decimal19,
    },
)

pa_schema = table.schema

name_mapping = NameMapping([
  MappedField(**{'field-id': i+1, 'names': [name]})
  for i, name
  in enumerate(pa_schema.names)
])

schema = pyarrow_to_schema(
  pa_schema,
  name_mapping
)

pyiceberg_table = catalog.create_table(
  'test.decimals',
  schema=table.schema,
)

pyiceberg_table.append(table)

My current fix to data_file_statistics_from_parquet_metadata is as follows, but I'm unsure what the unintended consequences would be.

                    if isinstance(stats_col.iceberg_type, DecimalType) and statistics.physical_type != "FIXED_LEN_BYTE_ARRAY":
                        scale = stats_col.iceberg_type.scale
                        if statistics.min_raw:
                            col_aggs[field_id].update_min(unscaled_to_decimal(statistics.min_raw, scale))
                        if statistics.max_raw:
                            col_aggs[field_id].update_max(unscaled_to_decimal(statistics.max_raw, scale))

I could not get the nightly build to install, so I'm unsure if this still exists. I tested it with 0.9.0 and did not run into this issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions