Skip to content

Commit d073964

Browse files
al13n321mkmkme
authored andcommitted
Disable datalake prewhere, allow missing field_id
1 parent 8b8535b commit d073964

File tree

2 files changed

+26
-3
lines changed

2 files changed

+26
-3
lines changed

src/Processors/Formats/Impl/Parquet/SchemaConverter.cpp

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,12 @@ std::string_view SchemaConverter::useColumnMapperIfNeeded(const parq::SchemaElem
133133
return element.name;
134134
const auto & map = column_mapper->getFieldIdToClickHouseName();
135135
if (!element.__isset.field_id)
136-
throw Exception(ErrorCodes::ICEBERG_SPECIFICATION_VIOLATION, "Missing field_id for column {}", element.name);
136+
{
137+
/// Does iceberg require that parquet files have field ids?
138+
/// Our iceberg writer currently doesn't write them.
139+
//throw Exception(ErrorCodes::ICEBERG_SPECIFICATION_VIOLATION, "Missing field_id for column {}", element.name);
140+
return element.name;
141+
}
137142
auto it = map.find(element.field_id);
138143
if (it == map.end())
139144
throw Exception(ErrorCodes::ICEBERG_SPECIFICATION_VIOLATION, "Parquet file has column {} with field_id {} that is not in datalake metadata", element.name, element.field_id);

src/Storages/ObjectStorage/StorageObjectStorage.cpp

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,26 @@ StorageObjectStorage::StorageObjectStorage(
203203
sample_path);
204204
}
205205

206-
supports_prewhere = FormatFactory::instance().checkIfFormatSupportsPrewhere(configuration->getFormat(), context, format_settings);
206+
/// TODO: Known problems with datalake prewhere:
207+
/// * If the iceberg table went through schema evolution, columns read from file may need to
208+
/// be renamed or typecast before applying prewhere. There's already a mechanism for
209+
/// telling parquet reader to rename columns: ColumnMapper. And parquet reader already
210+
/// automatically does type casts to requested types. But weirdly the iceberg reader uses
211+
/// those mechanism to request the *old* name and type of the column, then has additional
212+
/// code to do the renaming and casting as a separate step outside parquet reader.
213+
/// We should probably change this and delete that additional code?
214+
/// * Delta Lake can have "partition columns", which are columns with constant value specified
215+
/// in the metadata, not present in parquet file. Like hive partitioning, but in metadata
216+
/// files instead of path. Currently these columns are added to the block outside parquet
217+
/// reader. If they appear in prewhere expression, parquet reader gets a "no column in block"
218+
/// error. Unlike hive partitioning, we can't (?) just return these columns from
219+
/// supportedPrewhereColumns() because at the time of the call the delta lake metadata hasn't
220+
/// been read yet. So we should probably pass these columns to the parquet reader instead of
221+
/// adding them outside.
222+
/// * There's a bug in StorageObjectStorageSource::createReader: it makes a copy of
223+
/// FormatFilterInfo, but for some reason unsets prewhere_info and row_level_filter_info.
224+
/// There's probably no reason for this, and it should just copy those fields like the others.
225+
supports_prewhere = !configuration->isDataLakeConfiguration() && FormatFactory::instance().checkIfFormatSupportsPrewhere(configuration->getFormat(), context, format_settings);
207226

208227
StorageInMemoryMetadata metadata;
209228
metadata.setColumns(columns);
@@ -707,4 +726,3 @@ void StorageObjectStorage::checkAlterIsPossible(const AlterCommands & commands,
707726
}
708727

709728
}
710-

0 commit comments

Comments
 (0)