Conversation
Reviewer's GuideAdds latin-1 fallback support for non-UTF-8 ClickHouse table schemas during backup/restore and introduces integration tests (including a raw-bytes ClickHouse client method) to verify behavior with both valid UTF-8 and invalid-UTF-8 metadata. Sequence diagram for fallback decoding when loading table metadatasequenceDiagram
participant BackupManager
participant TableLogic as TableLogic_table_module
participant FileSystem
BackupManager->>TableLogic: _load_create_statement_from_disk(table)
TableLogic->>FileSystem: read_text(utf-8)
alt utf-8 readable
FileSystem-->>TableLogic: create_statement (utf-8)
TableLogic-->>BackupManager: create_statement
else UnicodeDecodeError
TableLogic->>TableLogic: log warning (non UTF-8 bytes)
TableLogic->>FileSystem: read_text(latin-1)
FileSystem-->>TableLogic: create_statement (latin-1)
TableLogic-->>BackupManager: create_statement
end
Sequence diagram for fallback decoding in storage loader download_datasequenceDiagram
participant RestoreProcess
participant StorageLoader
participant PhysicalLoader as PhysicalLoader_impl
RestoreProcess->>StorageLoader: download_data(remote_path, encoding)
StorageLoader->>PhysicalLoader: download_data(remote_path, is_async, encryption)
PhysicalLoader-->>StorageLoader: data (bytes)
alt encoding is falsy
StorageLoader-->>RestoreProcess: data (bytes)
else encoding is truthy
StorageLoader->>StorageLoader: data.decode(utf-8)
alt utf-8 succeeds
StorageLoader-->>RestoreProcess: text (utf-8)
else UnicodeDecodeError
StorageLoader->>StorageLoader: data.decode(latin-1)
StorageLoader-->>RestoreProcess: text (latin-1)
end
end
Updated class diagram for storage loader and ClickHouse test clientclassDiagram
class StorageLoader {
+download_data(remote_path, encoding, is_async, encryption) any
+download_file(remote_path, local_path, is_async, encryption) void
}
class PhysicalLoader {
+download_data(remote_path, is_async, encryption) bytes
}
StorageLoader --> PhysicalLoader : uses
class TableModule {
+_load_create_statement_from_disk(table) str
}
class Table {
+database str
+name str
+metadata_path str
}
TableModule --> Table : reads_metadata
class ClickHouseTestClient {
+query(sql) str
+query_raw_bytes(sql) bytes
}
ClickHouseTestClient ..> StorageLoader : used_in_integration_tests
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
StorageLoader.download_data, theencodingargument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passedencoding(including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API. - Consider adding a warning log when
download_datafalls back from UTF-8 to latin-1 decoding, similar to_load_create_statement_from_disk, to make non-UTF-8 data easier to detect in production.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `StorageLoader.download_data`, the `encoding` argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed `encoding` (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
- Consider adding a warning log when `download_data` falls back from UTF-8 to latin-1 decoding, similar to `_load_create_statement_from_disk`, to make non-UTF-8 data easier to detect in production.
## Individual Comments
### Comment 1
<location> `ch_backup/storage/loader.py:142-147` </location>
<code_context>
remote_path, is_async=is_async, encryption=encryption
)
- return data.decode(encoding) if encoding else data
+ if not encoding:
+ return data
+ try:
+ return data.decode("utf-8")
+ except UnicodeDecodeError:
+ return data.decode("latin-1")
def download_file(
</code_context>
<issue_to_address>
**issue (bug_risk):** The `encoding` parameter is now ignored when truthy, which is a behavior change and likely a bug.
With the new logic, callers that pass `encoding="ascii"`, `"utf-16"`, or rely on strict decoding will now always get a UTF-8→latin-1 fallback instead, which can change behavior or hide errors.
If you want a UTF-8→latin-1 fallback while still respecting the caller’s choice, you could do something like:
```python
if not encoding:
return data
try:
return data.decode(encoding)
except UnicodeDecodeError:
if encoding.lower() == "utf-8":
return data.decode("latin-1")
raise
```
so that the fallback only applies when UTF-8 was actually requested.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Instead of separate execute_raw
def execute(self, query: Union[str, bytes]) -> None:
if isinstance(query, bytes):
...
| @@ -0,0 +1,63 @@ | |||
| Feature: Non-UTF-8 schema encoding support | |||
There was a problem hiding this comment.
Do this tests fail without changes ?
| ch_client.execute(table_schema) | ||
|
|
||
|
|
||
| @given("we have created non-UTF-8 test table on {node}") |
There was a problem hiding this comment.
Let's make we execute query on {node:w} with encoding {encoding:w} step instead of this.
It seems more versatile.
There was a problem hiding this comment.
We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text
There was a problem hiding this comment.
How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined
"""
CREATE TABLE test_db_01.table_rus (
EventDate DateTime,
CounterID UInt32,
`\xcf\xf0\xe8\xe2\xe5\xf2` UInt32
)
ENGINE = MergeTree()
PARTITION BY CounterID % 10
ORDER BY (CounterID, EventDate)
"""
| def download_data( | ||
| self, remote_path, is_async=False, encryption=False, encoding="utf-8" | ||
| ): | ||
| def download_data(self, remote_path, is_async=False, encryption=False): |
There was a problem hiding this comment.
Let's just replace encoding="latin-1"
There was a problem hiding this comment.
Can there be a problem (corrupted text) ?
ch-backup/ch_backup/clickhouse/client.py
Line 85 in 8802592
If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?
There was a problem hiding this comment.
Let's just replace
encoding="latin-1"
this change breaks tests for valid UTF-8 data:
ASSERT FAILED:
Expected: <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_试` String COMMENT '试') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_😈` String COMMENT '😈') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
but: was <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_è¯\x95` String COMMENT 'è¯\x95') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_ð\x9f\x98\x88` String COMMENT 'ð\x9f\x98\x88') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
the issue is that bytes.decode() returns a str (unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character
Summary by Sourcery
Add support for handling non-UTF-8 ClickHouse schemas and validate compatibility via integration tests.
Bug Fixes:
Enhancements:
Tests: