-
-
Notifications
You must be signed in to change notification settings - Fork 62
feat(replication): Add support for relation changes #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(replication): Add support for relation changes #377
Conversation
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
// We process all the changes that we want to dispatch to the destination. | ||
let mut changes = vec![]; | ||
for column_schema in self.old_table_schema.column_schemas.iter() { | ||
let column_schema = IndexedColumnSchema(column_schema.clone()); | ||
let latest_column_schema = new_indexed_column_schemas.take(&column_schema); | ||
match latest_column_schema { | ||
Some(latest_column_schema) => { | ||
let column_schema = column_schema.into_inner(); | ||
let latest_column_schema = latest_column_schema.into_inner(); | ||
|
||
if column_schema.name != latest_column_schema.name { | ||
// If we find a column with the same name but different fields, we assume it was changed. The only changes | ||
// that we detect are changes to the column but with preserved name. | ||
changes.push(RelationChange::AlterColumn( | ||
column_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Altered columns never detected
In RelationEvent::build_changes
the comparison inside the Some
branch uses if column_schema.name != latest_column_schema.name
. Because the HashSet
lookup is keyed by column name, the retrieved column always has the same name, so this condition is never true even when type, nullability or other attributes changed. As a result no RelationChange::AlterColumn
events are ever produced and destinations will miss in-place schema changes, leaving replicated tables out of sync.
Useful? React with 👍 / 👎.
// Preserve per-table ordering. | ||
table_id_to_table_rows | ||
.entry(table_id) | ||
.or_default() | ||
.push(table_row); | ||
|
||
// Ensure a single schema version per batch. | ||
// | ||
// We need to do this since we don't make any assumptions on relation events being there | ||
// so we use the schema version of the first element that we find. | ||
// | ||
// The invariant that must be upheld is that for all events in a batch, they must all have | ||
// the same schema version. | ||
match batch_schema_version { | ||
Some(batch_schema_version) => { | ||
if schema_version != *batch_schema_version { | ||
bail!( | ||
ErrorKind::InvalidState, | ||
"Multiple schema versions in the same batch", | ||
"Multiple schema versions in the same batch were found while processing events for BigQuery" | ||
) | ||
} | ||
} | ||
None => { | ||
*batch_schema_version = Some(schema_version); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BigQuery batching assumes global schema version
The new batching logic records a single batch_schema_version
for the entire batch and bails if any subsequent DML event carries a different version. Schema versions are scoped per table, so it’s normal for a batch to include events for two tables with different versions (e.g., one table had a schema change while another did not). Under such circumstances the code emits “Multiple schema versions in the same batch” and the destination fails to apply further events. The version tracking should be per table rather than a global constraint on the batch.
Useful? React with 👍 / 👎.
This PR adds support for the
Relation
message, which can now emitRelationChange
events to describe what changed in a table. In addition, it defaults allColumnSchema
s tonullable
since we are unable to infer nullability fromRelation
messages because the field is omitted; thus, it's better if we just make everything nullable since it gives more flexibility.The PR also adds schema change support for the BigQuery destination.
The main limitation of this approach is that we do not recognize column renames because of a limitation of the replication protocol, so when a rename is performed, it is emitted as
DROP
+ADD
.