Skip to content

We need to handle the solr_query column for any solr-enabled table by default to avoid multiple issues #383

@paayers

Description

@paayers

When migrating a solr-enabled table, it's possible we'll need to reindex the solr core on either the origin or target cluster.
In such a scenario, this can cause both corruption and performance problems.
If you run a migrate job after a reindex, the solr_query column values get changed.
If you run a DiffData job after a reindex occurs, it will spot a mismatch for every record, for example:

25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% 463131761 %% 1970-01-06 %% AHE7S5351Q40G7UHIH9QQMY7] Mismatch: Target column:solr_query-origin[AHE7S5351Q40G7UHIH9QQMY7]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 561874193 %% 1970-01-07 %% 7PCQSU2I2FWCAHM1NYWSN2WJ] 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% 1259940133 %% 1970-01-15 %% D0ZKZZOW5HESXKRKDKLZZYQ2] 25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% 1141464253 %% 1970-01-13 %% XPPNQQC1YY30N4CHIUDTBCRK] Mismatch: Target column:solr_query-origin[XPPNQQC1YY30N4CHIUDTBCRK]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 986325790 %% 1970-01-12 %% 5N5HM62MAK66ENLIR7Q7MRFD] Mismatch: Target column:solr_query-origin[5N5HM62MAK66ENLIR7Q7MRFD]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% 758013392 %% 1970-01-09 %% Q3FRE934AQ7TTWJ62F37QC5X]

This will cause a large job to take days to run if you don't spot what's going on, and it also overwrites the values incorrectly if you have autocorrect turned on.

We can work-around this issue by setting the following for any solr enabled table:

spark.cdm.schema.origin.column.skip solr_query

I'm not sure how best to handle this, I could see on an initial migration where we'd want to transfer the values over, however, I think the downside outweighs the benefit of moving the solr_query data.
Perhaps we should auto-detect and default to ignoring the column and insist a customer runs a reindex after migration of any solr table, that seems safer than what happened to us at USBank.
We could leave it configurable that they could turn it on if they wanted it moved - whatever we decide is best.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions