-
Notifications
You must be signed in to change notification settings - Fork 19
Description
When migrating a solr-enabled table, it's possible we'll need to reindex the solr core on either the origin or target cluster.
In such a scenario, this can cause both corruption and performance problems.
If you run a migrate job after a reindex, the solr_query column values get changed.
If you run a DiffData job after a reindex occurs, it will spot a mismatch for every record, for example:
25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% AHE7S5351Q40G7UHIH9QQMY71ALSV0F0ZN9H4P9WNZHERT3IYAOOMVL3VU52BQUC %% 463131761 %% 1970-01-06 %% AHE7S5351Q40G7UHIH9QQMY7] Mismatch: Target column:solr_query-origin[AHE7S5351Q40G7UHIH9QQMY7]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 7PCQSU2I2FWCAHM1NYWSN2WJWR0TQ05R6WRSYWZC7ZAOZX7JQI3L14FQTBL89FI5 %% 561874193 %% 1970-01-07 %% 7PCQSU2I2FWCAHM1NYWSN2WJ] 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% D0ZKZZOW5HESXKRKDKLZZYQ2RJO7LKSFOAR1O38P47GAWOH059MDQ6WXQBTBXRI9 %% 1259940133 %% 1970-01-15 %% D0ZKZZOW5HESXKRKDKLZZYQ2] 25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% XPPNQQC1YY30N4CHIUDTBCRKQMOWJFFJ7YRHM3C0OOVMTTOMR6279MP7OEUTPADZ %% 1141464253 %% 1970-01-13 %% XPPNQQC1YY30N4CHIUDTBCRK] Mismatch: Target column:solr_query-origin[XPPNQQC1YY30N4CHIUDTBCRK]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Mismatch row found for key: [5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 5N5HM62MAK66ENLIR7Q7MRFD6SXA3P9BIT4TERK7Y4T5EH94TSSRP0I3CAQAFBPN %% 986325790 %% 1970-01-12 %% 5N5HM62MAK66ENLIR7Q7MRFD] Mismatch: Target column:solr_query-origin[5N5HM62MAK66ENLIR7Q7MRFD]-target[]; 25/08/22 14:48:26 ERROR DiffJobSession: Corrected mismatch row in target: [Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% Q3FRE934AQ7TTWJ62F37QC5XERFMMB30UYKE2MAQS6W0BXP6QVXB8OX2W8WNV4FD %% 758013392 %% 1970-01-09 %% Q3FRE934AQ7TTWJ62F37QC5X]
This will cause a large job to take days to run if you don't spot what's going on, and it also overwrites the values incorrectly if you have autocorrect turned on.
We can work-around this issue by setting the following for any solr enabled table:
spark.cdm.schema.origin.column.skip solr_query
I'm not sure how best to handle this, I could see on an initial migration where we'd want to transfer the values over, however, I think the downside outweighs the benefit of moving the solr_query data.
Perhaps we should auto-detect and default to ignoring the column and insist a customer runs a reindex after migration of any solr table, that seems safer than what happened to us at USBank.
We could leave it configurable that they could turn it on if they wanted it moved - whatever we decide is best.