Skip to content

Fix msdf_merge util function.#627

Merged
gouttegd merged 1 commit intomapping-commons:masterfrom
gouttegd:fix-merge_msdf
Oct 8, 2025
Merged

Fix msdf_merge util function.#627
gouttegd merged 1 commit intomapping-commons:masterfrom
gouttegd:fix-merge_msdf

Conversation

@gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Oct 8, 2025

The msdf_merge function attempts to "propagate" slots before merging the sets, but is doing so without any regard for which slots should actually be propagated.

In addition, it attempts to inject in every individual mapping a mapping_set_source slot pointing to the ID of the original set that contained the mapping, but this is invalid as there is no mapping_set_source slot on indivdual mapping records – the slot intended to capture the set from which a record came from is mapping_source.

Lastly, the function also attempts to drop duplicates after the sets have been merged, but the detection of duplicates is prevented by (1) the incorrect propagation of non-propagatable slots (which can cause two otherwise identical records in two different sets to appear different, if the metadata of the sets contain different wrongly propagated slots), and (2) the injection of the mapping_set_source slot.

This PR fixes all those issues by deleting the bogus inject_metadata_into_df function and replacing it by a call to msdf.propagate(), which implements propagation correctly. It then manually injects the correct mapping_source slot if possible, and if so ignores the injected slot when attempting to drop duplicates.

closes #626

@gouttegd gouttegd self-assigned this Oct 8, 2025
merged_msdf1 = merge_msdf(self.msdf1, msdf3)

self.assertEqual(152, len(merged_msdf1.df))
self.assertEqual(149, len(merged_msdf1.df))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This difference is expected, and the new behaviour is the correct one.

The merge of msdf1 (the basic3.tsv file) and msdf3 (the basic.tsv file) contains three records that are identical, but the previous version of merge_msdf failed to consider them as duplicates because of the incorrect propagation of the creator_id slot (basic3.tsv and basic.tsv have different values for creator_id, so as part of the merge operation all records in the msdf1 got one creator_id value, and all records in the msdf3 set got another creator_id value, resulting in all records in the merged set being different).

"""Test merging of multiple msdfs."""
merged_msdf = merge_msdf(*self.msdfs)
self.assertEqual(275, len(merged_msdf.df))
self.assertEqual(200, len(merged_msdf.df))
Copy link
Contributor Author

@gouttegd gouttegd Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same explanation as above: the previous version of merge_msdf did not drop all duplicates because of the incorrect propagation of slots that should not have been propagated.

@gouttegd gouttegd marked this pull request as draft October 8, 2025 19:40
The `msdf_merge` function attempts to "propagate" slots before merging
the sets, but is doing so without any regard for which slots should
actually be propagated.

In addition, it attempts to inject in every individual mapping a
`mapping_set_source` slot pointing to the ID of the original set that
contained the mapping, but this is invalid as there is _no_
`mapping_set_source` slot on indivdual mapping records -- the slot
intended to capture the set from which a record came from is
`mapping_source`.

Lastly, the function also attempts to drop duplicates after the sets
have been merged, but the detection of duplicates is prevented by (1)
the incorrect propagation of non-propagatable slots (which can cause two
otherwise identical records in two different sets to appear different,
if the metadata of the sets contain different wrongly propagated slots),
and (2) the injection of the `mapping_set_source` slot.

This commit fixes all those issues by deleting the bogus
`inject_metadata_into_df` function and replacing by a call to
`msdf.propagate()`, which implements propagation correctly. It then
manually inject the correct `mapping_source` slot if possible, and if so
ignore the injected slot when attempting to drop duplicates.
@gouttegd gouttegd marked this pull request as ready for review October 8, 2025 19:51
@gouttegd gouttegd requested a review from matentzn October 8, 2025 19:51
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!

@gouttegd gouttegd merged commit 07f05cd into mapping-commons:master Oct 8, 2025
6 checks passed
@gouttegd gouttegd deleted the fix-merge_msdf branch October 8, 2025 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

inject_metadata_into_df function is dubious and violates the spec

2 participants