Fix msdf_merge util function.#627
Conversation
| merged_msdf1 = merge_msdf(self.msdf1, msdf3) | ||
|
|
||
| self.assertEqual(152, len(merged_msdf1.df)) | ||
| self.assertEqual(149, len(merged_msdf1.df)) |
There was a problem hiding this comment.
This difference is expected, and the new behaviour is the correct one.
The merge of msdf1 (the basic3.tsv file) and msdf3 (the basic.tsv file) contains three records that are identical, but the previous version of merge_msdf failed to consider them as duplicates because of the incorrect propagation of the creator_id slot (basic3.tsv and basic.tsv have different values for creator_id, so as part of the merge operation all records in the msdf1 got one creator_id value, and all records in the msdf3 set got another creator_id value, resulting in all records in the merged set being different).
| """Test merging of multiple msdfs.""" | ||
| merged_msdf = merge_msdf(*self.msdfs) | ||
| self.assertEqual(275, len(merged_msdf.df)) | ||
| self.assertEqual(200, len(merged_msdf.df)) |
There was a problem hiding this comment.
Same explanation as above: the previous version of merge_msdf did not drop all duplicates because of the incorrect propagation of slots that should not have been propagated.
The `msdf_merge` function attempts to "propagate" slots before merging the sets, but is doing so without any regard for which slots should actually be propagated. In addition, it attempts to inject in every individual mapping a `mapping_set_source` slot pointing to the ID of the original set that contained the mapping, but this is invalid as there is _no_ `mapping_set_source` slot on indivdual mapping records -- the slot intended to capture the set from which a record came from is `mapping_source`. Lastly, the function also attempts to drop duplicates after the sets have been merged, but the detection of duplicates is prevented by (1) the incorrect propagation of non-propagatable slots (which can cause two otherwise identical records in two different sets to appear different, if the metadata of the sets contain different wrongly propagated slots), and (2) the injection of the `mapping_set_source` slot. This commit fixes all those issues by deleting the bogus `inject_metadata_into_df` function and replacing by a call to `msdf.propagate()`, which implements propagation correctly. It then manually inject the correct `mapping_source` slot if possible, and if so ignore the injected slot when attempting to drop duplicates.
e9af27b to
20621ce
Compare
The
msdf_mergefunction attempts to "propagate" slots before merging the sets, but is doing so without any regard for which slots should actually be propagated.In addition, it attempts to inject in every individual mapping a
mapping_set_sourceslot pointing to the ID of the original set that contained the mapping, but this is invalid as there is nomapping_set_sourceslot on indivdual mapping records – the slot intended to capture the set from which a record came from ismapping_source.Lastly, the function also attempts to drop duplicates after the sets have been merged, but the detection of duplicates is prevented by (1) the incorrect propagation of non-propagatable slots (which can cause two otherwise identical records in two different sets to appear different, if the metadata of the sets contain different wrongly propagated slots), and (2) the injection of the
mapping_set_sourceslot.This PR fixes all those issues by deleting the bogus
inject_metadata_into_dffunction and replacing it by a call tomsdf.propagate(), which implements propagation correctly. It then manually injects the correctmapping_sourceslot if possible, and if so ignores the injected slot when attempting to drop duplicates.closes #626