-
Notifications
You must be signed in to change notification settings - Fork 2k
Return candidates from all data sources on id search #6184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Return candidates from all data sources on id search #6184
Conversation
|
Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The
Candidatestype alias is defined asdict[Info.Identifier, AnyMatch]but then used asCandidates[AlbumMatch]/Candidates[TrackMatch], which isn’t a parametrizable generic; consider either makingCandidatesaTypeAliaswith two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing. - Moving the
album_matchedevent emission intoAlbumMatch.__post_init__makes constructingAlbumMatchobjects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about. - In
_add_candidate, the duplicate check mixesinfo.album_idandinfo.identifierwhile thecandidatesdict is keyed byidentifier; simplifying this to only useidentifierfor both the truthiness check and the lookup would make the intent clearer and avoid relying onalbum_idbeing non-empty.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `Candidates` type alias is defined as `dict[Info.Identifier, AnyMatch]` but then used as `Candidates[AlbumMatch]`/`Candidates[TrackMatch]`, which isn’t a parametrizable generic; consider either making `Candidates` a `TypeAlias` with two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing.
- Moving the `album_matched` event emission into `AlbumMatch.__post_init__` makes constructing `AlbumMatch` objects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about.
- In `_add_candidate`, the duplicate check mixes `info.album_id` and `info.identifier` while the `candidates` dict is keyed by `identifier`; simplifying this to only use `identifier` for both the truthiness check and the lookup would make the intent clearer and avoid relying on `album_id` being non-empty.
## Individual Comments
### Comment 1
<location> `beets/autotag/match.py:203-204` </location>
<code_context>
return
# Prevent duplicates.
- if info.album_id and info.album_id in results:
+ if info.album_id and info.identifier in results:
log.debug("Duplicate.")
return
</code_context>
<issue_to_address>
**issue (bug_risk):** Duplicate-prevention now checks album_id but keys are identifier tuples, so it will never filter duplicates.
Since results is keyed by info.identifier (data_source, id), this condition should be based solely on identifier. The album_id guard is now misleading and may skip intended deduping. Consider removing the album_id check and using only `if info.identifier in results:` (or otherwise aligning the condition with how keys are stored).
</issue_to_address>
### Comment 2
<location> `beets/metadata_plugins.py:58-62` </location>
<code_context>
- A single ID can yield just a single track, so we return the first match.
- """
+@notify_info_yielded("trackinfo_received")
+def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
+ """Return matching albums from all metadata sources for the given ID."""
for plugin in find_metadata_source_plugins():
- if info := plugin.track_for_id(_id):
</code_context>
<issue_to_address>
**nitpick (typo):** Docstring for tracks_for_ids mentions albums instead of tracks.
The description looks copied from `albums_for_ids` and should say "tracks" instead of "albums" to match the function’s purpose and avoid confusing metadata source plugin implementors.
```suggestion
@notify_info_yielded("trackinfo_received")
def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
"""Return matching tracks from all metadata sources for the given ID."""
for plugin in find_metadata_source_plugins():
yield from plugin.tracks_for_ids([_id])
```
</issue_to_address>
### Comment 3
<location> `beets/autotag/match.py:284-294` </location>
<code_context>
if candidates and not config["import"]["timid"]:
# If we have a very good MBID match, return immediately.
# Otherwise, this match will compete against metadata-based
# matches.
if rec == Recommendation.strong:
log.debug("ID match.")
return (
cur_artist,
cur_album,
Proposal(list(candidates.values()), rec),
)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))
```suggestion
if candidates and not config["import"]["timid"] and rec == Recommendation.strong:
log.debug("ID match.")
return (
cur_artist,
cur_album,
Proposal(list(candidates.values()), rec),
)
```
<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.
Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the autotag matching system to support returning candidates from multiple metadata sources when searching by ID, and fixes an issue where candidates with duplicate IDs from different sources would overwrite each other.
- Changes metadata plugin API from
album_for_id/track_for_id(returning single results) toalbums_for_ids/tracks_for_ids(yielding multiple results from all sources) - Uses composite
Info.identifier(tuple of data_source and id) as candidate dictionary keys to prevent cross-source ID collisions - Converts
AlbumMatchandTrackMatchfrom NamedTuples to dataclasses and movesalbum_matchedevent emission toAlbumMatch.__post_init__to deduplicate event firing
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/test_autotag.py | Removes assignment tests (moved to new test file) and unused import |
| test/autotag/test_match.py | New test file containing moved assignment tests plus new tests for multi-source ID matching scenarios |
| beets/metadata_plugins.py | Replaces single-result album_for_id/track_for_id functions with multi-result albums_for_ids/tracks_for_ids generators; updates base class method signatures to properly filter None values |
| beets/autotag/match.py | Simplifies match_by_id, updates candidate dictionary to use composite identifiers, removes manual album_matched event calls (now in dataclass), removes unused plugins import |
| beets/autotag/hooks.py | Adds Info.identifier property, converts Match classes to dataclasses with __post_init__ for event emission |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
95fecc5 to
c8c62b3
Compare
| extra_items: list[Item] | ||
| extra_tracks: list[TrackInfo] | ||
|
|
||
| def __post_init__(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the event trigger out of the class constructor. It is possible that an Match object is constructed independent of a beets pipeline. I do not want to couple this that strongly if not necessary.
E.g. we have some serialization logic in beets-flask and I don't want to trigger this whenever I load an Match entry for cold storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the concern about this side effect but moving this to __post_init__ was intentional:
- The
send("album_matched")call was duplicated across multiple sites in the autotagger - always right after creatingAlbumMatch - This guarantess that this event is sent on every
AlbumMatchcreation. This is especially relevant given that I'm currently refactoring this functionality extensively.
This is actually a textbook use of __post_init__ - PEP 557 explicitly recommends it for side effects that must always happen during initialization. The alternative (keeping manual emissions everywhere) was provably bug-prone.
Re: beets-flask serialization: AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API. For deserialization, you could bypass __post_init__ with:
match = object.__new__(AlbumMatch)
match.__dict__.update(serialized_data)Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.
I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the event should not be sent on every initialization. To me, the event is more closely coupled with the tagging logic than with the match object itself, although I do agree that the current approach makes the code a bit cleaner.
AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API.
How does one identify public api in this case? We do not use __all__ and there is no underscore in the naming. As there is no internal use indicator, AlbumMatch/TrackMatch are public api according to pep-8.
Historically beets has treated only the plugin API as public, yes, but the project is also used as a library, and I think that use case deserves consideration as well. Without explicit boundaries, users reasonably assume that importable classes are fair game.
Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.
Just to clarify: I meant only deserialization. This was what the "load from cold storage" comment was referencing.
I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?
We routinely do block changes, or at least adjust them, because of potential downstream usage. Wanting to avoid that concern here feels a bit inconsistent.
The workaround would work, but it shifts the burden entirely onto downstream users and breaks existing programs.
I’m not opposed to the change in principle, and I don’t think it needs to block progress, but by any reasonable definition, this is a breaking change. In my view that implies either a major-version bump or, alternatively, introducing a minimal public deserialization format so maintainers can refactor freely without silently breaking consumers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points - you're right about the API ambiguity and that beets is used as a library.
How about adding a from_dict() classmethod that skips the event?
@classmethod
def from_dict(cls, data: dict) -> AlbumMatch:
"""Reconstruct from serialized data without emitting events."""
obj = object.__new__(cls)
obj.__dict__.update(data)
return objThis gives library users a stable deserialization path, keeps the __post_init__ enforcement for the autotagger, and avoids the major version bump debate. Would that work for beets-flask?
|
|
||
| Identifier = tuple[str | None, str | None] | ||
|
|
||
| @property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like that we raise an NotImplementedError here. We should make the Info class abstract or a protocol if want to define a contract for the inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered using ABC + @abstractmethod here, but opted against it for these reasons:
-
Limited scope:
Infoonly has 2 concrete subclasses - it's not a public plugin interface where we need strict enforcement at instantiation time. -
Template pattern, not an interface: The base class provides real shared functionality (
identifierproperty,__repr__, common__init__parameters). Theidandnameproperties are just internal adapters mapping to different field names in subclasses (e.g.,album_idvstrack_id). -
Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching
__abstractmethods__in tests. Since we're not exposingInfofor external extension, the ceremony doesn't add value.
The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point.
Happy to reconsider if you feel strongly about it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template pattern, not an interface: The base class provides real shared functionality (identifier property, repr, common init parameters). The id and name properties are just internal adapters mapping to different field names in subclasses (e.g., album_id vs track_id).
Abstract classes or protocols can also provide real shared functionality.
Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching abstractmethods in tests. Since we're not exposing Info for external extension, the ceremony doesn't add value.
We actually don’t construct raw Info instances in tests. Tests only ever instantiate AlbumInfo or TrackInfo directly. So converting Info to an ABC wouldn’t add test-work in practice.
The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point
It looks like this usage has already begun showing up in plugins (mbpseudo.py is one example). Given that the system is already being extended, it might be safer to formalize the interface now rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some confusion here - mbpseudo.py doesn't subclass Info. It's a metadata source plugin that returns AlbumInfo/TrackInfo instances, but it doesn't create new Info subclasses.
On the other hand, discogs does - however, that approach is being reverted in #6179 since this subclass unintentionally introduced flexible attributes that ended up written into the database.
Given the above (that they actually should not be subclassed), I'd prefer to keep it simple with NotImplementedError for now. We can formalize it as an ABC if/when there's an actual need for plugin extensibility.
beets/metadata_plugins.py
Outdated
| A single ID can yield just a single album, so we return the first match. | ||
| """ | ||
| @notify_info_yielded("albuminfo_received") | ||
| def albums_for_ids(_id: str) -> Iterable[AlbumInfo]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be named albums_for_id if we keep the current call signature. Btw. this seems like a breaking change to me. Can we do this without a major version increment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named it following the method names available on MetadataSourcePlugin for consistency. I'll update the parameter to accept a list of IDs!
Btw. this seems like a breaking change to me. Can we do this without a major version increment?
As far as I'm aware, these functions are only used internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons why I tend towards using *args, **kwargs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dug the removed functions back out from the grave - didn't realise that we do use them - and added data_source parameter to both.
a4109de to
7282ede
Compare
7282ede to
e89d97d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6184 +/- ##
==========================================
+ Coverage 68.26% 68.36% +0.10%
==========================================
Files 138 138
Lines 18791 18809 +18
Branches 3167 3164 -3
==========================================
+ Hits 12827 12859 +32
+ Misses 5290 5280 -10
+ Partials 674 670 -4
🚀 New features to boost your workflow:
|
Restore album_for_id and track_for_id functions in metadata_plugins to support data source-specific lookups. These functions accept both an ID and data_source parameter, enabling plugins like mbsync and missing to retrieve metadata from the correct source. Update mbsync and missing plugins to use the restored functions with explicit data_source parameters. Add data_source validation to prevent lookups when the source is not specified. Add get_metadata_source helper function to retrieve plugins by their data_source name, cached for performance.
e89d97d to
079749c
Compare
Closes #6178 (multiple metadata source results per ID) and #6181 (duplicate/overwrite of candidates).
album_for_id/track_for_idwithalbums_for_ids/tracks_for_idsinmetadata_pluginsthat yield candidates from all metadata sourcesInfo.identifier((data_source, id)) as candidate keys to avoid cross-source ID collisions.test/autotag/test_match.py) for assignment logic and multi-source ID matchingmatch_by_idalbum_matchedevent emission by moving it toAlbumMatch.__post_init__(and convertAlbumMatch/TrackMatchto dataclasses)I am refactoring a couple of other things in
beets.autotag.matchmodule because this thing is a hot mess.