Return candidates from all data sources on id search #6184

snejus · 2025-11-23T14:24:01Z

Closes #6178 (multiple metadata source results per ID) and #6181 (duplicate/overwrite of candidates).

(Autotagger returns only the first candidate for an ID that is present in multiple sources #6178) Replace album_for_id / track_for_id with albums_for_ids / tracks_for_ids in metadata_plugins that yield candidates from all metadata sources
(Autotagger considers candidates with same album id from different data sources as duplicates #6181) Use Info.identifier ((data_source, id)) as candidate keys to avoid cross-source ID collisions.
Add tests (test/autotag/test_match.py) for assignment logic and multi-source ID matching
Simplify match_by_id
Dedupe album_matched event emission by moving it to AlbumMatch.__post_init__ (and convert AlbumMatch / TrackMatch to dataclasses)

I am refactoring a couple of other things in beets.autotag.match module because this thing is a hot mess.

github-actions · 2025-11-23T14:24:19Z

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

The Candidates type alias is defined as dict[Info.Identifier, AnyMatch] but then used as Candidates[AlbumMatch]/Candidates[TrackMatch], which isn’t a parametrizable generic; consider either making Candidates a TypeAlias with two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing.
Moving the album_matched event emission into AlbumMatch.__post_init__ makes constructing AlbumMatch objects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about.
In _add_candidate, the duplicate check mixes info.album_id and info.identifier while the candidates dict is keyed by identifier; simplifying this to only use identifier for both the truthiness check and the lookup would make the intent clearer and avoid relying on album_id being non-empty.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `Candidates` type alias is defined as `dict[Info.Identifier, AnyMatch]` but then used as `Candidates[AlbumMatch]`/`Candidates[TrackMatch]`, which isn’t a parametrizable generic; consider either making `Candidates` a `TypeAlias` with two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing.
- Moving the `album_matched` event emission into `AlbumMatch.__post_init__` makes constructing `AlbumMatch` objects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about.
- In `_add_candidate`, the duplicate check mixes `info.album_id` and `info.identifier` while the `candidates` dict is keyed by `identifier`; simplifying this to only use `identifier` for both the truthiness check and the lookup would make the intent clearer and avoid relying on `album_id` being non-empty.

## Individual Comments

### Comment 1
<location> `beets/autotag/match.py:203-204` </location>
<code_context>
         return

     # Prevent duplicates.
-    if info.album_id and info.album_id in results:
+    if info.album_id and info.identifier in results:
         log.debug("Duplicate.")
         return
</code_context>

<issue_to_address>
**issue (bug_risk):** Duplicate-prevention now checks album_id but keys are identifier tuples, so it will never filter duplicates.

Since results is keyed by info.identifier (data_source, id), this condition should be based solely on identifier. The album_id guard is now misleading and may skip intended deduping. Consider removing the album_id check and using only `if info.identifier in results:` (or otherwise aligning the condition with how keys are stored).
</issue_to_address>

### Comment 2
<location> `beets/metadata_plugins.py:58-62` </location>
<code_context>
-    A single ID can yield just a single track, so we return the first match.
-    """
+@notify_info_yielded("trackinfo_received")
+def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
+    """Return matching albums from all metadata sources for the given ID."""
     for plugin in find_metadata_source_plugins():
-        if info := plugin.track_for_id(_id):
</code_context>

<issue_to_address>
**nitpick (typo):** Docstring for tracks_for_ids mentions albums instead of tracks.

The description looks copied from `albums_for_ids` and should say "tracks" instead of "albums" to match the function’s purpose and avoid confusing metadata source plugin implementors.

```suggestion
@notify_info_yielded("trackinfo_received")
def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
    """Return matching tracks from all metadata sources for the given ID."""
    for plugin in find_metadata_source_plugins():
        yield from plugin.tracks_for_ids([_id])
```
</issue_to_address>

### Comment 3
<location> `beets/autotag/match.py:284-294` </location>
<code_context>
        if candidates and not config["import"]["timid"]:
            # If we have a very good MBID match, return immediately.
            # Otherwise, this match will compete against metadata-based
            # matches.
            if rec == Recommendation.strong:
                log.debug("ID match.")
                return (
                    cur_artist,
                    cur_album,
                    Proposal(list(candidates.values()), rec),
                )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))

```suggestion
        if candidates and not config["import"]["timid"] and rec == Recommendation.strong:
            log.debug("ID match.")
            return (
                cur_artist,
                cur_album,
                Proposal(list(candidates.values()), rec),
            )

```

<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.

Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

beets/autotag/match.py

beets/metadata_plugins.py

Copilot

Pull request overview

This PR refactors the autotag matching system to support returning candidates from multiple metadata sources when searching by ID, and fixes an issue where candidates with duplicate IDs from different sources would overwrite each other.

Changes metadata plugin API from album_for_id/track_for_id (returning single results) to albums_for_ids/tracks_for_ids (yielding multiple results from all sources)
Uses composite Info.identifier (tuple of data_source and id) as candidate dictionary keys to prevent cross-source ID collisions
Converts AlbumMatch and TrackMatch from NamedTuples to dataclasses and moves album_matched event emission to AlbumMatch.__post_init__ to deduplicate event firing

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/test_autotag.py	Removes assignment tests (moved to new test file) and unused import
test/autotag/test_match.py	New test file containing moved assignment tests plus new tests for multi-source ID matching scenarios
beets/metadata_plugins.py	Replaces single-result `album_for_id`/`track_for_id` functions with multi-result `albums_for_ids`/`tracks_for_ids` generators; updates base class method signatures to properly filter None values
beets/autotag/match.py	Simplifies `match_by_id`, updates candidate dictionary to use composite identifiers, removes manual `album_matched` event calls (now in dataclass), removes unused `plugins` import
beets/autotag/hooks.py	Adds `Info.identifier` property, converts Match classes to dataclasses with `__post_init__` for event emission

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

beets/metadata_plugins.py

semohr · 2025-11-23T14:27:46Z

beets/autotag/hooks.py

    extra_items: list[Item]
    extra_tracks: list[TrackInfo]

+    def __post_init__(self) -> None:


I would move the event trigger out of the class constructor. It is possible that an Match object is constructed independent of a beets pipeline. I do not want to couple this that strongly if not necessary.

E.g. we have some serialization logic in beets-flask and I don't want to trigger this whenever I load an Match entry for cold storage.

I get the concern about this side effect but moving this to __post_init__ was intentional:

The send("album_matched") call was duplicated across multiple sites in the autotagger - always right after creating AlbumMatch

This guarantess that this event is sent on every AlbumMatch creation. This is especially relevant given that I'm currently refactoring this functionality extensively.

This is actually a textbook use of __post_init__ - PEP 557 explicitly recommends it for side effects that must always happen during initialization. The alternative (keeping manual emissions everywhere) was provably bug-prone.

Re: beets-flask serialization: AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API. For deserialization, you could bypass __post_init__ with:

match = object.__new__(AlbumMatch) match.__dict__.update(serialized_data)

Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.

I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?

I still think the event should not be sent on every initialization. To me, the event is more closely coupled with the tagging logic than with the match object itself, although I do agree that the current approach makes the code a bit cleaner.

AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API.

How does one identify public api in this case? We do not use __all__ and there is no underscore in the naming. As there is no internal use indicator, AlbumMatch/TrackMatch are public api according to pep-8.

Historically beets has treated only the plugin API as public, yes, but the project is also used as a library, and I think that use case deserves consideration as well. Without explicit boundaries, users reasonably assume that importable classes are fair game.

Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.

Just to clarify: I meant only deserialization. This was what the "load from cold storage" comment was referencing.

I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?

We routinely do block changes, or at least adjust them, because of potential downstream usage. Wanting to avoid that concern here feels a bit inconsistent.

The workaround would work, but it shifts the burden entirely onto downstream users and breaks existing programs.

I’m not opposed to the change in principle, and I don’t think it needs to block progress, but by any reasonable definition, this is a breaking change. In my view that implies either a major-version bump or, alternatively, introducing a minimal public deserialization format so maintainers can refactor freely without silently breaking consumers.

Fair points - you're right about the API ambiguity and that beets is used as a library.

How about adding a from_dict() classmethod that skips the event?

@classmethod def from_dict(cls, data: dict) -> AlbumMatch: """Reconstruct from serialized data without emitting events.""" obj = object.__new__(cls) obj.__dict__.update(data) return obj

This gives library users a stable deserialization path, keeps the __post_init__ enforcement for the autotagger, and avoids the major version bump debate. Would that work for beets-flask?

beets/autotag/match.py

beets/autotag/hooks.py

semohr · 2025-11-23T14:43:35Z

beets/autotag/hooks.py


+    Identifier = tuple[str | None, str | None]
+
+    @property


I do not like that we raise an NotImplementedError here. We should make the Info class abstract or a protocol if want to define a contract for the inheritance.

I considered using ABC + @abstractmethod here, but opted against it for these reasons:

Limited scope: Info only has 2 concrete subclasses - it's not a public plugin interface where we need strict enforcement at instantiation time.

Template pattern, not an interface: The base class provides real shared functionality (identifier property, __repr__, common __init__ parameters). The id and name properties are just internal adapters mapping to different field names in subclasses (e.g., album_id vs track_id).

Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching __abstractmethods__ in tests. Since we're not exposing Info for external extension, the ceremony doesn't add value.

The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point.

Happy to reconsider if you feel strongly about it though.

Template pattern, not an interface: The base class provides real shared functionality (identifier property, repr, common init parameters). The id and name properties are just internal adapters mapping to different field names in subclasses (e.g., album_id vs track_id).

Abstract classes or protocols can also provide real shared functionality.

Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching abstractmethods in tests. Since we're not exposing Info for external extension, the ceremony doesn't add value.

We actually don’t construct raw Info instances in tests. Tests only ever instantiate AlbumInfo or TrackInfo directly. So converting Info to an ABC wouldn’t add test-work in practice.

The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point

It looks like this usage has already begun showing up in plugins (mbpseudo.py is one example). Given that the system is already being extended, it might be safer to formalize the interface now rather than later.

I think there's some confusion here - mbpseudo.py doesn't subclass Info. It's a metadata source plugin that returns AlbumInfo/TrackInfo instances, but it doesn't create new Info subclasses.

On the other hand, discogs does - however, that approach is being reverted in #6179 since this subclass unintentionally introduced flexible attributes that ended up written into the database.

Given the above (that they actually should not be subclassed), I'd prefer to keep it simple with NotImplementedError for now. We can formalize it as an ABC if/when there's an actual need for plugin extensibility.

semohr · 2025-11-23T14:54:19Z

beets/metadata_plugins.py

-    A single ID can yield just a single album, so we return the first match.
-    """
+@notify_info_yielded("albuminfo_received")
+def albums_for_ids(_id: str) -> Iterable[AlbumInfo]:


Should be named albums_for_id if we keep the current call signature. Btw. this seems like a breaking change to me. Can we do this without a major version increment?

I named it following the method names available on MetadataSourcePlugin for consistency. I'll update the parameter to accept a list of IDs!

Btw. this seems like a breaking change to me. Can we do this without a major version increment?

As far as I'm aware, these functions are only used internally.

This is one of the reasons why I tend towards using *args, **kwargs...

Dug the removed functions back out from the grave - didn't realise that we do use them - and added data_source parameter to both.

codecov · 2025-12-05T08:43:10Z

Codecov Report

❌ Patch coverage is 74.69880% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.36%. Comparing base (a62f4fb) to head (079749c).
⚠️ Report is 56 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
beets/metadata_plugins.py	60.00%	10 Missing ⚠️
beets/autotag/match.py	75.75%	5 Missing and 3 partials ⚠️
beetsplug/missing.py	0.00%	2 Missing ⚠️
beetsplug/mbsync.py	87.50%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6184      +/-   ##
==========================================
+ Coverage   68.26%   68.36%   +0.10%     
==========================================
  Files         138      138              
  Lines       18791    18809      +18     
  Branches     3167     3164       -3     
==========================================
+ Hits        12827    12859      +32     
+ Misses       5290     5280      -10     
+ Partials      674      670       -4

Files with missing lines	Coverage Δ
beets/autotag/hooks.py	`100.00% <100.00%> (ø)`
beetsplug/mbsync.py	`75.90% <87.50%> (-5.92%)`	⬇️
beetsplug/missing.py	`33.33% <0.00%> (ø)`
beets/autotag/match.py	`84.50% <75.75%> (+7.58%)`	⬆️
beets/metadata_plugins.py	`80.95% <60.00%> (+4.43%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Restore album_for_id and track_for_id functions in metadata_plugins to support data source-specific lookups. These functions accept both an ID and data_source parameter, enabling plugins like mbsync and missing to retrieve metadata from the correct source. Update mbsync and missing plugins to use the restored functions with explicit data_source parameters. Add data_source validation to prevent lookups when the source is not specified. Add get_metadata_source helper function to retrieve plugins by their data_source name, cached for performance.

Copilot AI review requested due to automatic review settings November 23, 2025 14:24

snejus requested review from a team and semohr as code owners November 23, 2025 14:24

snejus requested a review from henry-oberholtzer November 23, 2025 14:24

snejus linked an issue Nov 23, 2025 that may be closed by this pull request

Autotagger considers candidates with same album id from different data sources as duplicates #6181

Open

Copilot started reviewing on behalf of snejus November 23, 2025 14:24 View session

sourcery-ai bot reviewed Nov 23, 2025

View reviewed changes

beets/autotag/match.py Outdated Show resolved Hide resolved

beets/metadata_plugins.py Show resolved Hide resolved

Copilot finished reviewing on behalf of snejus November 23, 2025 14:26

Copilot AI reviewed Nov 23, 2025

View reviewed changes

beets/metadata_plugins.py Outdated Show resolved Hide resolved

snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from 95fecc5 to c8c62b3 Compare November 23, 2025 14:28

semohr reviewed Nov 23, 2025

View reviewed changes

snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch 6 times, most recently from a4109de to 7282ede Compare December 3, 2025 02:17

snejus mentioned this pull request Dec 4, 2025

Empty metadata support for autotagger plugins #6065

Open

snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from 7282ede to e89d97d Compare December 5, 2025 08:37

snejus mentioned this pull request Dec 19, 2025

Support multiple pseudo-releases and reimport in mbpseudo #6163

Closed

3 tasks

snejus added 7 commits December 26, 2025 18:53

Move assignment tests to test/autotag/test_match.py

9be0dc0

Add a test to reproduce the issue

0077dbe

Return album candidates from multiple sources when matching by IDs

f36956d

Take data source into account when deciding duplicate candidates

fb6f046

Refactor match_by_id

fcd9cca

Invoke album_matched hook from AlbumMatch.__post_init__

71d6ebd

Search for multiple album/track ids

e9dab62

snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from e89d97d to 079749c Compare December 26, 2025 19:24

Return candidates from all data sources on id search #6184

Are you sure you want to change the base?

Return candidates from all data sources on id search #6184

Uh oh!

Conversation

snejus commented Nov 23, 2025

Uh oh!

github-actions bot commented Nov 23, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snejus Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snejus Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snejus Nov 24, 2025 •

edited

Loading

snejus Nov 29, 2025 •

edited

Loading

codecov bot commented Dec 5, 2025 •

edited

Loading