Skip to content

Conversation

@snejus
Copy link
Member

@snejus snejus commented Nov 23, 2025

Closes #6178 (multiple metadata source results per ID) and #6181 (duplicate/overwrite of candidates).

I am refactoring a couple of other things in beets.autotag.match module because this thing is a hot mess.

Copilot AI review requested due to automatic review settings November 23, 2025 14:24
@snejus snejus requested review from a team and semohr as code owners November 23, 2025 14:24
@github-actions
Copy link

Thank you for the PR! The changelog has not been updated, so here is a friendly reminder to check if you need to add an entry.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • The Candidates type alias is defined as dict[Info.Identifier, AnyMatch] but then used as Candidates[AlbumMatch]/Candidates[TrackMatch], which isn’t a parametrizable generic; consider either making Candidates a TypeAlias with two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing.
  • Moving the album_matched event emission into AlbumMatch.__post_init__ makes constructing AlbumMatch objects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about.
  • In _add_candidate, the duplicate check mixes info.album_id and info.identifier while the candidates dict is keyed by identifier; simplifying this to only use identifier for both the truthiness check and the lookup would make the intent clearer and avoid relying on album_id being non-empty.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `Candidates` type alias is defined as `dict[Info.Identifier, AnyMatch]` but then used as `Candidates[AlbumMatch]`/`Candidates[TrackMatch]`, which isn’t a parametrizable generic; consider either making `Candidates` a `TypeAlias` with two type parameters (key/value) or annotating the dicts directly to avoid confusing/misleading typing.
- Moving the `album_matched` event emission into `AlbumMatch.__post_init__` makes constructing `AlbumMatch` objects have side effects everywhere; consider using a factory/helper (or an explicit method) to emit the event so that simple instantiation stays side-effect-free and easier to reason about.
- In `_add_candidate`, the duplicate check mixes `info.album_id` and `info.identifier` while the `candidates` dict is keyed by `identifier`; simplifying this to only use `identifier` for both the truthiness check and the lookup would make the intent clearer and avoid relying on `album_id` being non-empty.

## Individual Comments

### Comment 1
<location> `beets/autotag/match.py:203-204` </location>
<code_context>
         return

     # Prevent duplicates.
-    if info.album_id and info.album_id in results:
+    if info.album_id and info.identifier in results:
         log.debug("Duplicate.")
         return
</code_context>

<issue_to_address>
**issue (bug_risk):** Duplicate-prevention now checks album_id but keys are identifier tuples, so it will never filter duplicates.

Since results is keyed by info.identifier (data_source, id), this condition should be based solely on identifier. The album_id guard is now misleading and may skip intended deduping. Consider removing the album_id check and using only `if info.identifier in results:` (or otherwise aligning the condition with how keys are stored).
</issue_to_address>

### Comment 2
<location> `beets/metadata_plugins.py:58-62` </location>
<code_context>
-    A single ID can yield just a single track, so we return the first match.
-    """
+@notify_info_yielded("trackinfo_received")
+def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
+    """Return matching albums from all metadata sources for the given ID."""
     for plugin in find_metadata_source_plugins():
-        if info := plugin.track_for_id(_id):
</code_context>

<issue_to_address>
**nitpick (typo):** Docstring for tracks_for_ids mentions albums instead of tracks.

The description looks copied from `albums_for_ids` and should say "tracks" instead of "albums" to match the function’s purpose and avoid confusing metadata source plugin implementors.

```suggestion
@notify_info_yielded("trackinfo_received")
def tracks_for_ids(_id: str) -> Iterable[TrackInfo]:
    """Return matching tracks from all metadata sources for the given ID."""
    for plugin in find_metadata_source_plugins():
        yield from plugin.tracks_for_ids([_id])
```
</issue_to_address>

### Comment 3
<location> `beets/autotag/match.py:284-294` </location>
<code_context>
        if candidates and not config["import"]["timid"]:
            # If we have a very good MBID match, return immediately.
            # Otherwise, this match will compete against metadata-based
            # matches.
            if rec == Recommendation.strong:
                log.debug("ID match.")
                return (
                    cur_artist,
                    cur_album,
                    Proposal(list(candidates.values()), rec),
                )

</code_context>

<issue_to_address>
**suggestion (code-quality):** Merge nested if conditions ([`merge-nested-ifs`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/merge-nested-ifs))

```suggestion
        if candidates and not config["import"]["timid"] and rec == Recommendation.strong:
            log.debug("ID match.")
            return (
                cur_artist,
                cur_album,
                Proposal(list(candidates.values()), rec),
            )

```

<br/><details><summary>Explanation</summary>Too much nesting can make code difficult to understand, and this is especially
true in Python, where there are no brackets to help out with the delineation of
different nesting levels.

Reading deeply nested code is confusing, since you have to keep track of which
conditions relate to which levels. We therefore strive to reduce nesting where
possible, and the situation where two `if` conditions can be combined using
`and` is an easy win.
</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the autotag matching system to support returning candidates from multiple metadata sources when searching by ID, and fixes an issue where candidates with duplicate IDs from different sources would overwrite each other.

  • Changes metadata plugin API from album_for_id/track_for_id (returning single results) to albums_for_ids/tracks_for_ids (yielding multiple results from all sources)
  • Uses composite Info.identifier (tuple of data_source and id) as candidate dictionary keys to prevent cross-source ID collisions
  • Converts AlbumMatch and TrackMatch from NamedTuples to dataclasses and moves album_matched event emission to AlbumMatch.__post_init__ to deduplicate event firing

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/test_autotag.py Removes assignment tests (moved to new test file) and unused import
test/autotag/test_match.py New test file containing moved assignment tests plus new tests for multi-source ID matching scenarios
beets/metadata_plugins.py Replaces single-result album_for_id/track_for_id functions with multi-result albums_for_ids/tracks_for_ids generators; updates base class method signatures to properly filter None values
beets/autotag/match.py Simplifies match_by_id, updates candidate dictionary to use composite identifiers, removes manual album_matched event calls (now in dataclass), removes unused plugins import
beets/autotag/hooks.py Adds Info.identifier property, converts Match classes to dataclasses with __post_init__ for event emission

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@snejus snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from 95fecc5 to c8c62b3 Compare November 23, 2025 14:28
extra_items: list[Item]
extra_tracks: list[TrackInfo]

def __post_init__(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the event trigger out of the class constructor. It is possible that an Match object is constructed independent of a beets pipeline. I do not want to couple this that strongly if not necessary.

E.g. we have some serialization logic in beets-flask and I don't want to trigger this whenever I load an Match entry for cold storage.

Copy link
Member Author

@snejus snejus Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the concern about this side effect but moving this to __post_init__ was intentional:

  1. The send("album_matched") call was duplicated across multiple sites in the autotagger - always right after creating AlbumMatch
  2. This guarantess that this event is sent on every AlbumMatch creation. This is especially relevant given that I'm currently refactoring this functionality extensively.

This is actually a textbook use of __post_init__ - PEP 557 explicitly recommends it for side effects that must always happen during initialization. The alternative (keeping manual emissions everywhere) was provably bug-prone.

Re: beets-flask serialization: AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API. For deserialization, you could bypass __post_init__ with:

match = object.__new__(AlbumMatch)
match.__dict__.update(serialized_data)

Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.

I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the event should not be sent on every initialization. To me, the event is more closely coupled with the tagging logic than with the match object itself, although I do agree that the current approach makes the code a bit cleaner.

AlbumMatch/TrackMatch are internal to beets' autotagger, not public plugin API.

How does one identify public api in this case? We do not use __all__ and there is no underscore in the naming. As there is no internal use indicator, AlbumMatch/TrackMatch are public api according to pep-8.

Historically beets has treated only the plugin API as public, yes, but the project is also used as a library, and I think that use case deserves consideration as well. Without explicit boundaries, users reasonably assume that importable classes are fair game.

Or even better, serialize just the match data rather than the objects themselves. If there's broader need for match serialization, we could discuss adding a proper public API for it.

Just to clarify: I meant only deserialization. This was what the "load from cold storage" comment was referencing.

I want to avoid blocking necessary refactoring of beets' internals based on downstream usage of internal classes. Does the deserialization workaround work for your use case?

We routinely do block changes, or at least adjust them, because of potential downstream usage. Wanting to avoid that concern here feels a bit inconsistent.

The workaround would work, but it shifts the burden entirely onto downstream users and breaks existing programs.


I’m not opposed to the change in principle, and I don’t think it needs to block progress, but by any reasonable definition, this is a breaking change. In my view that implies either a major-version bump or, alternatively, introducing a minimal public deserialization format so maintainers can refactor freely without silently breaking consumers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points - you're right about the API ambiguity and that beets is used as a library.

How about adding a from_dict() classmethod that skips the event?

@classmethod
def from_dict(cls, data: dict) -> AlbumMatch:
    """Reconstruct from serialized data without emitting events."""
    obj = object.__new__(cls)
    obj.__dict__.update(data)
    return obj

This gives library users a stable deserialization path, keeps the __post_init__ enforcement for the autotagger, and avoids the major version bump debate. Would that work for beets-flask?


Identifier = tuple[str | None, str | None]

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like that we raise an NotImplementedError here. We should make the Info class abstract or a protocol if want to define a contract for the inheritance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered using ABC + @abstractmethod here, but opted against it for these reasons:

  1. Limited scope: Info only has 2 concrete subclasses - it's not a public plugin interface where we need strict enforcement at instantiation time.

  2. Template pattern, not an interface: The base class provides real shared functionality (identifier property, __repr__, common __init__ parameters). The id and name properties are just internal adapters mapping to different field names in subclasses (e.g., album_id vs track_id).

  3. Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching __abstractmethods__ in tests. Since we're not exposing Info for external extension, the ceremony doesn't add value.

The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point.

Happy to reconsider if you feel strongly about it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template pattern, not an interface: The base class provides real shared functionality (identifier property, repr, common init parameters). The id and name properties are just internal adapters mapping to different field names in subclasses (e.g., album_id vs track_id).

Abstract classes or protocols can also provide real shared functionality.

Testing overhead: Making it an ABC would require either creating stub implementations or monkeypatching abstractmethods in tests. Since we're not exposing Info for external extension, the ceremony doesn't add value.

We actually don’t construct raw Info instances in tests. Tests only ever instantiate AlbumInfo or TrackInfo directly. So converting Info to an ABC wouldn’t add test-work in practice.

The NotImplementedError approach clearly documents "subclasses must override this" without the ABC machinery. If we later expose Info for some plugin extendability, I'd absolutely convert it to ABC at that point

It looks like this usage has already begun showing up in plugins (mbpseudo.py is one example). Given that the system is already being extended, it might be safer to formalize the interface now rather than later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some confusion here - mbpseudo.py doesn't subclass Info. It's a metadata source plugin that returns AlbumInfo/TrackInfo instances, but it doesn't create new Info subclasses.

On the other hand, discogs does - however, that approach is being reverted in #6179 since this subclass unintentionally introduced flexible attributes that ended up written into the database.

Given the above (that they actually should not be subclassed), I'd prefer to keep it simple with NotImplementedError for now. We can formalize it as an ABC if/when there's an actual need for plugin extensibility.

A single ID can yield just a single album, so we return the first match.
"""
@notify_info_yielded("albuminfo_received")
def albums_for_ids(_id: str) -> Iterable[AlbumInfo]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be named albums_for_id if we keep the current call signature. Btw. this seems like a breaking change to me. Can we do this without a major version increment?

Copy link
Member Author

@snejus snejus Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it following the method names available on MetadataSourcePlugin for consistency. I'll update the parameter to accept a list of IDs!

Btw. this seems like a breaking change to me. Can we do this without a major version increment?

As far as I'm aware, these functions are only used internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons why I tend towards using *args, **kwargs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dug the removed functions back out from the grave - didn't realise that we do use them - and added data_source parameter to both.

@snejus snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch 6 times, most recently from a4109de to 7282ede Compare December 3, 2025 02:17
@snejus snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from 7282ede to e89d97d Compare December 5, 2025 08:37
@codecov
Copy link

codecov bot commented Dec 5, 2025

Codecov Report

❌ Patch coverage is 74.69880% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.36%. Comparing base (a62f4fb) to head (079749c).
⚠️ Report is 56 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
beets/metadata_plugins.py 60.00% 10 Missing ⚠️
beets/autotag/match.py 75.75% 5 Missing and 3 partials ⚠️
beetsplug/missing.py 0.00% 2 Missing ⚠️
beetsplug/mbsync.py 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6184      +/-   ##
==========================================
+ Coverage   68.26%   68.36%   +0.10%     
==========================================
  Files         138      138              
  Lines       18791    18809      +18     
  Branches     3167     3164       -3     
==========================================
+ Hits        12827    12859      +32     
+ Misses       5290     5280      -10     
+ Partials      674      670       -4     
Files with missing lines Coverage Δ
beets/autotag/hooks.py 100.00% <100.00%> (ø)
beetsplug/mbsync.py 75.90% <87.50%> (-5.92%) ⬇️
beetsplug/missing.py 33.33% <0.00%> (ø)
beets/autotag/match.py 84.50% <75.75%> (+7.58%) ⬆️
beets/metadata_plugins.py 80.95% <60.00%> (+4.43%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Restore album_for_id and track_for_id functions in metadata_plugins to
support data source-specific lookups. These functions accept both an ID
and data_source parameter, enabling plugins like mbsync and missing to
retrieve metadata from the correct source.

Update mbsync and missing plugins to use the restored functions with
explicit data_source parameters. Add data_source validation to prevent
lookups when the source is not specified.

Add get_metadata_source helper function to retrieve plugins by their
data_source name, cached for performance.
@snejus snejus force-pushed the return-candidates-from-all-data-sources-on-id-search branch from e89d97d to 079749c Compare December 26, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants