Add stepwise intermediate reward for RL#526
Add stepwise intermediate reward for RL#526Shaobo-Zhou wants to merge 110 commits intomunich-quantum-toolkit:mainfrom
Conversation
Update action space and feature space Update actions Update action space
Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py
Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations
Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>
Fix windows runtime warning problem Fix windows runtime warning issue
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
src/mqt/predictor/rl/cost_model.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-09T13:20:11.483Z
Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.
Applied to files:
src/mqt/predictor/rl/cost_model.py
📚 Learning: 2025-11-01T15:57:31.153Z
Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.
Applied to files:
src/mqt/predictor/rl/cost_model.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
- GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
- GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
🔇 Additional comments (5)
src/mqt/predictor/rl/cost_model.py (5)
1-36: LGTM: Clean module structure and imports.The module-level organization is solid: clear documentation, appropriate use of TYPE_CHECKING for forward references, and a logger for diagnostics. The
CanonicalCostTabletype alias usingMappingcorrectly conveys immutability.
42-148: LGTM: Device-specific cost tables are well-documented.The canonical cost tables correctly reflect different native gate sets across devices (CZ for Torino, iSWAP for Ankaa3, CZ for Emerald). The decomposition comments (e.g., CX → H•CZ•H ⇒ 6×1q + 1×2q) provide clear rationale. Gate coverage differences are intentional and align with each device's architecture.
151-169: LGTM: Safe fallback with appropriate warnings.The unknown-device fallback to Torino is well-communicated via both
warnings.warnand logger. The dual notification ensures visibility in different deployment contexts.
172-185: LGTM: Simple and correct delegation.The function correctly wraps
get_cost_tableand provides the documented (0, 0) fallback for unknown gates.
223-249: LGTM: Correct fidelity approximation with defensive bounds.The multiplicative fidelity model F = (1−p₁)^n₁ · (1−p₂)^n₂ is standard for independent gate errors. Defensive clamping and max() calls ensure numerical robustness.
src/mqt/predictor/rl/cost_model.py
Outdated
| def estimate_counts( | ||
| qc: QuantumCircuit, | ||
| *, | ||
| cost_table: CanonicalCostTable, | ||
| ) -> tuple[int, int]: | ||
| """Estimate canonical (n_1q, n_2q) counts for a circuit. | ||
|
|
||
| Uses the provided ``cost_table`` where available and a simple, conservative | ||
| fallback otherwise (3*1q for unknown 1q gates, 1*2q + 4*1q for unknown 2q gates). | ||
| """ | ||
| n_1q = 0 | ||
| n_2q = 0 | ||
|
|
||
| for circuit_instr in qc.data: | ||
| name = circuit_instr.operation.name | ||
| qargs = circuit_instr.qubits | ||
|
|
||
| # Ignore non-unitary / timing-only ops for this count | ||
| if name in ("barrier", "delay", "measure"): | ||
| continue | ||
|
|
||
| cost = cost_table.get(name) | ||
| if cost is None: | ||
| # Conservative fallback by arity (only used for gates missing in the table) | ||
| if len(qargs) == 1: | ||
| n_1q += 3 | ||
| elif len(qargs) == 2: | ||
| n_2q += 1 | ||
| n_1q += 4 | ||
| else: | ||
| n_1q += cost[0] | ||
| n_2q += cost[1] | ||
| return n_1q, n_2q |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider logging unknown gates.
The conservative fallback for gates not in the cost table is reasonable, but silently falling back (lines 210-216) could hide unexpected gates or make debugging difficult. Consider adding a debug or warning log when the fallback is triggered.
🔎 Suggested addition:
cost = cost_table.get(name)
if cost is None:
+ logger.debug(
+ f"Gate '{name}' not in cost table; using conservative fallback: "
+ f"{'(3, 0)' if len(qargs) == 1 else '(4, 1)'}"
+ )
# Conservative fallback by arity (only used for gates missing in the table)
if len(qargs) == 1:
n_1q += 3🤖 Prompt for AI Agents
In src/mqt/predictor/rl/cost_model.py around lines 188 to 220, the fallback path
for gates missing from the cost_table silently applies conservative counts;
modify this to emit a log message when the fallback is triggered that includes
the gate name and its arity (len(qargs)). Add a module logger
(logging.getLogger(__name__)) if not present, and call logger.debug or
logger.warning inside the if cost is None branch (before incrementing n_1q/n_2q)
so developers can see which unknown gates caused the fallback; keep message
concise and avoid sensitive data.
There was a problem hiding this comment.
@Shaobo-Zhou just did a very quick review here in order to not hold this up for too long.
I think this will probably need one more round of iteration until it is ready to be merged. At the moment it feels like it's cutting a few corners too many.
Edit: There were also quite some CodeRabbit comments left, which should ideally be resolved.
| RuntimeError, | ||
| match=re.escape("Device target does not expose the required Target API for approximate reward computation."), | ||
| ): | ||
| broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001 |
There was a problem hiding this comment.
SLF is enabled in the ruff rules of the project.
| if self.reward_function not in {"expected_fidelity", "estimated_success_probability"}: | ||
| if self.reward_function == "critical_depth": |
There was a problem hiding this comment.
this does not handle the two Hellinger fidelity metrics explicitly. Thus, they unnecessarily raise a warning in the code.
src/mqt/predictor/rl/predictorenv.py
Outdated
| if not p1 and not p2: | ||
| msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward." | ||
| raise RuntimeError(msg) |
There was a problem hiding this comment.
@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.
Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@CHANGELOG.md`:
- Around line 55-58: The CHANGELOG.md contains a link definition for [`#449`] that
is not referenced; either remove the unused link definition "[`#449`]:
https://github.com/munich-quantum-toolkit/predictor/pull/449" or add a
corresponding entry in the Unreleased changelog that references [`#449`] so
markdownlint no longer flags it as unused—update the Unreleased section to
include a line like "PR [`#449`] - <short description>" or delete the [`#449`] link
definition entirely.
In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 171-172: The current code uses QuantumCircuit.size() and writes
cost_table[...] = (qc_trans.size(), 0), which incorrectly treats all transpiled
basis gates as 1-qubit; instead iterate qc_trans.data (or
qc_trans.count_ops()/instructions) to compute separate 1-qubit and 2-qubit
counts and store them as the tuple (n_1q, n_2q) in cost_table. Locate the
transpile call and the assignment to cost_table (qc_trans and cost_table
entries) and replace the single size() value with explicit counting by
instruction arity (e.g., count len(instr.qargs) == 1 vs == 2) so downstream
fidelity/ESP uses both 1q and 2q contributions.
- Around line 150-167: The add_gate_to_cost_table function is double-prefixing
gates like "crx"/"cu3"/"ccx" causing invalid method names; fix by detecting when
the supplied gate string already encodes control prefixes and avoid prepending
"c"*controls in that case (e.g., if gate.startswith("c") then use gate as-is or
strip/interpret its inherent control count rather than adding extra "c"s), and
for the multi-control group (entries like "ccx","c3x","c4x") either replace them
with their base gate names plus the correct controls (e.g., "x" with
controls=2/3/4) or list each gate with the accurate controls count so
add_gate_to_cost_table uses the correct method name (refer to
add_gate_to_cost_table, the gate list entries and the logic building gate_name).
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 312-321: calculate_reward currently falls through to the fallback
for the "estimated_hellinger_distance" mode because there is no explicit branch;
add a branch inside the same block (where crit_depth(qc) is handled) that checks
if self.reward_function == "estimated_hellinger_distance" and returns the
computed Hellinger reward and an appropriate tag (e.g.,
(estimated_hellinger_distance(qc), "exact") or call the existing helper used
elsewhere if named differently), so the estimator is used instead of returning
0.0 and logging a warning.
- Around line 484-489: The routing branch currently assumes pm_property_set
contains "final_layout" and directly assigns self.layout.final_layout =
pm_property_set["final_layout"], which can raise KeyError because
pm_property_set may be an empty dict; update the guard to verify the key exists
(e.g., use pm_property_set.get("final_layout") or an explicit 'if "final_layout"
in pm_property_set') before assigning, and only set self.layout.final_layout
when a non-None value is returned; target the branch that checks action_index in
self.actions_routing_indices and uses self.layout and pm_property_set.
- Around line 222-230: The current termination check "if not altered_qc" treats
empty QuantumCircuit objects as falsy and prematurely ends episodes; change the
check to explicitly test for None (i.e., "if altered_qc is None") so only
intentional termination from apply_action triggers episode end, leaving empty
but valid circuits to continue; update the branch that returns
create_feature_dict(self.state), 0.0, True, False, {} to run only when
altered_qc is None and keep using apply_action and create_feature_dict as
before.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/reward.py (1)
207-212:⚠️ Potential issue | 🟠 MajorAddress incorrect exponential decay calculation when
device.dtis unavailable.The fallback
device.dt or 1.0is semantically incorrect. In Qiskit,instruction.durationis measured in time steps (units ofdt), anddevice.dtconverts those steps to seconds. Using1.0assumes each time step equals 1 second, but typicaldevice.dtvalues are ~2.22e-10 seconds (nanoseconds). This causes the decay calculation to be off by ~10 orders of magnitude whendtis unavailable.Qiskit's
Target.dtis legitimatelyNonefor some backends (not a rare edge case). Either:
- Skip the exponential decay term for backends without calibrated
dt- Log a warning and use a more conservative estimate
- Propagate
Noneand handle it at a higher level
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Line 119: The logging call in logger.warning currently uses an f-string which
forces formatting even when the warning level may be disabled; update the call
in the cost generation code (reference: the logger.warning line in
src/mqt/predictor/rl/cost_model.py that mentions "No basis for device") to use
lazy %-style or parameterized logging (e.g., logger.warning("No basis for device
'%s', using minimal universal basis for cost generation.", device_id)) so the
string is only formatted when the message will actually be emitted.
- Around line 167-168: Replace the bare "except Exception:" in
src/mqt/predictor/rl/cost_model.py with a specific exception catch for missing
gate methods (e.g., "except AttributeError:"), or a small explicit tuple like
"except (AttributeError, NotImplementedError):" so you only silence expected
missing-method errors while allowing other bugs (TypeError, ValueError, etc.) to
surface; update the except block that currently returns on error to use the
narrowed exception types.
- Around line 105-107: The if-check that tests "if transpile is None or
QuantumCircuit is None" is unreachable because those symbols are imported at
module level; remove that conditional and the accompanying ImportError so the
function no longer checks for None, or if you intended lazy importing, instead
move the Qiskit imports (transpile, QuantumCircuit) into the function and wrap
them in a try/except ImportError that raises the same message—update the code
around the function that references transpile/QuantumCircuit accordingly
(referenced symbols: transpile, QuantumCircuit in cost_model.py).
- Around line 170-171: The cost_table currently stores a flat total with a zero
second element; update generate_cost_table to compute and store the tuple (n_1q,
n_2q) by using the transpiled circuit qc_trans and the existing
estimate_basis_gate_counts utility and TWO_Q_GATES set: call
estimate_basis_gate_counts(qc_trans, basis_gates) (or the appropriate signature)
to get per-basis counts, sum counts for gates in TWO_Q_GATES to produce n_2q and
sum the remaining counts for n_1q, then assign cost_table[gate if controls == 0
else ("c" * controls) + gate] = (n_1q, n_2q) instead of (qc_trans.size(), 0).
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 621-622: The getattr call checking "_dev_avgs_cached" is redundant
because __init__ already sets self._dev_avgs_cached = False; replace
getattr(self, "_dev_avgs_cached", False) with a direct attribute access
(self._dev_avgs_cached) in the code path shown, so the condition becomes if
self._dev_avgs_cached: return; keep the existing __init__ initialization as the
source of truth and remove the defensive default.
- Line 250: Replace the eager f-string logging call logger.info(f"Final reward
({self.prev_reward_kind}): {self.prev_reward}") with lazy logging formatting to
avoid unnecessary string interpolation; use logger.info("Final reward (%s): %s",
self.prev_reward_kind, self.prev_reward) so that self.prev_reward_kind and
self.prev_reward are passed as parameters and only formatted if the log level is
enabled.
- Line 221: Replace the f-string logging call with lazy logging formatting to
avoid eager interpolation: in the code that currently calls
logger.info(f"Applying {self.action_set[action].name}") (within PredictorEnv /
predictorenv.py, using the logger, self.action_set and action variables), call
logger.info with a format string and pass self.action_set[action].name as an
argument (e.g., "Applying %s", self.action_set[action].name) so interpolation is
deferred until the log is emitted.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/utils.py (1)
49-55:⚠️ Potential issue | 🟡 MinorUpdate docstring: RuntimeWarning is no longer raised.
The Windows branch now logs and proceeds without raising; the Raises section is out of sync and should be corrected.
✏️ Suggested docstring update
- Raises: - RuntimeWarning: If the timeout is not supported on the current platform (e.g., Windows). - TimeoutExceptionError: If the function call exceeds the timeout limit. + Raises: + TimeoutExceptionError: If the function call exceeds the timeout limit (non-Windows).
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 33-58: Define the missing TORINO_CANONICAL_COSTS constant and
implement a canonical_cost function and unknown-device fallback used by
get_cost_table: add a TORINO_CANONICAL_COSTS mapping containing the canonical
cost table for the Torino device, implement canonical_cost(device_name: str) ->
CanonicalCostTable to return a known mapping for recognized names and otherwise
log a warning and return TORINO_CANONICAL_COSTS as the default; modify
get_cost_table to call canonical_cost when the device is not in
DEVICE_COST_CACHE/KNOWN_DEVICE_BASES instead of generating a new random table,
and ensure TORINO_CANONICAL_COSTS, canonical_cost, and get_cost_table are
exported/defined in the module so tests can import them.
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
104-104: 🧹 Nitpick | 🔵 TrivialUse lazy logging formatting for consistency.
String concatenation in the log call is inconsistent with the
%s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.♻️ Suggested fix
- logger.info("Init env: " + reward_function) + logger.info("Init env: %s", reward_function)
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mqt/predictor/rl/predictorenv.py (1)
562-576: 🧹 Nitpick | 🔵 Trivial
determine_valid_actions_for_statecreates fresh pass instances while_native_and_mappedreuses cached ones.Both methods perform the same GatesInBasis + CheckMap checks.
determine_valid_actions_for_statestill instantiates new objects each call (lines 564, 574), while_native_and_mappeduses the cached instances from__init__. Consider reusing the cached instances here too, or calling_native_and_mappedinternally.
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.
In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.
| done = action == self.action_terminate_index | ||
|
|
||
| if self.reward_function == "estimated_hellinger_distance": | ||
| reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty | ||
| self.state._layout = self.layout # noqa: SLF001 | ||
| return create_feature_dict(self.state), reward_val, done, False, {} |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.
This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 242-242: Unused noqa directive (non-enabled: SLF001)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path
for reward_function == "estimated_hellinger_distance" only gives the episode
terminal step a real reward and assigns no_effect_penalty for every non-terminal
step, which leaves the agent without shaping signal; update the branch in
predictorenv.py (the block checking self.reward_function ==
"estimated_hellinger_distance") to either implement a non-terminal shaping
heuristic or, if this is intentional, add a concise explanatory comment above
the block referencing that choice and the involved symbols
(self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty,
self.calculate_reward, and self.action_terminate_index) so readers know why
calculate_reward is only called at done and why non-terminal steps receive no
signal.
| for name in basis_ops: | ||
| with suppress(KeyError, AttributeError, TypeError): | ||
| op = target.operation_from_name(name) | ||
| arity_by_name[name] = int(op.num_qubits) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
suppress(KeyError, AttributeError, TypeError) silently swallows errors during arity detection.
While this is cleaner than a bare except Exception, TypeError is quite broad. If operation_from_name returns an unexpected type or num_qubits is somehow non-integer, the error would be silently swallowed and the gate would be skipped. Consider narrowing to (KeyError, AttributeError) or adding debug logging.
🤖 Prompt for AI Agents
In `@src/mqt/predictor/rl/predictorenv.py` around lines 636 - 639, The
suppress(KeyError, AttributeError, TypeError) around the basis_ops loop is too
broad and can hide legitimate TypeError bugs when calling
target.operation_from_name or reading op.num_qubits; change the exception list
to (KeyError, AttributeError) (i.e., remove TypeError) and optionally add a
debug/log line inside the except path that records the gate name and the caught
exception so skipped gates are visible; locate the loop using basis_ops,
target.operation_from_name, arity_by_name and op.num_qubits to make the change.
Description
This PR introduces a shaped, step-wise reward signal for the RL-based compiler.
For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:
Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.
Checklist: