Add stepwise intermediate reward for RL by Shaobo-Zhou · Pull Request #526 · munich-quantum-toolkit/predictor

Shaobo-Zhou · 2025-11-26T16:11:16Z

Description

This PR introduces a shaped, step-wise reward signal for the RL-based compiler.

For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:

Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.

Checklist:

The pull request only contains commits that are focused and relevant to this change.
I have added appropriate tests that cover the new/changed functionality.
I have updated the documentation to reflect these changes.
I have added entries to the changelog for any noteworthy additions, changes, fixes, or removals.
I have added migration instructions to the upgrade guide (if needed).
The changes follow the project's style guidelines and introduce no new warnings.
The changes are fully tested and pass the CI checks.
I have reviewed my own code changes.

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Fix bugs Fix bugs Fix bugs

…e gate check

Fix windows runtime warning problem Fix windows runtime warning issue

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8a80d2 and 7810d4d.

📒 Files selected for processing (1)

src/mqt/predictor/rl/cost_model.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-09T13:20:11.483Z

Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-11-01T15:57:31.153Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

src/mqt/predictor/rl/cost_model.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14

🔇 Additional comments (5)

src/mqt/predictor/rl/cost_model.py (5)

1-36: LGTM: Clean module structure and imports.

The module-level organization is solid: clear documentation, appropriate use of TYPE_CHECKING for forward references, and a logger for diagnostics. The CanonicalCostTable type alias using Mapping correctly conveys immutability.

42-148: LGTM: Device-specific cost tables are well-documented.

The canonical cost tables correctly reflect different native gate sets across devices (CZ for Torino, iSWAP for Ankaa3, CZ for Emerald). The decomposition comments (e.g., CX → H•CZ•H ⇒ 6×1q + 1×2q) provide clear rationale. Gate coverage differences are intentional and align with each device's architecture.

151-169: LGTM: Safe fallback with appropriate warnings.

The unknown-device fallback to Torino is well-communicated via both warnings.warn and logger. The dual notification ensures visibility in different deployment contexts.

172-185: LGTM: Simple and correct delegation.

The function correctly wraps get_cost_table and provides the documented (0, 0) fallback for unknown gates.

223-249: LGTM: Correct fidelity approximation with defensive bounds.

The multiplicative fidelity model F = (1−p₁)^n₁ · (1−p₂)^n₂ is standard for independent gate errors. Defensive clamping and max() calls ensure numerical robustness.

coderabbitai · 2025-12-18T19:10:46Z

src/mqt/predictor/rl/cost_model.py

+def estimate_counts(
+    qc: QuantumCircuit,
+    *,
+    cost_table: CanonicalCostTable,
+) -> tuple[int, int]:
+    """Estimate canonical (n_1q, n_2q) counts for a circuit.
+
+    Uses the provided ``cost_table`` where available and a simple, conservative
+    fallback otherwise (3*1q for unknown 1q gates, 1*2q + 4*1q for unknown 2q gates).
+    """
+    n_1q = 0
+    n_2q = 0
+
+    for circuit_instr in qc.data:
+        name = circuit_instr.operation.name
+        qargs = circuit_instr.qubits
+
+        # Ignore non-unitary / timing-only ops for this count
+        if name in ("barrier", "delay", "measure"):
+            continue
+
+        cost = cost_table.get(name)
+        if cost is None:
+            # Conservative fallback by arity (only used for gates missing in the table)
+            if len(qargs) == 1:
+                n_1q += 3
+            elif len(qargs) == 2:
+                n_2q += 1
+                n_1q += 4
+        else:
+            n_1q += cost[0]
+            n_2q += cost[1]
+    return n_1q, n_2q


🧹 Nitpick | 🔵 Trivial

Consider logging unknown gates.

The conservative fallback for gates not in the cost table is reasonable, but silently falling back (lines 210-216) could hide unexpected gates or make debugging difficult. Consider adding a debug or warning log when the fallback is triggered.

🔎 Suggested addition:

cost = cost_table.get(name) if cost is None: + logger.debug( + f"Gate '{name}' not in cost table; using conservative fallback: " + f"{'(3, 0)' if len(qargs) == 1 else '(4, 1)'}" + ) # Conservative fallback by arity (only used for gates missing in the table) if len(qargs) == 1: n_1q += 3

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/cost_model.py around lines 188 to 220, the fallback path for gates missing from the cost_table silently applies conservative counts; modify this to emit a log message when the fallback is triggered that includes the gate name and its arity (len(qargs)). Add a module logger (logging.getLogger(__name__)) if not present, and call logger.debug or logger.warning inside the if cost is None branch (before incrementing n_1q/n_2q) so developers can see which unknown gates caused the fallback; keep message concise and avoid sensitive data.

src/mqt/predictor/rl/cost_model.py

burgholzer

@Shaobo-Zhou just did a very quick review here in order to not hold this up for too long.
I think this will probably need one more round of iteration until it is ready to be merged. At the moment it feels like it's cutting a few corners too many.

Edit: There were also quite some CodeRabbit comments left, which should ideally be resolved.

src/mqt/predictor/rl/predictorenv.py

burgholzer · 2025-12-25T12:19:29Z

tests/compilation/test_predictor_rl.py

+        RuntimeError,
+        match=re.escape("Device target does not expose the required Target API for approximate reward computation."),
+    ):
+        broken_predictor.env._ensure_device_averages_cached()  # noqa: SLF001


SLF is enabled in the ruff rules of the project.

docs/setup.md

src/mqt/predictor/rl/cost_model.py

burgholzer · 2025-12-25T13:20:38Z

src/mqt/predictor/rl/predictorenv.py

+        if self.reward_function not in {"expected_fidelity", "estimated_success_probability"}:
+            if self.reward_function == "critical_depth":


this does not handle the two Hellinger fidelity metrics explicitly. Thus, they unnecessarily raise a warning in the code.

burgholzer · 2025-12-25T13:22:05Z

src/mqt/predictor/rl/predictorenv.py

+        if not p1 and not p2:
+            msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward."
+            raise RuntimeError(msg)


@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.

src/mqt/predictor/rl/predictorenv.py

pyproject.toml

src/mqt/predictor/rl/cost_model.py

CHANGELOG.md

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@CHANGELOG.md`:
- Around line 55-58: The CHANGELOG.md contains a link definition for [`#449`] that
is not referenced; either remove the unused link definition "[`#449`]:
https://github.com/munich-quantum-toolkit/predictor/pull/449" or add a
corresponding entry in the Unreleased changelog that references [`#449`] so
markdownlint no longer flags it as unused—update the Unreleased section to
include a line like "PR [`#449`] - <short description>" or delete the [`#449`] link
definition entirely.

In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 171-172: The current code uses QuantumCircuit.size() and writes
cost_table[...] = (qc_trans.size(), 0), which incorrectly treats all transpiled
basis gates as 1-qubit; instead iterate qc_trans.data (or
qc_trans.count_ops()/instructions) to compute separate 1-qubit and 2-qubit
counts and store them as the tuple (n_1q, n_2q) in cost_table. Locate the
transpile call and the assignment to cost_table (qc_trans and cost_table
entries) and replace the single size() value with explicit counting by
instruction arity (e.g., count len(instr.qargs) == 1 vs == 2) so downstream
fidelity/ESP uses both 1q and 2q contributions.
- Around line 150-167: The add_gate_to_cost_table function is double-prefixing
gates like "crx"/"cu3"/"ccx" causing invalid method names; fix by detecting when
the supplied gate string already encodes control prefixes and avoid prepending
"c"*controls in that case (e.g., if gate.startswith("c") then use gate as-is or
strip/interpret its inherent control count rather than adding extra "c"s), and
for the multi-control group (entries like "ccx","c3x","c4x") either replace them
with their base gate names plus the correct controls (e.g., "x" with
controls=2/3/4) or list each gate with the accurate controls count so
add_gate_to_cost_table uses the correct method name (refer to
add_gate_to_cost_table, the gate list entries and the logic building gate_name).

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 312-321: calculate_reward currently falls through to the fallback
for the "estimated_hellinger_distance" mode because there is no explicit branch;
add a branch inside the same block (where crit_depth(qc) is handled) that checks
if self.reward_function == "estimated_hellinger_distance" and returns the
computed Hellinger reward and an appropriate tag (e.g.,
(estimated_hellinger_distance(qc), "exact") or call the existing helper used
elsewhere if named differently), so the estimator is used instead of returning
0.0 and logging a warning.
- Around line 484-489: The routing branch currently assumes pm_property_set
contains "final_layout" and directly assigns self.layout.final_layout =
pm_property_set["final_layout"], which can raise KeyError because
pm_property_set may be an empty dict; update the guard to verify the key exists
(e.g., use pm_property_set.get("final_layout") or an explicit 'if "final_layout"
in pm_property_set') before assigning, and only set self.layout.final_layout
when a non-None value is returned; target the branch that checks action_index in
self.actions_routing_indices and uses self.layout and pm_property_set.
- Around line 222-230: The current termination check "if not altered_qc" treats
empty QuantumCircuit objects as falsy and prematurely ends episodes; change the
check to explicitly test for None (i.e., "if altered_qc is None") so only
intentional termination from apply_action triggers episode end, leaving empty
but valid circuits to continue; update the branch that returns
create_feature_dict(self.state), 0.0, True, False, {} to run only when
altered_qc is None and keep using apply_action and create_feature_dict as
before.

CHANGELOG.md

src/mqt/predictor/rl/cost_model.py

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/reward.py (1)

207-212: ⚠️ Potential issue | 🟠 Major

Address incorrect exponential decay calculation when device.dt is unavailable.

The fallback device.dt or 1.0 is semantically incorrect. In Qiskit, instruction.duration is measured in time steps (units of dt), and device.dt converts those steps to seconds. Using 1.0 assumes each time step equals 1 second, but typical device.dt values are ~2.22e-10 seconds (nanoseconds). This causes the decay calculation to be off by ~10 orders of magnitude when dt is unavailable.

Qiskit's Target.dt is legitimately None for some backends (not a rare edge case). Either:

Skip the exponential decay term for backends without calibrated dt

Log a warning and use a more conservative estimate

Propagate None and handle it at a higher level

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/cost_model.py`:
- Line 119: The logging call in logger.warning currently uses an f-string which
forces formatting even when the warning level may be disabled; update the call
in the cost generation code (reference: the logger.warning line in
src/mqt/predictor/rl/cost_model.py that mentions "No basis for device") to use
lazy %-style or parameterized logging (e.g., logger.warning("No basis for device
'%s', using minimal universal basis for cost generation.", device_id)) so the
string is only formatted when the message will actually be emitted.
- Around line 167-168: Replace the bare "except Exception:" in
src/mqt/predictor/rl/cost_model.py with a specific exception catch for missing
gate methods (e.g., "except AttributeError:"), or a small explicit tuple like
"except (AttributeError, NotImplementedError):" so you only silence expected
missing-method errors while allowing other bugs (TypeError, ValueError, etc.) to
surface; update the except block that currently returns on error to use the
narrowed exception types.
- Around line 105-107: The if-check that tests "if transpile is None or
QuantumCircuit is None" is unreachable because those symbols are imported at
module level; remove that conditional and the accompanying ImportError so the
function no longer checks for None, or if you intended lazy importing, instead
move the Qiskit imports (transpile, QuantumCircuit) into the function and wrap
them in a try/except ImportError that raises the same message—update the code
around the function that references transpile/QuantumCircuit accordingly
(referenced symbols: transpile, QuantumCircuit in cost_model.py).
- Around line 170-171: The cost_table currently stores a flat total with a zero
second element; update generate_cost_table to compute and store the tuple (n_1q,
n_2q) by using the transpiled circuit qc_trans and the existing
estimate_basis_gate_counts utility and TWO_Q_GATES set: call
estimate_basis_gate_counts(qc_trans, basis_gates) (or the appropriate signature)
to get per-basis counts, sum counts for gates in TWO_Q_GATES to produce n_2q and
sum the remaining counts for n_1q, then assign cost_table[gate if controls == 0
else ("c" * controls) + gate] = (n_1q, n_2q) instead of (qc_trans.size(), 0).

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 621-622: The getattr call checking "_dev_avgs_cached" is redundant
because __init__ already sets self._dev_avgs_cached = False; replace
getattr(self, "_dev_avgs_cached", False) with a direct attribute access
(self._dev_avgs_cached) in the code path shown, so the condition becomes if
self._dev_avgs_cached: return; keep the existing __init__ initialization as the
source of truth and remove the defensive default.
- Line 250: Replace the eager f-string logging call logger.info(f"Final reward
({self.prev_reward_kind}): {self.prev_reward}") with lazy logging formatting to
avoid unnecessary string interpolation; use logger.info("Final reward (%s): %s",
self.prev_reward_kind, self.prev_reward) so that self.prev_reward_kind and
self.prev_reward are passed as parameters and only formatted if the log level is
enabled.
- Line 221: Replace the f-string logging call with lazy logging formatting to
avoid eager interpolation: in the code that currently calls
logger.info(f"Applying {self.action_set[action].name}") (within PredictorEnv /
predictorenv.py, using the logger, self.action_set and action variables), call
logger.info with a format string and pass self.action_set[action].name as an
argument (e.g., "Applying %s", self.action_set[action].name) so interpolation is
deferred until the log is emitted.

src/mqt/predictor/rl/cost_model.py

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/utils.py (1)
49-55: ⚠️ Potential issue | 🟡 Minor

Update docstring: RuntimeWarning is no longer raised.

The Windows branch now logs and proceeds without raising; the Raises section is out of sync and should be corrected.
✏️ Suggested docstring update
-    Raises:
-        RuntimeWarning: If the timeout is not supported on the current platform (e.g., Windows).
-        TimeoutExceptionError: If the function call exceeds the timeout limit.
+    Raises:
+        TimeoutExceptionError: If the function call exceeds the timeout limit (non-Windows).

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/cost_model.py`:
- Around line 33-58: Define the missing TORINO_CANONICAL_COSTS constant and
implement a canonical_cost function and unknown-device fallback used by
get_cost_table: add a TORINO_CANONICAL_COSTS mapping containing the canonical
cost table for the Torino device, implement canonical_cost(device_name: str) ->
CanonicalCostTable to return a known mapping for recognized names and otherwise
log a warning and return TORINO_CANONICAL_COSTS as the default; modify
get_cost_table to call canonical_cost when the device is not in
DEVICE_COST_CACHE/KNOWN_DEVICE_BASES instead of generating a new random table,
and ensure TORINO_CANONICAL_COSTS, canonical_cost, and get_cost_table are
exported/defined in the module so tests can import them.

src/mqt/predictor/rl/cost_model.py

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.

src/mqt/predictor/rl/approx_reward.py

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/rl/predictorenv.py (1)
104-104: 🧹 Nitpick | 🔵 Trivial

Use lazy logging formatting for consistency.

String concatenation in the log call is inconsistent with the %s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.
♻️ Suggested fix
-        logger.info("Init env: " + reward_function)
+        logger.info("Init env: %s", reward_function)

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.

tests/compilation/test_predictor_rl.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/rl/predictorenv.py (1)

562-576: 🧹 Nitpick | 🔵 Trivial

determine_valid_actions_for_state creates fresh pass instances while _native_and_mapped reuses cached ones.

Both methods perform the same GatesInBasis + CheckMap checks. determine_valid_actions_for_state still instantiates new objects each call (lines 564, 574), while _native_and_mapped uses the cached instances from __init__. Consider reusing the cached instances here too, or calling _native_and_mapped internally.

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.

src/mqt/predictor/rl/predictorenv.py

tests/compilation/test_predictor_rl.py

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.

coderabbitai · 2026-02-08T19:52:14Z

src/mqt/predictor/rl/predictorenv.py

+        done = action == self.action_terminate_index
+
+        if self.reward_function == "estimated_hellinger_distance":
+            reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty
+            self.state._layout = self.layout  # noqa: SLF001
+            return create_feature_dict(self.state), reward_val, done, False, {}


🧹 Nitpick | 🔵 Trivial

Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.

This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 242-242: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path for reward_function == "estimated_hellinger_distance" only gives the episode terminal step a real reward and assigns no_effect_penalty for every non-terminal step, which leaves the agent without shaping signal; update the branch in predictorenv.py (the block checking self.reward_function == "estimated_hellinger_distance") to either implement a non-terminal shaping heuristic or, if this is intentional, add a concise explanatory comment above the block referencing that choice and the involved symbols (self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index) so readers know why calculate_reward is only called at done and why non-terminal steps receive no signal.

coderabbitai · 2026-02-08T19:52:14Z

src/mqt/predictor/rl/predictorenv.py

+        for name in basis_ops:
+            with suppress(KeyError, AttributeError, TypeError):
+                op = target.operation_from_name(name)
+                arity_by_name[name] = int(op.num_qubits)


🧹 Nitpick | 🔵 Trivial

suppress(KeyError, AttributeError, TypeError) silently swallows errors during arity detection.

While this is cleaner than a bare except Exception, TypeError is quite broad. If operation_from_name returns an unexpected type or num_qubits is somehow non-integer, the error would be silently swallowed and the gate would be skipped. Consider narrowing to (KeyError, AttributeError) or adding debug logging.

🤖 Prompt for AI Agents

In `@src/mqt/predictor/rl/predictorenv.py` around lines 636 - 639, The suppress(KeyError, AttributeError, TypeError) around the basis_ops loop is too broad and can hide legitimate TypeError bugs when calling target.operation_from_name or reading op.num_qubits; change the exception list to (KeyError, AttributeError) (i.e., remove TypeError) and optionally add a debug/log line inside the except path that records the gate name and the caught exception so skipped gates are visible; locate the loop using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to make the change.

tests/compilation/test_predictor_rl.py

Shaobo Zhou and others added 30 commits March 29, 2025 19:20

Update predictor(adding callbacks)

129b60f

Update

08889bd

Restore helper.py and predictor.py to match upstream

e2ff3fe

Merge remote-tracking branch 'upstream/main'

1c32d15

Implement new mapping actions

78dc1aa

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations

a3ba836

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations

5935e6f

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Fix: resolve pre-commit issues and add missing annotations

f71fb29

Fix: resolve pre-commit issues and add missing annotations

3c7592b

Fix mypy errors

6db5c27

Fix mypy errors

47841c5

Fix dependencies issues

b1ac8ce

Fix dependency issues

5f8473c

Add missing zip file

7491ec0

Fix issue with Python 3.13

3346842

Merge branch 'main' into hybrid-mapping

6f7a73c

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Remove Python 3.13 from noxfile.py due to compatibility issue

6c67349

Skip minimums session on Windows due to CI slowness

2692b96

Fix bugs

f4874e6

Fix bugs Fix bugs Fix bugs

Fix bugs

54eec91

Use default Qiskit settings for VF2Layout and add assertion for nativ…

845f7de

…e gate check

Debug

3418936

Fix missing argument

ae870cc

Fix warning issues

861bc62

Fix window runtime warning problem

fa989b6

Fix window runtime warning problem

405bd39

Add time limit for VF2PostLayout

7b2f321

Fix windows runtime warning problem

b67d0a6

Fix windows runtime warning problem Fix windows runtime warning issue

Add new actions

bf7c9ee

Add new actions

6d2733f

coderabbitai bot requested changes Dec 18, 2025

View reviewed changes

flowerthrower requested a review from burgholzer December 19, 2025 12:29

burgholzer added 6 commits December 25, 2025 11:15

Merge branch 'main' into new_RL

8b3ecaf

✏️ curate changelog

bc49c63

✏️ minimize unnecessary whitespace changes

5cd1bf7

⏪ revert Windows workaround

5a895d0

🏷️ Various typing fixes

1ecb3e2

⏪ avoid CUDA out of memory error

dde6f89

burgholzer added the feature New feature or request label Dec 25, 2025

burgholzer requested changes Dec 25, 2025

View reviewed changes

Shaobo-Zhou and others added 3 commits February 4, 2026 23:19

Update cost_model.py to QCEC style

2e3d1c7

Merge branch 'main' into new_RL

c8b7201

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

🎨 pre-commit fixes

da3c95f

coderabbitai bot requested changes Feb 4, 2026

View reviewed changes

src/mqt/predictor/rl/cost_model.py Outdated Show resolved Hide resolved

Adjusted implementation of reward approximation

56573a9

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

Coderabbit suggestions

e0cec3c

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Show resolved Hide resolved

Shaobo-Zhou and others added 2 commits February 8, 2026 15:51

Resolve problem with estimated Hellinger distance

a41b323

Merge branch 'main' into new_RL

37c55a8

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

Add test coverage

734df4e

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

tests/compilation/test_predictor_rl.py Outdated Show resolved Hide resolved

Apply coderabbit suggestions

d376268

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

tests/compilation/test_predictor_rl.py Outdated Show resolved Hide resolved

Apply coderabbit suggestions

adaefa9

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

Apply coderabbit suggestions

7a0c40b

		if self.reward_function not in {"expected_fidelity", "estimated_success_probability"}:
		if self.reward_function == "critical_depth":

Uh oh!

Conversation

Shaobo-Zhou commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

burgholzer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

burgholzer Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

burgholzer Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

burgholzer Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shaobo-Zhou commented Nov 26, 2025 •

edited

Loading

burgholzer left a comment •

edited

Loading