Features needed for routing env in qiskit-gym#40
Conversation
victor-villar
commented
Feb 24, 2026
- Added factorized action support (factorized_bernoulli) in Python + Rust policies, with MultiBinary auto-detection.
- Updated rollout collection in PPO to proper step-then-reward flow.
- Expanded difficulty progression controls (threshold/hysteresis, step size, max difficulty, warmup).
Summary of ChangesHello @victor-villar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant enhancements to the reinforcement learning framework, primarily focusing on improving the training process for routing environments. It integrates a new factorized action representation for more efficient policy learning, refines the PPO data collection mechanism for correctness, and provides advanced controls for dynamic difficulty adjustment. These changes aim to make the system more robust, flexible, and performant for complex environments like those found in qiskit-gym. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces significant features for routing environment support in qiskit-gym, including factorized action support in both Python and Rust policies, updated rollout collection in PPO for a proper step-then-reward flow, and expanded difficulty progression controls. The changes are well-structured and address the stated objectives. Several improvements have been made to ensure robustness and correct behavior, such as bounds checking and backward compatibility for policy configurations. However, a few minor issues related to function definition order and edge-case handling in action factor inference were identified.
| int(num_action_factors) | ||
| if num_action_factors is not None | ||
| else _infer_num_action_factors(self.num_actions) |
There was a problem hiding this comment.
The function _infer_num_action_factors is called here within BasicPolicy.__init__ before it is defined later in the file (lines 325-333). This will result in a NameError at runtime. In Python, functions must be defined before they are called. Please move the definition of _infer_num_action_factors (and _build_action_index_bits which is also called before its definition) to before BasicPolicy class definition, or at least before their first call.
| if num_factors == 0 || factor_logits.len() < num_factors { | ||
| return factor_logits.to_vec(); | ||
| } | ||
| let num_actions = self.effective_num_actions(); | ||
| if num_actions == 0 { | ||
| return factor_logits.to_vec(); | ||
| } |
There was a problem hiding this comment.
In expand_factorized_logits, if num_factors is non-zero but factor_logits.len() is less than num_factors, or if effective_num_actions() returns 0, the function currently returns factor_logits.to_vec(). This behavior might mask an underlying configuration error or an invalid state. It would be more robust to explicitly handle these as error conditions, perhaps by logging a warning or raising an error, to prevent silent misbehavior in the reinforcement learning process.
| if let Some(act_perm) = self.act_perms.get(pi) { | ||
| if act_perm.len() == action_logits.len() { | ||
| action_logits = act_perm.iter().map(|&v| action_logits[v]).collect(); | ||
| } | ||
| } |
There was a problem hiding this comment.
The condition if act_perm.len() == action_logits.len() is a good check. However, if the lengths do not match, the permutation is silently skipped. This could lead to unexpected behavior if the permutation was intended to be applied but couldn't due to a mismatch. Consider logging a warning or raising an error in this else branch to make such inconsistencies explicit, aiding in debugging and preventing silent failures.
| n_bits = getattr(action_space, "n", None) | ||
| if n_bits is None: | ||
| shape = getattr(action_space, "shape", None) | ||
| n_bits = int(np.prod(shape)) if shape is not None else 0 |
There was a problem hiding this comment.
In the auto-factorization logic, when inferring n_bits from action_space.shape, np.prod(shape) is used. If shape is an empty tuple (), np.prod(()) evaluates to 1. This would incorrectly set n_bits to 1 for an empty shape, which should likely be 0 for a MultiBinary space with no bits. Consider modifying the condition to explicitly check for an empty shape, e.g., if shape and len(shape) > 0: n_bits = int(np.prod(shape)) else: n_bits = 0.
| n_bits = int(np.prod(shape)) if shape is not None else 0 | |
| n_bits = int(np.prod(shape)) if shape and len(shape) > 0 else 0 |