Features needed for routing env in qiskit-gym by victor-villar · Pull Request #40 · AI4quantum/twisteRL

victor-villar · 2026-02-24T15:06:05Z

Added factorized action support (factorized_bernoulli) in Python + Rust policies, with MultiBinary auto-detection.
Updated rollout collection in PPO to proper step-then-reward flow.
Expanded difficulty progression controls (threshold/hysteresis, step size, max difficulty, warmup).

gemini-code-assist · 2026-02-24T15:06:37Z

Summary of Changes

Hello @victor-villar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the reinforcement learning framework, primarily focusing on improving the training process for routing environments. It integrates a new factorized action representation for more efficient policy learning, refines the PPO data collection mechanism for correctness, and provides advanced controls for dynamic difficulty adjustment. These changes aim to make the system more robust, flexible, and performant for complex environments like those found in qiskit-gym.

Highlights

Factorized Action Support: Implemented factorized_bernoulli action mode in both Python and Rust policy implementations. This mode allows for a more efficient representation of actions, especially for environments with MultiBinary action spaces, by treating actions as combinations of binary factors. The system now includes auto-detection for MultiBinary environments to automatically configure policies for this mode.
PPO Rollout Collection Update: Refactored the PPO collector in Rust to ensure a 'step-then-reward' flow. This change aligns the data collection process with standard reinforcement learning semantics, where the reward is observed after an action has been taken and the environment has transitioned to a new state.
Enhanced Difficulty Progression Controls: Expanded the configuration options for difficulty progression during learning. New controls include threshold_min for hysteresis, diff_step for customizable step increments, diff_max for an upper bound on difficulty, and a warmup period for initial gradual increases. This provides more granular control over how the environment's difficulty adapts to agent performance.
Robust Sparse Observation Handling: Improved the handling of sparse observations in Python's az.py and ppo.py data processing. The update now correctly accounts for multiplicities (duplicate indices) in observations, ensuring accurate representation when converting environment observations to PyTorch tensors.

Changelog

rust/src/collector/ppo.rs
- Renamed get_step_data to get_action_data and adjusted its return signature.
- Modified single_collect to implement a 'step-then-reward' sequence for accurate rollout collection.
- Added a check for empty collected data in single_collect to prevent errors.
- Updated test assertion to reflect changes in collected observation length.
rust/src/nn/policy.rs
- Introduced ActionMode enum for categorical and factorized Bernoulli actions.
- Extended Policy struct with action_mode, num_action_factors, and num_actions fields.
- Added new_with_action_mode constructor to support new action mode parameters.
- Implemented effective_num_actions and expand_factorized_logits for handling factorized actions.
- Modified predict and forward_with_perm to process logits based on the configured action_mode.
- Improved permutation logic in forward_with_perm for safer access.
rust/src/python_interface/policy.rs
- Updated PyPolicy::new to accept action_mode, num_action_factors, and num_actions with default values.
- Integrated the new Policy::new_with_action_mode constructor.
src/twisterl/defaults.py
- Expanded LEARNING_CONFIG with new parameters for difficulty progression: threshold_min, diff_max, diff_step, warmup, and final_diff_is_none.
src/twisterl/nn/policy.py
- Modified Policy.__init__ to support action_mode and num_action_factors.
- Added logic to infer num_action_factors for MultiBinary action spaces.
- Implemented validation for factorized_bernoulli mode, ensuring num_actions is a power of two.
- Adjusted action_out_size based on the selected action_mode.
- Registered _action_index_bits as a buffer for factorized action expansion.
- Updated _forward_core to project factorized logits using _action_index_bits.
- Modified to_rust method to pass new action mode parameters to the Rust policy.
- Updated ConvPolicy to propagate action mode parameters.
- Added helper functions _infer_num_action_factors and _build_action_index_bits.
src/twisterl/rl/algorithm.py
- Updated learn method to retrieve and utilize new difficulty progression parameters from configuration.
- Implemented a more sophisticated difficulty increase mechanism incorporating hysteresis, step size, warmup, and maximum difficulty.
src/twisterl/rl/az.py
- Enhanced data_to_torch to robustly handle sparse observations, including multiplicities, using np.add.at.
src/twisterl/rl/ppo.py
- Improved data_to_torch to correctly process sparse observations with multiplicities.
src/twisterl/utils.py
- Imported inspect and numpy for new functionalities.
- Modified prepare_algorithm to auto-detect factorized_bernoulli mode for MultiBinary action spaces.
- Added backward compatibility logic to filter unsupported policy kwargs for older policy classes.
- Ensured new policy kwargs are correctly passed to the policy constructor.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces significant features for routing environment support in qiskit-gym, including factorized action support in both Python and Rust policies, updated rollout collection in PPO for a proper step-then-reward flow, and expanded difficulty progression controls. The changes are well-structured and address the stated objectives. Several improvements have been made to ensure robustness and correct behavior, such as bounds checking and backward compatibility for policy configurations. However, a few minor issues related to function definition order and edge-case handling in action factor inference were identified.

gemini-code-assist · 2026-02-24T15:11:47Z

src/twisterl/nn/policy.py

+                int(num_action_factors)
+                if num_action_factors is not None
+                else _infer_num_action_factors(self.num_actions)


The function _infer_num_action_factors is called here within BasicPolicy.__init__ before it is defined later in the file (lines 325-333). This will result in a NameError at runtime. In Python, functions must be defined before they are called. Please move the definition of _infer_num_action_factors (and _build_action_index_bits which is also called before its definition) to before BasicPolicy class definition, or at least before their first call.

gemini-code-assist · 2026-02-24T15:11:47Z

rust/src/nn/policy.rs

+        if num_factors == 0 || factor_logits.len() < num_factors {
+            return factor_logits.to_vec();
+        }
+        let num_actions = self.effective_num_actions();
+        if num_actions == 0 {
+            return factor_logits.to_vec();
+        }


In expand_factorized_logits, if num_factors is non-zero but factor_logits.len() is less than num_factors, or if effective_num_actions() returns 0, the function currently returns factor_logits.to_vec(). This behavior might mask an underlying configuration error or an invalid state. It would be more robust to explicitly handle these as error conditions, perhaps by logging a warning or raising an error, to prevent silent misbehavior in the reinforcement learning process.

gemini-code-assist · 2026-02-24T15:11:47Z

rust/src/nn/policy.rs

+            if let Some(act_perm) = self.act_perms.get(pi) {
+                if act_perm.len() == action_logits.len() {
+                    action_logits = act_perm.iter().map(|&v| action_logits[v]).collect();
+                }
+            }


The condition if act_perm.len() == action_logits.len() is a good check. However, if the lengths do not match, the permutation is silently skipped. This could lead to unexpected behavior if the permutation was intended to be applied but couldn't due to a mismatch. Consider logging a warning or raising an error in this else branch to make such inconsistencies explicit, aiding in debugging and preventing silent failures.

gemini-code-assist · 2026-02-24T15:11:47Z

src/twisterl/utils.py

+        n_bits = getattr(action_space, "n", None)
+        if n_bits is None:
+            shape = getattr(action_space, "shape", None)
+            n_bits = int(np.prod(shape)) if shape is not None else 0


In the auto-factorization logic, when inferring n_bits from action_space.shape, np.prod(shape) is used. If shape is an empty tuple (), np.prod(()) evaluates to 1. This would incorrectly set n_bits to 1 for an empty shape, which should likely be 0 for a MultiBinary space with no bits. Consider modifying the condition to explicitly check for an empty shape, e.g., if shape and len(shape) > 0: n_bits = int(np.prod(shape)) else: n_bits = 0.

Suggested change

n_bits = int(np.prod(shape)) if shape is not None else 0

n_bits = int(np.prod(shape)) if shape and len(shape) > 0 else 0

victor-villar added 2 commits February 17, 2026 11:02

Changes for routing

0ac5933

Add some comments

19840ee

victor-villar requested review from cbjuan and d-kremer as code owners February 24, 2026 15:06

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Fix lint errors

3494458

victor-villar requested a review from jesus-talavera-ibm February 24, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features needed for routing env in qiskit-gym#40

Features needed for routing env in qiskit-gym#40
victor-villar wants to merge 3 commits intomainfrom
vv-routing

victor-villar commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	n_bits = int(np.prod(shape)) if shape is not None else 0
	n_bits = int(np.prod(shape)) if shape and len(shape) > 0 else 0

Conversation

victor-villar commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant