Skip to content

Features needed for routing env in qiskit-gym#40

Open
victor-villar wants to merge 3 commits intomainfrom
vv-routing
Open

Features needed for routing env in qiskit-gym#40
victor-villar wants to merge 3 commits intomainfrom
vv-routing

Conversation

@victor-villar
Copy link
Collaborator

  • Added factorized action support (factorized_bernoulli) in Python + Rust policies, with MultiBinary auto-detection.
  • Updated rollout collection in PPO to proper step-then-reward flow.
  • Expanded difficulty progression controls (threshold/hysteresis, step size, max difficulty, warmup).

@gemini-code-assist
Copy link

Summary of Changes

Hello @victor-villar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the reinforcement learning framework, primarily focusing on improving the training process for routing environments. It integrates a new factorized action representation for more efficient policy learning, refines the PPO data collection mechanism for correctness, and provides advanced controls for dynamic difficulty adjustment. These changes aim to make the system more robust, flexible, and performant for complex environments like those found in qiskit-gym.

Highlights

  • Factorized Action Support: Implemented factorized_bernoulli action mode in both Python and Rust policy implementations. This mode allows for a more efficient representation of actions, especially for environments with MultiBinary action spaces, by treating actions as combinations of binary factors. The system now includes auto-detection for MultiBinary environments to automatically configure policies for this mode.
  • PPO Rollout Collection Update: Refactored the PPO collector in Rust to ensure a 'step-then-reward' flow. This change aligns the data collection process with standard reinforcement learning semantics, where the reward is observed after an action has been taken and the environment has transitioned to a new state.
  • Enhanced Difficulty Progression Controls: Expanded the configuration options for difficulty progression during learning. New controls include threshold_min for hysteresis, diff_step for customizable step increments, diff_max for an upper bound on difficulty, and a warmup period for initial gradual increases. This provides more granular control over how the environment's difficulty adapts to agent performance.
  • Robust Sparse Observation Handling: Improved the handling of sparse observations in Python's az.py and ppo.py data processing. The update now correctly accounts for multiplicities (duplicate indices) in observations, ensuring accurate representation when converting environment observations to PyTorch tensors.
Changelog
  • rust/src/collector/ppo.rs
    • Renamed get_step_data to get_action_data and adjusted its return signature.
    • Modified single_collect to implement a 'step-then-reward' sequence for accurate rollout collection.
    • Added a check for empty collected data in single_collect to prevent errors.
    • Updated test assertion to reflect changes in collected observation length.
  • rust/src/nn/policy.rs
    • Introduced ActionMode enum for categorical and factorized Bernoulli actions.
    • Extended Policy struct with action_mode, num_action_factors, and num_actions fields.
    • Added new_with_action_mode constructor to support new action mode parameters.
    • Implemented effective_num_actions and expand_factorized_logits for handling factorized actions.
    • Modified predict and forward_with_perm to process logits based on the configured action_mode.
    • Improved permutation logic in forward_with_perm for safer access.
  • rust/src/python_interface/policy.rs
    • Updated PyPolicy::new to accept action_mode, num_action_factors, and num_actions with default values.
    • Integrated the new Policy::new_with_action_mode constructor.
  • src/twisterl/defaults.py
    • Expanded LEARNING_CONFIG with new parameters for difficulty progression: threshold_min, diff_max, diff_step, warmup, and final_diff_is_none.
  • src/twisterl/nn/policy.py
    • Modified Policy.__init__ to support action_mode and num_action_factors.
    • Added logic to infer num_action_factors for MultiBinary action spaces.
    • Implemented validation for factorized_bernoulli mode, ensuring num_actions is a power of two.
    • Adjusted action_out_size based on the selected action_mode.
    • Registered _action_index_bits as a buffer for factorized action expansion.
    • Updated _forward_core to project factorized logits using _action_index_bits.
    • Modified to_rust method to pass new action mode parameters to the Rust policy.
    • Updated ConvPolicy to propagate action mode parameters.
    • Added helper functions _infer_num_action_factors and _build_action_index_bits.
  • src/twisterl/rl/algorithm.py
    • Updated learn method to retrieve and utilize new difficulty progression parameters from configuration.
    • Implemented a more sophisticated difficulty increase mechanism incorporating hysteresis, step size, warmup, and maximum difficulty.
  • src/twisterl/rl/az.py
    • Enhanced data_to_torch to robustly handle sparse observations, including multiplicities, using np.add.at.
  • src/twisterl/rl/ppo.py
    • Improved data_to_torch to correctly process sparse observations with multiplicities.
  • src/twisterl/utils.py
    • Imported inspect and numpy for new functionalities.
    • Modified prepare_algorithm to auto-detect factorized_bernoulli mode for MultiBinary action spaces.
    • Added backward compatibility logic to filter unsupported policy kwargs for older policy classes.
    • Ensured new policy kwargs are correctly passed to the policy constructor.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces significant features for routing environment support in qiskit-gym, including factorized action support in both Python and Rust policies, updated rollout collection in PPO for a proper step-then-reward flow, and expanded difficulty progression controls. The changes are well-structured and address the stated objectives. Several improvements have been made to ensure robustness and correct behavior, such as bounds checking and backward compatibility for policy configurations. However, a few minor issues related to function definition order and edge-case handling in action factor inference were identified.

Comment on lines +55 to +57
int(num_action_factors)
if num_action_factors is not None
else _infer_num_action_factors(self.num_actions)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function _infer_num_action_factors is called here within BasicPolicy.__init__ before it is defined later in the file (lines 325-333). This will result in a NameError at runtime. In Python, functions must be defined before they are called. Please move the definition of _infer_num_action_factors (and _build_action_index_bits which is also called before its definition) to before BasicPolicy class definition, or at least before their first call.

Comment on lines +109 to +115
if num_factors == 0 || factor_logits.len() < num_factors {
return factor_logits.to_vec();
}
let num_actions = self.effective_num_actions();
if num_actions == 0 {
return factor_logits.to_vec();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In expand_factorized_logits, if num_factors is non-zero but factor_logits.len() is less than num_factors, or if effective_num_actions() returns 0, the function currently returns factor_logits.to_vec(). This behavior might mask an underlying configuration error or an invalid state. It would be more robust to explicitly handle these as error conditions, perhaps by logging a warning or raising an error, to prevent silent misbehavior in the reinforcement learning process.

Comment on lines +195 to +199
if let Some(act_perm) = self.act_perms.get(pi) {
if act_perm.len() == action_logits.len() {
action_logits = act_perm.iter().map(|&v| action_logits[v]).collect();
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition if act_perm.len() == action_logits.len() is a good check. However, if the lengths do not match, the permutation is silently skipped. This could lead to unexpected behavior if the permutation was intended to be applied but couldn't due to a mismatch. Consider logging a warning or raising an error in this else branch to make such inconsistencies explicit, aiding in debugging and preventing silent failures.

n_bits = getattr(action_space, "n", None)
if n_bits is None:
shape = getattr(action_space, "shape", None)
n_bits = int(np.prod(shape)) if shape is not None else 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the auto-factorization logic, when inferring n_bits from action_space.shape, np.prod(shape) is used. If shape is an empty tuple (), np.prod(()) evaluates to 1. This would incorrectly set n_bits to 1 for an empty shape, which should likely be 0 for a MultiBinary space with no bits. Consider modifying the condition to explicitly check for an empty shape, e.g., if shape and len(shape) > 0: n_bits = int(np.prod(shape)) else: n_bits = 0.

Suggested change
n_bits = int(np.prod(shape)) if shape is not None else 0
n_bits = int(np.prod(shape)) if shape and len(shape) > 0 else 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant