AI4quantum
diff --git a/‎README.md‎
Lines changed: 6 additions & 1 deletion b/‎README.md‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/twists.md‎
Lines changed: 66 additions & 0 deletions b/‎docs/twists.md‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎rust/src/collector/az.rs‎
Lines changed: 3 additions & 2 deletions b/‎rust/src/collector/az.rs‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎rust/src/collector/collector.rs‎
Lines changed: 7 additions & 0 deletions b/‎rust/src/collector/collector.rs‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎rust/src/collector/ppo.rs‎
Lines changed: 7 additions & 5 deletions b/‎rust/src/collector/ppo.rs‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎rust/src/nn/policy.rs‎
Lines changed: 14 additions & 9 deletions b/‎rust/src/nn/policy.rs‎
Lines changed: 14 additions & 9 deletions
diff --git a/‎rust/src/python_interface/collector.rs‎
Lines changed: 16 additions & 1 deletion b/‎rust/src/python_interface/collector.rs‎
Lines changed: 16 additions & 1 deletion
@@ -100,11 +100,16 @@ The `examples/grid_world` custom environment example [here](examples/grid_world)
 
 Refer to [grid_world](examples/grid_world) for a complete working example.
 
+## Documentation
+
+- [Permutation twists in environments](docs/twists.md)
+
 ## 🚀 Key Features 
 - **High-Performance Core**: RL episode loop implemented in Rust for faster training and inference
 - **Inference-Ready**: Easy compilation and bundling of models with environments into portable binaries for inference
 - **Modular Design**: Support for multiple algorithms (PPO, AlphaZero) with interchangeable training and inference
 - **Language Interoperability**: Core in Rust with Python interface
+- **Symmetry-Aware Training via Twists**: Environments can expose observation/action permutations (“twists”) so policies automatically exploit device or puzzle symmetries for faster learning.
 
 
 ## 🏗️ Current State (PoC)
@@ -165,4 +170,4 @@ This project is currently in PoC stage. While functional, it's under active deve
 
 ## 📜 License
 
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
@@ -0,0 +1,66 @@
+# Twists (Permutation Symmetries) in twisteRL
+
+Twists are twisteRL's way to describe exact permutation symmetries that exist inside an environment.
+Instead of training a policy to rediscover that symmetries exist (for example, that swapping qubits
+across a symmetric coupling map produces an equivalent observation/action space), the environment
+hands the policy explicit permutations that it can use for data augmentation or symmetry-aware heads.
+By repeatedly doing and undoing these permutations you also reduce the chance of deadlocks and gain a 
+lightweight form of regularization because the agent sees equivalent states under many orderings.
+
+## Where Twists Are Used
+- Every environment implements the `twisterl::rl::env::Env` trait. The trait includes a `twists`
+  method that returns `(Vec<Vec<usize>>, Vec<Vec<usize>>)` representing valid permutations on the
+  flattened observation array and matching permutations on the discrete action space
+  (`rust/src/rl/env.rs:33`).
+- When an environment is instantiated from Python via `prepare_algorithm`, twisteRL immediately calls
+  `env.twists()` and forwards the returned permutations to the policy constructor
+  (`src/twisterl/utils.py:120`). The policy can then symmetrize logits, average values, or augment
+  rollouts without extra environment queries.
+
+## Data Contract
+1. **Observation permutations (`obs_perms`)** are expressed in the same flattened index space
+   produced by the environment’s `observe()` method. Each permutation covers every index exactly once.
+2. **Action permutations (`act_perms`)** must use the same ordering as `obs_perms`. TwisteRL
+   assumes `act_perms[i]` describes how to remap actions when `obs_perms[i]` is applied.
+3. The length of the two permutation lists must match (`len(obs_perms) == len(act_perms)`), and the
+   first permutation should usually be the identity so policies have a canonical ordering to fall back to.
+
+## Implementing Twists in Rust Environments
+1. **Compute permutations once** when the environment is constructed. Store the resulting vectors on
+   the struct so you can reuse them without recomputing each step.
+2. **Return cached permutations** from the `twists` method by cloning or otherwise referencing the
+   stored vectors. This keeps the call cheap even when policies request twists frequently.
+3. **Gate toggles through config**. Consider exposing a `use_perms` or `add_perms` flag so users can
+   disable symmetries if they want to benchmark raw performance or compare against non-symmetric runs.
+
+### Tips for new envs
+- If your observation is multi-dimensional, decide on a consistent flattening order and reuse it in
+  `observe()`, `obs_shape()`, and permutation computation.
+- Keep permutations short: only add a symmetry when it actually preserves the transition dynamics;
+  incorrect permutations can break training stability.
+- Store permutations on the struct instead of recomputing them each `twists()` call to avoid extra
+  allocations during training.
+
+## Implementing Twists in Python Environments
+Python environments exposed through `PyEnv` can mirror the same pattern:
+
+1. **Detect graph/device symmetries** using domain-specific tooling. Capture any permutation that
+   leaves the transition structure unchanged.
+2. **Sample a permutation for every observation** if you want trajectories to naturally explore each
+   orbit; this mimics the way many structured environments randomize qubit or tile order.
+3. **Expose action permutations** through the PyO3 wrapper so the policy receives matching
+   permutations. When porting a Python env to Rust, copy the action/observation permutation lists into
+   the Rust struct and return them from `twists()`.
+
+## Verifying Your Twists
+1. Call `env.twists()` from Python and check that each permutation is a rearrangement of
+   `range(len(observe()))` and `range(num_actions())`.
+2. Run a short training job with and without permutations enabled. If permutations are correct you
+   should see either faster convergence or identical performance; regressions usually mean the
+   action-and-observation permutations are misaligned.
+3. For debugging, temporarily limit the permutation list to `[identity]` and re-enable additional
+   symmetries one at a time.
+
+By explicitly documenting and exposing twists, twisteRL policies gain symmetry awareness for free,
+leading to higher data efficiency on structured problems such as puzzle solvers and quantum circuit
+optimization.
@@ -61,7 +61,6 @@ impl AZCollector {
         let mut probs: Vec<Vec<f32>> = vec![];
         let mut vals: Vec<f32> = vec![];
         let mut total_vals: Vec<f32> = vec![];
-
         let mut total_val = 0.0;
 
         // Loop until a final state
@@ -93,9 +92,12 @@ impl AZCollector {
         // Post process rewards
         let remaining_vals: Vec<f32> = total_vals.iter().map(|&v| total_val - v).collect();
 
+        let perms: Vec<Option<usize>> = vec![None; obs.len()];
+
         let mut data = CollectedData::new(
             obs,
             probs,
+            perms,
             vec![],
             vec![],
             vec![],
@@ -182,4 +184,3 @@ mod tests {
         assert!(data.additional_data.contains_key("remaining_values"));
     }
 }
-
 
@@ -24,6 +24,8 @@ pub struct CollectedData {
     pub obs: Vec<Vec<usize>>,
     /// Logits (action probabilities) at each timestep
     pub logits: Vec<Vec<f32>>,
+    /// Optional permutation index used at each timestep (-1 if none)
+    pub perms: Vec<Option<usize>>,
     /// Value estimates at each timestep
     pub values: Vec<f32>,
     /// Rewards received at each timestep
@@ -48,13 +50,15 @@ impl CollectedData {
     pub fn new(
         obs: Vec<Vec<usize>>,
         logits: Vec<Vec<f32>>,
+        perms: Vec<Option<usize>>,
         values: Vec<f32>,
         rewards: Vec<f32>,
         actions: Vec<usize>,
     ) -> Self {
         CollectedData {
             obs,
             logits,
+            perms,
             values,
             rewards,
             actions,
@@ -67,6 +71,7 @@ impl CollectedData {
         // Append observations and logits (2D vectors)
         self.obs.extend(other.obs.iter().cloned());
         self.logits.extend(other.logits.iter().cloned());
+        self.perms.extend(other.perms.iter().cloned());
 
         // Append 1D vectors
         self.values.extend(&other.values);
@@ -98,6 +103,7 @@ mod tests {
         let d1 = CollectedData::new(
             vec![vec![0]],
             vec![vec![0.1]],
+            vec![Some(0)],
             vec![0.2],
             vec![0.3],
             vec![1],
@@ -106,6 +112,7 @@ mod tests {
         let d2 = CollectedData::new(
             vec![vec![1]],
             vec![vec![0.4]],
+            vec![None],
             vec![0.5],
             vec![0.6],
             vec![0],
 
@@ -42,13 +42,13 @@ impl PPOCollector {
         &self,
         env: &dyn Env,
         policy: &Policy,
-    ) -> (Vec<usize>, Vec<f32>, usize, f32, f32) {
+    ) -> (Vec<usize>, Vec<f32>, usize, f32, f32, Option<usize>) {
         let obs = env.observe();      // Vec<f32> or whatever your Env returns
         let masks   = env.masks();
         let reward  = env.reward();
-        let (logits, value) = policy.forward(obs.clone(), masks);  
+        let (logits, value, perm_idx) = policy.forward_with_perm(obs.clone(), masks);
         let action = sample_from_logits(&logits);
-        (obs, logits, action, value, reward)
+        (obs, logits, action, value, reward, perm_idx)
     }
 
     fn single_collect(
@@ -64,14 +64,16 @@ impl PPOCollector {
         let mut vals  = Vec::new();
         let mut rews = Vec::new();
         let mut acts = Vec::new();
+        let mut perms = Vec::new();
 
         loop {
-            let (obs, log_prob, act, val, rew) = self.get_step_data(&*env, policy);
+            let (obs, log_prob, act, val, rew, perm_idx) = self.get_step_data(&*env, policy);
             obss.push(obs);
             log_probs.push(log_prob);
             vals.push(val);
             rews.push(rew);
             acts.push(act);
+            perms.push(perm_idx);
 
             if env.is_final() { break; }
             env.step(act);
@@ -92,6 +94,7 @@ impl PPOCollector {
         let mut data = CollectedData::new(
             obss,
             log_probs,
+            perms,
             vals,
             rews,
             acts,
@@ -177,4 +180,3 @@ mod tests {
         assert!(data.additional_data.contains_key("rets"));
     }
 }
-
 
@@ -32,31 +32,36 @@ impl Policy {
     }
 
     pub fn predict(&self, obs: Vec<usize>, masks: Vec<bool>) -> (Vec<f32>, f32) {
-        // Forward of the action net
-        let (action_logits, value) = self._raw_predict(obs, self.get_perm_id());
+        let (exp_masked_probs, value, _) = self.predict_with_perm(obs, masks);
+        (exp_masked_probs, value)
+    }
+
+    pub fn predict_with_perm(&self, obs: Vec<usize>, masks: Vec<bool>) -> (Vec<f32>, f32, Option<usize>) {
+        let (action_logits, value, perm_idx) = self.forward_with_perm(obs, masks.clone());
 
         // Apply masks to the actions
         let mut exp_masked_probs: Vec<f32> = action_logits.iter().zip(masks.iter()).map(|(&a, &m)| if m {a.exp()} else {0.0}).collect();
 
-        // TODO: apply noise to the actions
-
         // Normalize actions
         let action_probs_sum: f32 = exp_masked_probs.iter().sum();
         exp_masked_probs = exp_masked_probs.iter().map(|&v| v / (action_probs_sum + 0.000001)).collect();
-        (exp_masked_probs, value)
+        (exp_masked_probs, value, perm_idx)
     }
 
-
     pub fn forward(&self, obs: Vec<usize>, masks: Vec<bool>) -> (Vec<f32>, f32) {
-        // Similar to predict but outputs unnormalized logits instead of probabilities
+        let (masked_logits, value, _) = self.forward_with_perm(obs, masks);
+        (masked_logits, value)
+    }
 
+    pub fn forward_with_perm(&self, obs: Vec<usize>, masks: Vec<bool>) -> (Vec<f32>, f32, Option<usize>) {
         // Forward of the action net
-        let (action_logits, value) = self._raw_predict(obs, self.get_perm_id());
+        let perm_idx = self.get_perm_id();
+        let (action_logits, value) = self._raw_predict(obs, perm_idx);
 
         // Apply masks to the actions
         let masked_logits: Vec<f32> = action_logits.iter().zip(masks.iter()).map(|(&a, &m)| if m {a} else {-1e10}).collect();
 
-        (masked_logits, value)
+        (masked_logits, value, perm_idx)
     }
 
     fn get_perm_id(&self) -> Option<usize> {
 
@@ -37,9 +37,11 @@ impl PyCollectedData {
         values: Vec<f32>,
         rewards: Vec<f32>,
         actions: Vec<usize>,
+        perms: Option<Vec<Option<usize>>>,
     ) -> Self {
+        let perms = perms.unwrap_or_else(|| vec![None; obs.len()]);
         PyCollectedData {
-            inner: CollectedData::new(obs, logits, values, rewards, actions),
+            inner: CollectedData::new(obs, logits, perms, values, rewards, actions),
         }
     }
 
@@ -69,6 +71,19 @@ impl PyCollectedData {
     fn set_logits(&mut self, logits: Vec<Vec<f32>>) {
         self.inner.logits = logits;
     }
+
+    #[getter]
+    fn get_perms(&self) -> Vec<i64> {
+        self.inner.perms.iter().map(|opt| opt.map(|v| v as i64).unwrap_or(-1)).collect()
+    }
+
+    #[setter]
+    fn set_perms(&mut self, perms: Vec<i64>) {
+        self.inner.perms = perms
+            .into_iter()
+            .map(|v| if v < 0 { None } else { Some(v as usize) })
+            .collect();
+    }
 
     #[getter]
     fn get_values(&self) -> Vec<f32> {