[Question] How to vectorize customized environment?

### Required prerequisites

- [x] I have read the documentation <https://omnisafe.readthedocs.io>.
- [x] I have searched the [Issue Tracker](https://github.com/PKU-Alignment/omnisafe/issues) and [Discussions](https://github.com/PKU-Alignment/omnisafe/discussions) that this hasn't already been reported. (+1 or comment there if it has.)
- [ ] Consider asking first in a [Discussion](https://github.com/PKU-Alignment/omnisafe/discussions/new).

### Questions

Thanks for your great work!
I managed to integrate my environment within your example file `train_from_custom_env.py`. However, I failed to use the vectorized environment to speed up the data collection. 
First I tried directly to change the parameter `vector_env_nums` in config files, but it reported
```python
Processing rollout for epoch: 0... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
Traceback (most recent call last):
  File "/home/southriver/omnisafe/examples/train_from_custom_env.py", line 160, in <module>
    agent.learn()
  File "/home/southriver/omnisafe/omnisafe/algorithms/algo_wrapper.py", line 180, in learn
    ep_ret, ep_cost, ep_len = self.agent.learn()
  File "/home/southriver/omnisafe/omnisafe/algorithms/on_policy/base/policy_gradient.py", line 259, in learn
    self._env.rollout(
  File "/home/southriver/omnisafe/omnisafe/adapter/onpolicy_adapter.py", line 94, in rollout
    buffer.store(
  File "/home/southriver/omnisafe/omnisafe/common/buffer/vector_onpolicy_buffer.py", line 99, in store
    buffer.store(**{k: v[i] for k, v in data.items()})
  File "/home/southriver/omnisafe/omnisafe/common/buffer/vector_onpolicy_buffer.py", line 99, in <dictcomp>
    buffer.store(**{k: v[i] for k, v in data.items()})
IndexError: index 1 is out of bounds for dimension 0 with size 1
```
In this scenario, the `step` function was implemented to only collect single environments forward simulation, which indicates `obs[obs_space_size]`, `reward[1]`, and `cost[1]`.

Then I tried to implement the step to collect batched forward simulation, and make the return elements in `step` satisfies `obs[num_env, obs_space_size]`, `reward[num_env, 1]`, and `cost[num_env, 1]`, but also failed:
```python
Processing rollout for epoch: 0... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:-- 
Traceback (most recent call last): 
File "/home/southriver/omnisafe/examples/train_from_custom_env.py", line 182, in <module> agent.learn() 
File "/home/southriver/omnisafe/omnisafe/algorithms/algo_wrapper.py", line 180, in learn ep_ret, ep_cost, ep_len = self.agent.learn() 
File "/home/southriver/omnisafe/omnisafe/algorithms/on_policy/base/policy_gradient.py", line 259, in learn self._env.rollout( 
File "/home/southriver/omnisafe/omnisafe/adapter/onpolicy_adapter.py", line 88, in rollout self._log_value(reward=reward, cost=cost, info=info) 
File "/home/southriver/omnisafe/omnisafe/adapter/onpolicy_adapter.py", line 155, in _log_value self._ep_ret += info.get('original_reward', reward).cpu() 
RuntimeError: output with shape [1] doesn't match the broadcast shape [1, 4]
```
I also tried to reshape the reward and cost into [1], but also failed:
```python
Traceback (most recent call last): 
File "/home/southriver/omnisafe/examples/train_from_custom_env.py", line 186, in <module> agent.learn() 
File "/home/southriver/omnisafe/omnisafe/algorithms/algo_wrapper.py", line 180, in learn ep_ret, ep_cost, ep_len = self.agent.learn() 
File "/home/southriver/omnisafe/omnisafe/algorithms/on_policy/base/policy_gradient.py", line 259, in learn self._env.rollout( 
File "/home/southriver/omnisafe/omnisafe/adapter/onpolicy_adapter.py", line 94, in rollout buffer.store( 
File "/home/southriver/omnisafe/omnisafe/common/buffer/vector_onpolicy_buffer.py", line 99, in store buffer.store(**{k: v[i] for k, v in data.items()}) 
File "/home/southriver/omnisafe/omnisafe/common/buffer/onpolicy_buffer.py", line 145, in store self.data[key][self.ptr] = value 
RuntimeError: expand(torch.cuda.FloatTensor{[4, 185]}, size=[185]): the number of sizes provided (1) must be greater or equal to the number of dimensions in the tensor (2)
```
This is my original implementation in forward simulation:
```python
    def step(
        self,
        action: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, dict]:
        self._count += 1
        # obs = torch.as_tensor(self._observation_space.sample())
        # reward = 2 * torch.as_tensor(random.random())  # noqa
        # cost = 2 * torch.as_tensor(random.random())  # noqa
        # terminated = torch.as_tensor(random.random() > 0.9)  # noqa
        
        # prepare action, a_in is the real action input to the simulator
        a_in = [
            (action[0] + 1) / 4 * 3,
            action[1]
        ]   
        
        # forward simulation
        latest_scan, distance, cos, sin, collision, goal, a, reward, cost = self.sim.step(
            lin_velocity=a_in[0].item(), ang_velocity=a_in[1].item()
        )
        
        # prepare observation
        latest_scan = np.array(latest_scan)
        inf_mask = np.isinf(latest_scan)
        latest_scan[inf_mask] = 7.0  # max range
        
        max_bins = 180
        bin_size = int(np.ceil(len(latest_scan) / max_bins))
        min_values = []
        for i in range(0, len(latest_scan), bin_size):
            # Get the current bin
            bin = latest_scan[i : i + min(bin_size, len(latest_scan) - i)]
            # Find the minimum value in the current bin and append it to the min_values list
            min_values.append(min(bin) / 7)
        
        distance /= 100 # process to maintain within [0, 1]
        lin_vel = (action[0] + 1) /2        # action is in [-1, 1], process to [0, 1]
        ang_vel = (action[1] + 1) /2
        state = min_values + [distance, cos, sin, lin_vel, ang_vel]   
        
        # process data types
        obs = torch.as_tensor(state, dtype=torch.float32).to(self.device) 
        reward = torch.as_tensor(reward, dtype=torch.float32).to(self.device)          
        cost = torch.as_tensor(cost, dtype=torch.float32).to(self.device)              
        terminated = torch.as_tensor(goal, dtype=torch.float32).to(self.device)        
        truncated = torch.as_tensor(self._count > self.max_episode_steps, dtype=torch.float32).to(self.device) 
        
        return obs, reward, cost, terminated, truncated, {'final_observation': obs, 'cost': cost}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to vectorize customized environment? #380

Required prerequisites

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] How to vectorize customized environment? #380

Description

Required prerequisites

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions