How to yield better and more accurate results?

### Description

I’ve been experimenting with the single-control inference examples from cosmos-transfer1, using both the Edge and Vis control branches.
I’m getting good visual results, but I'd like to understand how to target specific elements of the scene and achieve consistent outcomes.

### Steps taken

I ran both official examples from the documentation:

Example 1: Single Control (Edge)

```
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU
```

Example 2: Single Control (Vis)

```
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_vis \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_vis.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU
```

**Guardrail: disabled for shorter prompts**

### What I Tried

1. Used both a Canny-edge-only video and a regular RGB video as input.
2. Tried distilled model.
3. Tested with long descriptive prompts and short minimal prompts.
4. Had to disable the guardrail to run shorter prompts.

### Prompts Used

Long prompt:

> “"The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, each wearing genuine brown leather baseball gloves with deep pockets and visible lacing, are seen handling a red and white patterned coffee cup with a black lid. The baseball gloves are oversized and padded, covering the robotic grippers completely with tan leather material and traditional webbing between the thumb and fingers. The arms are positioned in front of a woman who is standing on the opposite side of the counter..."”

Short prompts:

> “A robotic arm with a hand wearing a bulky tuscan red leather glove”

> “Two robotic arms wearing brown leather baseball gloves hand a coffee cup to a woman at an office counter"

### Some of the outputs
Most noticable is that the gloves does not take the prompted color of either brown or tuscany red.

<img width="1513" height="1105" alt="Image" src="https://github.com/user-attachments/assets/0b039498-5af5-49a1-806d-deabfa34e707" />

<img width="1513" height="1018" alt="Image" src="https://github.com/user-attachments/assets/497ab86c-51ed-46ac-848b-f9875e6aee8f" />

<img width="1512" height="1029" alt="Image" src="https://github.com/user-attachments/assets/b896dbc7-edf2-4aca-bf29-559d44778d17" />


### Goal

At the moment i am not really interested in the background. I want to apply a style to a dataset consisting of hands, that have keypoint annotation. This way i can test and train for cases, where the hands are covered with e.g. soft robotic gloves.
For now i am not really able to alter the style of the hands in the sample videos.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to yield better and more accurate results? #225

Description

Steps taken

What I Tried

Prompts Used

Some of the outputs

Goal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to yield better and more accurate results? #225

Description

Description

Steps taken

What I Tried

Prompts Used

Some of the outputs

Goal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions