Skip to content

How to yield better and more accurate results? #225

@stxai

Description

@stxai

Description

I’ve been experimenting with the single-control inference examples from cosmos-transfer1, using both the Edge and Vis control branches.
I’m getting good visual results, but I'd like to understand how to target specific elements of the scene and achieve consistent outcomes.

Steps taken

I ran both official examples from the documentation:

Example 1: Single Control (Edge)

PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_edge \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU

Example 2: Single Control (Vis)

PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/example1_single_control_vis \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_vis.json \
    --offload_text_encoder_model \
    --offload_guardrail_models \
    --num_gpus $NUM_GPU

Guardrail: disabled for shorter prompts

What I Tried

  1. Used both a Canny-edge-only video and a regular RGB video as input.
  2. Tried distilled model.
  3. Tested with long descriptive prompts and short minimal prompts.
  4. Had to disable the guardrail to run shorter prompts.

Prompts Used

Long prompt:

“"The video is set in a modern, well-lit office environment with a sleek, minimalist design. The background features several people working at desks, indicating a busy workplace atmosphere. The main focus is on a robotic interaction at a counter. Two robotic arms, each wearing genuine brown leather baseball gloves with deep pockets and visible lacing, are seen handling a red and white patterned coffee cup with a black lid. The baseball gloves are oversized and padded, covering the robotic grippers completely with tan leather material and traditional webbing between the thumb and fingers. The arms are positioned in front of a woman who is standing on the opposite side of the counter..."”

Short prompts:

“A robotic arm with a hand wearing a bulky tuscan red leather glove”

“Two robotic arms wearing brown leather baseball gloves hand a coffee cup to a woman at an office counter"

Some of the outputs

Most noticable is that the gloves does not take the prompted color of either brown or tuscany red.

Image Image Image

Goal

At the moment i am not really interested in the background. I want to apply a style to a dataset consisting of hands, that have keypoint annotation. This way i can test and train for cases, where the hands are covered with e.g. soft robotic gloves.
For now i am not really able to alter the style of the hands in the sample videos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions