Skip to content

How to debug training getting stuck? #354

@howard-yen

Description

@howard-yen

Hi, I'm running RL training with a custom environment that's based on the search tool environment. My current logs look something like this:

====== End Trajectory Group ======
tinker_cookbook.utils.misc_utils:20 [INFO] Starting assemble_training_data
tinker_cookbook.utils.misc_utils:23 [INFO] assemble_training_data took 1.07 seconds
tinker_cookbook.utils.misc_utils:20 [INFO] Starting train

The trajectories are sampled, but it's stuck on the "starting train" step, where it's been hours for relatively few trajectories (<100). From what I understand, the optimization step is done on the Tinker servers, but I'm not sure what went wrong or how to debug this. Please let me know what's the best way to fix this! Happy to provide more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions