Skip to content

Conversation

@mina-parham
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 1.78571% with 55 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
transformerlab/routers/remote.py 1.78% 55 Missing ⚠️

📢 Thoughts on this report? Let us know!

# Get the parent job
parent_job = job_service.job_get(parent_job_id)
if not parent_job:
return {"status": "error", "message": f"Parent job {parent_job_id} not found"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just ignore if the parent job id is invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that might cause some errors in the future since all the parameters we need come from the parent job when training from a specific checkpoint. If that’s not valid we won’t have access to all of them?

return {"status": "error", "message": "Command contains null bytes"}

# Create a simple, meaningful task name for the resumed training
task_name = f"resume_training_{parent_job_id}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is necessary? I thought we were taking task names from the user? But I'm not aware of the flow as I havent tried yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you’re saying when we resume from a checkpoint we keep the same task name? yeah we can do that, I wasn’t sure what to do at that step. we could either keep the same task_name or maybe append something to it to indicate it’s after resumed training, like f"{task_name}_something"?

task_name = f"resume_training_{parent_job_id}"

# Use ALL parameters from parent job for resume (user just presses button)
cluster_name = parent_job_data.get("cluster_name")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way gpu-orchestration works, it wont let you provide the same name as the older cluster, it will add a number or something at the end. Maybe you ask the user for a cluster name or something when clicking on restart? This might seem irrelevant but it will affect other processes checking the health of the cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think asking the user for the cluster name might make the us worse? I can add a timestamp to the name but since the orchestrator already handles the cluster name automatically, I’m not sure what the issue might be...

# Create a new REMOTE job
job_data = {"task_name": task_name, "command": command, "cluster_name": cluster_name}

# Add optional parameters if provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just merge both instead of having two of the same things?

# Add checkpoint metadata for resume training (will be set as env vars in orchestrator)
if checkpoint and parent_job_id:
request_data["tlab_parent_job_id"] = parent_job_id
request_data["tlab_checkpoint_name"] = checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing, which is quite stupid but might be a good solution would be to just store these in the job data. Since lab facade can access that, it would be easier to do it that way than going the env way. Sorry I asked you to do the env stuff but I realize this might be easier if you agree

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clearer, I meant adding both of these in the job data of the newly created job instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries yes that would be simpler, added the changes.

# Build the logs endpoint URL
logs_url = f"{gpu_orchestrator_url}:{gpu_orchestrator_port}/api/v1/instances/requests/{request_id}/logs"

async def stream_logs():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why there's a diff here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you restore whatever is on main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mina-parham mina-parham closed this Nov 3, 2025
@mina-parham mina-parham force-pushed the add/resume-training-checkpoint branch from e5d3dff to ae4970e Compare November 3, 2025 21:42
@mina-parham mina-parham reopened this Nov 3, 2025
Launch a remote instance via Lattice orchestrator. If job_id is provided, use existing job, otherwise create new one.
If checkpoint and parent_job_id are provided, resume training from the specified checkpoint.
"""
# If job_id is provided, use existing job, otherwise create a new one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this logic?

Copy link
Contributor Author

@mina-parham mina-parham Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remove that part only needed to change the way way we handle it, it's still the same logic. I think as I remember I had to do it this way to address one of the comments here and needed to change the implementation

return {"status": "error", "message": "Original command not found in parent job data"}

# Validate command doesn't have problematic characters that could break shell execution
if '\x00' in command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a specific thing this was caused by? I used to have carriage returns but nothing else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

# Add resume metadata if resuming from checkpoint
if checkpoint and parent_job_id:
data["resumed_from_checkpoint"] = checkpoint
data["checkpoint_path"] = checkpoint_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include the entire path or just the ending part? Because on lattice it would be present at a different path so it might help if we just add things like "checkpoint-100" or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only the ending path, in the sdk we would get the entire path based on the job id and the ending path

Copy link
Member

@deep1401 deep1401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor nitpicks we can discuss and get this merged quickly. Sorry dont mean to block this!

@deep1401
Copy link
Member

Checking in incase you forgot these comments :)
I'll re-review whenever this is ready

Copy link
Member

@deep1401 deep1401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked for me perfectly, maybe just verify with @dadmobile once if he wanted to try before merging

@mina-parham mina-parham merged commit 3f67d9c into main Nov 12, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants