-
Notifications
You must be signed in to change notification settings - Fork 43
Add first draft of resuming training from latest checkpoint #638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ransformerlab/transformerlab-api into add/resume-training-checkpoint
…ransformerlab/transformerlab-api into add/resume-training-checkpoint
| # Get the parent job | ||
| parent_job = job_service.job_get(parent_job_id) | ||
| if not parent_job: | ||
| return {"status": "error", "message": f"Parent job {parent_job_id} not found"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just ignore if the parent job id is invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that might cause some errors in the future since all the parameters we need come from the parent job when training from a specific checkpoint. If that’s not valid we won’t have access to all of them?
| return {"status": "error", "message": "Command contains null bytes"} | ||
|
|
||
| # Create a simple, meaningful task name for the resumed training | ||
| task_name = f"resume_training_{parent_job_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is necessary? I thought we were taking task names from the user? But I'm not aware of the flow as I havent tried yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you’re saying when we resume from a checkpoint we keep the same task name? yeah we can do that, I wasn’t sure what to do at that step. we could either keep the same task_name or maybe append something to it to indicate it’s after resumed training, like f"{task_name}_something"?
| task_name = f"resume_training_{parent_job_id}" | ||
|
|
||
| # Use ALL parameters from parent job for resume (user just presses button) | ||
| cluster_name = parent_job_data.get("cluster_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way gpu-orchestration works, it wont let you provide the same name as the older cluster, it will add a number or something at the end. Maybe you ask the user for a cluster name or something when clicking on restart? This might seem irrelevant but it will affect other processes checking the health of the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think asking the user for the cluster name might make the us worse? I can add a timestamp to the name but since the orchestrator already handles the cluster name automatically, I’m not sure what the issue might be...
transformerlab/routers/remote.py
Outdated
| # Create a new REMOTE job | ||
| job_data = {"task_name": task_name, "command": command, "cluster_name": cluster_name} | ||
|
|
||
| # Add optional parameters if provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just merge both instead of having two of the same things?
transformerlab/routers/remote.py
Outdated
| # Add checkpoint metadata for resume training (will be set as env vars in orchestrator) | ||
| if checkpoint and parent_job_id: | ||
| request_data["tlab_parent_job_id"] = parent_job_id | ||
| request_data["tlab_checkpoint_name"] = checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing, which is quite stupid but might be a good solution would be to just store these in the job data. Since lab facade can access that, it would be easier to do it that way than going the env way. Sorry I asked you to do the env stuff but I realize this might be easier if you agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clearer, I meant adding both of these in the job data of the newly created job instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries yes that would be simpler, added the changes.
| # Build the logs endpoint URL | ||
| logs_url = f"{gpu_orchestrator_url}:{gpu_orchestrator_port}/api/v1/instances/requests/{request_id}/logs" | ||
|
|
||
| async def stream_logs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why there's a diff here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you restore whatever is on main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
e5d3dff to
ae4970e
Compare
| Launch a remote instance via Lattice orchestrator. If job_id is provided, use existing job, otherwise create new one. | ||
| If checkpoint and parent_job_id are provided, resume training from the specified checkpoint. | ||
| """ | ||
| # If job_id is provided, use existing job, otherwise create a new one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remove that part only needed to change the way way we handle it, it's still the same logic. I think as I remember I had to do it this way to address one of the comments here and needed to change the implementation
transformerlab/routers/remote.py
Outdated
| return {"status": "error", "message": "Original command not found in parent job data"} | ||
|
|
||
| # Validate command doesn't have problematic characters that could break shell execution | ||
| if '\x00' in command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a specific thing this was caused by? I used to have carriage returns but nothing else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
| # Add resume metadata if resuming from checkpoint | ||
| if checkpoint and parent_job_id: | ||
| data["resumed_from_checkpoint"] = checkpoint | ||
| data["checkpoint_path"] = checkpoint_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this include the entire path or just the ending part? Because on lattice it would be present at a different path so it might help if we just add things like "checkpoint-100" or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only the ending path, in the sdk we would get the entire path based on the job id and the ending path
deep1401
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor nitpicks we can discuss and get this merged quickly. Sorry dont mean to block this!
|
Checking in incase you forgot these comments :) |
deep1401
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked for me perfectly, maybe just verify with @dadmobile once if he wanted to try before merging
No description provided.