Add first draft of resuming training from latest checkpoint #638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

mina-parham merged 50 commits into main from add/resume-training-checkpoint

Nov 12, 2025

Contributor

mina-parham commented Oct 23, 2025

No description provided.

mina-parham and others added 3 commits

October 22, 2025 01:38


          Add first draft of resuming training from latest checkpoint

e6523f4


          Ruff and add resume_from_checkpoint endpoint

50c58af


          Merge branch 'main' into add/resume-training-checkpoint

771be20

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/experiment/jobs.py Fixed Show fixed Hide fixed

transformerlab/routers/experiment/jobs.py Fixed Show fixed Hide fixed

codecov bot commented Oct 28, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 1.78571% with 55 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
transformerlab/routers/remote.py	1.78%	55 Missing ⚠️

📢 Thoughts on this report? Let us know!

mina-parham added 4 commits

October 28, 2025 17:12


          Ruff

976baea


          Move resume checkpoint endpoint to remote.py

841c0a7


          Add resume from checkpoint endpoint

cdd5f55


          Ruff

90677be

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/remote.py Fixed Show fixed Hide fixed

mina-parham and others added 4 commits

October 28, 2025 18:13


          Ruff

0f29b56


          Merge branch 'main' into add/resume-training-checkpoint

d0c472a


          Fix the bug related to passing the checkpoint

42273eb


           Use HTTPException with proper status codes (400/404/500) instead of …

2f8844d

…200 for errors

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/remote.py Fixed Show fixed Hide fixed

mina-parham and others added 6 commits

October 29, 2025 23:53


          use parent job data and proper error handling

9e4a615


          Merge branch 'main' into add/resume-training-checkpoint

e54f02f


          Ruff

0cfa5f8


          Merge branch 'add/resume-training-checkpoint' of https://github.com/t…

0909b8d

…ransformerlab/transformerlab-api into add/resume-training-checkpoint


          Debug

eedada5


          Debug

cb59a9a

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/remote.py Fixed Show fixed Hide fixed

mina-parham and others added 3 commits

October 30, 2025 11:29


          Merge branch 'main' into add/resume-training-checkpoint

4f19401


          Debug

d2dc3ad


          Merge branch 'add/resume-training-checkpoint' of https://github.com/t…

bc30c47

…ransformerlab/transformerlab-api into add/resume-training-checkpoint

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/remote.py Fixed Show fixed Hide fixed

mina-parham added 2 commits

October 30, 2025 13:02


          Clean up resume_from_checkpoint debugging code

276a028


          Fix checkpoint resume functionality for remote training jobs

139edb6

github-advanced-security bot found potential problems

View reviewed changes

transformerlab/routers/remote.py Fixed Show fixed Hide fixed


          Fix checkpoint resume: add --resume_from_checkpoint to python command

e59f0cb

mina-parham and others added 8 commits

October 31, 2025 15:19


          Debug load checkpoint problem

120dbbf


          Remove redundant print and ruff

5b6f982


          Fix resume training validation for missing cluster_name

5238ca9


          Merge branch 'main' into add/resume-training-checkpoint

0fb8b50


          Pass checkpoint metadata via env vars

df85fea


          Merge branch 'add/resume-training-checkpoint' of https://github.com/t…

c70bab4

…ransformerlab/transformerlab-api into add/resume-training-checkpoint


          Bump the version of sdk into 0.0.41

d545d82


          Add logs for debugging

b453102

deep1401 requested changes

View reviewed changes

transformerlab/routers/remote.py

    
                      # Get the parent job

                      parent_job = job_service.job_get(parent_job_id)

                      if not parent_job:

                          return {"status": "error", "message": f"Parent job {parent_job_id} not found"}

Member

deep1401 Nov 3, 2025

Maybe we just ignore if the parent job id is invalid?

Contributor Author

mina-parham Nov 3, 2025

I'm afraid that might cause some errors in the future since all the parameters we need come from the parent job when training from a specific checkpoint. If that’s not valid we won’t have access to all of them?

transformerlab/routers/remote.py Outdated Show resolved Hide resolved

transformerlab/routers/remote.py

    
                          return {"status": "error", "message": "Command contains null bytes"}

                      # Create a simple, meaningful task name for the resumed training

                      task_name = f"resume_training_{parent_job_id}"

Member

deep1401 Nov 3, 2025

Not sure if this is necessary? I thought we were taking task names from the user? But I'm not aware of the flow as I havent tried yet

Contributor Author

mina-parham Nov 3, 2025

So you’re saying when we resume from a checkpoint we keep the same task name? yeah we can do that, I wasn’t sure what to do at that step. we could either keep the same task_name or maybe append something to it to indicate it’s after resumed training, like f"{task_name}_something"?

transformerlab/routers/remote.py

    
                      task_name = f"resume_training_{parent_job_id}"

                      # Use ALL parameters from parent job for resume (user just presses button)

                      cluster_name = parent_job_data.get("cluster_name")

Member

deep1401 Nov 3, 2025

The way gpu-orchestration works, it wont let you provide the same name as the older cluster, it will add a number or something at the end. Maybe you ask the user for a cluster name or something when clicking on restart? This might seem irrelevant but it will affect other processes checking the health of the cluster

Contributor Author

mina-parham Nov 3, 2025

I think asking the user for the cluster name might make the us worse? I can add a timestamp to the name but since the orchestrator already handles the cluster name automatically, I’m not sure what the issue might be...

transformerlab/routers/remote.py Outdated

                       # Create a new REMOTE job
                       job_data = {"task_name": task_name, "command": command, "cluster_name": cluster_name}
+                      # Add optional parameters if provided

Member

deep1401 Nov 3, 2025

Maybe we just merge both instead of having two of the same things?

transformerlab/routers/remote.py Outdated

    
                  # Add checkpoint metadata for resume training (will be set as env vars in orchestrator)

                  if checkpoint and parent_job_id:

                      request_data["tlab_parent_job_id"] = parent_job_id

                      request_data["tlab_checkpoint_name"] = checkpoint

Member

deep1401 Nov 3, 2025

Another thing, which is quite stupid but might be a good solution would be to just store these in the job data. Since lab facade can access that, it would be easier to do it that way than going the env way. Sorry I asked you to do the env stuff but I realize this might be easier if you agree

Member

deep1401 Nov 3, 2025

Just to be clearer, I meant adding both of these in the job data of the newly created job instead

Contributor Author

mina-parham Nov 3, 2025

no worries yes that would be simpler, added the changes.

transformerlab/routers/remote.py

    
                  # Build the logs endpoint URL

                  logs_url = f"{gpu_orchestrator_url}:{gpu_orchestrator_port}/api/v1/instances/requests/{request_id}/logs"

                  async def stream_logs():

Member

deep1401 Nov 3, 2025

Not sure why there's a diff here?

Member

deep1401 Nov 3, 2025

Could you restore whatever is on main?

Contributor Author

mina-parham Nov 12, 2025

done

mina-parham closed this

mina-parham force-pushed the add/resume-training-checkpoint branch from e5d3dff to ae4970e Compare

November 3, 2025 21:42

mina-parham reopened this

mina-parham added 8 commits

November 3, 2025 17:23


          Restore stream_logs from main

321403e


          Restore stream_logs from main

8806ff8


          Restore stream_logs from main

e0246cf


          Ruff

b8d237d


          Remove prints

bf7fed0


          Merge job_data and request_data

a6d715b


          stores parent_job_id and resumed_from_checkpoint in new job's data

85310c0


          Remove comments

51bda4f

deep1401 reviewed

View reviewed changes

transformerlab/routers/remote.py

    
                  Launch a remote instance via Lattice orchestrator. If job_id is provided, use existing job, otherwise create new one.

                  If checkpoint and parent_job_id are provided, resume training from the specified checkpoint.

                  """

                  # If job_id is provided, use existing job, otherwise create a new one

Member

deep1401 Nov 4, 2025

Why are we removing this logic?

Contributor Author

mina-parham Nov 12, 2025 •

edited

Loading

I didn't remove that part only needed to change the way way we handle it, it's still the same logic. I think as I remember I had to do it this way to address one of the comments here and needed to change the implementation

deep1401 reviewed

View reviewed changes

transformerlab/routers/remote.py Outdated

    
                          return {"status": "error", "message": "Original command not found in parent job data"}

                      # Validate command doesn't have problematic characters that could break shell execution

                      if '\x00' in command:

Member

deep1401 Nov 4, 2025

Was there a specific thing this was caused by? I used to have carriage returns but nothing else

Contributor Author

mina-parham Nov 12, 2025

removed it

deep1401 reviewed

View reviewed changes

transformerlab/routers/remote.py

    
                  # Add resume metadata if resuming from checkpoint

                  if checkpoint and parent_job_id:

                      data["resumed_from_checkpoint"] = checkpoint

                      data["checkpoint_path"] = checkpoint_path

Member

deep1401 Nov 4, 2025

Does this include the entire path or just the ending part? Because on lattice it would be present at a different path so it might help if we just add things like "checkpoint-100" or something

Contributor Author

mina-parham Nov 12, 2025

it's only the ending path, in the sdk we would get the entire path based on the job id and the ending path

deep1401 reviewed

View reviewed changes

Member

deep1401 left a comment

Just minor nitpicks we can discuss and get this merged quickly. Sorry dont mean to block this!


          Merge branch 'main' into add/resume-training-checkpoint

c100901

Member

deep1401 commented Nov 10, 2025

Checking in incase you forgot these comments :)
I'll re-review whenever this is ready

deep1401 and others added 2 commits

November 10, 2025 14:42


          Merge branch 'main' into add/resume-training-checkpoint

6a3694a


          merge conflict

15472e2

deep1401 approved these changes

View reviewed changes

Member

deep1401 left a comment

This worked for me perfectly, maybe just verify with @dadmobile once if he wanted to try before merging

mina-parham merged commit 3f67d9c into main

7 of 8 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet