feat: add scanner tracker #422

aluu317 · 2024-12-17T20:36:41Z

Description of the change

Extracted from this suggestion in a prior PR, adding HFResourceScanner TrainerCallback as a tracker.

In order to use this, user would need to install HFResourceScanner in the environment, and pass in training args to enable:

"trackers": ["hf_resource_scanner"]
"scanner_output_filename": "scanner_output.json" // optional, if not passed, the default value is used

See the test written in test_launch_script.py

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

github-actions · 2024-12-17T20:36:53Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

kmehant · 2024-12-18T05:13:48Z

@ChanderG FYA

ashokponkumar · 2024-12-18T18:09:04Z

@dushyantbehl @ChanderG PTAL. Do we support pushing pushing scanner data also to aim/wandb etc? or is it not in scope?

aluu317 · 2024-12-18T22:41:31Z

Also @ashokponkumar @ChanderG I tested this with our internal test and even though the scanner output file was created, I ran into this error while scanner was in use. I'm not sure how the other way in the prior PR of using a flag and not tracker, I didn't see this problem, but have this problem writing it to file this way. The scanner output file exists but is empty. I just get the results from stdout for now until I can debug this. But if you know why, let me know~

ChanderG · 2024-12-19T06:45:24Z

@aluu317 That's weird. I think the file is created, but write is failing? I don't think the tracker framework should be causing this. I should have clearly printed out the exception in Scanner, my bad.

That said, I am unable to reproduce the error. I tried running the tests and also manually ran the cli command with output json file different places (curr dir, /tmp dir etc) - and it's working in all cases. Tests pass and inspecting the generated json files in other cases shows output in expected formats.

ChanderG · 2025-01-09T05:05:33Z

@aluu317 Could you re-try? It seems to work fine for me - last I tried.

aluu317 · 2025-01-09T23:32:32Z

@ChanderG When you try, did you try with single GPU and calling sft_trainer train function directly? I noticed in a small subset of models I tried that when it's 1 GPU, it works ok. But it seems with accelerate using fsdp config with multiple GPUs training, I have to use txt file extension and not the json format for scanner output. Unsure what's the difference that would cause the writing issue. The file with extension json is created, but the content is empty

ChanderG · 2025-01-14T09:38:46Z

@aluu317 You are right - I was able to repro the problem with accelerate/multi-gpu.

This was a bug in Scanner that I have now fixed in a new release v0.1.2. Can you update Scanner and retry?

aluu317 · 2025-01-14T18:48:38Z

@ChanderG Ahh how interesting! Thanks for the fix. I will test with the newer version. But I think this proves that the tracker code for this PR works though, independently of the json issue. Let's wrap up this PR if you're ok with reviewing/merging? It'd be nice to include this in our next fms-hf-tuning release (being worked on this week).

aluu317 · 2025-01-16T18:59:50Z

@ChanderG Verfied with 0.1.2 HFResourceScanner, json file is written with content! Thank you

Signed-off-by: Angel Luu <[email protected]>

aluu317 · 2025-01-21T17:44:57Z

@ashokponkumar @kmehant @ChanderG Please review

kmehant · 2025-01-22T06:04:54Z

@aluu317 Should we include hf resource scanner unit tests to run in our CI/CD, currently they are being skipped, WDYT?

aluu317 · 2025-01-22T22:00:42Z

@kmehant They are being skipped because we need HFResourceScanner installed to run the tests. It's the same behavior with ML Flow tracker and aim stack tracker unit tests. Did you mean some other tests?

kmehant

@aluu317 Do we plan to install HFResourceScanner package and let unit tests run? I know for aim and ml flow need bit of a set up to run unit tests so could be skipped. We can look at this in a separate PR as well. Thanks.

aluu317 · 2025-01-23T16:27:58Z

@kmehant We could install HFResourceScanner and have that turned on by default with our library. Is that the behavior we want? I can make a separate PR to always have it installed by default

kmehant · 2025-02-05T04:20:07Z

@aluu317

#422 (comment)

While running the unit tests, we could possibly install it so that HFResourceScanner based unit tests would run.

We can change tox.ini to accommodate this here -

fms-hf-tuning/tox.ini

Lines 4 to 9 in 5c03aa8

    
           [testenv] 
        
           description = run unit tests 
        
           deps = 
        
               pytest>=7 
        
           commands = 
        
               pytest {posargs:tests}

adding the field extras = scanner-dev? More docs if you are interested - https://tox.wiki/en/latest/config.html#python-run

aluu317 requested review from Ssukriti, anhuong, fabianlim and kmehant as code owners December 17, 2024 20:36

github-actions bot added the feat label Dec 17, 2024

aluu317 force-pushed the scanner_tracker branch 2 times, most recently from 7991726 to 2218219 Compare December 17, 2024 23:29

aluu317 force-pushed the scanner_tracker branch 2 times, most recently from 4615177 to 81000bb Compare December 18, 2024 22:36

aluu317 force-pushed the scanner_tracker branch from 81000bb to 7930625 Compare January 14, 2025 18:48

aluu317 mentioned this pull request Jan 16, 2025

chore(release): merge set of changes for v2.4.0 #439

Closed

aluu317 added 6 commits January 21, 2025 10:44

feat: add scanner tracker

7b21b53

Signed-off-by: Angel Luu <[email protected]>

Add installation for HFResourceScanner if enabled in Dockerfile

3ea9368

Signed-off-by: Angel Luu <[email protected]>

fix: remove extra }

39d070a

Signed-off-by: Angel Luu <[email protected]>

test: make the test more explicit

481ef10

Signed-off-by: Angel Luu <[email protected]>

chore: Run fmt

7841a9d

Signed-off-by: Angel Luu <[email protected]>

fix: fix unit tests

ee46cd2

Signed-off-by: Angel Luu <[email protected]>

aluu317 force-pushed the scanner_tracker branch from ad5d0ab to ee46cd2 Compare January 21, 2025 17:44

kmehant approved these changes Jan 23, 2025

View reviewed changes

aluu317 merged commit c0362ad into foundation-model-stack:main Jan 23, 2025
8 checks passed

aluu317 deleted the scanner_tracker branch January 23, 2025 16:28

dushyantbehl mentioned this pull request Feb 5, 2025

feat: add a new arg for HFResourceScanner callback #397

Closed

2 tasks

feat: add scanner tracker #422

feat: add scanner tracker #422

Uh oh!

Conversation

aluu317 commented Dec 17, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Uh oh!

github-actions bot commented Dec 17, 2024

Uh oh!

kmehant commented Dec 18, 2024

Uh oh!

ashokponkumar commented Dec 18, 2024

Uh oh!

aluu317 commented Dec 18, 2024

Uh oh!

ChanderG commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChanderG commented Jan 9, 2025

Uh oh!

aluu317 commented Jan 9, 2025

Uh oh!

ChanderG commented Jan 14, 2025

Uh oh!

aluu317 commented Jan 14, 2025

Uh oh!

aluu317 commented Jan 16, 2025

Uh oh!

aluu317 commented Jan 21, 2025

Uh oh!

kmehant commented Jan 22, 2025

Uh oh!

aluu317 commented Jan 22, 2025

Uh oh!

kmehant left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aluu317 commented Jan 23, 2025

Uh oh!

kmehant commented Feb 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChanderG commented Dec 19, 2024 •

edited

Loading