Skip to content

Conversation

@aluu317
Copy link
Collaborator

@aluu317 aluu317 commented Dec 17, 2024

Description of the change

Extracted from this suggestion in a prior PR, adding HFResourceScanner TrainerCallback as a tracker.

In order to use this, user would need to install HFResourceScanner in the environment, and pass in training args to enable:

"trackers": ["hf_resource_scanner"]
"scanner_output_filename": "scanner_output.json" // optional, if not passed, the default value is used

See the test written in test_launch_script.py

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Dec 17, 2024
@aluu317 aluu317 force-pushed the scanner_tracker branch 2 times, most recently from 7991726 to 2218219 Compare December 17, 2024 23:29
@kmehant
Copy link
Collaborator

kmehant commented Dec 18, 2024

@ChanderG FYA

@ashokponkumar
Copy link
Collaborator

@dushyantbehl @ChanderG PTAL. Do we support pushing pushing scanner data also to aim/wandb etc? or is it not in scope?

@aluu317 aluu317 force-pushed the scanner_tracker branch 2 times, most recently from 4615177 to 81000bb Compare December 18, 2024 22:36
@aluu317
Copy link
Collaborator Author

aluu317 commented Dec 18, 2024

Also @ashokponkumar @ChanderG I tested this with our internal test and even though the scanner output file was created, I ran into this error while scanner was in use. I'm not sure how the other way in the prior PR of using a flag and not tracker, I didn't see this problem, but have this problem writing it to file this way. The scanner output file exists but is empty. I just get the results from stdout for now until I can debug this. But if you know why, let me know~

@ChanderG
Copy link
Contributor

ChanderG commented Dec 19, 2024

@aluu317 That's weird. I think the file is created, but write is failing? I don't think the tracker framework should be causing this. I should have clearly printed out the exception in Scanner, my bad.

That said, I am unable to reproduce the error. I tried running the tests and also manually ran the cli command with output json file different places (curr dir, /tmp dir etc) - and it's working in all cases. Tests pass and inspecting the generated json files in other cases shows output in expected formats.

@ChanderG
Copy link
Contributor

ChanderG commented Jan 9, 2025

@aluu317 Could you re-try? It seems to work fine for me - last I tried.

@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 9, 2025

@ChanderG When you try, did you try with single GPU and calling sft_trainer train function directly? I noticed in a small subset of models I tried that when it's 1 GPU, it works ok. But it seems with accelerate using fsdp config with multiple GPUs training, I have to use txt file extension and not the json format for scanner output. Unsure what's the difference that would cause the writing issue. The file with extension json is created, but the content is empty

@ChanderG
Copy link
Contributor

@aluu317 You are right - I was able to repro the problem with accelerate/multi-gpu.

This was a bug in Scanner that I have now fixed in a new release v0.1.2. Can you update Scanner and retry?

@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 14, 2025

@ChanderG Ahh how interesting! Thanks for the fix. I will test with the newer version. But I think this proves that the tracker code for this PR works though, independently of the json issue. Let's wrap up this PR if you're ok with reviewing/merging? It'd be nice to include this in our next fms-hf-tuning release (being worked on this week).

@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 16, 2025

@ChanderG Verfied with 0.1.2 HFResourceScanner, json file is written with content! Thank you

@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 21, 2025

@ashokponkumar @kmehant @ChanderG Please review

@kmehant
Copy link
Collaborator

kmehant commented Jan 22, 2025

@aluu317 Should we include hf resource scanner unit tests to run in our CI/CD, currently they are being skipped, WDYT?

@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 22, 2025

@kmehant They are being skipped because we need HFResourceScanner installed to run the tests. It's the same behavior with ML Flow tracker and aim stack tracker unit tests. Did you mean some other tests?

Copy link
Collaborator

@kmehant kmehant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aluu317 Do we plan to install HFResourceScanner package and let unit tests run? I know for aim and ml flow need bit of a set up to run unit tests so could be skipped. We can look at this in a separate PR as well. Thanks.

@aluu317 aluu317 merged commit c0362ad into foundation-model-stack:main Jan 23, 2025
8 checks passed
@aluu317
Copy link
Collaborator Author

aluu317 commented Jan 23, 2025

@kmehant We could install HFResourceScanner and have that turned on by default with our library. Is that the behavior we want? I can make a separate PR to always have it installed by default

@aluu317 aluu317 deleted the scanner_tracker branch January 23, 2025 16:28
@kmehant
Copy link
Collaborator

kmehant commented Feb 5, 2025

@aluu317

#422 (comment)

While running the unit tests, we could possibly install it so that HFResourceScanner based unit tests would run.

We can change tox.ini to accommodate this here -

fms-hf-tuning/tox.ini

Lines 4 to 9 in 5c03aa8

[testenv]
description = run unit tests
deps =
pytest>=7
commands =
pytest {posargs:tests}
adding the field extras = scanner-dev? More docs if you are interested - https://tox.wiki/en/latest/config.html#python-run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants