Conversation
- Automatically log git commit, branch, and dirty state for both main repo and simplexity - Track full commit hash and short hash for easy reference - Add log_storage_info() method to track model storage location (S3 or local) - Helps with experiment reproducibility and debugging - Non-breaking change - all tracking happens automatically The logger now captures: - git.main.* tags for the repository where training runs - git.simplexity.* tags for the simplexity library version - storage.* tags when log_storage_info() is called with a persister This enables: - Full reproducibility of experiments - Easy debugging of version-related issues - Filtering/searching runs by git version in MLflow UI - Tracking exact model storage locations
- Add safety check for simplexity.__file__ being None - Prevents TypeError when simplexity is installed in certain ways - Ensures git tracking continues to work for main repository
- Use multiple methods to find simplexity installation path - Try __file__, __path__, and importlib.util.find_spec - Handles editable installs and regular pip installs - Now successfully tracks simplexity git info in all cases
- Run ruff format to comply with project style guidelines - Fix CI formatting check
- Replace direct attribute access with getattr for __file__ and __path__ - Fixes pyright reportAttributeAccessIssue errors - Maintains full functionality while being type-safe
- Move git tracking methods from MLFlowLogger to base Logger class - Make log_git_info a public method that needs manual invocation - Use functools.partial for cleaner subprocess.run calls - Replace cwd parameter with git -C flag in all git commands - Use git diff-index for checking dirty state - Use git branch --show-current for getting branch name - Use git remote get-url origin for remote URL - Remove log_storage_info method per reviewer feedback
- Add _sanitize_remote() to remove credentials from git remote URLs - Add _find_git_root() for cleaner git repository detection - Replace complex simplexity path finding with simpler approach using __file__ - Apply ruff formatting
|
Claude finished @adamimos's task —— View job Review Comment Verification
|


Summary
This PR adds automatic git tracking and model storage information to MLflowLogger for improved experiment reproducibility.
Motivation
When debugging experiments or trying to reproduce results, it's critical to know:
This PR makes MLflowLogger automatically capture this information without any code changes required from users.
Changes
log_storage_info()method to track where models are saved (S3 or local)What gets tracked automatically
Git information (logged as MLflow tags):
git.main.commit: Short commit hash of the main repositorygit.main.commit_full: Full commit hashgit.main.branch: Current branch namegit.main.dirty: Whether there are uncommitted changesgit.main.remote: Git remote URLgit.simplexity.*: Same information for the simplexity libraryStorage information (when log_storage_info() is called):
storage.type: "s3" or "local"storage.location: Full path to model storage (e.g., s3://bucket/prefix or /absolute/path)Testing
Tested with the basic_mess3 example. Here are actual tags from a test run:
git.main.branch: example-new
git.main.commit: d419ed74
git.main.commit_full: d419ed745a0a960917ad76131f50468ee4feccb9
git.main.dirty: True
git.main.remote: https://github.com/Astera-org/simplex-research.git
git.simplexity.branch: feature/mlflow-git-tracking
git.simplexity.commit: 60f83fe
git.simplexity.commit_full: 60f83fe
git.simplexity.dirty: True
git.simplexity.remote: https://github.com/Astera-org/simplexity.git
storage.location: /Users/adamimos/Documents/GitHub/simplex-research/basic_mess3/data/models
storage.type: local
Benefits
Implementation notes
__init__, no user action requiredExample usage