Skip to content

Feature/mlflow git tracking#65

Merged
ealt merged 10 commits intomainfrom
feature/mlflow-git-tracking
Sep 8, 2025
Merged

Feature/mlflow git tracking#65
ealt merged 10 commits intomainfrom
feature/mlflow-git-tracking

Conversation

@adamimos
Copy link
Collaborator

@adamimos adamimos commented Sep 4, 2025

Summary

This PR adds automatic git tracking and model storage information to MLflowLogger for improved experiment reproducibility.

Motivation

When debugging experiments or trying to reproduce results, it's critical to know:

  • What exact code version was used (including uncommitted changes)
  • What version of simplexity was used
  • Where the model checkpoints are stored

This PR makes MLflowLogger automatically capture this information without any code changes required from users.

Changes

  • Automatic git tracking: MLflowLogger now automatically captures git information when initialized
  • Storage tracking: New log_storage_info() method to track where models are saved (S3 or local)
  • Robust path detection: Multiple fallback methods to find simplexity installation path
  • Non-breaking: All existing code continues to work without modification

What gets tracked automatically

Git information (logged as MLflow tags):

  • git.main.commit: Short commit hash of the main repository
  • git.main.commit_full: Full commit hash
  • git.main.branch: Current branch name
  • git.main.dirty: Whether there are uncommitted changes
  • git.main.remote: Git remote URL
  • git.simplexity.*: Same information for the simplexity library

Storage information (when log_storage_info() is called):

  • storage.type: "s3" or "local"
  • storage.location: Full path to model storage (e.g., s3://bucket/prefix or /absolute/path)

Testing

Tested with the basic_mess3 example. Here are actual tags from a test run:

git.main.branch: example-new
git.main.commit: d419ed74
git.main.commit_full: d419ed745a0a960917ad76131f50468ee4feccb9
git.main.dirty: True
git.main.remote: https://github.com/Astera-org/simplex-research.git
git.simplexity.branch: feature/mlflow-git-tracking
git.simplexity.commit: 60f83fe
git.simplexity.commit_full: 60f83fe
git.simplexity.dirty: True
git.simplexity.remote: https://github.com/Astera-org/simplexity.git
storage.location: /Users/adamimos/Documents/GitHub/simplex-research/basic_mess3/data/models
storage.type: local

Benefits

  • Full reproducibility: Know exact code versions that produced each model
  • Easy debugging: Quickly identify version-related issues ("works on my machine")
  • MLflow filtering: Search and filter runs by git commit/branch in the UI
  • Model tracking: Know exactly where models are stored for retrieval

Implementation notes

  • Uses subprocess to call git commands with timeout protection
  • Gracefully handles non-git directories (returns empty dict)
  • Multiple fallback methods for finding simplexity path (handles editable installs, pip installs, etc.)
  • All git tracking happens automatically in __init__, no user action required

Example usage

# Git tracking happens automatically when logger is created
logger = MLFlowLogger(experiment_name="my_experiment")

# Optional: Log storage info after creating persister
if hasattr(logger, 'log_storage_info'):
    logger.log_storage_info(persister)

Future improvements (not in this PR)

- Could add Python environment info (pip freeze)
- Could add system info (OS, Python version, GPU)
- Could add option to disable git tracking if needed

- Automatically log git commit, branch, and dirty state for both main repo and simplexity
- Track full commit hash and short hash for easy reference
- Add log_storage_info() method to track model storage location (S3 or local)
- Helps with experiment reproducibility and debugging
- Non-breaking change - all tracking happens automatically

The logger now captures:
- git.main.* tags for the repository where training runs
- git.simplexity.* tags for the simplexity library version
- storage.* tags when log_storage_info() is called with a persister

This enables:
- Full reproducibility of experiments
- Easy debugging of version-related issues
- Filtering/searching runs by git version in MLflow UI
- Tracking exact model storage locations
- Add safety check for simplexity.__file__ being None
- Prevents TypeError when simplexity is installed in certain ways
- Ensures git tracking continues to work for main repository
- Use multiple methods to find simplexity installation path
- Try __file__, __path__, and importlib.util.find_spec
- Handles editable installs and regular pip installs
- Now successfully tracks simplexity git info in all cases
- Run ruff format to comply with project style guidelines
- Fix CI formatting check
- Replace direct attribute access with getattr for __file__ and __path__
- Fixes pyright reportAttributeAccessIssue errors
- Maintains full functionality while being type-safe
@adamimos adamimos requested a review from ealt September 4, 2025 21:50
- Move git tracking methods from MLFlowLogger to base Logger class
- Make log_git_info a public method that needs manual invocation
- Use functools.partial for cleaner subprocess.run calls
- Replace cwd parameter with git -C flag in all git commands
- Use git diff-index for checking dirty state
- Use git branch --show-current for getting branch name
- Use git remote get-url origin for remote URL
- Remove log_storage_info method per reviewer feedback
- Add _sanitize_remote() to remove credentials from git remote URLs
- Add _find_git_root() for cleaner git repository detection
- Replace complex simplexity path finding with simpler approach using __file__
- Apply ruff formatting
@adamimos
Copy link
Collaborator Author

adamimos commented Sep 6, 2025

@claude can you go one by one through the comments that @ealt made and double and triple check if they've been addressed?

@claude
Copy link

claude bot commented Sep 6, 2025

Claude finished @adamimos's task —— View job


Review Comment Verification

Going through each of @ealt's review comments to verify they've been addressed:

Review Comments to Verify:

  • Storage method concern: ✅ ADDRESSED - The log_storage_info() method has been completely removed from the code
  • Base logger placement: ✅ ADDRESSED - Git tracking functionality has been moved to the base Logger class in simplexity/logging/logger.py
  • Manual invocation: ✅ ADDRESSED - Git logging is now a public method log_git_info() that must be called explicitly (lines 141-163)
  • functools.partial optimization: ✅ ADDRESSED - Using partial(subprocess.run, capture_output=True, text=True, timeout=2.0) on line 110
  • Git command structure: ✅ ADDRESSED - All commands use -C str(repo_path) format (lines 113, 118, 122, 126)
  • Dirty check optimization: ✅ ADDRESSED - Simplified to result.returncode != 0 on line 119
  • Branch command simplification: ✅ ADDRESSED - Uses ["git", "-C", str(repo_path), "branch", "--show-current"] on line 122
  • Remote URL simplification: ✅ ADDRESSED - Uses ["git", "-C", str(repo_path), "remote", "get-url", "origin"] on line 126
  • Remote URL sanitization: ✅ ADDRESSED - Implements the suggested _sanitize_remote() method (lines 41-72) and applies it on line 128
  • Path detection improvement: ✅ ADDRESSED - Implements the exact _find_git_root() method suggested by @ealt (lines 74-97)

Detailed Analysis

✅ All Major Review Comments Addressed

  1. Storage Method Removal: The log_storage_info() method that @ealt questioned has been completely removed, addressing the concern about redundancy with persister config logging.

  2. Architecture Improvement: The functionality has been properly moved from MLFlowLogger to the base Logger class, making it available to all logger implementations.

  3. API Design Fix: Git logging is now explicit via the public log_git_info() method rather than happening automatically on instantiation.

  4. Code Quality Improvements:

    • Uses functools.partial for cleaner subprocess calls
    • Implements proper -C flag usage instead of cwd
    • Includes security sanitization for git remote URLs
    • Uses the exact path detection logic suggested by @ealt

Code Quality Assessment: The implementation follows all of @ealt's suggestions precisely and maintains good error handling with timeouts and fallbacks.

Status: ✅ All review comments have been fully addressed - this PR is ready for re-review.

@adamimos
Copy link
Collaborator Author

adamimos commented Sep 6, 2025

In a test run the git info is tagged well in mlflow

image

@ealt ealt merged commit 418189c into main Sep 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants