Skip to content

Conversation

@chen2021673
Copy link

@chen2021673 chen2021673 commented Jan 5, 2026

Add automatic precision checking for backward pass via INFINI_PRECISION_CHECK environment variable. Hooks are registered on gradient functions to detect NaN/Inf and monitor gradient statistics during backpropagation.
image

@chen2021673 chen2021673 changed the title feat: add automatic precision checking for backward pass [WIP] feat: add automatic precision checking for backward pass Jan 5, 2026
chen2021673 and others added 2 commits January 7, 2026 08:56
Add PrecisionCheckLevel enum (NONE/FUNCTION/MODULE) to GlobalEnv for fine-grained control of precision checking. Register backward hooks for both Function and Module levels to check gradients during backward pass.

Key changes:
- Add PrecisionCheckLevel to GlobalEnv with env var support (INFINI_PRECISION_CHECK=module/function)
- Register backward pre/post hooks in Module::operator() on output tensor grad_fn
- Register precision check hooks in Function::Apply based on precision level
- Add HookHandle base class definition
- Update Module::Forward call syntax from Forward() to operator()
- Simplify CheckTensors output format

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…te hook types

- Replace environment variables (INFINI_PRECISION_CHECK, INFINI_PRECISION_CHECK_ALL_RANKS) with command-line flags (--precision_check, --precision_check_all_ranks)
- Move hook type definitions from global scope into Function class to eliminate duplication between function.h and function_hook.h
- Update GlobalEnv to accept precision check parameters and propagate through InitAllEnv
- Update precision_checker to use GlobalEnv instead of getenv()
- Add gflags definitions to gpt2 and llama3 examples
- Fix module operator() calls to use correct syntax

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant