[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
Open
[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
Conversation
…ntrol Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves AutoRound’s calibration dataset preprocessing by reducing peak RAM via subprocess-based preprocessing and adding a persistent on-disk cache keyed by tokenizer/dataset parameters, with configuration exposed through environment variables.
Changes:
- Added subprocess preprocessing mode for calibration dataset generation with in-process fallback.
- Implemented a disk cache for preprocessed calibration datasets using a SHA-256–derived key and a completion marker.
- Integrated new environment variables into
envs.pyand documented them indocs/environments.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| docs/environments.md | Documents new dataset preprocessing/caching environment variables. |
| auto_round/envs.py | Adds unified env accessors for disabling subprocess mode and selecting cache directory. |
| auto_round/calib_dataset.py | Introduces subprocess-based preprocessing and persistent disk cache for calibration datasets. |
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
auto_round/calib_dataset.py:708
- The
_get_dataset_impldocstring lists parameters likesplitandapply_chat_templatethat are not present in the function signature. This makes the internal API misleading and harder to maintain; please update the docstring to match the actual parameters and behavior.
def _get_dataset_impl(tokenizer, seqlen, dataset_name="NeelNanda/pile-10k", seed=42, nsamples=512):
"""Internal implementation: generate a dataset for calibration.
Args:
tokenizer (Tokenizer): The tokenizer to use for tokenization.
seqlen (int): The exact sequence length. samples < seqlen will be dropped,
samples longer than seqlen will be truncated
dataset_name (str, optional): The name of the dataset or datasets separated by commas.
Defaults to "NeelNanda/pile-10k".
split (str, optional): The data split to use. Defaults to None.
seed (int, optional): The random seed for reproducibility. Defaults to 42.
nsamples (int, optional): The total number of samples to include. Defaults to 512.
apply_chat_template: Whether to apply chat template in tokenization.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/intel/auto-round/sessions/0c14c972-2687-4283-aee6-3017898d7e0e Co-authored-by: xin3he <83260933+xin3he@users.noreply.github.com>
Contributor
|
This PR should not target this release, as it introduces a new feature, right? |
Signed-off-by: Xin He <xin3.he@intel.com>
Contributor
Author
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Reduced peak_ram from 9GB to 2.5GB for Qwen/Qwen3-0.6B
Details:
Subprocess Preprocessing:
Since operations like datasets.map generate large temporary memory objects that gc.collect() cannot fully reclaim, running it in a forked subprocess ensures all memory is returned to the OS upon exit, preventing memory leaks during quantization.
add
_make_map_fingerprintto fix below warning:Documentation:
docs/environments_CN.mdas a Chinese translation ofdocs/environments.md, covering all environment variables includingAR_DISABLE_DATASET_SUBPROCESS.Type of Change
Related Issues
Checklist Before Submitting
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.