Skip to content

[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621

Open
xin3he wants to merge 10 commits intomainfrom
xinhe/3-26a
Open

[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
xin3he wants to merge 10 commits intomainfrom
xinhe/3-26a

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented Mar 26, 2026

Description

Reduced peak_ram from 9GB to 2.5GB for Qwen/Qwen3-0.6B

Details:

  1. Subprocess Preprocessing:

    • Introduced a subprocess mode for dataset preprocessing.
      Since operations like datasets.map generate large temporary memory objects that gc.collect() cannot fully reclaim, running it in a forked subprocess ensures all memory is returned to the OS upon exit, preventing memory leaks during quantization.
  2. add _make_map_fingerprint to fix below warning:

    Parameter 'function'=<function get_tokenizer_function.<locals>.default_tokenizer_function at 0x752e9b7b36a0> of the 
    transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Ma
    ke sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and cachi
    ng to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous ca
    lls and recompute everything. This warning is only shown once. Subsequent hashing failures won't be shown.
    
  3. Documentation:

    • Added docs/environments_CN.md as a Chinese translation of docs/environments.md, covering all environment variables including AR_DISABLE_DATASET_SUBPROCESS.
    • Added language toggle links between the English and Chinese documentation files.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…ntrol

Signed-off-by: Xin He <xin3.he@intel.com>
Copilot AI review requested due to automatic review settings March 26, 2026 07:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves AutoRound’s calibration dataset preprocessing by reducing peak RAM via subprocess-based preprocessing and adding a persistent on-disk cache keyed by tokenizer/dataset parameters, with configuration exposed through environment variables.

Changes:

  • Added subprocess preprocessing mode for calibration dataset generation with in-process fallback.
  • Implemented a disk cache for preprocessed calibration datasets using a SHA-256–derived key and a completion marker.
  • Integrated new environment variables into envs.py and documented them in docs/environments.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
docs/environments.md Documents new dataset preprocessing/caching environment variables.
auto_round/envs.py Adds unified env accessors for disabling subprocess mode and selecting cache directory.
auto_round/calib_dataset.py Introduces subprocess-based preprocessing and persistent disk cache for calibration datasets.

@xin3he xin3he marked this pull request as draft March 26, 2026 07:15
xin3he added 2 commits March 26, 2026 15:21
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he xin3he marked this pull request as ready for review March 26, 2026 07:22
@xin3he xin3he requested a review from Copilot March 26, 2026 07:23
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he xin3he changed the title [Feature] Enhance dataset preprocessing memory management and introduce persistent caching [Feature] Enhance dataset preprocessing memory management and fix hash failure Mar 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

auto_round/calib_dataset.py:708

  • The _get_dataset_impl docstring lists parameters like split and apply_chat_template that are not present in the function signature. This makes the internal API misleading and harder to maintain; please update the docstring to match the actual parameters and behavior.
def _get_dataset_impl(tokenizer, seqlen, dataset_name="NeelNanda/pile-10k", seed=42, nsamples=512):
    """Internal implementation: generate a dataset for calibration.

    Args:
        tokenizer (Tokenizer): The tokenizer to use for tokenization.
        seqlen (int): The exact sequence length. samples < seqlen will be dropped,
                      samples longer than seqlen will be truncated
        dataset_name (str, optional): The name of the dataset or datasets separated by commas.
                                     Defaults to "NeelNanda/pile-10k".
        split (str, optional): The data split to use. Defaults to None.
        seed (int, optional): The random seed for reproducibility. Defaults to 42.
        nsamples (int, optional): The total number of samples to include. Defaults to 512.
        apply_chat_template: Whether to apply chat template in tokenization.

xin3he and others added 2 commits March 27, 2026 11:19
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@wenhuach21
Copy link
Copy Markdown
Contributor

This PR should not target this release, as it introduces a new feature, right?

xin3he added 2 commits March 27, 2026 11:27
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Mar 27, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants