[Feature] Enhance dataset preprocessing memory management and fix hash failure by xin3he · Pull Request #1621 · intel/auto-round

xin3he · 2026-03-26T06:58:50Z

Description

Reduced peak_ram from 9GB to 2.5GB for Qwen/Qwen3-0.6B

Details:

Subprocess Preprocessing:
- Introduced a subprocess mode for dataset preprocessing.
  Since operations like datasets.map generate large temporary memory objects that gc.collect() cannot fully reclaim, running it in a forked subprocess ensures all memory is returned to the OS upon exit, preventing memory leaks during quantization.

add _make_map_fingerprint to fix below warning:

Parameter 'function'=<function get_tokenizer_function.<locals>.default_tokenizer_function at 0x752e9b7b36a0> of the 
transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Ma
ke sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and cachi
ng to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous ca
lls and recompute everything. This warning is only shown once. Subsequent hashing failures won't be shown.

Documentation:
- Added docs/environments_CN.md as a Chinese translation of docs/environments.md, covering all environment variables including AR_DISABLE_DATASET_SUBPROCESS.
- Added language toggle links between the English and Chinese documentation files.

Type of Change

Related Issues

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…ntrol Signed-off-by: Xin He <xin3.he@intel.com>

Copilot

Pull request overview

This PR improves AutoRound’s calibration dataset preprocessing by reducing peak RAM via subprocess-based preprocessing and adding a persistent on-disk cache keyed by tokenizer/dataset parameters, with configuration exposed through environment variables.

Changes:

Added subprocess preprocessing mode for calibration dataset generation with in-process fallback.
Implemented a disk cache for preprocessed calibration datasets using a SHA-256–derived key and a completion marker.
Integrated new environment variables into envs.py and documented them in docs/environments.md.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
docs/environments.md	Documents new dataset preprocessing/caching environment variables.
auto_round/envs.py	Adds unified env accessors for disabling subprocess mode and selecting cache directory.
auto_round/calib_dataset.py	Introduces subprocess-based preprocessing and persistent disk cache for calibration datasets.

auto_round/envs.py

docs/environments.md

auto_round/calib_dataset.py

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

auto_round/calib_dataset.py:708

The _get_dataset_impl docstring lists parameters like split and apply_chat_template that are not present in the function signature. This makes the internal API misleading and harder to maintain; please update the docstring to match the actual parameters and behavior.

def _get_dataset_impl(tokenizer, seqlen, dataset_name="NeelNanda/pile-10k", seed=42, nsamples=512):
    """Internal implementation: generate a dataset for calibration.

    Args:
        tokenizer (Tokenizer): The tokenizer to use for tokenization.
        seqlen (int): The exact sequence length. samples < seqlen will be dropped,
                      samples longer than seqlen will be truncated
        dataset_name (str, optional): The name of the dataset or datasets separated by commas.
                                     Defaults to "NeelNanda/pile-10k".
        split (str, optional): The data split to use. Defaults to None.
        seed (int, optional): The random seed for reproducibility. Defaults to 42.
        nsamples (int, optional): The total number of samples to include. Defaults to 512.
        apply_chat_template: Whether to apply chat template in tokenization.

docs/environments.md

auto_round/calib_dataset.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/0c14c972-2687-4283-aee6-3017898d7e0e Co-authored-by: xin3he <83260933+xin3he@users.noreply.github.com>

wenhuach21 · 2026-03-27T03:27:18Z

This PR should not target this release, as it introduces a new feature, right?

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-03-27T14:35:27Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-03-27T14:35:36Z

Azure Pipelines successfully started running 1 pipeline(s).

feat: add environment variables for dataset caching and subprocess co…

69447c6

…ntrol Signed-off-by: Xin He <xin3.he@intel.com>

xin3he requested review from Copilot and yiliu30 and removed request for Copilot March 26, 2026 06:58

Copilot started reviewing on behalf of xin3he March 26, 2026 06:59 View session

xin3he requested review from XuehaoSun, lvliang-intel, n1ck-guo and wenhuach21 March 26, 2026 06:59

remove useless code

b33934b

Copilot AI review requested due to automatic review settings March 26, 2026 07:09

Copilot started reviewing on behalf of xin3he March 26, 2026 07:09 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

auto_round/envs.py Outdated Show resolved Hide resolved

docs/environments.md Outdated Show resolved Hide resolved

auto_round/calib_dataset.py Show resolved Hide resolved

auto_round/calib_dataset.py Outdated Show resolved Hide resolved

auto_round/calib_dataset.py Show resolved Hide resolved

xin3he marked this pull request as draft March 26, 2026 07:15

xin3he added 2 commits March 26, 2026 15:21

remove cache logic to reuse datasets packge

b285697

Signed-off-by: Xin He <xin3.he@intel.com>

fix hash failure

ed3e940

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he marked this pull request as ready for review March 26, 2026 07:22

xin3he requested a review from Copilot March 26, 2026 07:23

remove useless code

d48c8ec

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot started reviewing on behalf of xin3he March 26, 2026 07:23 View session

xin3he changed the title ~~[Feature] Enhance dataset preprocessing memory management and introduce persistent caching~~ [Feature] Enhance dataset preprocessing memory management and fix hash failure Mar 26, 2026

Copilot AI reviewed Mar 26, 2026

View reviewed changes

docs/environments.md Show resolved Hide resolved

auto_round/calib_dataset.py Outdated Show resolved Hide resolved

auto_round/calib_dataset.py Show resolved Hide resolved

auto_round/calib_dataset.py Show resolved Hide resolved

Copilot started work on behalf of xin3he March 27, 2026 03:18 View session

xin3he and others added 2 commits March 27, 2026 11:19

Update auto_round/calib_dataset.py

b428388

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

docs: add Chinese translation for environments.md and add language links

e2683a5

Agent-Logs-Url: https://github.com/intel/auto-round/sessions/0c14c972-2687-4283-aee6-3017898d7e0e Co-authored-by: xin3he <83260933+xin3he@users.noreply.github.com>

Copilot finished work on behalf of xin3he March 27, 2026 03:20

Merge branch 'main' into xinhe/3-26a

e14db44

xin3he added 2 commits March 27, 2026 11:27

update doc

2dae9c9

Signed-off-by: Xin He <xin3.he@intel.com>

Merge branch 'main' into xinhe/3-26a

d6e74a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621

[Feature] Enhance dataset preprocessing memory management and fix hash failure#1621
xin3he wants to merge 10 commits intomainfrom
xinhe/3-26a

xin3he commented Mar 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Mar 27, 2026

Uh oh!

xin3he commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xin3he commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Mar 27, 2026

Uh oh!

xin3he commented Mar 27, 2026

Uh oh!

azure-pipelines bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xin3he commented Mar 26, 2026 •

edited

Loading