Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion llmc/data/dataset/base_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def __init__(self, tokenizer, calib_cfg, batch_process=None):
self.load_from_txt = calib_cfg.get('load_from_txt', False)
self.calib_dataset_path = calib_cfg.get('path', None)
self.apply_chat_template = calib_cfg.get('apply_chat_template', False)
self.n_samples = calib_cfg.get('seq_len', None)
self.n_samples = calib_cfg.get('n_samples', None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While changing the key from 'seq_len' to 'n_samples' is a good correction for semantic accuracy, there's a potential issue if 'n_samples' is missing from calib_cfg.

Currently, if 'n_samples' is not found, self.n_samples defaults to None. This None value, when subsequently used by preprocessing functions (e.g., in llmc/data/dataset/specified_preproc.py), is likely to cause a TypeError. For instance:

  • Functions like wikitext2_gptq use n_samples in range(n_samples), and range(None) is an error.
  • Functions like pileval_awq use n_samples in a loop break condition (if n_run == n_samples:). If n_samples is None, the loop might process the entire dataset, leading to unexpected behavior or performance issues.
  • Functions like img_general use n_samples in comparisons (len(calib_dataset) > n_samples), which will also error with None.

To make this more robust and prevent runtime crashes due to incomplete configurations, consider providing a default integer value for n_samples. A value like 128 is used in several example configurations (e.g., configs/quantization/methods/Awq/awq_w_only.yml).

Alternatively, if n_samples is a strictly required parameter, accessing it directly via calib_cfg['n_samples'] would raise a KeyError immediately if it's missing, which is a more explicit way to handle mandatory configurations. However, providing a sensible default is often a good balance.

Suggested change
self.n_samples = calib_cfg.get('n_samples', None)
self.n_samples = calib_cfg.get('n_samples', 128) # Default to 128 to prevent errors if not in config

self.calib_bs = calib_cfg['bs']
if self.calib_dataset_name in ['t2v', 'i2v']:
assert self.calib_bs == 1
Expand Down