Update base_dataset.py #385

gushiqiao · 2025-05-26T05:55:38Z

No description provided.

gemini-code-assist

Hello @gushiqiao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request by @gushiqiao. The PR title is "Update base_dataset.py" and there is no description provided. Based on the single change in the patch, the primary intent appears to be a minor correction or update related to how configuration is read for the dataset.

The core change involves modifying the key used to retrieve the number of samples (n_samples) from the calibration configuration (calib_cfg) within the __init__ method of the BaseDataset class (or a similar class in base_dataset.py). The key is being changed from 'seq_len' to 'n_samples'. This suggests a renaming of the expected configuration parameter.

Highlights

Configuration Key Update: The configuration key used to fetch the number of samples (n_samples) from the calibration configuration (calib_cfg) has been updated from 'seq_len' to 'n_samples' in the __init__ method.

Changelog

llmc/data/dataset/base_dataset.py
- Updated the configuration key for retrieving n_samples from 'seq_len' to 'n_samples' (line 28).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

A key name changed,
From 'seq_len' to 'samples',
Code now aligns well.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the key used to fetch the number of calibration samples in BaseDataset from 'seq_len' to 'n_samples'. This is a positive change as 'n_samples' is the semantically correct key, aligning with its usage in example configuration files (e.g., awq_w_only.yml), where n_samples defines the number of calibration samples and seq_len typically refers to sequence length.

I've identified one high-severity issue related to how a missing 'n_samples' key in the configuration is handled, which could lead to runtime errors. A suggestion for improvement is provided below.

To improve clarity for future reviewers and maintainers, could you please add a brief description to the pull request explaining the motivation and impact of this change?

Summary of Findings

Potential TypeError if 'n_samples' is missing in calibration config: The self.n_samples attribute in BaseDataset defaults to None if the 'n_samples' key is not found in calib_cfg. This None value is likely to cause TypeError exceptions or unexpected behavior in downstream preprocessing functions, as they generally expect an integer for the number of samples. A default integer value (e.g., 128, as seen in example configs) or making the key mandatory would improve robustness.
Pull Request Description: The pull request lacks a description. Adding a brief explanation of the change's purpose and impact would be beneficial for reviewers and future reference.

Merge Readiness

The primary change in this pull request correctly updates the configuration key for n_samples, which is an improvement. However, a high-severity issue has been identified regarding the handling of a missing 'n_samples' key in the calibration configuration, which could lead to runtime errors.

I recommend addressing this issue by providing a default value or ensuring the key is mandatory before merging. As an AI, I am not authorized to approve pull requests; please ensure this change is further reviewed and approved by team members after the suggested modifications are considered.

gemini-code-assist · 2025-05-26T05:58:10Z

llmc/data/dataset/base_dataset.py

        self.calib_dataset_path = calib_cfg.get('path', None)
        self.apply_chat_template = calib_cfg.get('apply_chat_template', False)
-        self.n_samples = calib_cfg.get('seq_len', None)
+        self.n_samples = calib_cfg.get('n_samples', None)


While changing the key from 'seq_len' to 'n_samples' is a good correction for semantic accuracy, there's a potential issue if 'n_samples' is missing from calib_cfg.

Currently, if 'n_samples' is not found, self.n_samples defaults to None. This None value, when subsequently used by preprocessing functions (e.g., in llmc/data/dataset/specified_preproc.py), is likely to cause a TypeError. For instance:

Functions like wikitext2_gptq use n_samples in range(n_samples), and range(None) is an error.

Functions like pileval_awq use n_samples in a loop break condition (if n_run == n_samples:). If n_samples is None, the loop might process the entire dataset, leading to unexpected behavior or performance issues.

Functions like img_general use n_samples in comparisons (len(calib_dataset) > n_samples), which will also error with None.

To make this more robust and prevent runtime crashes due to incomplete configurations, consider providing a default integer value for n_samples. A value like 128 is used in several example configurations (e.g., configs/quantization/methods/Awq/awq_w_only.yml).

Alternatively, if n_samples is a strictly required parameter, accessing it directly via calib_cfg['n_samples'] would raise a KeyError immediately if it's missing, which is a more explicit way to handle mandatory configurations. However, providing a sensible default is often a good balance.

Suggested change

self.n_samples = calib_cfg.get('n_samples', None)

self.n_samples = calib_cfg.get('n_samples', 128) # Default to 128 to prevent errors if not in config

Update base_dataset.py

51baabc

gemini-code-assist bot reviewed May 26, 2025

View reviewed changes

llmc-reviewer approved these changes May 26, 2025

View reviewed changes

gushiqiao merged commit 4d7cc78 into ModelTC:main May 26, 2025
2 checks passed

gemini-code-assist bot suggested changes May 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update base_dataset.py #385

Update base_dataset.py #385

Uh oh!

gushiqiao commented May 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot May 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self.n_samples = calib_cfg.get('n_samples', None)
	self.n_samples = calib_cfg.get('n_samples', 128) # Default to 128 to prevent errors if not in config

Update base_dataset.py #385

Update base_dataset.py #385

Uh oh!

Conversation

gushiqiao commented May 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants