Skip to content

Conversation

@gushiqiao
Copy link
Contributor

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @gushiqiao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary of this pull request by @gushiqiao. The PR title is "Update base_dataset.py" and there is no description provided. Based on the single change in the patch, the primary intent appears to be a minor correction or update related to how configuration is read for the dataset.

The core change involves modifying the key used to retrieve the number of samples (n_samples) from the calibration configuration (calib_cfg) within the __init__ method of the BaseDataset class (or a similar class in base_dataset.py). The key is being changed from 'seq_len' to 'n_samples'. This suggests a renaming of the expected configuration parameter.

Highlights

  • Configuration Key Update: The configuration key used to fetch the number of samples (n_samples) from the calibration configuration (calib_cfg) has been updated from 'seq_len' to 'n_samples' in the __init__ method.

Changelog

  • llmc/data/dataset/base_dataset.py
    • Updated the configuration key for retrieving n_samples from 'seq_len' to 'n_samples' (line 28).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


A key name changed,
From 'seq_len' to 'samples',
Code now aligns well.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gushiqiao gushiqiao merged commit 4d7cc78 into ModelTC:main May 26, 2025
2 checks passed
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the key used to fetch the number of calibration samples in BaseDataset from 'seq_len' to 'n_samples'. This is a positive change as 'n_samples' is the semantically correct key, aligning with its usage in example configuration files (e.g., awq_w_only.yml), where n_samples defines the number of calibration samples and seq_len typically refers to sequence length.

I've identified one high-severity issue related to how a missing 'n_samples' key in the configuration is handled, which could lead to runtime errors. A suggestion for improvement is provided below.

To improve clarity for future reviewers and maintainers, could you please add a brief description to the pull request explaining the motivation and impact of this change?

Summary of Findings

  • Potential TypeError if 'n_samples' is missing in calibration config: The self.n_samples attribute in BaseDataset defaults to None if the 'n_samples' key is not found in calib_cfg. This None value is likely to cause TypeError exceptions or unexpected behavior in downstream preprocessing functions, as they generally expect an integer for the number of samples. A default integer value (e.g., 128, as seen in example configs) or making the key mandatory would improve robustness.
  • Pull Request Description: The pull request lacks a description. Adding a brief explanation of the change's purpose and impact would be beneficial for reviewers and future reference.

Merge Readiness

The primary change in this pull request correctly updates the configuration key for n_samples, which is an improvement. However, a high-severity issue has been identified regarding the handling of a missing 'n_samples' key in the calibration configuration, which could lead to runtime errors.

I recommend addressing this issue by providing a default value or ensuring the key is mandatory before merging. As an AI, I am not authorized to approve pull requests; please ensure this change is further reviewed and approved by team members after the suggested modifications are considered.

self.calib_dataset_path = calib_cfg.get('path', None)
self.apply_chat_template = calib_cfg.get('apply_chat_template', False)
self.n_samples = calib_cfg.get('seq_len', None)
self.n_samples = calib_cfg.get('n_samples', None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While changing the key from 'seq_len' to 'n_samples' is a good correction for semantic accuracy, there's a potential issue if 'n_samples' is missing from calib_cfg.

Currently, if 'n_samples' is not found, self.n_samples defaults to None. This None value, when subsequently used by preprocessing functions (e.g., in llmc/data/dataset/specified_preproc.py), is likely to cause a TypeError. For instance:

  • Functions like wikitext2_gptq use n_samples in range(n_samples), and range(None) is an error.
  • Functions like pileval_awq use n_samples in a loop break condition (if n_run == n_samples:). If n_samples is None, the loop might process the entire dataset, leading to unexpected behavior or performance issues.
  • Functions like img_general use n_samples in comparisons (len(calib_dataset) > n_samples), which will also error with None.

To make this more robust and prevent runtime crashes due to incomplete configurations, consider providing a default integer value for n_samples. A value like 128 is used in several example configurations (e.g., configs/quantization/methods/Awq/awq_w_only.yml).

Alternatively, if n_samples is a strictly required parameter, accessing it directly via calib_cfg['n_samples'] would raise a KeyError immediately if it's missing, which is a more explicit way to handle mandatory configurations. However, providing a sensible default is often a good balance.

Suggested change
self.n_samples = calib_cfg.get('n_samples', None)
self.n_samples = calib_cfg.get('n_samples', 128) # Default to 128 to prevent errors if not in config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants