-
Notifications
You must be signed in to change notification settings - Fork 66
Update base_dataset.py #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @gushiqiao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary of this pull request by @gushiqiao. The PR title is "Update base_dataset.py" and there is no description provided. Based on the single change in the patch, the primary intent appears to be a minor correction or update related to how configuration is read for the dataset.
The core change involves modifying the key used to retrieve the number of samples (n_samples) from the calibration configuration (calib_cfg) within the __init__ method of the BaseDataset class (or a similar class in base_dataset.py). The key is being changed from 'seq_len' to 'n_samples'. This suggests a renaming of the expected configuration parameter.
Highlights
- Configuration Key Update: The configuration key used to fetch the number of samples (
n_samples) from the calibration configuration (calib_cfg) has been updated from'seq_len'to'n_samples'in the__init__method.
Changelog
- llmc/data/dataset/base_dataset.py
- Updated the configuration key for retrieving
n_samplesfrom'seq_len'to'n_samples'(line 28).
- Updated the configuration key for retrieving
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
A key name changed,
From 'seq_len' to 'samples',
Code now aligns well.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the key used to fetch the number of calibration samples in BaseDataset from 'seq_len' to 'n_samples'. This is a positive change as 'n_samples' is the semantically correct key, aligning with its usage in example configuration files (e.g., awq_w_only.yml), where n_samples defines the number of calibration samples and seq_len typically refers to sequence length.
I've identified one high-severity issue related to how a missing 'n_samples' key in the configuration is handled, which could lead to runtime errors. A suggestion for improvement is provided below.
To improve clarity for future reviewers and maintainers, could you please add a brief description to the pull request explaining the motivation and impact of this change?
Summary of Findings
- Potential TypeError if 'n_samples' is missing in calibration config: The
self.n_samplesattribute inBaseDatasetdefaults toNoneif the'n_samples'key is not found incalib_cfg. ThisNonevalue is likely to causeTypeErrorexceptions or unexpected behavior in downstream preprocessing functions, as they generally expect an integer for the number of samples. A default integer value (e.g., 128, as seen in example configs) or making the key mandatory would improve robustness. - Pull Request Description: The pull request lacks a description. Adding a brief explanation of the change's purpose and impact would be beneficial for reviewers and future reference.
Merge Readiness
The primary change in this pull request correctly updates the configuration key for n_samples, which is an improvement. However, a high-severity issue has been identified regarding the handling of a missing 'n_samples' key in the calibration configuration, which could lead to runtime errors.
I recommend addressing this issue by providing a default value or ensuring the key is mandatory before merging. As an AI, I am not authorized to approve pull requests; please ensure this change is further reviewed and approved by team members after the suggested modifications are considered.
| self.calib_dataset_path = calib_cfg.get('path', None) | ||
| self.apply_chat_template = calib_cfg.get('apply_chat_template', False) | ||
| self.n_samples = calib_cfg.get('seq_len', None) | ||
| self.n_samples = calib_cfg.get('n_samples', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While changing the key from 'seq_len' to 'n_samples' is a good correction for semantic accuracy, there's a potential issue if 'n_samples' is missing from calib_cfg.
Currently, if 'n_samples' is not found, self.n_samples defaults to None. This None value, when subsequently used by preprocessing functions (e.g., in llmc/data/dataset/specified_preproc.py), is likely to cause a TypeError. For instance:
- Functions like
wikitext2_gptqusen_samplesinrange(n_samples), andrange(None)is an error. - Functions like
pileval_awqusen_samplesin a loop break condition (if n_run == n_samples:). Ifn_samplesisNone, the loop might process the entire dataset, leading to unexpected behavior or performance issues. - Functions like
img_generalusen_samplesin comparisons (len(calib_dataset) > n_samples), which will also error withNone.
To make this more robust and prevent runtime crashes due to incomplete configurations, consider providing a default integer value for n_samples. A value like 128 is used in several example configurations (e.g., configs/quantization/methods/Awq/awq_w_only.yml).
Alternatively, if n_samples is a strictly required parameter, accessing it directly via calib_cfg['n_samples'] would raise a KeyError immediately if it's missing, which is a more explicit way to handle mandatory configurations. However, providing a sensible default is often a good balance.
| self.n_samples = calib_cfg.get('n_samples', None) | |
| self.n_samples = calib_cfg.get('n_samples', 128) # Default to 128 to prevent errors if not in config |
No description provided.