-
Notifications
You must be signed in to change notification settings - Fork 65
feat: support csv #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support csv #310
Conversation
Summary of ChangesHello @shijinpjlab, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data ingestion capabilities of the Dingo framework by introducing robust and flexible support for CSV files. It provides a dedicated configuration for CSV datasets, allowing users to specify parameters like headers, encoding, and delimiters, and implements a streaming mechanism for efficient processing of large files. This feature, along with updated label collection logic in the executor, streamlines the process of preparing diverse datasets for evaluation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for reading CSV files, including new configuration options, a data converter, and a data loading implementation. It also includes comprehensive documentation, examples, and tests for the new functionality. My review focuses on improving code maintainability, and fixing some issues in the new documentation files. Overall, this is a great addition with thorough test coverage.
| def __init__(self): | ||
| super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if has_header: | ||
| # 第一行作为列名 | ||
| headers = [str(h).strip() if h else f'column_{i}' for i, h in enumerate(first_row)] | ||
| else: | ||
| # 不使用标题行,使用 column_x 格式 | ||
| headers = [f'column_{i}' for i in range(len(first_row))] | ||
| first_row_data = first_row # 保存第一行数据,稍后处理 | ||
|
|
||
| # 如果第一行是数据(has_header=False),先处理它 | ||
| if first_row_data is not None: | ||
| row_dict = {} | ||
| for i, (header, value) in enumerate(zip(headers, first_row_data)): | ||
| row_dict[header] = value.strip() if value else "" | ||
| yield json.dumps(row_dict, ensure_ascii=False) + '\n' | ||
|
|
||
| # 逐行读取并转换为 JSON | ||
| for row in csv_reader: | ||
| # 跳过空行 | ||
| if not row or all(not cell.strip() for cell in row): | ||
| continue | ||
|
|
||
| # 将行数据与标题组合成字典 | ||
| row_dict = {} | ||
| for i, header in enumerate(headers): | ||
| # 如果当前行的列数少于标题数,用空字符串填充 | ||
| if i < len(row): | ||
| row_dict[header] = row[i].strip() if row[i] else "" | ||
| else: | ||
| row_dict[header] = "" | ||
|
|
||
| # 转换为 JSON 字符串并 yield | ||
| yield json.dumps(row_dict, ensure_ascii=False) + '\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for handling headers and processing rows has some duplication. The logic for processing the first row (when has_header=False) is repeated in the main loop. This can be refactored to use a single processing loop by creating a unified iterator for data rows. This will make the code more concise and easier to maintain.
I suggest using itertools.chain to combine the first row with the rest of the CSV reader when there's no header. This avoids special handling for the first data row. You'll need to add from itertools import chain to your imports.
if has_header:
# The first row is the header
headers = [str(h).strip() if h else f'column_{i}' for i, h in enumerate(first_row)]
data_rows = csv_reader
else:
# Generate headers and treat the first row as data
from itertools import chain
headers = [f'column_{i}' for i in range(len(first_row))]
data_rows = chain([first_row], csv_reader)
# Process all data rows in a single loop
for row in data_rows:
# Skip empty rows
if not row or all(not cell.strip() for cell in row):
continue
# Combine row data with headers into a dictionary, handling rows with fewer columns
row_dict = {
header: (row[i].strip() if row[i] else "") if i < len(row) else ""
for i, header in enumerate(headers)
}
# Yield the JSON string
yield json.dumps(row_dict, ensure_ascii=False) + '\n'| | 参数 | 类型 | 默认值 | | ||
| |------|-----|--------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table header for the Excel file example seems to be a copy-paste error. It shows | 参数 | 类型 | 默认值 | which is for a parameter description table, but the table body contains example data. The header should reflect the data columns, such as | id | content | label |, to match the JSON output example that follows.
| | 参数 | 类型 | 默认值 | | |
| |------|-----|--------| | |
| | id | content | label | | |
| |------|---------|--------| |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
No description provided.