feat: support csv #310

shijinpjlab · 2025-12-23T06:12:36Z

No description provided.

add limit of label in gradio

gemini-code-assist · 2025-12-23T06:13:02Z

Summary of Changes

Hello @shijinpjlab, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data ingestion capabilities of the Dingo framework by introducing robust and flexible support for CSV files. It provides a dedicated configuration for CSV datasets, allowing users to specify parameters like headers, encoding, and delimiters, and implements a streaming mechanism for efficient processing of large files. This feature, along with updated label collection logic in the executor, streamlines the process of preparing diverse datasets for evaluation.

Highlights

CSV Dataset Support: Introduced a new DatasetCsvArgs configuration and CsvConverter to enable reading and processing of CSV files.
Flexible CSV Parsing: The new implementation supports various CSV formats, including configurable headers, encoding (e.g., UTF-8, GBK), dialects (e.g., excel, excel-tab), and custom delimiters.
Streaming File Loading: Implemented efficient streaming for CSV files, allowing processing of large datasets without loading the entire file into memory.
Refined Label Collection: Updated the LocalExecutor to correctly collect and count unique labels for evaluation statistics.
Comprehensive Documentation & Examples: Added detailed documentation (docs/dataset/csv.md) and an example script (examples/dataset/example_csv.py) to guide users on how to utilize the new CSV features.
Excel Documentation: Added comprehensive documentation for Excel dataset reading (docs/dataset/excel.md).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for reading CSV files, including new configuration options, a data converter, and a data loading implementation. It also includes comprehensive documentation, examples, and tests for the new functionality. My review focuses on improving code maintainability, and fixing some issues in the new documentation files. Overall, this is a great addition with thorough test coverage.

gemini-code-assist · 2025-12-23T06:15:58Z

dingo/data/converter/base.py

+    def __init__(self):
+        super().__init__()


The __init__ method only calls the parent's __init__, which is done automatically in Python if __init__ is not defined. This method is redundant and can be removed for cleaner code.

gemini-code-assist · 2025-12-23T06:15:58Z

dingo/data/datasource/local.py

+                if has_header:
+                    # 第一行作为列名
+                    headers = [str(h).strip() if h else f'column_{i}' for i, h in enumerate(first_row)]
+                else:
+                    # 不使用标题行，使用 column_x 格式
+                    headers = [f'column_{i}' for i in range(len(first_row))]
+                    first_row_data = first_row  # 保存第一行数据，稍后处理
+
+                # 如果第一行是数据（has_header=False），先处理它
+                if first_row_data is not None:
+                    row_dict = {}
+                    for i, (header, value) in enumerate(zip(headers, first_row_data)):
+                        row_dict[header] = value.strip() if value else ""
+                    yield json.dumps(row_dict, ensure_ascii=False) + '\n'
+
+                # 逐行读取并转换为 JSON
+                for row in csv_reader:
+                    # 跳过空行
+                    if not row or all(not cell.strip() for cell in row):
+                        continue
+
+                    # 将行数据与标题组合成字典
+                    row_dict = {}
+                    for i, header in enumerate(headers):
+                        # 如果当前行的列数少于标题数，用空字符串填充
+                        if i < len(row):
+                            row_dict[header] = row[i].strip() if row[i] else ""
+                        else:
+                            row_dict[header] = ""
+
+                    # 转换为 JSON 字符串并 yield
+                    yield json.dumps(row_dict, ensure_ascii=False) + '\n'


This block of code for handling headers and processing rows has some duplication. The logic for processing the first row (when has_header=False) is repeated in the main loop. This can be refactored to use a single processing loop by creating a unified iterator for data rows. This will make the code more concise and easier to maintain.

I suggest using itertools.chain to combine the first row with the rest of the CSV reader when there's no header. This avoids special handling for the first data row. You'll need to add from itertools import chain to your imports.

if has_header: # The first row is the header headers = [str(h).strip() if h else f'column_{i}' for i, h in enumerate(first_row)] data_rows = csv_reader else: # Generate headers and treat the first row as data from itertools import chain headers = [f'column_{i}' for i in range(len(first_row))] data_rows = chain([first_row], csv_reader) # Process all data rows in a single loop for row in data_rows: # Skip empty rows if not row or all(not cell.strip() for cell in row): continue # Combine row data with headers into a dictionary, handling rows with fewer columns row_dict = { header: (row[i].strip() if row[i] else "") if i < len(row) else "" for i, header in enumerate(headers) } # Yield the JSON string yield json.dumps(row_dict, ensure_ascii=False) + '\n'

docs/dataset/csv.md

gemini-code-assist · 2025-12-23T06:15:58Z

docs/dataset/excel.md

+| 参数 | 类型 | 默认值 |
+|------|-----|--------|


The table header for the Excel file example seems to be a copy-paste error. It shows | 参数 | 类型 | 默认值 | which is for a parameter description table, but the table body contains example data. The header should reflect the data columns, such as | id | content | label |, to match the JSON output example that follows.

Suggested change

| 参数 | 类型 | 默认值 |

|------|-----|--------|

| id | content | label |

|------|---------|--------|

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

shijinpjlab added 7 commits December 22, 2025 14:58

feat: update gradio image in readme,

0c3c421

add limit of label in gradio

feat: support csv

caa6be9

Merge branch 'dev_1222' into dev_csv

16ca843

feat: fix bug. circular import and conflict name

8b2501c

feat: excel md

08adce8

feat: fix bug. label repeat

f4e7996

feat: lint

d060e63

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

shijinpjlab force-pushed the dev_csv branch from b8df4db to d060e63 Compare December 23, 2025 06:19

shijinpjlab and others added 4 commits December 23, 2025 14:27

feat: file loc

bc76267

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

feat: update by gemini assist

3bf0567

Merge remote-tracking branch 'origin/dev_csv' into dev_csv

1e2d1aa

feat: lint

5a01ae8

shijinpjlab merged commit 1de03d7 into MigoXLab:dev Dec 23, 2025
2 checks passed

shijinpjlab deleted the dev_csv branch December 31, 2025 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support csv #310

feat: support csv #310

Uh oh!

shijinpjlab commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: support csv #310

feat: support csv #310

Uh oh!

Conversation

shijinpjlab commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant