Skip to content

[dataviewer] Support preview json/csv large file and optimize thread num#268

Merged
HaiHui886 merged 2 commits intomainfrom
os-hhwang
Feb 12, 2025
Merged

[dataviewer] Support preview json/csv large file and optimize thread num#268
HaiHui886 merged 2 commits intomainfrom
os-hhwang

Conversation

@HaiHui886
Copy link
Collaborator

@HaiHui886 HaiHui886 commented Feb 12, 2025

What is this feature?

merge code of support preview large json/csv file

MR Summary:

The summary is added by @codegpt.

The Merge Request introduces support for previewing large JSON/CSV files and optimizes the number of threads used for exporting data. Key updates include:

  1. Increased maximum thread number for exports from 4 to 8 and introduced a conversion limit size of 5GB for large file handling.
  2. Enhanced file handling in dataviewer/common/types.go to include total JSON and CSV file sizes, improving file management and processing.
  3. Updated dataset_viewer.go to support efficient data retrieval and catalog generation for large datasets, including handling for parquet files.
  4. Implemented new workflow activities in activity.go for scanning repository files, determining card data, and managing file conversions and uploads, with added support for JSON and CSV file types.
  5. Added unit tests and utility functions to support the new features and ensure robustness.

This MR optimizes data processing capabilities, particularly for large datasets, and improves the system's scalability and efficiency.

@starship-github
Copy link

Linter Issue Report

During the code review, a list issues were found. These issues could affect the code quality, maintainability, and consistency. Below is the detailed Linter issue report:

dataviewer/workflows/activity.go

Lint Issue: undefined: minio

  • Location: Line 779, Column 96
  • Code Context:
    uploadInfo, err := dva.s3Client.PutObject(ctx, dva.cfg.S3.Bucket, objectKey, f, pointer.Size, minio.PutObjectOptions{})
  • Actionable Suggestion: Ensure that the minio package is correctly imported at the top of your file. If it's already imported, check for any typos in the package name or in your usage of the PutObjectOptions struct. If the minio package is not imported, you should add import "github.com/minio/minio-go/v7" to the import block at the beginning of your file.

Please make the suggested changes to improve the code quality.

@starship-github
Copy link

Possible Issues And Suggestions:

  • Line 54 in dataviewer/workflows/activity.go

    • Comments:
      • The addition of lfsMetaStore to dataViewerActivityImpl struct and its initialization in NewDataViewerActivity function without its usage in any method might indicate dead code or incomplete implementation.
  • dataviewer/component/dataset_viewer.go

    • Comments:
      • The method genDefaultCatalog uses getFilesRowCount to calculate the total number of examples for each split. Ensure that getFilesRowCount efficiently handles large datasets to avoid performance issues.
  • dataviewer/workflows/utils.go

    • Comments:
      • The function CopyJsonArray does not handle the case where the JSON array is empty, which could lead to incorrect behavior or errors when trying to encode an empty object.
    • Suggestions:
      if count == 0 {
          _, err = writer.Write([]byte("[]"))
          if err != nil {
              return 0, fmt.Errorf("write empty json array error: %w", err)
          }
      }
      
  • Line 158 in dataviewer/workflows/utils.go

    • Comments:
      • The CopyFileContext function uses bufio.NewScanner which has a default max token size of 64*1024 bytes. Large lines in files might cause bufio.ErrTooLong errors.
    • Suggestions:
      scanner.Buffer(make([]byte, 64*1024), bufio.MaxScanTokenSize)
      
  • dataviewer/common/types.go

    • Comments:
      • The RepoFilesClass struct includes TotalJsonSize and TotalCsvSize but does not include a similar field for Parquet files, which might be needed for consistency.
    • Suggestions:
      TotalParquetSize int64
      
  • Line 107 in dataviewer/workflows/utils_test.go

    • Comments:
      • The 'CopyFileContext' and 'CopyJsonArray' functions seem to have a hardcoded limit for copying data which might not be flexible for different file sizes.
  • dataviewer/workflows/workflow_test.go

    • Comments:
      • Changing 'MaxFileSize' to 'ConvertLimitSize' without adjusting logic might cause issues if the new limit is not properly validated against the expected file processing capabilities.
  • Line 27 in dataviewer/workflows/repo_files_test.go

    • Comments:
      • The test 'TestRepoFiles_appendFile' now expects 'JsonlFiles' to have one file after appending, which might not match the intended logic if 'appendFile' should filter by size.

MR Evaluation:

This feature is still under test, evaluation are given by AI and might be inaccurate.

After evaluation, the code changes in the Merge Request get score: 100.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

About Us:

Visit the OpenCSG StarShip website for the Dashboard and detailed information on CodeReview, CodeGen, and other StarShip modules.

Yiling-J
Yiling-J previously approved these changes Feb 12, 2025
@Yiling-J Yiling-J self-requested a review February 12, 2025 03:33
@HaiHui886 HaiHui886 merged commit f08c508 into main Feb 12, 2025
6 checks passed
@HaiHui886 HaiHui886 deleted the os-hhwang branch February 12, 2025 03:34
@starship-github
Copy link

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

3 similar comments
@starship-github
Copy link

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

@starship-github
Copy link

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

@starship-github
Copy link

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants