Skip to content

Dataset viewer workflow refactor and bug fix#270

Merged
Yiling-J merged 1 commit intomainfrom
oss/dataset_viewer_scan_files
Feb 12, 2025
Merged

Dataset viewer workflow refactor and bug fix#270
Yiling-J merged 1 commit intomainfrom
oss/dataset_viewer_scan_files

Conversation

@Yiling-J
Copy link
Contributor

@Yiling-J Yiling-J commented Feb 12, 2025

What this PR includes:

Enabling Tracing for Temporal Workflow
screenshot-20250211-162605

Refactor ScanRepoFiles in Dataset Viewer Workflows

The ScanRepoFiles function is triggered every time there is a dataset repository git push. When there are a large number of files, it has a significant impact on performance and generates massive logs in Gitaly. Therefore, the function has been modified to use the new Gitaly Tree API to retrieve files. The optimization effect is similar to the tree API optimization PR: when there are many files, the number of Gitaly requests reduces from 100+ to 1, and the time taken reduces from seconds/minutes to microseconds.

Fix README Dataset Config Parsing

Hugging Face dataset cards allow the definition of dataset configurations in the README using YAML:

---
configs:
- config_name: default
  data_files:
  - split: train
    path: "data/*.csv"
  - split: test
    path: "holdout/*.csv"
---

In this case, the path uses a wildcard format, and through manual testing, the ** syntax is also supported, for example, data/**/*.csv.

Issue: The current code uses regex for matching instead of the wildcard format, which is completely incompatible with the Hugging Face syntax.

Fix: The code has been updated to use doublestar for matching (since Go's built-in filepath does not support the ** syntax). Relevant tests have also been added.

Dataset viewer workflow refactor and bug fix

See merge request product/starhub/starhub-server!873
@starship-github
Copy link

Linter Issue Report

During the code review, a list issues were found. These issues could affect the code quality, maintainability, and consistency. Below is the detailed Linter issue report:

dataviewer/workflows/activity.go

Lint Issue: undefined: appendFile

  • Location: Line 156, Column 3
  • Code Context:
    for _, file := range resp.Files {
        if file.Type == "dir" {
            continue
        }
        appendFile(file, &fileClass, scanParam.ConvertLimitSize)
    }
  • Actionable Suggestion: It appears that the function appendFile is being called but it is not defined within the scope of your code or imported from any packages. Ensure that you have defined the appendFile function within your codebase or imported the correct package where appendFile is defined. If appendFile is intended to be a custom function, you should implement it. For example:
    func appendFile(file FileType, fileClass *FileClassType, limitSize int) {
        // Implementation of appendFile
    }

Please make the suggested changes to improve the code quality.

@starship-github
Copy link

Review Comments And Suggestions:

  • dataviewer/workflows/activity.go

    • Comments:
      • The use of math.MaxInt as a limit might cause an overflow on 32-bit systems.
    • Suggestions:
      +\t\tLimit:     math.MaxInt64,
      
  • dataviewer/workflows/utils.go

    • Comments:
      • The error from doublestar.PathMatch is not handled properly. It should stop the loop or handle the error.
    • Suggestions:
      +\t\t\tif err != nil {
      +\t\t\t\tslog.Error(\"file pattern match\", \"error\", err)
      +\t\t\t\treturn nil, err
      +\t\t\t}
      
  • dataviewer/workflows/utils_test.go

    • Comments:
      • The test case 'foobar/a.csv' is expected to fail but is marked as an empty expected result. This might be misleading.
  • dataviewer/workflows/activity_test.go

    • Comments:
      • The use of math.MaxInt as a limit might lead to integer overflow or unexpected behavior on 32-bit systems.
    • Suggestions:
      +\t\tLimit:     math.MaxInt64,
      
  • dataviewer/component/callback.go

    • Comments:
      • Replacing ctx with context.Background() in workflow execution ignores the original context's deadline or cancellation.
    • Suggestions:
      +\t\tctx,
      
  • cmd/csghub-server/cmd/dataviewer/launch.go

    • Comments:
      • The error message "unable to create workflow client" could be more descriptive by including the serviceName.
    • Suggestions:
      +\t\t\treturn fmt.Errorf(\"unable to create workflow client for service '%s', error: %w\", serviceName, err)
      
  • api/workflow/worker_ce.go

    • Comments:
      • Redundant error check after temporal.NewClient. The error check seems to be mistakenly copied.

MR Evaluation:

This feature is still under test, evaluation are given by AI and might be inaccurate.

After evaluation, the code changes in the Merge Request get score: 100.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

About Us:

Visit the OpenCSG StarShip website for the Dashboard and detailed information on CodeReview, CodeGen, and other StarShip modules.

@Yiling-J Yiling-J requested a review from HaiHui886 February 12, 2025 04:07
Copy link
Collaborator

@HaiHui886 HaiHui886 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Yiling-J Yiling-J merged commit c080dcd into main Feb 12, 2025
6 checks passed
@Yiling-J Yiling-J deleted the oss/dataset_viewer_scan_files branch February 12, 2025 05:08
@starship-github
Copy link

The StarShip CodeReviewer was triggered but terminated because it encountered an issue: The MR state is not opened.

Tips

CodeReview Commands (invoked as MR or PR comments)

  • @codegpt /review to trigger an code review.
  • @codegpt /evaluate to trigger code evaluation process.
  • @codegpt /describe to regenerate the summary of the MR.
  • @codegpt /secscan to scan security vulnerabilities for the MR or the Repository.
  • @codegpt /help to get help.

CodeReview Discussion Chat

There are 2 ways to chat with Starship CodeReview:

  • Review comments: Directly reply to a review comment made by StarShip.
    Example:
    • @codegpt How to fix this bug?
  • Files and specific lines of code (under the "Files changed" tab):
    Tag @codegpt in a new review comment at the desired location with your query.
    Examples:
    • @codegpt generate unit testing code for this code snippet.

Note: Be mindful of the bot's finite context window.
It's strongly recommended to break down tasks such as reading entire modules into smaller chunks.
For a focused discussion, use review comments to chat about specific files and their changes, instead of using the MR/PR comments.

CodeReview Documentation and Community

  • Visit our Documentation
    for detailed information on how to use Starship CodeReview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants