Skip to content

Conversation

@krrish175-byte
Copy link

@krrish175-byte krrish175-byte commented Jan 29, 2026

What type of PR is this?

/kind documentation
/kind proposal

What this PR does / why we need it:

This PR adds a comprehensive technical proposal for implementing parallel processing of test cases in Ianvs core, as requested in issue #8 and discussed in PR #308.

The proposal addresses the need to reduce benchmarking time when testing multiple parameter configurations or algorithms. Currently, Ianvs executes test cases serially, which can lead to excessive execution times (e.g., 5 test cases × 2 hours each = 10 hours total). This feature will enable concurrent execution across multiple CPU cores, significantly reducing total benchmarking time.

Proposal Contents:

  • Motivation & Problem Statement: Why parallel processing is essential for all Ianvs developers
  • Architecture Design: Detailed technical design showing integration with Ianvs core components
  • Backward Compatibility Analysis: Comprehensive demonstration that all existing examples continue to work without modification
  • Impact Assessment: Analysis of how this affects current running examples across all scenarios
  • Testing & Validation Strategy: Plan for validating all existing examples in both serial and parallel modes
  • Implementation Roadmap: Phased 4-week approach with clear milestones
  • Risk Assessment: Identified risks and mitigation strategies
  • Performance Analysis: Expected speedup metrics (3-6x for typical workloads)

Key Design Principles:

  1. Backward Compatible: Parallel execution is opt-in; default behavior remains serial
  2. Zero Breaking Changes: All existing examples and workflows continue to function unchanged
  3. Robust Error Handling: Failures in one test case don't crash the entire benchmarking job
  4. User Control: Flexible configuration via CLI flags and YAML configuration

Which issue(s) this PR fixes:

Related to #8
Related to #308

Special notes for reviewers:

This proposal is intended for community review and architectural discussion before proceeding with implementation, as requested by @MooreZheng in PR #308.

The proposal will be presented at the next KubeEdge SIG AI meeting for formal architectural review. Feedback on the following aspects is particularly welcome:

  • Overall architecture and design approach
  • Impact analysis completeness
  • Testing and validation strategy
  • Implementation timeline and phases
  • Any concerns about backward compatibility or existing examples

Next Steps:

  1. Community review of this proposal
  2. Presentation at KubeEdge SIG AI meeting (pending schedule)
  3. Revisions based on community feedback
  4. Approval for implementation
  5. Update PR docs: Add parallel processing feature proposal for issue #8 #308 with implementation based on approved design

Note: This is a documentation-only PR. The actual implementation will be submitted separately after proposal approval.

/cc @MooreZheng @hsj576 @Poorunga
/assign @MooreZheng

@kubeedge-bot kubeedge-bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 29, 2026
@kubeedge-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: krrish175-byte
To complete the pull request process, please assign moorezheng after the PR has been reviewed.
You can assign the PR to them by writing /assign @moorezheng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @krrish175-byte, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Ianvs benchmarking system by introducing parallel processing capabilities for test case execution. Previously, test cases ran serially, which could lead to excessive execution times for complex or numerous tests. The changes enable concurrent execution across multiple CPU cores, drastically reducing overall benchmarking duration. This is achieved through new command-line interface options, corresponding configuration file parameters, and a refactored internal mechanism that leverages Python's multiprocessing to distribute workloads efficiently and robustly.

Highlights

  • Parallel Execution: Implemented parallel processing for test case execution using concurrent.futures.ProcessPoolExecutor, allowing multiple test cases to run concurrently.
  • CLI Arguments: Introduced new command-line flags: -p / --parallel to enable parallel execution and -w / --workers to specify the number of worker processes.
  • Configuration Support: Added support for parallel_execution (boolean) and num_workers (integer) parameters within the benchmarking job configuration files.
  • Code Refactoring: Extracted the core test case execution logic into a new top-level, picklable function run_testcase_func to ensure compatibility with Python's multiprocessing module.
  • Robust Error Handling: Enhanced error handling during parallel execution, ensuring that a failure in one test case does not halt the entire benchmarking job, with errors being logged and other cases continuing to process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kubeedge-bot kubeedge-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 29, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces parallel processing for test case execution, which is a great feature for improving performance on multi-core systems. The implementation using concurrent.futures.ProcessPoolExecutor is solid. I've made a few suggestions to improve robustness, efficiency, and maintainability. Specifically, I've pointed out a potential crash when determining the number of worker processes, suggested an optimization to reduce data transfer from worker processes, and recommended using the project's standard logger for consistency. Overall, this is a valuable addition.

if parallel:
# Determine number of workers
if workers is None:
workers = max(1, os.cpu_count() - 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

os.cpu_count() can return None on some platforms, which would cause a TypeError when you try to subtract 1 from it. You should handle this case to prevent the program from crashing.

Suggested change
workers = max(1, os.cpu_count() - 1)
workers = max(1, (os.cpu_count() or 2) - 1)

Comment on lines 124 to 127
if k == "parallel_execution":
self.parallel = v
if k == "num_workers":
self.workers = v

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for parsing parallel execution parameters is a bit disconnected from the main configuration parsing structure. For better readability and maintainability, consider integrating it into the if/elif/else block above. This would make the flow of control clearer.

# Run training with config parameters
# Evaluate and collect metrics
res = testcase.run(workspace)
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.

Suggested change
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}
return {"status": "success", "config": testcase.algorithm.name, "results": res}

res = testcase.run(workspace)
return {"status": "success", "config": testcase.algorithm.name, "results": res, "testcase": testcase}
except Exception as e:
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Returning the full testcase object from the worker process is unnecessary and can be inefficient. The testcase object is already available in the main process through the future_to_testcase map. Removing it from the returned dictionary will reduce serialization/deserialization overhead.

Suggested change
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e), "testcase": testcase}
return {"status": "failed", "config": testcase.algorithm.name, "error": str(e)}

if workers is None:
workers = max(1, os.cpu_count() - 1)

print(f"Running {len(self.test_cases)} test cases on {workers} workers")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This uses print for logging, but the project seems to have a configured LOGGER (e.g., in core/cmd/benchmarking.py). For consistent and manageable logging, it's better to use the logger instance. This also applies to the print statements on lines 82 and 84. Consider using LOGGER.info, LOGGER.warning, or LOGGER.error as appropriate. You will need to import LOGGER from core.common.log.

Suggested change
print(f"Running {len(self.test_cases)} test cases on {workers} workers")
LOGGER.info(f"Running {len(self.test_cases)} test cases on {workers} workers")

@kubeedge-bot kubeedge-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 29, 2026
@krrish175-byte
Copy link
Author

Hi @MooreZheng, can you please provide updates on this pr?

@MooreZheng MooreZheng requested review from hsj576 and removed request for Poorunga February 2, 2026 01:21
@MooreZheng
Copy link
Collaborator

MooreZheng commented Feb 2, 2026

Hi @MooreZheng, can you please provide updates on this pr?

Welcome, Krrish. The work will be appreciated.

This is related to issue #8. Please note that parallel processing is an important feature in ianvs core code that will impact all ianvs developers' work, past, present, and future. For such important features, you need to provide a proposal for community reviewers to show how it would affect all current running examples. Then a formal presentation is needed to launch a review of architecture design in the KubeEdge SIG AI before getting to any implementation.

See a proposal example in https://github.com/kubeedge/ianvs/blob/main/docs/proposals/scenarios/GovDoc2Poster/GovDoc2Poster.md

@kubeedge-bot kubeedge-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2026
@krrish175-byte krrish175-byte changed the title feat: Add parallel processing for multiple test case execution feat: Proposal for parallel processing in Ianvs Feb 2, 2026
@krrish175-byte krrish175-byte changed the title feat: Proposal for parallel processing in Ianvs docs: Add parallel processing feature proposal for issue #8 Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants