Skip to content

Migrate stressng and uperf workloads from benchmark-operator/snafu to native resource creation#1186

Open
arpsharm wants to merge 1 commit intomainfrom
migrate-native-workloads
Open

Migrate stressng and uperf workloads from benchmark-operator/snafu to native resource creation#1186
arpsharm wants to merge 1 commit intomainfrom
migrate-native-workloads

Conversation

@arpsharm
Copy link
Copy Markdown
Collaborator

@arpsharm arpsharm commented Mar 13, 2026

What changed

  • Remove benchmark-operator and snafu/benchmark-wrapper dependencies
  • Pod workloads use native Kubernetes Jobs
  • VM workloads use native KubeVirt VirtualMachines with cloud-init
  • VM results extracted via qemu-guest-agent
  • Added helper methods to oc.py for pod/VM introspection and guest-agent operations
  • Replaced all subprocess.run calls with oc.py methods
  • Added @typechecked annotations, moved initializations to __init__
  • New templates for native Job and VirtualMachine creation
  • Fixed stressng_timeout variable collision with general timeout env var
  • ES upload handled by benchmark-runner (pod workloads) and cloud-init curl (stressng pod)
  • 95th percentile latency with numpy-equivalent linear interpolation
  • All ES fields match OG schema for Grafana compatibility
  • Prometheus metrics populated
  • 380 golden files auto-regenerated from template changes

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@arpsharm arpsharm force-pushed the migrate-native-workloads branch 2 times, most recently from 594eece to 950a41c Compare March 13, 2026 05:21
@arpsharm arpsharm requested a review from ebattat March 13, 2026 05:28
@arpsharm
Copy link
Copy Markdown
Collaborator Author

/test all

@ebattat
Copy link
Copy Markdown
Member

ebattat commented Mar 15, 2026

@arpsharm,
let's have a meeting regarding it

@arpsharm arpsharm changed the title Migrate stressng and uperf workloads to native Kubernetes Migrate stressng and uperf workloads from benchmark-operator to direct resource creation Mar 16, 2026
@arpsharm arpsharm force-pushed the migrate-native-workloads branch 2 times, most recently from 98c26bb to fe01110 Compare March 17, 2026 12:18
@arpsharm
Copy link
Copy Markdown
Collaborator Author

/test all

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from fe01110 to 360e410 Compare March 17, 2026 13:32
@arpsharm
Copy link
Copy Markdown
Collaborator Author

/test all

@arpsharm arpsharm force-pushed the migrate-native-workloads branch 8 times, most recently from fb7a63b to 8012f77 Compare March 24, 2026 06:04
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: arpsharm
Once this PR has been reviewed and has the lgtm label, please assign robertkrawitz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from 8012f77 to 66fb387 Compare March 24, 2026 08:10
@arpsharm arpsharm marked this pull request as ready for review March 24, 2026 11:26
@openshift-ci openshift-ci bot requested a review from RobertKrawitz March 24, 2026 11:26
@arpsharm arpsharm marked this pull request as draft March 24, 2026 11:26
@arpsharm arpsharm marked this pull request as ready for review March 24, 2026 11:26
@arpsharm arpsharm marked this pull request as draft March 24, 2026 11:27
'node_range': self._environment_variables_dict.get('node_range', ''),
'pod_id': '',
'hostnetwork': self._environment_variables_dict.get('hostnetwork', 'False')
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to create uperf_data.yaml with default values

self._environment_variables_dict['test_user'] = os.environ.get('TEST_USER', 'ripsaw')
self._environment_variables_dict['port'] = os.environ.get('PORT', '30000')
self._environment_variables_dict['run_id'] = os.environ.get('RUN_ID', 'NA')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it to init and put it inside yaml
Pls check in elastic if we need all the fields and if not remove it, for example:
self._environment_variables_dict['test_user'] = os.environ.get('TEST_USER', 'ripsaw')

for _ in range(30):
if not self._oc.vm_exists(vm_name=self.__client_vm_name):
break
time.sleep(1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use existing method delete_vm_sync
Go over all the places with subprocess.run and check for existing method in oc class and use always sync method.

yaml_path = os.path.join(f'{self._run_artifacts_path}', f'{self.__name}.yaml')
apply_cmd = f"oc apply -f {yaml_path}"
result = subprocess.run(apply_cmd, shell=True, capture_output=True, text=True)
if result.returncode != 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_vm_sync


# Wait for client workload to complete by polling for signal file via guest agent
logger.info("Waiting for uperf client workload to complete...")
max_wait = 600 # 10 minutes timeout
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take timout from env variable or from oc class

self._environment_variables_dict['clustername'] = cluster_name
self._environment_variables_dict['test_user'] = os.environ.get('TEST_USER', 'ripsaw')
self._environment_variables_dict['port'] = os.environ.get('PORT', '30000')
self._environment_variables_dict['run_id'] = os.environ.get('RUN_ID', 'NA')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls try to put it in init

time.sleep(5)

# Re-generate client YAML with server IP (template needs it)
from benchmark_runner.common.template_operations.template_operations import TemplateOperations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it on the begining

# Re-generate client YAML with server IP (template needs it)
from benchmark_runner.common.template_operations.template_operations import TemplateOperations
template_ops = TemplateOperations(workload=self._workload)
template_ops.set_environment_variables(self._environment_variables_dict)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it in init

logger.info(f"Client IP: {client_ip}")

# Get pod logs using oc command
logs_cmd = f"oc logs -n {self._environment_variables_dict['namespace']} {client_pod}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit in oc file, save_pod_log

self.__server_job_name = ''
self.__client_job_name = ''

def _parse_uperf_pod_logs(self, pod_logs, server_ip, server_node, client_node, pod_id, client_ip):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add data type pod_logs, server_ip, server_node, client_node, pod_id, client_ip and also check by using
@TypeChecked
@logger_time_stamp

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from 66fb387 to ddb6980 Compare March 25, 2026 11:36
@arpsharm arpsharm changed the title Migrate stressng and uperf workloads from benchmark-operator to direct resource creation Migrate stressng and uperf workloads from benchmark-operator/snafu to native resource creation Mar 25, 2026
@arpsharm arpsharm force-pushed the migrate-native-workloads branch 2 times, most recently from 7922313 to 15cb446 Compare March 26, 2026 13:57
@arpsharm arpsharm marked this pull request as ready for review March 26, 2026 14:15
@openshift-ci openshift-ci bot requested a review from ebattat March 26, 2026 14:15
@arpsharm arpsharm force-pushed the migrate-native-workloads branch from 15cb446 to c86e5fd Compare March 31, 2026 13:33
logger.info("Server VM is ready, getting server IP")

# Get server VMI IP - retry until IP is assigned
namespace = self._environment_variables_dict['namespace']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be on init

logger.warning(f"virtctl ssh error: {e}")
return None

def wait_for_virtctl_ssh(self, vm_name: str, namespace: str = '', key_path: str = '', username: str = 'fedora', timeout: int = 180) -> bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

username: str = 'fedora', make environment variable

self.__server_vm_name = f'uperf-server-{self._trunc_uuid}'
self.__client_vm_name = f'uperf-client-{self._trunc_uuid}'
self.__template_ops = TemplateOperations(workload=self._workload)
self.__ssh_key_path = self._environment_variables_dict.get('ssh_key_path', '/tmp/benchmark-runner-ssh-key')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need dynamic key that generate for every workload


# Wait for SSH to be ready on client VM
logger.info("Waiting for SSH on client VM...")
self._oc.wait_for_virtctl_ssh(vm_name=self.__client_vm_name, namespace=namespace, key_path=self.__ssh_key_path, username='fedora', timeout=180)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

username should be environment variable

workload_complete = False

for elapsed in range(0, max_wait, poll_interval):
check_result = self._oc.virtctl_ssh(vm_name=self.__client_vm_name, command='test -f /opt/uperf/workload_complete.signal && echo done', namespace=namespace, key_path=self.__ssh_key_path, username='fedora')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Virtctl class

def wait_for_vm_workload_completed (file_path, local_path) => should be on virtctl dir

  1. def ssh ready
  2. def wait for file created
  3. def scp the file to local
    ** not use hard coded pem secret

Uperf_vm.py

def parse uperf vm result

workload_operation.py => if uperf and stessng log parser is the same

@@ -0,0 +1,43 @@
apiVersion: v1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uperf_vm_secret_template.yaml

check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
with open(f'{ssh_key_path}.pub', 'r') as f:
self._environment_variables_dict['ssh_public_key'] = f.read().strip()
self._environment_variables_dict['ssh_key_path'] = ssh_key_path
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._ssh_key_path= generate_ssh_key()

- mkdir -p /opt/uperf && chmod 777 /opt/uperf
- systemctl enable --now qemu-guest-agent
- for nic in $(ls /sys/class/net/ | grep -v lo); do ethtool -L $nic combined $(nproc) 2>/dev/null; done || true
- uperf -s -P 30000 > /opt/uperf/server.log 2>&1 &
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • export HOME=/root
  • export TMP=/tmp
  • export TEMP=/tmp
  • /tmp/uperf.log
  • python3 uperf_parser.py /tmp/uperf.log => should generate /tmp/uperf.json ( create configmap in same yaml of cloudinit with uperf_parser.py)

uperf_vm.py
-- so we need to wait for /tmp/uperf.json
-- copy /tmp/uperf.json to local

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from c86e5fd to c6ea527 Compare April 2, 2026 07:54
@arpsharm
Copy link
Copy Markdown
Collaborator Author

arpsharm commented Apr 2, 2026

/test all

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from c6ea527 to cbde4b2 Compare April 2, 2026 09:58

@typechecked
def wait_for_file_created(self, vm_name: str, file_path: str, namespace: str = '', key_path: str = '', username: str = '', timeout: int = 3600) -> bool:
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout: int = 3600 => pls use the timeout from env variable because there are workload the run more than hour

@arpsharm arpsharm force-pushed the migrate-native-workloads branch from cbde4b2 to acc8202 Compare April 3, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants