Skip to content

Conversation

@vcandapp
Copy link
Contributor

@vcandapp vcandapp commented Jun 3, 2025

With nmstate-provider set as default for os-net-config, default cleanup during greenfield/update/adoption is not desired.

Relates-to: os-net-config/os-net-config#217

Jira(s): https://issues.redhat.com/browse/OSPRH-16537
https://issues.redhat.com/browse/OSPRH-17036
https://issues.redhat.com/browse/OSPRH-16760

@rabi
Copy link
Contributor

rabi commented Jun 3, 2025

Why is this a problem as it should clean only the devices not in the config[1]? To me it looks like how it's implemented is incorrect for nmstate provider i.e cleanup all available interfaces[2].

[1] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L2356
[2] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L716

@vcandapp
Copy link
Contributor Author

vcandapp commented Jun 3, 2025

Why is this a problem as it should clean only the devices not in the config[1]? To me it looks like how it's implemented is incorrect for nmstate provider i.e cleanup all available interfaces[2].

[1] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L2356 [2] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L716

Cleanup might be required only for certain un-provision use case. It is also not required for network_update use case. Hence setting default to false. User can do cleanup when necessary only
cc: @karthiksundaravel

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/23629d0511aa45048e3745c6afb27c02

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 01m 47s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 39s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 33m 38s
✔️ noop SUCCESS in 0s
✔️ edpm-ansible-tempest-multinode SUCCESS in 1h 46m 06s
adoption-standalone-to-crc-ceph-provider FAILURE in 1h 40m 41s

@vcandapp
Copy link
Contributor Author

vcandapp commented Jun 3, 2025

recheck

@rabi
Copy link
Contributor

rabi commented Jun 4, 2025

Why is this a problem as it should clean only the devices not in the config[1]? To me it looks like how it's implemented is incorrect for nmstate provider i.e cleanup all available interfaces[2].
[1] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L2356 [2] https://github.com/os-net-config/os-net-config/blob/master/os_net_config/impl_nmstate.py#L716

Cleanup might be required only for certain un-provision use case. It is also not required for network_update use case. Hence setting default to false. User can do cleanup when necessary only cc: @karthiksundaravel

If we only cleanup the interfaces not in the config (like we filter for ifconfig provider[1]) there should not be any issue. Also in case of "edpm_network_config_update: true" where the config has changed in the next deploy one has to cleanup the old configured devides not in the config. You should fix how the cleanup flag is used with the nmstate provider rather than changing this.

@karthiksundaravel
Copy link

Yes, I understand there is a gap in both ifcfg cleanup and nmstate cleanup. We will address the same soon. There are certain restrictions during parameter update and it's not possible to cleanup every configuration that is not defined in the new config.yaml. Cleanup could be an optional parameter in greenfield and the usage of cleanup is not something required all these years during network update is good enough statement to qualify that it is redundant in most of the use cases. Hence requesting for the default value as False.

@rabi
Copy link
Contributor

rabi commented Jun 4, 2025

cleanup is not something required all these years during network update

That's incorrect, we've multiple cases where we've suggested it to customers though the default was false. Before we change this, we should test multinamespace usecase https://issues.redhat.com/browse/OSPRH-14156 using vlan tagged ctlplane which required this change.

@karthiksundaravel
Copy link

Sorry the EPIC doesn't provide all the details why cleanup is required. And if cleanup is required, then it could be optionally enabled, isn't it ? The default value of true is going to silently affect many deployments during updates/upgrades and if cleanup is required, we always have a means to override the default. IMHO it reduces the complexity for most of the configurations and use cases.

@Jaganathancse
Copy link
Contributor

cleanup is not something required all these years during network update

That's incorrect, we've multiple cases where we've suggested it to customers though the default was false. Before we change this, we should test multinamespace usecase https://issues.redhat.com/browse/OSPRH-14156 using vlan tagged ctlplane which required this change.

@rabi Looks like only few scenarios we need this cleanup and not most sceanrios.

  1. Provision with physical interface using ifcfg or nmsate provider (this cleanup is not required)
  2. Pre-provisioned nodes with physical interface using ifcfg or nmsate provider (this cleanup is not required)
  3. Provision with vlan tagged interface using ifcfg or nmsate provider (this cleanup is required)
  4. Pre-provisioned nodes with vlan tagged interface using ifcfg or nmsate provider (this cleanup is required)
  5. network update and minor update scenarios ( this cleanup is not required)
  6. dataplane adoption ( this cleanup is not required)
  7. multiple dataplane with seperate namespace( i think depends on controlplane interface vlan tagged or not)

Only 3,4 and may be 7 partially. Mainly this cleanup requires specific to vlan tagged controlplane only.
I recommend to have default value for this variable 'false' and configure in nodeset this variable 'true' for vlan tagged scenario and also if any additional scenario requires cleanup.
@karthiksundaravel @vcandapp ^

this variable default 'true' will be problem for network update scenario in post deployment cluster.

@rabi
Copy link
Contributor

rabi commented Jun 4, 2025

Have you checked the commit that introduced the change d6df441 ? It has the details. Also, it has the link to the TP tracker jira https://issues.redhat.com/browse/OSPRH-10124 for the same use-case I mentioned above. I remember we also discussed that in details when I addeed related changes to os-net-config.

The default value of true is going to silently affect many deployments during updates/upgrades and if cleanup is required.

Why would it impact when you don't touch devices in the config? Anyway, the default was flipped for a reason and we can't revert it without testing that use-case.

@Jaganathancse
Copy link
Contributor

Jaganathancse commented Jun 4, 2025

Have you checked the commit that introduced the change d6df441 ? It has the details. Also, it has the link to the TP tracker jira https://issues.redhat.com/browse/OSPRH-10124 for the same use-case I mentioned above. I remember we also discussed that in details when I addeed related changes to os-net-config.

The default value of true is going to silently affect many deployments during updates/upgrades and if cleanup is required.

Why would it impact when you don't touch devices in the config? Anyway, the default was flipped for a reason and we can't revert it without testing that use-case.

@rabi I think this cleanup variable is not working as expected and performing cleanup in update scenario also.
I am seeing gap related to this variable in os-net-config for ifcfg provider with vlan tagged scenario and your os-net-config and also behave incorrect cleanup in other scenarios.
Can we make the default value 'false' now and we can decide this variable defaulting 'true' later once complete testing is done for all scenarios.
Also in pre-provisioned nodes, we can't decide non configured cleanup 'true' default and depends on the customer use cases.
cleaning up all the existing connection currently since defaulting nonconfigured cleanup variable 'true' and suppose want to keep few connections manually then will be problem.
Need more testing and issue fix around this variable defaulting 'true', also documentaion gaps now.

@karthiksundaravel
Copy link

@rabi As stated already, please understand that there is an outstanding issue in os-net-config for cleanup and this is not going to work for all scenarios now and the recommendation from os-net-config team is not to enable the --cleanup without any knowledge (default). Wherever the cleanup is required, please override the default for the testcases you have listed and when we could do that what sort of additional testing is required ? Is this clear ?

@dsneddon
Copy link

dsneddon commented Jun 5, 2025

I'm not sure this is an appropriate use case for --cleanup in os-net-config. We really created the --cleanup parameter for CI and test environments to make sure that previous (different) environments were cleaned up. If there is additional functionality required to support e.g. multiple OSP data planes using different namespaces then we should look at that as a feature request and figure out if it makes more sense to add the functions to os-net-config or Ansible.

@vcandapp
Copy link
Contributor Author

vcandapp commented Jun 6, 2025

recheck

1 similar comment
@ekuris-redhat
Copy link

recheck

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1a4ec4d4796d4152b303d56c9f6a3349

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 29m 17s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 18m 02s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 32m 58s
✔️ noop SUCCESS in 0s
✔️ edpm-ansible-tempest-multinode SUCCESS in 1h 41m 21s
✔️ adoption-standalone-to-crc-ceph-provider SUCCESS in 3h 12m 52s

@vcandapp
Copy link
Contributor Author

recheck

Copy link

@karthiksundaravel karthiksundaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karthiksundaravel, vcandapp
Once this PR has been reviewed and has the lgtm label, please assign bogdando for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vcandapp vcandapp requested a review from rabi June 23, 2025 08:33
@vcandapp
Copy link
Contributor Author

@rabi / @karelyatin , cu and QE have done some validation tests with default clean false (both fresh deployment and network_update). Can you please review and provide more use ases to get this change tested ?
cc: @karthiksundaravel

@rabi
Copy link
Contributor

rabi commented Jun 23, 2025

@rabi / @karelyatin , cu and QE have done some validation tests with default clean false (both fresh deployment and network_update). Can you please review and provide more use ases to get this change tested ? cc: @karthiksundaravel

I thought we had agreed in the last call to create scenarios (ex. vlan taggged control plane with both ifcfg and nmstate providers, ovs bridge etc), test those scenarios and then document where the flag has to be set true vs false, right?

With nmstate-provider set as default for os-net-config,
default cleanup during update/adoption is not desired.

Updated description for the proper usage of this flag

Signed-off-by: <[email protected]>
Jira: https://issues.redhat.com/browse/OSPRH-16537
      https://issues.redhat.com/browse/OSPRH-17036
Signed-off-by: vcandapp <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2025

New changes are detected. LGTM label has been removed.

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.

@vcandapp
Copy link
Contributor Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants