Skip to content

Conversation

@bogdando
Copy link
Contributor

@bogdando bogdando commented Jul 3, 2024

Split edpm nodes into compute cells by 1:1 mapping it as
dataplane nodesets.

Use edpm_nodes var to describe compuptes for each cell,
instead of static host and ip vars that only used to work for
a single-cell standalone, or multi-node single cell cases.
Also explain EDPM net config requirements in vars.sample, when
it is used outside of ci-framework (local deployments).

Remove edpm_computes vars no longer used after moving stopping
control-plane tripleo services into edpm-ansible

Simplify ENV headers management by collecting in a single place.

Provide a variable to define the source cloud Ironic topology,
for any cells with Ironic services.

Align nova/libvirt and related services ordering in the
lists of services defined in multiple places, with those
specified in VA.

Align the names in the tests to follow the documented steps
to make the corresponding code easy discoverable.

Adjust storage/storageRequests values to make it better fitting
a multi-cell test scenarios. Also provide values in docs and
add a comment to adjust them as needed.

Stop ovn services only if active, or not missing (like on
the cell controllers)

Retain EDPM host IPs on internalapi network. Without that, edpm-ansible's os-net-config
changes IPs on internalapi, and also breaks connectivity to EDPM hosts for ansible
(which restores after a node reboot).

Add edpmRoleServiceName value for tlsCerts.

Jira: #OSPRH-6548

@bogdando bogdando changed the title Multi-cell adoption [WIP] Multi-cell adoption Jul 3, 2024
@bogdando
Copy link
Contributor Author

bogdando commented Jul 8, 2024

The recent revision gives an overview to the approach taken, PTAL.
As long as we need to maintain the docs-as-code here, I'm afraid there would be no a cleaner solution than that.
For the ci-framework and rdo-jobs side of things, which should template all that in, I have WIP as well...
@jistr @SeanMooney

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/install_yamls#826 is needed.

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,18b084c576712d289411bfab3a4bfee4b60a3fbf

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,4687df731d7a30007950c91ac21ee931ebfebf8c

@bogdando
Copy link
Contributor Author

Based on feedback from @SeanMooney, we should not shift cells names as I proposed here. We want it instead like this:

  • A single-cell adoption (only default cell exists): rename default to cell1,
  • A multi-cell ( default, cell1, etc. exist) - omit importing the default as there is no compute hosts supported to be there for a multi-cell OSP, hence nothing to adopt from it.
  • Or, a multi-cell ( default, cell1, etc. exist) - omit renaming the default cell, and import as is
  • Or, a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1:
default -> cell4
cell1 -> cell1
cell2 -> cell2
cell3 -> cell3

Implementing either of these is quite challenging given the local requirement to maintain code in tests in the same form as it is documented (meaning shell commands). This sofisticated logic will bring in even more loops and arrays handling into already overcomplicated code proposed in this PR draft.

@jistr @gibizer looking for your ideas on that

@gibizer
Copy link
Contributor

gibizer commented Jul 11, 2024

Based on feedback from @SeanMooney, we should not shift cells names as I proposed here. We want it instead like this:

  • A single-cell adoption (only default cell exists): rename default to cell1,
  • A multi-cell ( default, cell1, etc. exist) - omit importing the default as there is no compute hosts supported to be there for a multi-cell OSP, hence nothing to adopt from it.
  • Or, a multi-cell ( default, cell1, etc. exist) - omit renaming the default cell, and import as is
  • Or, a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1:
default -> cell4
cell1 -> cell1
cell2 -> cell2
cell3 -> cell3

Implementing either of these is quite challenging given the local requirement to maintain code in tests in the same form as it is documented (meaning shell commands). This sofisticated logic will bring in even more loops and arrays handling into already overcomplicated code proposed in this PR draft.

@jistr @gibizer looking for your ideas on that

As nova-operator allows a cell to be named "default" the simplest solution would be your second proposal. Just import the cells as is. This has the benefit also that it will work even if a given customer wrongly attached computes to the default cell.
After GA nova-operator will get the ability to delete cells. So that feature can be used later to delete the "default" cell and therefore get the deployment structurally the same as a greenfield 18 deployment.

@bogdando
Copy link
Contributor Author

bogdando commented Jul 11, 2024

I tend now to implement the last choice: for a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1. This keeps it consistent for single cell and multicell...

/update: See the combined option which allows both renaming or importing as is

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,9541ce7f013b9b35b2cbd681cb30259da1a85157

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,54d110489b8215e014580b8b77b05ce107fd1e04

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,9954245ae2addd169cc80deab137024b7046f30e

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Unable to update github.com/openstack-k8s-operators/install_yamls

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,b946bca930ed67ffe94465e36e742abd9ba55d95

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/64b2ab79e55847e3b2622e10febae11e

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 1h 34m 47s
✔️ adoption-standalone-to-crc-no-ceph SUCCESS in 3h 09m 52s
✔️ adoption-docs-preview SUCCESS in 1m 25s

@bogdando
Copy link
Contributor Author

recheck adoption-standalone-to-crc-ceph

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/10d929291834486995d25e11bddf7bf7

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 2h 12m 14s
✔️ adoption-standalone-to-crc-no-ceph SUCCESS in 3h 15m 22s
✔️ adoption-docs-preview SUCCESS in 1m 19s

@bogdando
Copy link
Contributor Author

recheck adoption-standalone-to-crc-ceph

Copy link
Contributor

@jistr jistr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Some follow up suggestions inline, i'm mainly concerned about 2 things right now:

  • Hardcoding multi-DB and multi-MQ which increases HW requirements for all jobs.
  • This $CONTROLLER1_SSH if sudo systemctl is-active tripleo_ovn_cluster_northd.service ';' then sudo systemctl stop tripleo_ovn_cluster_northd.service ';' fi which is likely just lack of knowledge on my part and i need to do some experimentation how SSH behaves in such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just like we use $ prefix for lines that begin a command, when the commands are multi-line (for cycles etc.) we use the > prefix. I was a bit sceptical about that but today i learned the copy button in the downstream docs actually works well with those -- the $ and > prefixes are not copied into clipboard.

Just nitpicking though, a thing like this should be done in a follow-up given the size and priority of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I proposed that $ and > earlier for the extracted DB adoption of multi cell setups, and we agreed to follow this on.
However, we have also agreed to address the docs review in follow up

@openshift-ci openshift-ci bot added the lgtm label Mar 25, 2025
@jistr jistr mentioned this pull request Mar 25, 2025
1 task
@SeanMooney
Copy link
Contributor

/lgtm

Some follow up suggestions inline, i'm mainly concerned about 2 things right now:

* Hardcoding multi-DB and multi-MQ which increases HW requirements for all jobs.

* This `$CONTROLLER1_SSH if sudo systemctl is-active tripleo_ovn_cluster_northd.service ';' then sudo systemctl stop tripleo_ovn_cluster_northd.service ';' fi` which is likely just lack of knowledge on my part and i need to do some experimentation how SSH behaves in such cases.

if you have more then one cell nova require either that rabbit is configure to use vhosts which our rabbit cant do or that you have separate conductor per cell.

that is why we use separate message queues is a limitation of our rabbit operator.

each cell can share difent schemas on the same db server. for testing that valid but cell exist pruly to scale nova horizontally to overcome message queue and db bottelnecs so it normally does not make sense to share the db between cells since the db performance used to be one of the bottle necks that cells were designed to overcome

ssds removed most of the db bottle neck so its really the rabbitmq throughput that is the limiting factor now.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7f04d9e4fa00467cad5f990d38908b7a

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 2h 03m 13s
adoption-standalone-to-crc-no-ceph FAILURE in 2h 00m 37s
✔️ adoption-docs-preview SUCCESS in 1m 24s

@bogdando
Copy link
Contributor Author

recheck

Split edpm nodes into compute cells by 1:1 mapping it as
dataplane nodesets.

Use edpm_nodes var to describe compuptes for each cell,
instead of static host and ip vars that only used to work for
a single-cell standalone, or multi-node single cell cases.
Also explain EDPM net config requirements in vars.sample, when
it is used outside of ci-framework (local deployments).

Remove edpm_computes vars no longer used after moving stopping
control-plane tripleo services into edpm-ansible

Simplify ENV headers management by collecting in a single place.

Provide a variable to define the source cloud Ironic topology,
for any cells with Ironic services.

Align nova/libvirt and related services ordering in the
lists of services defined in multiple places, with those
specified in VA.

Align the names in the tests to follow the documented steps
to make the corresponding code easy discoverable.

Adjust storage/storageRequests values to make it better fitting
a multi-cell test scenarios. Also provide values in docs and
add a comment to adjust them as needed.

Stop ovn services only if active, or not missing (like on
the cell controllers)

Signed-off-by: Bohdan Dobrelia <[email protected]>
Without that, edpm-ansible's os-net-config changes IPs on internalapi,
which also breaks connectivity to EDPM hosts for ansible (restores
after a node reboot though).

Signed-off-by: Bohdan Dobrelia <[email protected]>
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a221b52ae8854ec08b129ff743855344

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 1h 37m 52s
adoption-standalone-to-crc-no-ceph FAILURE in 2h 06m 44s
✔️ adoption-docs-preview SUCCESS in 1m 28s

@bogdando
Copy link
Contributor Author

recheck

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/36bb1aafc4f5477995134332a02d8253

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 2h 06m 54s
adoption-standalone-to-crc-no-ceph FAILURE in 2h 05m 46s
✔️ adoption-docs-preview SUCCESS in 1m 15s

@jistr
Copy link
Contributor

jistr commented Mar 27, 2025

I think the octavia bug should now be fixed by #874

@jistr
Copy link
Contributor

jistr commented Mar 27, 2025

recheck

@jistr
Copy link
Contributor

jistr commented Mar 27, 2025

This PR has been on review for more than half a year and received peer review earlier, and we recently agreed to postpone merging this to let #855 land, which has happened. Let's take the opportunity to merge this PR and address any tweaks in follow-ups, we already have other work waiting for this one to land. Re-adding my LGTM and approving.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm label Mar 27, 2025
@openshift-ci
Copy link

openshift-ci bot commented Mar 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jistr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 80c021f into openstack-k8s-operators:main Mar 27, 2025
6 checks passed
@karelyatin
Copy link
Contributor

This broke multinode adoption jobs:-
TASK [dataplane_adoption : create a OpenStackDataPlaneDeployment CR that runs only the validation] *** fatal: [localhost]: FAILED! => {"changed": true, "cmd": "set -euxo pipefail\n\nset -euxo pipefail\n\n\nCELLS=\"default\"\nDEFAULT_CELL_NAME=cell1\nRENAMED_CELLS=\"cell1\"\n\n\nNODESETS=\"\"\nfor CELL in $(echo $RENAMED_CELLS); do\n oc get Openstackdataplanenodeset openstack-${CELL} || continue\n NODESETS=\"'openstack-${CELL}', $NODESETS\"\ndone\nNODESETS=\"[${NODESETS%,*}]\"\n\n\n\nNODESETS=\"${NODESETS%]*},openstack-networker]\"\n\n\noc apply -f - <<EOF\napiVersion: dataplane.openstack.org/v1beta1\nkind: OpenStackDataPlaneDeployment\nmetadata:\n name: openstack-pre-adoption\nspec:\n nodeSets: $NODESETS\n servicesOverride:\n - pre-adoption-validation\n backoffLimit: 1\nEOF\n", "delta": "0:00:00.233125", "end": "2025-03-28 02:46:53.220531", "msg": "non-zero return code", "rc": 1, "start": "2025-03-28 02:46:52.987406", "stderr": "+ set -euxo pipefail\n+ CELLS=default\n+ DEFAULT_CELL_NAME=cell1\n+ RENAMED_CELLS=cell1\n+ NODESETS=\n++ echo cell1\n+ for CELL in $(echo $RENAMED_CELLS)\n+ oc get Openstackdataplanenodeset openstack-cell1\nError from server (NotFound): openstackdataplanenodesets.dataplane.openstack.org \"openstack-cell1\" not found\n+ continue\n+ NODESETS='[]'\n+ NODESETS='[,openstack-networker]'\n+ oc apply -f -\nerror: error parsing STDIN: error converting YAML to JSON: yaml: line 5: did not find expected node content", "stderr_lines": ["+ set -euxo pipefail", "+ CELLS=default", "+ DEFAULT_CELL_NAME=cell1", "+ RENAMED_CELLS=cell1", "+ NODESETS=", "++ echo cell1", "+ for CELL in $(echo $RENAMED_CELLS)", "+ oc get Openstackdataplanenodeset openstack-cell1", "Error from server (NotFound): openstackdataplanenodesets.dataplane.openstack.org \"openstack-cell1\" not found", "+ continue", "+ NODESETS='[]'", "+ NODESETS='[,openstack-networker]'", "+ oc apply -f -", "error: error parsing STDIN: error converting YAML to JSON: yaml: line 5: did not find expected node content"], "stdout": "", "stdout_lines": []}

@karelyatin
Copy link
Contributor

Proposed revert #878 while this is being checked, also uni adoption jobs broken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants