Skip to content

MicroCeph fails to join cluster networked using OpenFabric #648

@VaticanUK

Description

@VaticanUK

Issue report

What version of MicroCeph are you using ?

(squid/stable) 19.2.1+snap74c0060321

What are the steps to reproduce this issue ?

  1. Setup 3 machines with full mesh network (basic idea here: https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server e.g. node 1 has a direct link to node 2 and 3, node 2 has a direct line to node 1 and 3, node 3 has a direct link to nodes 1 and 2)
  2. Install FRR
  3. Setup fabricd within FRR:
    enable the fabrid daemon and set up nodes similar to below

frr.conf:

frr defaults traditional
hostname node1
log syslog warning
ip forwarding
no ipv6 forwarding
service integrated-vtysh-config
!
interface lo
 ip address 10.15.15.51/32
 ip router openfabric 1
 openfabric passive
!
interface enp2s0f0np0
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-mulitplier 2
!
interface enp2s0f1np1
 ip router openfabric 1
 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-mulitplier 2
!
line vty
!
router openfabric 1
 net 49.0001.1111.1111.1111.00
 lsp-gen-interval 1
 max-lsp-lifetime 600
 lsp-refresh-interval 180

netplan:

network:
  version: 2
  ethernets:
    enp88s0:
      addresses:
      - "192.168.8.2/24"
      nameservers:
        addresses:
        - 192.168.8.1
        search: []
      routes:
      - to: "default"
        via: "192.168.8.1"
    enp2s0f0np0:
      mtu: 9000
    enp2s0f1np1:
      mtu: 9000
  1. You'll now be able to ping each node from each other node, with redundancy against any single link failure, but (and I think this is related to the problem), the IP address used by openfabric for each node isn't listed when executing ifconfig, e.g. (following commands and response executed on node 2)
$ ifconfig
enp2s0f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet6 fe80::5a47:caff:fe7a:c1da  prefixlen 64  scopeid 0x20<link>
        ether 58:47:ca:7a:c1:da  txqueuelen 1000  (Ethernet)
        RX packets 3960444  bytes 10763025277 (10.7 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4257559  bytes 1789028060 (1.7 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp2s0f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet6 fe80::5a47:caff:fe7a:c1db  prefixlen 64  scopeid 0x20<link>
        ether 58:47:ca:7a:c1:db  txqueuelen 1000  (Ethernet)
        RX packets 776652  bytes 1995475489 (1.9 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 486683  bytes 1360628932 (1.3 GB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp88s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.8.3  netmask 255.255.255.0  broadcast 192.168.8.255
        inet6 fe80::5a47:caff:fe7a:c1dd  prefixlen 64  scopeid 0x20<link>
        ether 58:47:ca:7a:c1:dd  txqueuelen 1000  (Ethernet)
        RX packets 5171181  bytes 5922053009 (5.9 GB)
        RX errors 0  dropped 5518  overruns 0  frame 0
        TX packets 1402604  bytes 161658482 (161.6 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0x6c500000-6c5fffff

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1647425  bytes 486132167 (486.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1647425  bytes 486132167 (486.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
$ ping 10.15.15.51/32
PING 10.15.15.51 (10.15.15.51) 56(84) bytes of data.
64 bytes from 10.15.15.51: icmp_seq=1 ttl=64 time=0.443 ms
64 bytes from 10.15.15.51: icmp_seq=2 ttl=64 time=0.462 ms
^C
--- 10.15.15.51 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1046ms
rtt min/avg/max/mdev = 0.443/0.452/0.462/0.009 ms
  1. On node 1 get a join key: sudo microceph cluster add node2
  2. On node 2, attempt to join: sudo microceph cluster join <key>

What happens (observed behaviour) ?

Error: failed to generate the configuration: failed to locate IP on public network 10.15.15.51/32: no IP belongs to provided subnet 10.15.15.51/32

What were you expecting to happen ?

Since the machines can communicate, as demonstrated by both ping (and in my case, the nodes are all part of a microk8s that has no issues communicating on the same IPs that I'm trying to use for microceph...), the node should join the cluster

Relevant logs, error output, etc.

Node 1:

microceph (squid/stable) 19.2.1+snap74c0060321 from Canonical✓ installed
richard@node1:~$ sudo microceph cluster bootstrap
richard@node1:~$ sudo microceph disk add /dev/nvme0n1p4

+----------------+---------+
|      PATH      | STATUS  |
+----------------+---------+
| /dev/nvme0n1p4 | Success |
+----------------+---------+
richard@node1:~$ sudo snap refresh --hold microceph
General refreshes of "microceph" held indefinitely
richard@node1:~$ sudo microceph cluster add node2
eyJzZWNyZXQiOiJkMmJkYTk0NTkxYmZjN2ZlNzNkZWQzNGQzNDZjOTc0MThjODI0YTZjZDc0Y2VjNzA3YTJiYmU2OTRkY2Q1NGY1IiwiZmluZ2VycHJpbnQiOiIwNWRkZjZjOTEyMjdhOTA5YmVkOTU4Njg1Y2Q1YzgxNjBjM2M2NDUxZTYxNjMxZGJmYzk4NGM3MjU3ODJiYmVmIiwiam9pbl9hZGRyZXNzZXMiOlsiMTAuMTUuMTUuNTE6NzQ0MyJdfQ==
richard@node1:~$ sudo microceph cluster add node3
eyJzZWNyZXQiOiJlYWQ4ZDM3N2JmMDRiYzVkMzMwYzc2NTA5Mjk3YTFmZGQ3MjY0YTllNTc0MmExMzM0NGE2NmViY2MwY2Y0MGVjIiwiZmluZ2VycHJpbnQiOiIwNWRkZjZjOTEyMjdhOTA5YmVkOTU4Njg1Y2Q1YzgxNjBjM2M2NDUxZTYxNjMxZGJmYzk4NGM3MjU3ODJiYmVmIiwiam9pbl9hZGRyZXNzZXMiOlsiMTAuMTUuMTUuNTE6NzQ0MyJdfQ==

Node 2:

richard@node2:~$ sudo snap install microceph --channel=squid/stable
[sudo] password for richard:
microceph (squid/stable) 19.2.1+snap74c0060321 from Canonical✓ installed
richard@node2:~$ sudo snap refresh --hold microceph
General refreshes of "microceph" held indefinitely
richard@node2:~$ sudo microceph cluster join eyJzZWNyZXQiOiJkMmJkYTk0NTkxYmZjN2ZlNzNkZWQzNGQzNDZjOTc0MThjODI0YTZjZDc0Y2VjNzA3YTJiYmU2OTRkY2Q1NGY1IiwiZmluZ2VycHJpbnQiOiIwNWRkZjZjOTEyMjdhOTA5YmVkOTU4Njg1Y2Q1YzgxNjBjM2M2NDUxZTYxNjMxZGJmYzk4NGM3MjU3ODJiYmVmIiwiam9pbl9hZGRyZXNzZXMiOlsiMTAuMTUuMTUuNTE6NzQ0MyJdfQ==
Error: failed to generate the configuration: failed to locate IP on public network 10.15.15.51/32: no IP belongs to provided subnet 10.15.15.51/32
richard@node2:~$ ping 10.15.15.51
PING 10.15.15.51 (10.15.15.51) 56(84) bytes of data.
64 bytes from 10.15.15.51: icmp_seq=1 ttl=64 time=0.443 ms
64 bytes from 10.15.15.51: icmp_seq=2 ttl=64 time=0.462 ms
^C
--- 10.15.15.51 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1046ms
rtt min/avg/max/mdev = 0.443/0.452/0.462/0.009 ms

Additional comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions