Skip to content

Commit 5ae4b2e

Browse files
committed
Add physical network and EVPN fabric documentation
1 parent c6980a7 commit 5ae4b2e

File tree

7 files changed

+372
-0
lines changed

7 files changed

+372
-0
lines changed

.github/workflows/deployment_yaml/enable-ironic.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ ceph_managed: false
55
ironic: true
66
ironic_automated_cleaning: true
77
kayobe_manages_physical_network: true
8+
physical_network_evpn: true

source/_static/spine-leaf.png

263 KB
Loading

source/data/deployment.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ ironic_automated_cleaning: true
2020
# Whether Kayobe manages physical network devices.
2121
kayobe_manages_physical_network: true
2222

23+
# Whether the physical network is an EVPN fabric.
24+
physical_network_evpn: false
25+
2326
# Whether the deployment includes Wazuh.
2427
wazuh: true
2528

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
EVPN VXLAN Fabric details
2+
=========================
3+
4+
.. Specific details about the client's EVPN fabric should be included here.
Lines changed: 344 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,344 @@
1+
EVPN VXLAN overview
2+
===================
3+
4+
Ethernet Virtual Private Network (EVPN) is a standard technology that
5+
can be used for scaling layer 2 networks that is becoming increasingly
6+
popular. It combines a layer 3 (Equal Cost Multipath)
7+
ECMP "underlay" fabric with VXLAN "overlays" that stretch VLANs between
8+
switches.
9+
10+
EVPN is typically used with a spine/leaf network architecture:
11+
12+
.. figure:: _static/spine-leaf.png
13+
:alt: Spine/leaf network architecture
14+
:class: no-scaled-link
15+
16+
This type of multipath network is resilient to failures of
17+
individual links or devices. A standard layer 2 (Ethernet) network
18+
cannot achieve this without introducing forwarding loops or disabling
19+
links and/or switches (e.g. Spanning Tree Protocol (STP)).
20+
21+
Often the leaf switches will be paired, with MLAG or similar technology
22+
used to connect servers to both switches in a pair.
23+
24+
Border Gateway Protocol (BGP) is used in the control plane of both the
25+
underlay and the overlay to exchange connectivity between the switches.
26+
BGP is a proven, widely used protocol that underpins exchange of routing
27+
information on the Internet. The underlay uses standard BGP, while the
28+
overlay uses Multi Protocol BGP (MP-BGP) to exchange MAC addresses and
29+
other information between Virtual Tunnel Endpoints (VTEPs).
30+
31+
This network architecture is undoubtedly more complex than the standard
32+
layer 2 networks we have generally used in the past. It's easiest
33+
to build up the picture in layers.
34+
35+
Underlay IP links
36+
-----------------
37+
38+
Each leaf switch has a layer 3 (IP) /31 point to point connection to
39+
each spine switch. Typically we would divide up a supernet (e.g. /24)
40+
into multiple /31 subnets. Doing this puts the spine-leaf links into
41+
layer 3 mode.
42+
43+
Example leaf interface config on Dell OS10:
44+
45+
::
46+
47+
!
48+
interface ethernet1/1/1
49+
no shutdown
50+
no switchport
51+
ip address 172.0.0.0/31
52+
53+
Example spine interface config on Dell OS10:
54+
55+
::
56+
57+
!
58+
interface ethernet1/1/1
59+
no shutdown
60+
no switchport
61+
ip address 172.0.0.1/31
62+
63+
One of the implications of this is that each switch and the hosts
64+
attached to it have become an isolated layer 2 (Ethernet) network.
65+
66+
Another implication is that each switch only has layer 3 (IP)
67+
connectivity to other neighbouring switches.
68+
69+
On Dell leaf switches with Virtual Link Trunking (VLT, aka MLAG), the
70+
inter-switch link between leaf switch pairs is also configured with an
71+
IP point to point link. This ends up getting used more than you might
72+
expect.
73+
74+
BGP underlay
75+
------------
76+
77+
In order to stitch together these individual point to point IP fabric
78+
links, we use a BGP control plane to exchange routing information
79+
between the switches. This allows each switch to reach (via L3) not only
80+
its immediate neighbours, but any of their neighbours (and so on). Wait,
81+
we could build an Internet out of this...
82+
83+
For the BGP underlay, each switch establishes a BGP session with each of
84+
its immediate neighbours.
85+
86+
Example on Dell OS10 from a Leaf:
87+
88+
::
89+
90+
# show ip bgp summary
91+
BGP router identifier 172.1.0.0 local AS number 65001
92+
Neighbor AS MsgRcvd MsgSent Up/Down State/Pfx
93+
172.0.0.1 65101 22834 22842 1w:6d:18:55:29 34
94+
95+
The neighbour's IP address (172.0.0.1) is the fabric link partner's IP.
96+
97+
The BGP router identifier (172.1.0.0) is unique to each switch, and
98+
should be a separate IP range from the fabric links. On Dell OS10
99+
switches this is assigned to a loopback device as a /32 IP address.
100+
101+
The Autonomous System (AS) number (spine: 65101, leaf: 65001) may be
102+
assigned to multiple devices. Dell provides various different reference
103+
configurations which use a single shared AS, or multiple AS. It's not
104+
clear to me why you would choose one or another approach.
105+
106+
One thing that may not be immediately obvious is that BGP within an AS
107+
is internal BGP (iBGP), whereas between different AS it is external BGP
108+
(eBGP).
109+
110+
BGP neighbour info on a Dell OS10 system:
111+
112+
::
113+
114+
# show ip bgp neighbors
115+
BGP neighbor is 172.0.0.1, remote AS 65101, local AS 65001 external link
116+
117+
BGP version 4, remote router ID 172.1.0.1
118+
BGP state ESTABLISHED, in this state for 1 weeks 6 days 19:04:49
119+
Last read 01:18:01 seconds
120+
Hold time is 180, keepalive interval is 60 seconds
121+
Configured hold time is 180, keepalive interval is 60 seconds
122+
Fall-over disabled
123+
124+
Received 22845 messages
125+
1 opens, 0 notifications, 15 updates
126+
22829 keepalives, 0 route refresh requests
127+
Sent 22854 messages
128+
1 opens, 0 notifications, 17 updates
129+
22836 keepalives, 0 route refresh requests
130+
Minimum time between advertisement runs is 30 seconds
131+
Minimum time before advertisements start is 0 seconds
132+
133+
Capabilities received from neighbor for IPv4 Unicast:
134+
MULTIPROTO_EXT(1)
135+
ROUTE_REFRESH(2)
136+
CISCO_ROUTE_REFRESH(128)
137+
4_OCTET_AS(65)
138+
Capabilities advertised to neighbor for IPv4 Unicast:
139+
MULTIPROTO_EXT(1)
140+
ROUTE_REFRESH(2)
141+
CISCO_ROUTE_REFRESH(128)
142+
4_OCTET_AS(65)
143+
Prefixes accepted 34, Prefixes advertised 36
144+
Connections established 1; dropped 0
145+
Last reset never
146+
For address family: IPv4 Unicast
147+
Allow local AS number 0 times in AS-PATH attribute
148+
Prefixes ignored due to:
149+
Martian address 0, Our own AS in AS-PATH 0
150+
Invalid Nexthop 0, Invalid AS-PATH length 0
151+
Wellknown community 0, Locally originated 0
152+
153+
Local host: 172.0.0.0, Local port: 179
154+
Foreign host: 172.0.0.1, Foreign port: 44058
155+
156+
We're looking for a BGP state of ESTABLISHED.
157+
158+
Here is a route table on Dell OS10:
159+
160+
::
161+
162+
# show ip bgp
163+
BGP local RIB : Routes to be Added , Replaced , Withdrawn
164+
BGP local router ID is 172.1.0.0
165+
Status codes: s suppressed, S stale, d dampened, h history, * valid, > best
166+
Path source: I - internal, a - aggregate, c - confed-external,
167+
r - redistributed/network, S - stale
168+
Origin codes: i - IGP, e - EGP, ? - incomplete
169+
Network Next Hop Metric LocPrf Weight Path
170+
* 172.0.0.0/31 172.0.0.1 0 100 0 65001 ?
171+
* 172.0.0.0/31 172.2.0.5 0 100 0 65001 ?
172+
*>r 172.0.0.0/31 0.0.0.0 0 100 32768 ?
173+
174+
At this point it should be possible to ping the fabric IP address of any
175+
switch in the network.
176+
177+
On Dell OS this is configured as follows:
178+
179+
::
180+
181+
router bgp 65001
182+
router-id 172.1.0.0
183+
!
184+
address-family ipv4 unicast
185+
redistribute connected
186+
!
187+
neighbor 172.0.0.1
188+
remote-as 65101
189+
no shutdown
190+
!
191+
address-family ipv4 unicast
192+
no sender-side-loop-detection
193+
!
194+
195+
BGP-EVPN overlay
196+
~~~~~~~~~~~~~~~~
197+
198+
The MP-BGP overlay is used to share VXLAN connectivity information
199+
between switches.
200+
201+
On Dell OS10 (from a leaf):
202+
203+
::
204+
205+
# show ip bgp l2vpn evpn summary
206+
BGP router identifier 172.1.0.0 local AS number 65001
207+
Neighbor AS MsgRcvd MsgSent Up/Down State/Pfx
208+
172.1.0.1 65101 29100 33582 1w:6d:19:15:50 295
209+
210+
This may appear similar to the underlay BGP summary, however here the
211+
neighbours are using the per-switch BGP router ID. This IP is now
212+
reachable across the IP fabric. Again, each switch establishes a session
213+
with its immediate neighbours.
214+
215+
BGP neighbour info on a Dell OS10 system:
216+
217+
::
218+
219+
# show ip bgp l2vpn evpn neighbors
220+
BGP neighbor is 172.1.0.1, remote AS 65101, local AS 65001 external link
221+
222+
BGP version 4, remote router ID 172.1.0.1
223+
BGP state ESTABLISHED, in this state for 1 weeks 6 days 21:59:35
224+
Last read 00:11:56 seconds
225+
Hold time is 180, keepalive interval is 60 seconds
226+
Configured hold time is 180, keepalive interval is 60 seconds
227+
Fall-over disabled
228+
EBGP multihop enabled, multihop TTL set to 4
229+
230+
Received 39322 messages
231+
2 opens, 2 notifications, 21181 updates
232+
18137 keepalives, 0 route refresh requests
233+
Sent 32041 messages
234+
5 opens, 0 notifications, 12303 updates
235+
19733 keepalives, 0 route refresh requests
236+
Minimum time between advertisement runs is 30 seconds
237+
Minimum time before advertisements start is 0 seconds
238+
239+
Prefixes accepted 270, Prefixes advertised 163
240+
Connections established 2; dropped 2
241+
Closed by neighbor sent 1 weeks 6 days 21:59:50 ago
242+
Local host: 172.1.0.0, Local port: 41483
243+
Foreign host: 172.1.0.1, Foreign port: 179
244+
245+
Again, we're looking for a state of ESTABLISHED. At Habrok we saw the
246+
BGP session getting to ESTABLISHED, then sometimes flapping after 3
247+
minutes. This is the default hold time, and would happen when a large
248+
BGP update occurred, due to an MTU blackhole on the network path (the
249+
inter-switch link).
250+
251+
So far we have not configured any VXLANs to share information about.
252+
Let's fix that.
253+
254+
VXLANs
255+
------
256+
257+
If we return to our mental model of each switch as an isolated layer 2
258+
Ethernet network, consider connecting up those isolated networks with a
259+
series of overlay networks, such that a host in VLAN A on switch 1 again
260+
has direct connectivity to a host in VLAN A on switch 2. We can do this
261+
using VXLANs. These overlays, or tunnels, are used to encapsulate a
262+
layer 2 packet within a VXLAN UDP packet. This allows the packet to
263+
traverse a network with only layer 3 connectivity, such as our underlay
264+
fabric.
265+
266+
We must create a VXLAN network on each switch that maps to a VLAN.
267+
268+
On a Dell OS10 system, here is one such VXLAN network:
269+
270+
::
271+
272+
# show virtual-network 10016
273+
Codes: DP - MAC-learn Dataplane, CP - MAC-learn Controlplane, UUD - Unknown-Unicast-Drop
274+
Virtual Network: 10016
275+
Members:
276+
VLAN 16: port-channel1000
277+
VxLAN Virtual Network Identifier: 10016
278+
Source Interface: loopback0(172.2.0.0)
279+
Remote-VTEPs (flood-list):
280+
281+
In this case we have VXLAN VNI 10016, which maps to VLAN 16. The source
282+
interface is loopback0, which we have configured with a /32 IP address
283+
for the VTEP. In an MLAG scenario , this IP address is shared between
284+
each leaf switch pair. This IP address is used as the source and
285+
destination for the outer VXLAN UDP packet.
286+
287+
Currently, there are no remote VTEPs.
288+
289+
EVIs
290+
----
291+
292+
EVPN Instances (EVIs) are the missing link between the EVPN BGP control
293+
plane and the VXLAN networks - they define which VXLAN networks will be
294+
shared via EVPN BGP, and with which switches.
295+
296+
::
297+
298+
# show evpn evi 10016
299+
300+
EVI : 10016, State : up
301+
Bridge-Domain : Virtual-Network 10016, VNI 10016
302+
Route-Distinguisher : 1:172.2.0.0:10016
303+
Route-Targets : 0:65001:10016 both, 0:65101:10016 import
304+
Inclusive Multicast : 172.2.0.1
305+
IRB : Disabled
306+
307+
On Dell OS10 switches there is an "auto evi" mode, which automatically
308+
adds an EVI for each VXLAN. However this doesn't work with the multiple
309+
AS topology used at Habrok.
310+
311+
The route distinguisher (RD) is an ID for routes shared by this switch.
312+
The Route Targets (RT) are AS numbers of other switches. Routes can be
313+
exported, imported, or both. Inclusive multicast defines the list of
314+
VTEPs to be included in a multicast group for BUM traffic. IRB is
315+
Integrated Routing and Bridging (IRB), which we'll get onto.
316+
317+
Now that we have an EVI configured for our VXLAN, we now see EVPN
318+
"routes" for MAC addresses:
319+
320+
::
321+
322+
* Route distinguisher: 172.23.62.133:10016 VNI:10016
323+
[2]:[0]:[48]:[16:7f:06:fb:02:47]:[0]:[0.0.0.0]/280 172.23.62.133 0 100 0 65103 65005 ?
324+
325+
The most common type of route is type 2, and this defines MAC address
326+
routes. Each EVI shares MAC addresses in its local MAC table for the
327+
VLAN with other EVPN switches, avoiding the "flood and learn" behaviour
328+
of a static VXLAN configuration. This means that a MAC address lookup on
329+
switch A will now potentially include remote VTEPs, as well as local
330+
interfaces.
331+
332+
Integrated Routing and Bridging (IRB)
333+
-------------------------------------
334+
335+
IRB can be used to perform routing in a distributed manner, across the
336+
fabric. Typically, each leaf switch is configured as a router, and will
337+
route on ingress to the destination VXLAN.
338+
339+
Resources
340+
---------
341+
342+
This explainer series by nullzero is very helpful in building up the
343+
details in the picture. Here's the first part:
344+
https://www.nullzero.co.uk/aruba-aos-cx-evpn-vxlan/

source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Contents
2020
introduction
2121
working_with_openstack
2222
working_with_kayobe
23+
physical_network
2324
hardware_inventory_management
2425
ceph_storage
2526
managing_users_and_projects

source/physical_network.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
.. include:: vars.rst
2+
3+
================
4+
Physical network
5+
================
6+
7+
.. ifconfig:: deployment['kayobe_manages_physical_network']
8+
9+
The |project_name| deployment uses Kayobe to manage the physical network.
10+
11+
.. ifconfig:: deployment['physical_network_evpn']
12+
13+
.. include:: include/evpn_fabric_overview.rst
14+
15+
.. include:: include/evpn_fabric_details.rst
16+
17+
.. ifconfig:: not deployment['kayobe_manages_physical_network']
18+
19+
The |project_name| deployment does not use Kayobe to manage the physical network.

0 commit comments

Comments
 (0)