|
| 1 | +EVPN VXLAN overview |
| 2 | +=================== |
| 3 | + |
| 4 | +Ethernet Virtual Private Network (EVPN) is a standard technology that |
| 5 | +can be used for scaling layer 2 networks that is becoming increasingly |
| 6 | +popular. It combines a layer 3 (Equal-Cost Multi-Path) |
| 7 | +ECMP "underlay" fabric with VXLAN "overlays" that stretch VLANs between |
| 8 | +switches. |
| 9 | + |
| 10 | +EVPN is typically used with a spine/leaf network architecture: |
| 11 | + |
| 12 | +.. figure:: _static/spine-leaf.png |
| 13 | + :alt: Spine/leaf network architecture |
| 14 | + :class: no-scaled-link |
| 15 | + |
| 16 | +This type of multipath network is resilient to failures of |
| 17 | +individual links or devices. A standard layer 2 (Ethernet) network |
| 18 | +cannot achieve this without introducing forwarding loops or disabling |
| 19 | +links and/or switches (e.g. Spanning Tree Protocol (STP)). |
| 20 | + |
| 21 | +Often the leaf switches will be paired, with MLAG or similar technology |
| 22 | +used to connect servers to both switches in a pair. |
| 23 | + |
| 24 | +Border Gateway Protocol (BGP) is used in the control plane of both the |
| 25 | +underlay and the overlay to exchange connectivity between the switches. |
| 26 | +BGP is a proven, widely used protocol that underpins exchange of routing |
| 27 | +information on the Internet. The underlay uses standard BGP, while the |
| 28 | +overlay uses Multi Protocol BGP (MP-BGP) to exchange MAC addresses and |
| 29 | +other information between Virtual Tunnel Endpoints (VTEPs). |
| 30 | + |
| 31 | +This network architecture is undoubtedly more complex than the standard |
| 32 | +layer 2 networks we have generally used in the past. It's easiest |
| 33 | +to build up the picture in layers. |
| 34 | + |
| 35 | +Underlay IP links |
| 36 | +----------------- |
| 37 | + |
| 38 | +Each leaf switch has a layer 3 (IP) /31 point to point connection to |
| 39 | +each spine switch. Typically we would divide up a supernet (e.g. /24) |
| 40 | +into multiple /31 subnets. Doing this puts the spine-leaf links into |
| 41 | +layer 3 mode. |
| 42 | + |
| 43 | +Example leaf interface config on Dell OS10: |
| 44 | + |
| 45 | +:: |
| 46 | + |
| 47 | + ! |
| 48 | + interface ethernet1/1/1 |
| 49 | + no shutdown |
| 50 | + no switchport |
| 51 | + ip address 172.0.0.0/31 |
| 52 | + |
| 53 | +Example spine interface config on Dell OS10: |
| 54 | + |
| 55 | +:: |
| 56 | + |
| 57 | + ! |
| 58 | + interface ethernet1/1/1 |
| 59 | + no shutdown |
| 60 | + no switchport |
| 61 | + ip address 172.0.0.1/31 |
| 62 | + |
| 63 | +One of the implications of this is that each switch and the hosts |
| 64 | +attached to it have become an isolated layer 2 (Ethernet) network. |
| 65 | + |
| 66 | +Another implication is that each switch only has layer 3 (IP) |
| 67 | +connectivity to other neighbouring switches. |
| 68 | + |
| 69 | +On Dell leaf switches with Virtual Link Trunking (VLT, aka MLAG), the |
| 70 | +inter-switch link between leaf switch pairs is also configured with an |
| 71 | +IP point to point link. This ends up getting used more than you might |
| 72 | +expect. |
| 73 | + |
| 74 | +BGP underlay |
| 75 | +------------ |
| 76 | + |
| 77 | +In order to stitch together these individual point to point IP fabric |
| 78 | +links, we use a BGP control plane to exchange routing information |
| 79 | +between the switches. This allows each switch to reach (via L3) not only |
| 80 | +its immediate neighbours, but any of their neighbours (and so on). Wait, |
| 81 | +we could build an Internet out of this... |
| 82 | + |
| 83 | +For the BGP underlay, each switch establishes a BGP session with each of |
| 84 | +its immediate neighbours. |
| 85 | + |
| 86 | +Example on Dell OS10 from a Leaf: |
| 87 | + |
| 88 | +:: |
| 89 | + |
| 90 | + # show ip bgp summary |
| 91 | + BGP router identifier 172.1.0.0 local AS number 65001 |
| 92 | + Neighbor AS MsgRcvd MsgSent Up/Down State/Pfx |
| 93 | + 172.0.0.1 65101 22834 22842 1w:6d:18:55:29 34 |
| 94 | + |
| 95 | +The neighbour's IP address (172.0.0.1) is the fabric link partner's IP. |
| 96 | + |
| 97 | +The BGP router identifier (172.1.0.0) is unique to each switch, and |
| 98 | +should be a separate IP range from the fabric links. On Dell OS10 |
| 99 | +switches this is assigned to a loopback device as a /32 IP address. |
| 100 | + |
| 101 | +The Autonomous System (AS) number (spine: 65101, leaf: 65001) may be |
| 102 | +assigned to multiple devices. Dell provides various different reference |
| 103 | +configurations which use a single shared AS, or multiple AS. It's not |
| 104 | +clear to me why you would choose one or another approach. |
| 105 | + |
| 106 | +One thing that may not be immediately obvious is that BGP within an AS |
| 107 | +is internal BGP (iBGP), whereas between different AS it is external BGP |
| 108 | +(eBGP). |
| 109 | + |
| 110 | +BGP neighbour info on a Dell OS10 system: |
| 111 | + |
| 112 | +:: |
| 113 | + |
| 114 | + # show ip bgp neighbors |
| 115 | + BGP neighbor is 172.0.0.1, remote AS 65101, local AS 65001 external link |
| 116 | + |
| 117 | + BGP version 4, remote router ID 172.1.0.1 |
| 118 | + BGP state ESTABLISHED, in this state for 1 weeks 6 days 19:04:49 |
| 119 | + Last read 01:18:01 seconds |
| 120 | + Hold time is 180, keepalive interval is 60 seconds |
| 121 | + Configured hold time is 180, keepalive interval is 60 seconds |
| 122 | + Fall-over disabled |
| 123 | + |
| 124 | + Received 22845 messages |
| 125 | + 1 opens, 0 notifications, 15 updates |
| 126 | + 22829 keepalives, 0 route refresh requests |
| 127 | + Sent 22854 messages |
| 128 | + 1 opens, 0 notifications, 17 updates |
| 129 | + 22836 keepalives, 0 route refresh requests |
| 130 | + Minimum time between advertisement runs is 30 seconds |
| 131 | + Minimum time before advertisements start is 0 seconds |
| 132 | + |
| 133 | + Capabilities received from neighbor for IPv4 Unicast: |
| 134 | + MULTIPROTO_EXT(1) |
| 135 | + ROUTE_REFRESH(2) |
| 136 | + CISCO_ROUTE_REFRESH(128) |
| 137 | + 4_OCTET_AS(65) |
| 138 | + Capabilities advertised to neighbor for IPv4 Unicast: |
| 139 | + MULTIPROTO_EXT(1) |
| 140 | + ROUTE_REFRESH(2) |
| 141 | + CISCO_ROUTE_REFRESH(128) |
| 142 | + 4_OCTET_AS(65) |
| 143 | + Prefixes accepted 34, Prefixes advertised 36 |
| 144 | + Connections established 1; dropped 0 |
| 145 | + Last reset never |
| 146 | + For address family: IPv4 Unicast |
| 147 | + Allow local AS number 0 times in AS-PATH attribute |
| 148 | + Prefixes ignored due to: |
| 149 | + Martian address 0, Our own AS in AS-PATH 0 |
| 150 | + Invalid Nexthop 0, Invalid AS-PATH length 0 |
| 151 | + Wellknown community 0, Locally originated 0 |
| 152 | + |
| 153 | + Local host: 172.0.0.0, Local port: 179 |
| 154 | + Foreign host: 172.0.0.1, Foreign port: 44058 |
| 155 | + |
| 156 | +We're looking for a BGP state of ESTABLISHED. |
| 157 | + |
| 158 | +Here is a route table on Dell OS10: |
| 159 | + |
| 160 | +:: |
| 161 | + |
| 162 | + # show ip bgp |
| 163 | + BGP local RIB : Routes to be Added , Replaced , Withdrawn |
| 164 | + BGP local router ID is 172.1.0.0 |
| 165 | + Status codes: s suppressed, S stale, d dampened, h history, * valid, > best |
| 166 | + Path source: I - internal, a - aggregate, c - confed-external, |
| 167 | + r - redistributed/network, S - stale |
| 168 | + Origin codes: i - IGP, e - EGP, ? - incomplete |
| 169 | + Network Next Hop Metric LocPrf Weight Path |
| 170 | + * 172.0.0.0/31 172.0.0.1 0 100 0 65001 ? |
| 171 | + * 172.0.0.0/31 172.2.0.5 0 100 0 65001 ? |
| 172 | + *>r 172.0.0.0/31 0.0.0.0 0 100 32768 ? |
| 173 | + |
| 174 | +At this point it should be possible to ping the fabric IP address of any |
| 175 | +switch in the network. |
| 176 | + |
| 177 | +On Dell OS this is configured as follows: |
| 178 | + |
| 179 | +:: |
| 180 | + |
| 181 | + router bgp 65001 |
| 182 | + router-id 172.1.0.0 |
| 183 | + ! |
| 184 | + address-family ipv4 unicast |
| 185 | + redistribute connected |
| 186 | + ! |
| 187 | + neighbor 172.0.0.1 |
| 188 | + remote-as 65101 |
| 189 | + no shutdown |
| 190 | + ! |
| 191 | + address-family ipv4 unicast |
| 192 | + no sender-side-loop-detection |
| 193 | + ! |
| 194 | + |
| 195 | +BGP-EVPN overlay |
| 196 | +~~~~~~~~~~~~~~~~ |
| 197 | + |
| 198 | +The MP-BGP overlay is used to share VXLAN connectivity information |
| 199 | +between switches. |
| 200 | + |
| 201 | +On Dell OS10 (from a leaf): |
| 202 | + |
| 203 | +:: |
| 204 | + |
| 205 | + # show ip bgp l2vpn evpn summary |
| 206 | + BGP router identifier 172.1.0.0 local AS number 65001 |
| 207 | + Neighbor AS MsgRcvd MsgSent Up/Down State/Pfx |
| 208 | + 172.1.0.1 65101 29100 33582 1w:6d:19:15:50 295 |
| 209 | + |
| 210 | +This may appear similar to the underlay BGP summary, however here the |
| 211 | +neighbours are using the per-switch BGP router ID. This IP is now |
| 212 | +reachable across the IP fabric. Again, each switch establishes a session |
| 213 | +with its immediate neighbours. |
| 214 | + |
| 215 | +BGP neighbour info on a Dell OS10 system: |
| 216 | + |
| 217 | +:: |
| 218 | + |
| 219 | + # show ip bgp l2vpn evpn neighbors |
| 220 | + BGP neighbor is 172.1.0.1, remote AS 65101, local AS 65001 external link |
| 221 | + |
| 222 | + BGP version 4, remote router ID 172.1.0.1 |
| 223 | + BGP state ESTABLISHED, in this state for 1 weeks 6 days 21:59:35 |
| 224 | + Last read 00:11:56 seconds |
| 225 | + Hold time is 180, keepalive interval is 60 seconds |
| 226 | + Configured hold time is 180, keepalive interval is 60 seconds |
| 227 | + Fall-over disabled |
| 228 | + EBGP multihop enabled, multihop TTL set to 4 |
| 229 | + |
| 230 | + Received 39322 messages |
| 231 | + 2 opens, 2 notifications, 21181 updates |
| 232 | + 18137 keepalives, 0 route refresh requests |
| 233 | + Sent 32041 messages |
| 234 | + 5 opens, 0 notifications, 12303 updates |
| 235 | + 19733 keepalives, 0 route refresh requests |
| 236 | + Minimum time between advertisement runs is 30 seconds |
| 237 | + Minimum time before advertisements start is 0 seconds |
| 238 | + |
| 239 | + Prefixes accepted 270, Prefixes advertised 163 |
| 240 | + Connections established 2; dropped 2 |
| 241 | + Closed by neighbor sent 1 weeks 6 days 21:59:50 ago |
| 242 | + Local host: 172.1.0.0, Local port: 41483 |
| 243 | + Foreign host: 172.1.0.1, Foreign port: 179 |
| 244 | + |
| 245 | +Again, we're looking for a state of ESTABLISHED. At Habrok we saw the |
| 246 | +BGP session getting to ESTABLISHED, then sometimes flapping after 3 |
| 247 | +minutes. This is the default hold time, and would happen when a large |
| 248 | +BGP update occurred, due to an MTU blackhole on the network path (the |
| 249 | +inter-switch link). |
| 250 | + |
| 251 | +So far we have not configured any VXLANs to share information about. |
| 252 | +Let's fix that. |
| 253 | + |
| 254 | +VXLANs |
| 255 | +------ |
| 256 | + |
| 257 | +If we return to our mental model of each switch as an isolated layer 2 |
| 258 | +Ethernet network, consider connecting up those isolated networks with a |
| 259 | +series of overlay networks, such that a host in VLAN A on switch 1 again |
| 260 | +has direct connectivity to a host in VLAN A on switch 2. We can do this |
| 261 | +using VXLANs. These overlays, or tunnels, are used to encapsulate a |
| 262 | +layer 2 packet within a VXLAN UDP packet. This allows the packet to |
| 263 | +traverse a network with only layer 3 connectivity, such as our underlay |
| 264 | +fabric. |
| 265 | + |
| 266 | +We must create a VXLAN network on each switch that maps to a VLAN. |
| 267 | + |
| 268 | +On a Dell OS10 system, here is one such VXLAN network: |
| 269 | + |
| 270 | +:: |
| 271 | + |
| 272 | + # show virtual-network 10016 |
| 273 | + Codes: DP - MAC-learn Dataplane, CP - MAC-learn Controlplane, UUD - Unknown-Unicast-Drop |
| 274 | + Virtual Network: 10016 |
| 275 | + Members: |
| 276 | + VLAN 16: port-channel1000 |
| 277 | + VxLAN Virtual Network Identifier: 10016 |
| 278 | + Source Interface: loopback0(172.2.0.0) |
| 279 | + Remote-VTEPs (flood-list): |
| 280 | + |
| 281 | +In this case we have VXLAN VNI 10016, which maps to VLAN 16. The source |
| 282 | +interface is loopback0, which we have configured with a /32 IP address |
| 283 | +for the VTEP. In an MLAG scenario , this IP address is shared between |
| 284 | +each leaf switch pair. This IP address is used as the source and |
| 285 | +destination for the outer VXLAN UDP packet. |
| 286 | + |
| 287 | +Currently, there are no remote VTEPs. |
| 288 | + |
| 289 | +EVIs |
| 290 | +---- |
| 291 | + |
| 292 | +EVPN Instances (EVIs) are the missing link between the EVPN BGP control |
| 293 | +plane and the VXLAN networks - they define which VXLAN networks will be |
| 294 | +shared via EVPN BGP, and with which switches. |
| 295 | + |
| 296 | +:: |
| 297 | + |
| 298 | + # show evpn evi 10016 |
| 299 | + |
| 300 | + EVI : 10016, State : up |
| 301 | + Bridge-Domain : Virtual-Network 10016, VNI 10016 |
| 302 | + Route-Distinguisher : 1:172.2.0.0:10016 |
| 303 | + Route-Targets : 0:65001:10016 both, 0:65101:10016 import |
| 304 | + Inclusive Multicast : 172.2.0.1 |
| 305 | + IRB : Disabled |
| 306 | + |
| 307 | +On Dell OS10 switches there is an "auto evi" mode, which automatically |
| 308 | +adds an EVI for each VXLAN. However this doesn't work with the multiple |
| 309 | +AS topology used at Habrok. |
| 310 | + |
| 311 | +The route distinguisher (RD) is an ID for routes shared by this switch. |
| 312 | +The Route Targets (RT) are AS numbers of other switches. Routes can be |
| 313 | +exported, imported, or both. Inclusive multicast defines the list of |
| 314 | +VTEPs to be included in a multicast group for BUM traffic. IRB is |
| 315 | +Integrated Routing and Bridging (IRB), which we'll get onto. |
| 316 | + |
| 317 | +Now that we have an EVI configured for our VXLAN, we now see EVPN |
| 318 | +"routes" for MAC addresses: |
| 319 | + |
| 320 | +:: |
| 321 | + |
| 322 | + * Route distinguisher: 172.23.62.133:10016 VNI:10016 |
| 323 | + [2]:[0]:[48]:[16:7f:06:fb:02:47]:[0]:[0.0.0.0]/280 172.23.62.133 0 100 0 65103 65005 ? |
| 324 | + |
| 325 | +The most common type of route is type 2, and this defines MAC address |
| 326 | +routes. Each EVI shares MAC addresses in its local MAC table for the |
| 327 | +VLAN with other EVPN switches, avoiding the "flood and learn" behaviour |
| 328 | +of a static VXLAN configuration. This means that a MAC address lookup on |
| 329 | +switch A will now potentially include remote VTEPs, as well as local |
| 330 | +interfaces. |
| 331 | + |
| 332 | +Integrated Routing and Bridging (IRB) |
| 333 | +------------------------------------- |
| 334 | + |
| 335 | +IRB can be used to perform routing in a distributed manner, across the |
| 336 | +fabric. Typically, each leaf switch is configured as a router, and will |
| 337 | +route on ingress to the destination VXLAN. |
| 338 | + |
| 339 | +Resources |
| 340 | +--------- |
| 341 | + |
| 342 | +This explainer series by nullzero is very helpful in building up the |
| 343 | +details in the picture. Here's the first part: |
| 344 | +https://www.nullzero.co.uk/aruba-aos-cx-evpn-vxlan/ |
0 commit comments