Add 'debug at startup' capability by ipspace · Pull Request #2544 · ipspace/netlab

ipspace · 2025-07-16T13:17:05Z

This is a proof-of-concept (for Cisco IOS) of a capability that enables debugging at the very beginning of initial device configuration to ensure no relevant events are lost.

It introduces a new node attribute (debug), an extra flag to 'must_be_list' function that can split lines of a string value when netlab expects a list of values, and a sample initial config template.

Also, I added a whole document explaning how one could do debugging based on my lovely recent experience with Cisco IOS and Aruba CX (more about those coming soon from the usual soapbox).

ipspace · 2025-07-16T13:17:49Z

Based on my recent fights with BGP IPv6 AF on Cisco IOS. Would love to hear your feedback @ssasso @DanPartelly @jbemmel

ssasso · 2025-07-16T13:23:11Z

netsim/ansible/templates/initial/ios.j2

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check), to be called before initial configuration in case the debug hostvar is present.

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check)

Any particular reason for that? I'm already annoyed by the amount of noise Ansible produces 🤷‍♂️

I would suggest a Jinja include file - same Ansible task, but keeping the debug logic separate from the rest.

I could imagine that the debugging logic could become quite extensive (dozens of flags to be set)

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check)

Any particular reason for that? I'm already annoyed by the amount of noise Ansible produces 🤷‍♂️

Just logical separation, maybe easier to follow for my mind.

I do not think that modularization and device abstractions are worthy here. While would be cool to have, just imagine tenths of debug flags, abstracted over N modules over M devices. So many lookup tables, so many git diffs, and then you end up in the situation you forgot to abstract Y's favorite BGP debugging flag, then you have another pull request, and so on and so forth.

Who will do all this work ? By contrast, current method "just works" even if its not abstracted or modular in nature.

I frankly like it as it is, its really pragmatic.

DanPartelly · 2025-07-16T16:30:14Z

I love this one. As it is. Very easy to set . Not very keen to see more Ansible tasks/roles/whatever

jbemmel

This should work, but would it be an idea to have a 'debug' flag per module (bgp.debug, ospf.debug, etc.) and then have the debug flags to be enabled defined in the device specific YAML files, under the features for that module (features.bgp.debug)

Users could customize these in their topology if they need to

The current implementation feels a bit like a quick hack (and I think I know what I'm talking about ;) - it lacks the Netlab signature device abstraction. When debugging, it makes most sense to enable the same debugging flags on all devices of a particular kind - as opposed to on each node individually

Imagine the following: The user would set

bgp.debug: True

and BGP debugging would be automatically enabled on all devices in the lab, handling vendor specific nuances

DanPartelly

For me, pragmatism here beats abstraction and modularization.

ipspace · 2025-07-17T07:45:01Z

Thanks a million for the feedback. Going through it in random order:

The current implementation feels a bit like a quick hack (and I think I know what I'm talking about ;) - it lacks the Netlab signature device abstraction.

Correct. This is a hack that addresses a very specific need: enabling debugging early enough to capture all events, but immediately after the device boots in case some events might be uptime-specific. For anything else, we already have solutions (documented in the .md file that's part of this PR).

This should work, but would it be an idea to have a 'debug' flag per module (bgp.debug, ospf.debug, etc.) and then have the debug flags to be enabled defined in the device specific YAML files, under the features for that module (features.bgp.debug)

While that sounds great, I'm not ready to invest time into making it happen (ignoring for the moment that testing the correctness of the implementations would be... interesting). Nobody asked for it, and the debugging capabilities heavily depend on the implementations. Also, "debugging BGP" sounds great, but do go through the debug bgp CLI on any reasonable device and you'll see how many options there are. The options you want to use depend on what you're trying to troubleshoot. Enable too little and you won't see a thing. Enable too much and you'll be swamped. Oh, and BGP also has something called "address families" ;)

Just logical separation, maybe easier to follow for my mind.

In theory, that's correct. In practice, it's a tiny for loop, and I don't expect it to be anything more any time soon (see above).

In theory, we could make this part of the "normalize" (pre-initial) phase, but that's just moving the problem around (RFC 1925 rule 6). We could also add "debug" config module to ansible/tasks/initial-config.yml but that would just add more Ansible noise (and the debugging commands wouldn't appear in files created by the netlab initial -o command).

To wrap up: think of this as custom configuration templates, but executed very early on in the process. I don't want to have anything more than that at this stage, and I don't see (at the moment) the need for any other custom configuration executed early in the initial configuration process. However, if we get to the point where we have good reasons to have other pre-initial custom configuration templates, then this could be easily merged into that logic.

Obviously, we could also drop the whole thing (I think the problem I was trying to solve was not uptime-specific after all 😜) and just keep the "debugging network devices" documentation.

ssasso · 2025-07-17T08:31:55Z

Obviously, we could also drop the whole thing (I think the problem I was trying to solve was not uptime-specific after all 😜) and just keep the "debugging network devices" documentation.

let's keep as it is.

This is a proof-of-concept (for Cisco IOS) of a capability that enables debugging at the very beginning of initial device configuration to ensure no relevant events are lost. It introduces a new node attribute (debug), an extra flag to 'must_be_list' function that can split lines of a string value when netlab expects a list of values, and a sample initial config template. Also, I added a whole document explaning how one could do debugging based on my lovely recent experience with Cisco IOS and Aruba CX (more about those coming soon from the usual soapbox).

ipspace · 2025-07-21T13:50:59Z

This is harder than I expected. For example, EOS won't allow you to enable debugging for things that are not configured, and I would expect NX-OS to behave in a similar way (due to their use of features).

Back to the drawing board. For the moment, it looks like I'll implement this as device-specific features (similar to eos.serialnumber). However, as IOS debugging applies to a while range of devices, I have to add another tweak first. I'll be back ;)

DanPartelly · 2025-07-22T17:56:05Z

@ipspace Leaving aside debug flags for a second, but this issue is somehow linked. today i tried to make a lab where IS-IS overload bit is used at startup with wait-for-bgp. I failed. My methodology was to deploy custom configs for ISIS and ACLs which block BGP neighbor formation so i have time to observe what is going on. I failed. Although BGP neighbors where never formed , so at least the ACL part escaped racing.

This will require further investigation and uses of other images besides iol, like Xrd or CSR. Anyways, one issue is that custom configs are always applied last. For some items, to eliminate races, they should really come first(after initial) . This is why I choose to post this here instead of a new issues. Order might matter. Your thoughts ?

ipspace · 2025-07-23T04:59:02Z

Anyways, one issue is that custom configs are always applied last. For some items, to eliminate races, they should really come first(after initial).

You can solve that with a sequence of commands:

netlab up --no-config
netlab initial -i
netlab config template
netlab initial -m

This is why I choose to post this here instead of a new issues.

Not a good idea ;) Someone might have a similar issue, and now the discussion will be buried in some unrelated stuff (not to mention we're bloating this PR).

Order might matter. Your thoughts?

I don't want to open that can of worms. I need something before initial. You need something between initial and other modules. Someone will need something between IS-IS and BGP... The only sane way to solve this edge requests is to use a more complex lab startup sequence (bash FTW!).

jbemmel · 2025-07-23T13:44:05Z

I don't want to open that can of worms. I need something before initial. You need something between initial and other modules. Someone will need something between IS-IS and BGP... The only sane way to solve this edge requests is to use a more complex lab startup sequence (bash FTW!).

node:
      config:
        type: list
        _subtype:
          file: str
          before: list
          after: list
          _alt_types: [ str ]

not the "only" sane way

ipspace · 2025-07-23T14:37:35Z

not the "only" sane way

Congratulations, you successfully defined the data schema. Now go and solve the remaining 95% of the problem, but do it somewhere else, not in an unrelated PR.

ipspace · 2025-07-25T08:01:48Z

Thanks again for all the feedback. Will replace this PR with a more focused one targeting IOS and FRR.

ipspace requested review from DanPartelly, jbemmel and ssasso July 16, 2025 13:17

ssasso reviewed Jul 16, 2025

View reviewed changes

jbemmel reviewed Jul 16, 2025

View reviewed changes

DanPartelly approved these changes Jul 16, 2025

View reviewed changes

ipspace added 2 commits July 21, 2025 15:47

Tweaks, FRR support

e4ef5a9

ipspace marked this pull request as draft July 21, 2025 13:48

ipspace force-pushed the debug branch from b1cc2da to e4ef5a9 Compare July 21, 2025 13:48

DanPartelly mentioned this pull request Jul 23, 2025

is-is set-overload-bit on-startup wait-for-bgp solution #2549

Closed

ipspace closed this Jul 25, 2025

ipspace deleted the debug branch July 25, 2025 08:02

ipspace mentioned this pull request Jul 25, 2025

Add 'debug at startup' capability for IOS and FRR #2550

Merged

Conversation

ipspace commented Jul 16, 2025

Uh oh!

ipspace commented Jul 16, 2025

Uh oh!

ssasso Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipspace Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

jbemmel Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

ssasso Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

DanPartelly Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanPartelly commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbemmel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanPartelly left a comment

Choose a reason for hiding this comment

Uh oh!

ipspace commented Jul 17, 2025

Uh oh!

ssasso commented Jul 17, 2025

Uh oh!

ipspace commented Jul 21, 2025

Uh oh!

DanPartelly commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ipspace commented Jul 23, 2025

Uh oh!

jbemmel commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ipspace commented Jul 23, 2025

Uh oh!

ipspace commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ssasso Jul 16, 2025 •

edited

Loading

DanPartelly Jul 16, 2025 •

edited

Loading

DanPartelly commented Jul 16, 2025 •

edited

Loading

jbemmel left a comment •

edited

Loading

DanPartelly commented Jul 22, 2025 •

edited

Loading

jbemmel commented Jul 23, 2025 •

edited

Loading