Skip to content

Add 'debug at startup' capability#2544

Closed
ipspace wants to merge 2 commits intodevfrom
debug
Closed

Add 'debug at startup' capability#2544
ipspace wants to merge 2 commits intodevfrom
debug

Conversation

@ipspace
Copy link
Owner

@ipspace ipspace commented Jul 16, 2025

This is a proof-of-concept (for Cisco IOS) of a capability that enables debugging at the very beginning of initial device configuration to ensure no relevant events are lost.

It introduces a new node attribute (debug), an extra flag to 'must_be_list' function that can split lines of a string value when netlab expects a list of values, and a sample initial config template.

Also, I added a whole document explaning how one could do debugging based on my lovely recent experience with Cisco IOS and Aruba CX (more about those coming soon from the usual soapbox).

@ipspace
Copy link
Owner Author

ipspace commented Jul 16, 2025

Based on my recent fights with BGP IPv6 AF on Cisco IOS. Would love to hear your feedback @ssasso @DanPartelly @jbemmel

Copy link
Collaborator

@ssasso ssasso Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check), to be called before initial configuration in case the debug hostvar is present.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check)

Any particular reason for that? I'm already annoyed by the amount of noise Ansible produces 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a Jinja include file - same Ansible task, but keeping the debug logic separate from the rest.

I could imagine that the debugging logic could become quite extensive (dozens of flags to be set)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather prefer having the debug "code" in a separate file (or even ansible include task - similar to what we do for device readiness check)

Any particular reason for that? I'm already annoyed by the amount of noise Ansible produces 🤷‍♂️

Just logical separation, maybe easier to follow for my mind.

Copy link
Collaborator

@DanPartelly DanPartelly Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that modularization and device abstractions are worthy here. While would be cool to have, just imagine tenths of debug flags, abstracted over N modules over M devices. So many lookup tables, so many git diffs, and then you end up in the situation you forgot to abstract Y's favorite BGP debugging flag, then you have another pull request, and so on and so forth.

Who will do all this work ? By contrast, current method "just works" even if its not abstracted or modular in nature.

I frankly like it as it is, its really pragmatic.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Jul 16, 2025

I love this one. As it is. Very easy to set . Not very keen to see more Ansible tasks/roles/whatever

Copy link
Collaborator

@jbemmel jbemmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work, but would it be an idea to have a 'debug' flag per module (bgp.debug, ospf.debug, etc.) and then have the debug flags to be enabled defined in the device specific YAML files, under the features for that module (features.bgp.debug)

Users could customize these in their topology if they need to

The current implementation feels a bit like a quick hack (and I think I know what I'm talking about ;) - it lacks the Netlab signature device abstraction. When debugging, it makes most sense to enable the same debugging flags on all devices of a particular kind - as opposed to on each node individually

Imagine the following: The user would set

bgp.debug: True

and BGP debugging would be automatically enabled on all devices in the lab, handling vendor specific nuances

Copy link
Collaborator

@DanPartelly DanPartelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, pragmatism here beats abstraction and modularization.

@ipspace
Copy link
Owner Author

ipspace commented Jul 17, 2025

Thanks a million for the feedback. Going through it in random order:

The current implementation feels a bit like a quick hack (and I think I know what I'm talking about ;) - it lacks the Netlab signature device abstraction.

Correct. This is a hack that addresses a very specific need: enabling debugging early enough to capture all events, but immediately after the device boots in case some events might be uptime-specific. For anything else, we already have solutions (documented in the .md file that's part of this PR).

This should work, but would it be an idea to have a 'debug' flag per module (bgp.debug, ospf.debug, etc.) and then have the debug flags to be enabled defined in the device specific YAML files, under the features for that module (features.bgp.debug)

While that sounds great, I'm not ready to invest time into making it happen (ignoring for the moment that testing the correctness of the implementations would be... interesting). Nobody asked for it, and the debugging capabilities heavily depend on the implementations. Also, "debugging BGP" sounds great, but do go through the debug bgp CLI on any reasonable device and you'll see how many options there are. The options you want to use depend on what you're trying to troubleshoot. Enable too little and you won't see a thing. Enable too much and you'll be swamped. Oh, and BGP also has something called "address families" ;)

Just logical separation, maybe easier to follow for my mind.

In theory, that's correct. In practice, it's a tiny for loop, and I don't expect it to be anything more any time soon (see above).

In theory, we could make this part of the "normalize" (pre-initial) phase, but that's just moving the problem around (RFC 1925 rule 6). We could also add "debug" config module to ansible/tasks/initial-config.yml but that would just add more Ansible noise (and the debugging commands wouldn't appear in files created by the netlab initial -o command).

To wrap up: think of this as custom configuration templates, but executed very early on in the process. I don't want to have anything more than that at this stage, and I don't see (at the moment) the need for any other custom configuration executed early in the initial configuration process. However, if we get to the point where we have good reasons to have other pre-initial custom configuration templates, then this could be easily merged into that logic.

Obviously, we could also drop the whole thing (I think the problem I was trying to solve was not uptime-specific after all 😜) and just keep the "debugging network devices" documentation.

@ssasso
Copy link
Collaborator

ssasso commented Jul 17, 2025

Obviously, we could also drop the whole thing (I think the problem I was trying to solve was not uptime-specific after all 😜) and just keep the "debugging network devices" documentation.

let's keep as it is.

ipspace added 2 commits July 21, 2025 15:47
This is a proof-of-concept (for Cisco IOS) of a capability that
enables debugging at the very beginning of initial device configuration
to ensure no relevant events are lost.

It introduces a new node attribute (debug), an extra flag to
'must_be_list' function that can split lines of a string value when
netlab expects a list of values, and a sample initial config template.

Also, I added a whole document explaning how one could do debugging
based on my lovely recent experience with Cisco IOS and Aruba CX
(more about those coming soon from the usual soapbox).
@ipspace ipspace marked this pull request as draft July 21, 2025 13:48
@ipspace
Copy link
Owner Author

ipspace commented Jul 21, 2025

This is harder than I expected. For example, EOS won't allow you to enable debugging for things that are not configured, and I would expect NX-OS to behave in a similar way (due to their use of features).

Back to the drawing board. For the moment, it looks like I'll implement this as device-specific features (similar to eos.serialnumber). However, as IOS debugging applies to a while range of devices, I have to add another tweak first. I'll be back ;)

@DanPartelly
Copy link
Collaborator

DanPartelly commented Jul 22, 2025

@ipspace Leaving aside debug flags for a second, but this issue is somehow linked. today i tried to make a lab where IS-IS overload bit is used at startup with wait-for-bgp. I failed. My methodology was to deploy custom configs for ISIS and ACLs which block BGP neighbor formation so i have time to observe what is going on. I failed. Although BGP neighbors where never formed , so at least the ACL part escaped racing.

This will require further investigation and uses of other images besides iol, like Xrd or CSR. Anyways, one issue is that custom configs are always applied last. For some items, to eliminate races, they should really come first(after initial) . This is why I choose to post this here instead of a new issues. Order might matter. Your thoughts ?

@ipspace
Copy link
Owner Author

ipspace commented Jul 23, 2025

Anyways, one issue is that custom configs are always applied last. For some items, to eliminate races, they should really come first(after initial).

You can solve that with a sequence of commands:

  • netlab up --no-config
  • netlab initial -i
  • netlab config template
  • netlab initial -m

This is why I choose to post this here instead of a new issues.

Not a good idea ;) Someone might have a similar issue, and now the discussion will be buried in some unrelated stuff (not to mention we're bloating this PR).

Order might matter. Your thoughts?

I don't want to open that can of worms. I need something before initial. You need something between initial and other modules. Someone will need something between IS-IS and BGP... The only sane way to solve this edge requests is to use a more complex lab startup sequence (bash FTW!).

@jbemmel
Copy link
Collaborator

jbemmel commented Jul 23, 2025

I don't want to open that can of worms. I need something before initial. You need something between initial and other modules. Someone will need something between IS-IS and BGP... The only sane way to solve this edge requests is to use a more complex lab startup sequence (bash FTW!).

node:
      config:
        type: list
        _subtype:
          file: str
          before: list
          after: list
          _alt_types: [ str ]

not the "only" sane way

@ipspace
Copy link
Owner Author

ipspace commented Jul 23, 2025

not the "only" sane way

Congratulations, you successfully defined the data schema. Now go and solve the remaining 95% of the problem, but do it somewhere else, not in an unrelated PR.

@ipspace
Copy link
Owner Author

ipspace commented Jul 25, 2025

Thanks again for all the feedback. Will replace this PR with a more focused one targeting IOS and FRR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants