You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
151811: rfcs: tiniest spelling fix r=bghal a=bghal
TSIA
Epic: none
Release note: None
151850: roachtest: extract Fatal-level log messages to facilitate triage r=srosenberg,rickystewart,herkolategan a=williamchoe3
Fixes: #147360
### Motivation
Currently, when triaging an issue that originates from a Monitor watching a node you get a message that will most likely require you to download the CI logs and find and unzip the artifact. As mentioned in the linked issue, a simple grep on the node's logs can help to identify the issue quickly and there are cases where the roachtest failure can be categorized as an infra related flake (e.g. clock sync).
Also this enhanced logging can potentially help older issues when their artifacts get wiped after the retention period expires.
### Changes
For every failure, after artifact collection, we will call a new function `inspectArtifacts()` which will run a grep on the node logs to look for fatal level logs. If found, we save those logs and append them to the `message` string we pass to the `GithubPoster` interface which eventually passes the message to `issues.Body`
In `issues.Body`, we call a new `TemplateData.CondensedMessage` message formatter method `FatalNodeRoachtest` which is similar to the existing `FatalOrPanic` & `RSGCrash` in order to better format the github issue message (see below for an example).
* Note: I attempted to use the existing `CondensedMessage.FatalOrPanic`, but since we're only passing in a subset of the logs and because that method seems to expect a "go test like" message string, I opted to create a new method with it's own regex pattern to match this new message
### Verification
Added 2 new manual roachtests to cover the `registry.TestSpec.Monitor = True` case, and another roachtest to cover when we're not setting the test level node monitor and using a test case defined monitor on a specific node.
Used an internal SQL statement `SELECT crdb_internal.force_log_fatal('oops');` to mock fatal node behavior
* https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/sem/builtins/builtins.go#L6061
* https://docs.google.com/presentation/d/153LwR070a-BW1LGTv3SFLyB96aEVQQUvyKKWmzyO8jw/edit?slide=id.p#slide=id.p
Manually verified local single node cluster, local multi node cluster, remote single node cluster, remote multi node cluster.
For github markdown rendering, added a data driven test into `pkg/cmd/roachtest/github_test.go`. Decided not to add a case to `pkg/cmd/bazci/githubpost/issues/issues_test.go` because it'd be the same test case so I thought it'd be redundant, but i did add a new formatter to `pkg/cmd/bazci/githubpost/issues/formatter_unit.go` so I can see the argument for also including the test case in the `issues` packages along with the test case in `roachtest`
### Misc / Design decisions
Current grep is limited to up to 10 lines. I choose that arbitrarily, open to changing it.
Technically, I don't think I needed to use concurrency control for `githubMessage` because I'm only writing to it during test teardown / cleanup, but I did it incase we ever append to that string when we're not serial
Initially wanted to run grep on each node via `Cluster.RunE()` and then return those results back to the test runner, but because by the time we are in the monitor defer block, the cancel context signal has already been sent so `Cluster.RunE()` is unable to run.
Originally I was wrapping errors thrown by the monitor with a new Monitor specific error type, but after [this thread discussion](#151850 (comment)), in order to capture unmonitored node fatals / panics, we decided to call `inspectArtifacts` on every failure, not just monitor specific failure. This adds an additional grep command to every failure, but it should only be a few seconds and the tradeoff for better logging was prioritized.
### E.g. Github Issue with Fatal Logs
#152540
<img width="1347" height="690" alt="image" src="https://github.com/user-attachments/assets/f28365b1-5c04-469f-aa8a-abf2085a5474" />
152855: stmtdiagnostics: Add support for transaction diagnostics r=kyle-a-wong a=kyle-a-wong
Adds a new TxnRegistry and other supporting structs to support
the collection of transaction diagnostic bundles. The TxnRegistry
adds functionality to:
- Register a TxnRequest
- defines the criteria for collecting a transaction
diagnostic bundle
- Start collecting a transaction bundle
- This is done by checking that a statement fingerprint id
matches the first statement fingerprint id in a TxnRequest
- Save a transaction diagnostic bundle upon completion to be
downloaded in the future
Since the system tables to persist transaction diagnostics and
transaction diagnostics requests don't exist yet, this commit
only registers requests in the local registry. A future
commit will add request and diagnostic persistence, as well
as add polling logic to register requests created in other
gateway nodes.
Part of: [CRDB-5342](https://cockroachlabs.atlassian.net/browse/CRDB-5342)
Epic: [CRDB-53541](https://cockroachlabs.atlassian.net/browse/CRDB-53541)
Release note: None
Co-authored-by: Brendan Gerrity <[email protected]>
Co-authored-by: William Choe <[email protected]>
Co-authored-by: Kyle Wong <[email protected]>
[This test on roachdash](https://roachdash.crdb.dev/?filter=status:open%20t:.*github_test.*&sort=title+created&display=lastcommented+project) | [Improve this report!](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/bazci/githubpost/issues)
0 commit comments