-
Notifications
You must be signed in to change notification settings - Fork 34
Add anomaly detection transform stage to flowlogs-pipeline #1143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nd-config Add anomaly detection transform
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @vatankh. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| func (a *Anomaly) Transform(entry config.GenericMap) (config.GenericMap, bool) { | ||
| value, err := utils.ConvertToFloat64(entry[a.config.ValueField]) | ||
| if err != nil { | ||
| anomalyLog.Errorf("unable to convert %s to float: %v", a.config.ValueField, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid flooding logs with errors in the data path, we tend to use an error metric rather than logs, like you can see here: https://github.com/netobserv/flowlogs-pipeline/blob/main/pkg/pipeline/encode/metrics_common.go#L192
| parts := make([]string, 0, len(a.config.KeyFields)) | ||
| for _, key := range a.config.KeyFields { | ||
| if val, ok := entry[key]; ok { | ||
| parts = append(parts, fmt.Sprint(val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use utils.ConvertToString for this kind of conversion - it should be more performant than fmt-package conversions
|
Thanks @vatankh ! This is looking pretty good already. I have a few more comments, let's start with the nitpicking one :-) : could you remove Then a comment on the API design: as it is, it doesn't allow to run several anomaly detections (e.g. on several valueFields, or with different keys). A single Anomaly stage runs for a single value field, and if several stages are defined, they would conflict when writing on the same output fields. A simple way to fix this would be to add a Another approach would be to allow multiple value fields in a single stage. |
| stddev = math.Max(math.Abs(state.baseline)*1e-6, 1e-9) | ||
| } | ||
| score := math.Abs(deviation) / stddev | ||
| state.baseline = state.baseline + a.alpha*(value-state.baseline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| state.baseline = state.baseline + a.alpha*(value-state.baseline) | |
| state.baseline += a.alpha*(value-state.baseline) |
| anomalyType := "normal" | ||
| if score >= a.sensitivity { | ||
| if value > mean { | ||
| anomalyType = "zscore_high" | ||
| } else { | ||
| anomalyType = "zscore_low" | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an enum for that ?
You could then use it as return type instead of string
| - name: write | ||
| write: | ||
| type: stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good enough as example but could you explain what's the final goal for your usage ?
Do you want to expose that in a prometheus metric or somewhere else ?
Description
This PR introduces a new
anomalytransform stage toflowlogs-pipelineas a first step toward anomaly detection for Kubernetes network flows (see issue #).Key points:
type: anomalytransform that computes streaming anomaly scores per key.zscore: rolling z-score over a sliding window.ewma: exponentially weighted moving average baseline.algorithm(ewma | zscore)valueField(numeric field, e.g.Bytes)keyFields(used to group flows per entity, e.g.[SrcAddr, DstAddr, Proto])windowSize,baselineWindow,sensitivity,ewmaAlphaanomaly_scoreanomaly_type(e.g.warming_up,normal,zscore_high,zscore_low,ewma_high,ewma_low)baseline_window(current number of samples in the baseline window)hack/examples/pipeline-anomaly.yaml).This is intentionally a local, per-instance anomaly stage that works on the existing pipeline input only; it does not consume Loki/Kafka yet, as discussed in the issue conversation.
Dependencies
n/a
Testing
go test ./pkg/pipeline/transform -run TestTransformAnomalygo test ./...go build ./cmd/flowlogs-pipeline./flowlogs-pipeline --log-level debug --config hack/examples/pipeline-anomaly.yamlChecklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.
[ x] Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
No, this change only adds a new optional transform in flowlogs-pipeline and is not yet wired into the operator.
Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
Does this PR require product documentation?
Does this PR require a product release notes entry?
Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
QE requirements (check 1 from the list):
To run a perfscale test, comment with:
/test flp-node-density-heavy-25nodes