Skip to content

Refactor Builder to make it easier to use #523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ e2e-suite.test
payload
test/testinfra/*.log
test/testinfra/*.pem
.aider*
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,20 @@ As PagerDuty itself does not provide finer granularity for webhooks than service
To add a new alert investigation:

- run `make bootstrap-investigation` to generate boilerplate code in `pkg/investigations` (This creates the corresponding folder & .go file, and also appends the investigation to the `availableInvestigations` interface in `registry.go`.).
- investigation.Resources contain initialized clients for the clusters aws environment, ocm and more. See [Integrations](#integrations)
- The `Run` method of your investigation receives a `ResourceBuilder`. Use its `With...` methods to request the resources your investigation needs, then call `Build()` to get a `Resources` struct containing them. The builder automatically handles dependencies between resources (e.g., requesting an AWS client will also initialize the cluster object). For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated - I didn't know we had some scaffolding for investigations 🥳

```go
func (c *Investigation) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
// Request an AWS client. This will also ensure the Cluster resource is built.
r, err := rb.WithAwsClient().Build()
if err != nil {
return result, err
}
// Now you can use r.AwsClient, r.Cluster, r.PdClient, etc.
// ...
}
```
- The returned `Resources` struct contains initialized clients and cluster objects. See [Integrations](#integrations) for a full list of available resources.
- Add test objects or scripts used to recreate the alert symptoms to the `pkg/investigations/$INVESTIGATION_NAME/testing/` directory for future use. Be sure to clearly document the testing procedure under the `Testing` section of the investigation-specific README.md file

### Graduating an investigation
Expand Down
23 changes: 12 additions & 11 deletions cadctl/cmd/investigate/investigate.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ import (
"github.com/openshift/configuration-anomaly-detection/pkg/investigations/precheck"
"github.com/openshift/configuration-anomaly-detection/pkg/logging"
"github.com/openshift/configuration-anomaly-detection/pkg/metrics"
ocm "github.com/openshift/configuration-anomaly-detection/pkg/ocm"
"github.com/openshift/configuration-anomaly-detection/pkg/pagerduty"

"github.com/spf13/cobra"
Expand Down Expand Up @@ -101,26 +100,28 @@ func run(cmd *cobra.Command, _ []string) error {
return err
}

ocmClient, err := ocm.New()
builder, err := investigation.NewResourceBuilder(pdClient, clusterID, alertInvestigation.Name(), logLevelFlag, pipelineNameEnv)
if err != nil {
return fmt.Errorf("could not initialize ocm client: %w", err)
return fmt.Errorf("failed to create resource builder: %w", err)
}

builder := &investigation.ResourceBuilderT{}
defer func() {
if err != nil {
handleCADFailure(err, builder, pdClient)
}
}()

// Prime the builder with information required for all investigations.
builder.WithName(alertInvestigation.Name()).WithCluster(clusterID).WithPagerDutyClient(pdClient).WithOcmClient(ocmClient).WithLogger(logLevelFlag, pipelineNameEnv)

precheck := precheck.ClusterStatePrecheck{}
result, err := precheck.Run(builder)
if err != nil && strings.Contains(err.Error(), "no cluster found") {
logging.Warnf("No cluster found with ID '%s'. Escalating and exiting.", clusterID)
return pdClient.EscalateIncidentWithNote("CAD was unable to find the incident cluster in OCM. An alert for a non-existing cluster is unexpected. Please investigate manually.")
if err != nil {
if strings.Contains(err.Error(), "no cluster found") {
logging.Warnf("No cluster found with ID '%s'. Escalating and exiting.", clusterID)
return pdClient.EscalateIncidentWithNote("CAD was unable to find the incident cluster in OCM. An alert for a non-existing cluster is unexpected. Please investigate manually.")
}
return err
}
if result.StopInvestigations != nil {
return result.StopInvestigations
}

ccamInvestigation := ccam.Investigation{}
Expand All @@ -140,7 +141,7 @@ func run(cmd *cobra.Command, _ []string) error {
return updateIncidentTitle(pdClient)
}

func handleCADFailure(err error, rb *investigation.ResourceBuilderT, pdClient *pagerduty.SdkClient) {
func handleCADFailure(err error, rb investigation.ResourceBuilder, pdClient *pagerduty.SdkClient) {
logging.Errorf("CAD investigation failed: %v", err)

var notes string
Expand Down
8 changes: 3 additions & 5 deletions hack/bootstrap-investigation.sh
Original file line number Diff line number Diff line change
Expand Up @@ -91,12 +91,10 @@ import (
type Investigation struct{}
func (c *Investigation) Run(r *investigation.Resources) (investigation.InvestigationResult, error) {
func (c *Investigation) Run(rb *investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
// Initialize PagerDuty note writer
notes := notewriter.New(r.Name, logging.RawLogger)
defer func() { r.Notes = notes }()
// TODO: Add additional builder configuration depending on your required resources using With...()
r, err := rb.WithNotes().Build()
// TODO: Implement investigation logic here
Expand Down
2 changes: 1 addition & 1 deletion pkg/investigations/cpd/cpd.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ var byovpcRoutingSL = &ocm.ServiceLog{Severity: "Major", Summary: "Installation
// In the future, we want to automate service logs based on the network verifier output.
func (c *Investigation) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
r, err := rb.WithClusterDeployment().Build()
r, err := rb.WithClusterDeployment().WithAwsClient().Build()
if err != nil {
return result, err
}
Expand Down
104 changes: 34 additions & 70 deletions pkg/investigations/investigation/investigation.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package investigation

import (
"errors"
"fmt"

cmv1 "github.com/openshift-online/ocm-sdk-go/clustersmgmt/v1"
Expand All @@ -25,8 +24,30 @@ type InvestigationResult struct {
ServiceLogSent InvestigationStep

// If multiple investigations might be run this can indicate a fatal error that makes running additional investigations useless.
// Uses a false-default so it must be set expclicitly by an investigation.
StopInvestigations bool
// If nil, investigations should continue. If not nil, should contain a meaningful error message explaining why investigations must stop.
StopInvestigations error
}

func NewResourceBuilder(pdClient pagerduty.Client, clusterId string, name string, logLevel string, pipelineName string) (ResourceBuilder, error) {
ocmClient, err := ocm.New()
if err != nil {
return nil, fmt.Errorf("could not create ocm client: %w", err)
}

rb := &ResourceBuilderT{
buildLogger: true,
clusterId: clusterId,
name: name,
logLevel: logLevel,
pipelineName: pipelineName,
ocmClient: ocmClient,
builtResources: &Resources{
PdClient: pdClient,
OcmClient: ocmClient,
},
}

return rb, nil
}

type Investigation interface {
Expand All @@ -51,14 +72,10 @@ type Resources struct {
}

type ResourceBuilder interface {
WithCluster(clusterId string) ResourceBuilder
WithCluster() ResourceBuilder
WithClusterDeployment() ResourceBuilder
WithAwsClient() ResourceBuilder
WithOcmClient(*ocm.SdkClient) ResourceBuilder
WithPagerDutyClient(client *pagerduty.SdkClient) ResourceBuilder
WithNotes() ResourceBuilder
WithName(name string) ResourceBuilder
WithLogger(logLevel string, pipelineName string) ResourceBuilder
Build() (*Resources, error)
}

Expand All @@ -69,87 +86,50 @@ type ResourceBuilderT struct {
buildNotes bool
buildLogger bool

// These clients are required for all investigations and all clusters, so they are pre-filled.
pdClient *pagerduty.SdkClient
ocmClient *ocm.SdkClient

clusterId string
name string
logLevel string
pipelineName string

ocmClient *ocm.SdkClient

// cache
builtResources *Resources
buildErr error
}

func (r *ResourceBuilderT) WithCluster(clusterId string) ResourceBuilder {
func (r *ResourceBuilderT) WithCluster() ResourceBuilder {
r.buildCluster = true
r.clusterId = clusterId
return r
}

func (r *ResourceBuilderT) WithClusterDeployment() ResourceBuilder {
r.WithCluster()
r.buildClusterDeployment = true
return r
}

func (r *ResourceBuilderT) WithAwsClient() ResourceBuilder {
r.WithCluster()
r.buildAwsClient = true
return r
}

func (r *ResourceBuilderT) WithOcmClient(client *ocm.SdkClient) ResourceBuilder {
r.ocmClient = client
return r
}

func (r *ResourceBuilderT) WithPagerDutyClient(client *pagerduty.SdkClient) ResourceBuilder {
r.pdClient = client
return r
}

func (r *ResourceBuilderT) WithNotes() ResourceBuilder {
r.buildNotes = true
return r
}

func (r *ResourceBuilderT) WithName(name string) ResourceBuilder {
r.name = name
return r
}

func (r *ResourceBuilderT) WithLogger(logLevel string, pipelineName string) ResourceBuilder {
r.buildLogger = true
r.logLevel = logLevel
r.pipelineName = pipelineName
return r
}

func (r *ResourceBuilderT) Build() (*Resources, error) {
if r.buildErr != nil {
return nil, r.buildErr
}

if r.builtResources == nil {
r.builtResources = &Resources{
Name: r.name,
OcmClient: r.ocmClient,
PdClient: r.pdClient,
}
}
// The Name is now set during construction.
r.builtResources.Name = r.name

var err error

if r.buildClusterDeployment && !r.buildCluster {
r.buildErr = errors.New("cannot build ClusterDeployment without Cluster")
return nil, r.buildErr
}
if r.buildAwsClient && !r.buildCluster {
r.buildErr = errors.New("cannot build AwsClient without Cluster")
return nil, r.buildErr
}

if r.buildCluster && r.builtResources.Cluster == nil {
r.builtResources.Cluster, err = r.ocmClient.GetClusterInfo(r.clusterId)
if err != nil {
Expand Down Expand Up @@ -183,7 +163,7 @@ func (r *ResourceBuilderT) Build() (*Resources, error) {

if r.buildLogger {
// Re-initialize the logger with the cluster ID.
logging.RawLogger = logging.InitLogger(r.logLevel, "", internalClusterId)
logging.RawLogger = logging.InitLogger(r.logLevel, r.pipelineName, internalClusterId)
}
}

Expand All @@ -200,7 +180,7 @@ type ResourceBuilderMock struct {
BuildError error
}

func (r *ResourceBuilderMock) WithCluster(clusterId string) ResourceBuilder {
func (r *ResourceBuilderMock) WithCluster() ResourceBuilder {
return r
}

Expand All @@ -212,26 +192,10 @@ func (r *ResourceBuilderMock) WithAwsClient() ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) WithOcmClient(client *ocm.SdkClient) ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) WithPagerDutyClient(client *pagerduty.SdkClient) ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) WithNotes() ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) WithName(name string) ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) WithLogger(logLevel string, pipelineName string) ResourceBuilder {
return r
}

func (r *ResourceBuilderMock) Build() (*Resources, error) {
if r.BuildError != nil {
return nil, r.BuildError
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ func (i *Investigation) teardown() error {
func (i *Investigation) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
ctx := context.Background()
result := investigation.InvestigationResult{}
r, err := rb.Build()
r, err := rb.WithCluster().Build()
if err != nil {
return result, err
}
Expand Down
10 changes: 6 additions & 4 deletions pkg/investigations/precheck/precheck.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package precheck

import (
"errors"

cmv1 "github.com/openshift-online/ocm-sdk-go/clustersmgmt/v1"
investigation "github.com/openshift/configuration-anomaly-detection/pkg/investigations/investigation"
"github.com/openshift/configuration-anomaly-detection/pkg/logging"
Expand All @@ -14,7 +16,7 @@ type ClusterStatePrecheck struct{}
// Performs according pagerduty actions and returns whether CAD needs to investigate the cluster
func (c *ClusterStatePrecheck) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
result := investigation.InvestigationResult{}
r, err := rb.Build()
r, err := rb.WithCluster().Build()
if err != nil {
return result, err
}
Expand All @@ -23,13 +25,13 @@ func (c *ClusterStatePrecheck) Run(rb investigation.ResourceBuilder) (investigat
ocmClient := r.OcmClient
if cluster.State() == cmv1.ClusterStateUninstalling {
logging.Info("Cluster is uninstalling and requires no investigation. Silencing alert.")
result.StopInvestigations = true
result.StopInvestigations = errors.New("cluster is already uninstalling")
return result, pdClient.SilenceIncidentWithNote("CAD: Cluster is already uninstalling, silencing alert.")
}

if cluster.AWS() == nil {
logging.Info("Cloud provider unsupported, forwarding to primary.")
result.StopInvestigations = true
result.StopInvestigations = errors.New("unsupported cloud provider (non-AWS)")
return result, pdClient.EscalateIncidentWithNote("CAD could not run an automated investigation on this cluster: unsupported cloud provider.")
}

Expand All @@ -39,7 +41,7 @@ func (c *ClusterStatePrecheck) Run(rb investigation.ResourceBuilder) (investigat
}
if isAccessProtected {
logging.Info("Cluster is access protected. Escalating alert.")
result.StopInvestigations = true
result.StopInvestigations = errors.New("cluster is access protected")
return result, pdClient.EscalateIncidentWithNote("CAD is unable to run against access protected clusters. Please investigate.")
}
return result, nil
Expand Down
3 changes: 2 additions & 1 deletion pkg/investigations/precheck/precheck_test.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package precheck

import (
"errors"
"reflect"
"testing"

Expand Down Expand Up @@ -35,7 +36,7 @@ func TestInvestigation_Run(t *testing.T) {
PdClient: pdClient,
},
}},
investigation.InvestigationResult{StopInvestigations: true},
investigation.InvestigationResult{StopInvestigations: errors.New("unsupported cloud provider (non-AWS)")},
false,
},
}
Expand Down