Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions .cursorrules
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
# SBD Operator Cursor Rules

You are a senior Go developer working on a Kubernetes operator for STONITH Block Device (SBD) remediation.

PROJECT CONTEXT:
This is the sbd-operator, a Kubernetes operator that manages SBD (STONITH Block Device) configurations and remediations for high-availability clustering. The operator handles automatic node remediation when nodes become unresponsive.

TECH STACK:

- Language: Go 1.21+
- Framework: Kubebuilder/Controller-Runtime
- Kubernetes: Custom Resources (CRDs), Controllers, RBAC
Expand All @@ -13,6 +16,7 @@ TECH STACK:
- Build: Makefile with targets for development, testing, and deployment

ARCHITECTURE:

- Kubernetes Operator pattern with custom controllers
- Custom Resource Definitions: SBDConfig and SBDRemediation
- Controller reconciliation loops with exponential backoff
Expand All @@ -21,6 +25,7 @@ ARCHITECTURE:
- Structured logging with leveled output

GO CODING STANDARDS:

- Follow Go best practices and idioms
- Use gofmt, golint, and go vet for code quality
- Prefer composition over inheritance
Expand All @@ -33,8 +38,8 @@ GO CODING STANDARDS:
- Scripts should be idempotent and include the ability to delete anything they create
- When proposing code changes, ensure they are concise and reuse existing code where possible


KUBERNETES OPERATOR PATTERNS:

- Implement proper controller reconciliation logic
- Use controller-runtime's predicate filtering
- Handle resource ownership with OwnerReferences
Expand All @@ -45,6 +50,7 @@ KUBERNETES OPERATOR PATTERNS:
- Use structured logging with controller-runtime's logger

TESTING REQUIREMENTS:

- Write unit tests for all business logic
- Use table-driven tests where appropriate
- Mock external dependencies (Kubernetes API, etc.)
Expand All @@ -57,6 +63,7 @@ TESTING REQUIREMENTS:
- Always build, push, and run tests from the makefile

ERROR HANDLING:

- Always return errors, don't panic
- Wrap errors with context using fmt.Errorf with %w verb
- Use controller-runtime's Result pattern for reconciliation
Expand All @@ -66,6 +73,7 @@ ERROR HANDLING:
- Check for any required AWS permissions at the beginning of scripts or tests

LOGGING:

- Use structured logging with controller-runtime's logger
- Include relevant context (namespace, name, etc.)
- Use appropriate log levels (Debug, Info, Error)
Expand All @@ -74,6 +82,7 @@ LOGGING:
- Use consistent field names across log entries

SECURITY:

- Implement proper RBAC permissions
- Validate all user inputs
- Use SecurityContext in pod specifications
Expand All @@ -82,6 +91,7 @@ SECURITY:
- Implement proper authentication and authorization

PERFORMANCE:

- Use resource limits and requests in deployments
- Implement efficient reconciliation loops
- Use informers and caches properly
Expand All @@ -90,6 +100,7 @@ PERFORMANCE:
- Monitor resource usage with metrics

DOCUMENTATION:

- Include comprehensive README files
- Document all public APIs with Go doc comments
- Include examples in config/samples/
Expand All @@ -98,6 +109,7 @@ DOCUMENTATION:
- Document operational procedures

PROJECT-SPECIFIC GUIDELINES:

- SBD operations require careful timeout handling
- Node remediation is a critical operation - implement safeguards
- Block device operations need proper error handling
Expand All @@ -107,9 +119,10 @@ PROJECT-SPECIFIC GUIDELINES:
- Consider split-brain scenarios in remediation logic
- Use concise summaries for commit messages
- Run All aws commands with AWS_PAGER=""
- ONLY use ACSII characters for all commit messages and shell commands
- ONLY use ASCII characters for all commit messages and shell commands

DEPENDENCIES:

- Use controller-runtime for Kubernetes operations
- Use logr for structured logging
- Use Prometheus client for metrics
Expand All @@ -119,13 +132,14 @@ DEPENDENCIES:
- Always use UBI base images

When suggesting code changes:

1. Ensure Kubernetes best practices are followed
2. Consider the impact on cluster stability
3. Implement proper error handling and recovery
4. Include appropriate tests
5. Follow Go conventions and idioms
6. Consider the operational aspects of the change
7. Ensure backwards compatibility when possible
8. Document any breaking changes
8. Document any breaking changes
9. Commit all changes with a concise summary
10. Prefer modifying existing make targets instead of creating new ones
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ go.work.sum
# Downloaded tools directory
.tools/

# npm cache
.npm-cache/

# Cluster provisioning artifacts
cluster/

Expand All @@ -48,3 +51,18 @@ cluster/
dist/
config/manager/kustomization.yaml
deploy/sbd-agent-daemonset-*.yaml

# Documentation
docs/prompt.md

# Debug files
sbd-device-debug.txt
sbd-device.txt
sbd-node-mapping.txt
node-mapping-debug.txt
<<<<<<< HEAD

# cSpell configuration (personal word list)
cspell.json
=======
>>>>>>> 796954c (Fix markdown linting issues and update documentation)
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# IMAGE_REGISTRY used to indicate the registery/group for the operator, bundle and catalog
# IMAGE_REGISTRY used to indicate the registry/group for the operator, bundle and catalog
IMAGE_REGISTRY ?= quay.io/medik8s
export IMAGE_REGISTRY

Expand Down
File renamed without changes.
22 changes: 15 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,41 +24,49 @@ The operator consists of two main components:
## Custom Resources

### SBDConfig

Defines the SBD configuration for the cluster:

- Shared block device PVC name
- Timeout settings
- Watchdog device path
- Node exclusion lists
- Reboot methods

### SBDRemediation

Triggers node remediation operations:

- Target node specification
- Remediation status tracking
- Integration with Medik8s Node Healthcheck Operator

## Quick Start

### Prerequisites

- Kubernetes cluster with CSI driver supporting `volumeMode: Block`
- Shared block storage with concurrent multi-node access (e.g., Ceph RBD, cloud provider shared volumes)
- Cluster nodes with kernel watchdog support

### Installation

1. Install the operator:
```bash
make deploy
```

```bash
make deploy
```

2. Create an SBDConfig:
```bash
kubectl apply -f config/samples/medik8s_v1alpha1_sbdconfig.yaml
```

```bash
kubectl apply -f config/samples/medik8s_v1alpha1_sbdconfig.yaml
```

### Development

Build and test locally:

```bash
# Build the operator
make build
Expand Down Expand Up @@ -87,7 +95,7 @@ Comprehensive documentation is available in the `docs/` directory:
The project includes comprehensive testing:

- **Unit Tests**: `make test`
- **E2E Tests**: `make test-e2e`
- **E2E Tests**: `make test-e2e`
- **Smoke Tests**: `make test-smoke`

E2E tests deploy a complete operator environment and verify functionality end-to-end.
Expand Down
12 changes: 11 additions & 1 deletion config/openshift/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The SBD Agent requires privileged access to hardware watchdog devices and block
## Required Permissions

The SCC grants the following permissions:

- `allowPrivilegedContainer: true` - Required for hardware watchdog access
- `allowHostDirVolumePlugin: true` - Required to mount host directories like `/dev`
- `allowHostNetwork: true` - Required for network access
Expand All @@ -23,31 +24,36 @@ The SCC grants the following permissions:
## Installation

### Option 1: Using the OpenShift Installer

```bash
make build-openshift-installer
kubectl apply -f dist/install-openshift.yaml
```

### Option 2: Manual Installation

```bash
kubectl apply -f config/openshift/
```

### Option 3: Using Kustomize

```bash
kubectl apply -k config/openshift-default/
```

## Service Account Binding

The SCC is automatically bound to the `sbd-agent` service account in the `sbd-system` namespace through:

1. A ClusterRole (`sbd-agent-scc-user`) that grants permission to use the SCC
2. A ClusterRoleBinding that binds the service account to the ClusterRole
3. Direct user reference in the SCC (`system:serviceaccount:sbd-system:sbd-agent`)

## Security Considerations

The SBD Agent requires these elevated privileges because it needs to:

- Access hardware watchdog devices (`/dev/watchdog*`)
- Read/write SBD (STONITH Block Device) devices
- Monitor system health and perform emergency reboots
Expand All @@ -60,21 +66,25 @@ These permissions are necessary for the SBD Agent to function as a cluster fenci
If SBD Agent pods fail to start with permission errors:

1. Verify the SCC is created:

```bash
oc get scc sbd-agent-privileged
```

2. Check if the service account can use the SCC:

```bash
oc adm policy who-can use scc sbd-agent-privileged
```

3. Verify the service account has the SCC assigned:

```bash
oc describe scc sbd-agent-privileged
```

4. Check pod security context:

```bash
kubectl describe pod <sbd-agent-pod-name> -n sbd-system
```
```
12 changes: 9 additions & 3 deletions config/rbac/RBAC_QUICK_REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
## Files Overview

### SBD Agent RBAC (Minimal Permissions)

- `sbd_agent_service_account.yaml` - ServiceAccount for SBD Agent pods
- `sbd_agent_role.yaml` - ClusterRole with read-only permissions
- `sbd_agent_role_binding.yaml` - Binds ServiceAccount to ClusterRole

### SBD Operator RBAC (Orchestration Permissions)

- `sbd_operator_service_account.yaml` - ServiceAccount for SBD Operator
- `sbd_operator_role.yaml` - ClusterRole with management permissions
- `sbd_operator_role_binding.yaml` - Binds ServiceAccount to ClusterRole
Expand All @@ -32,15 +34,17 @@ kubectl apply -f config/rbac/sbd_operator_role_binding.yaml
## Permission Summary

### SBD Agent Permissions (Read-Only)

| Resource | Permissions | Purpose |
|----------|-------------|---------|
| -------- | ----------- | ------- |
| `pods` | `get`, `list` | Read own pod metadata |
| `nodes` | `get`, `list`, `watch` | Node name to ID mapping |
| `events` | `create`, `patch` | Observability events |

### SBD Operator Permissions (Management)

| Resource | Permissions | Purpose |
|----------|-------------|---------|
| -------- | ----------- | ------- |
| `namespaces` | `create`, `get`, `list`, `patch`, `update`, `watch` | Namespace management |
| `daemonsets` | `create`, `delete`, `get`, `list`, `patch`, `update`, `watch` | Agent deployment |
| `nodes` | `get`, `list`, `watch` | Node information (read-only) |
Expand All @@ -67,11 +71,13 @@ kubectl auth can-i delete nodes --as=system:serviceaccount:sbd-system:sbd-operat
## Troubleshooting

### Common Issues

1. **Permission Denied**: Verify ClusterRole and ClusterRoleBinding are applied
2. **Wrong Namespace**: Ensure ServiceAccounts are in correct namespace (`sbd-system`)
3. **Missing Resources**: Check if CRDs are installed before applying RBAC

### Debug Commands

```bash
# List all SBD-related RBAC
kubectl get clusterroles | grep sbd
Expand All @@ -93,4 +99,4 @@ kubectl describe clusterrolebinding sbd-operator-manager-rolebinding
- ✅ **No Direct Node Fencing**: Neither component can delete/modify nodes via Kubernetes API
- ✅ **Read-Only Node Access**: Both components only read node information
- ✅ **Isolated Scope**: Permissions limited to SBD system resources
- ✅ **Hardware-Based Fencing**: Actual fencing occurs via SBD block device, not API calls
- ✅ **Hardware-Based Fencing**: Actual fencing occurs via SBD block device, not API calls
Loading