Skip to content

Commit cb536b6

Browse files
author
jmccormick2001
committed
final failover logic added plus docs
1 parent d224c90 commit cb536b6

File tree

6 files changed

+166
-54
lines changed

6 files changed

+166
-54
lines changed

docs/operator-docs.asciidoc

Lines changed: 41 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -492,44 +492,6 @@ Or if you have DNS configured on your client host:
492492
export CO_APISERVER_URL=https://postgres-operator.demo.svc.cluster.local:8443
493493
....
494494

495-
496-
== Performing a Smoke Test
497-
498-
A simple *smoke test* of the postgres operator includes testing
499-
the following:
500-
501-
* get version information (*pgo version*)
502-
* create a cluster (*pgo create cluster testcluster*)
503-
* scale a cluster (*pgo scale testcluster *)
504-
* show a cluster (*pgo show cluster testcluster*)
505-
* show all clusters (*pgo show cluster all*)
506-
* backup a cluster (*pgo backup testcluster*)
507-
* show backup of cluster (*pgo show backup testcluster*)
508-
* show backup pvc of cluster (*pgo show pvc testcluster-backup-pvc*)
509-
* restore a cluster (*pgo create cluster restoredcluster --backup-pvc=testcluster-backup-pvc --backup-path=testcluster-backups/2017-01-01-01-01-01 --secret-from=testcluster*)
510-
* test a cluster (*pgo test restoredcluster*)
511-
* minor upgrade a cluster (*pgo upgrade testcluster*)
512-
* major upgrade a cluster (*pgo upgrade testcluster --upgrade-type=major*)
513-
* delete a cluster (*pgo delete cluster testcluster --delete-data --delete-backups*)
514-
* create a policy from local file (*pgo create policy policy1 --in-file=./examples/policy/policy1.sql*)
515-
* create a policy from git repo (*pgo create policy gitpolicy --url=https://github.com/CrunchyData/postgres-operator/blob/master/examples/policy/gitpolicy.sql*)
516-
* repeat testing using emptydir storage type
517-
* repeat testing using create storage type
518-
* repeat testing using existing storage type
519-
* create a series of clusters (*pgo create cluster myseries --series=2*)
520-
* apply labels at cluster creation (*pgo create cluster xraydb --series=2 --labels=project=xray*)
521-
* apply a label to an existing set of clusters (*pgo label --label=env=research --selector=project=xray*)
522-
* create a user for a given cluster (*pgo create user user0 --valid-days=30 --managed --db=userdb --selector=name=xraydb0*)
523-
* load a csv file into a cluster (*pgo load --load-config=./sample-load-config.json --selector=project=xray*)
524-
* extend a user's password allowed age (*pgo user --change-password=user0 --valid-days=10 --selector=name=xraydb1*)
525-
* drop user access (*pgo user --delete-user=user2 --selector=project=xray*)
526-
* check password age (*pgo user --expired=10 --selector=project=xray*)
527-
* backup an entire project (*pgo backup --selector=project=xray*)
528-
* delete an entire project (*pgo delete cluster --selector=project=xray*)
529-
* create a cluster with a crunchy-collect sidecar(*pgo create cluster testcluster --metrics*)
530-
531-
More detailed explanations of the commands can be found below <<pgo Commands>>.
532-
533495
== Makefile Targets
534496

535497
The following table describes the Makefile targets:
@@ -1374,6 +1336,47 @@ The load configuration file has the following YAML attributes:
13741336
|SecurityContext| either fsGroup or SupplementalGroup values
13751337
|======================
13761338

1339+
=== pgo failover
1340+
1341+
Starting with Release 2.6, there is a manual failover command which
1342+
can be used to promote a replica to a primary role in a PostgreSQL
1343+
cluster.
1344+
1345+
This process includes the following actions:
1346+
* pick a target replica to become the new primary
1347+
* delete the current primary deployment to avoid user requests from
1348+
going to multiple primary databases (split brain)
1349+
* promote the targeted replica using *pg_ctl promote*, this will
1350+
cause PostgreSQL to go into read-write mode
1351+
* re-label the targeted replica to use the primary labels, this
1352+
will match the primary service selector and cause new requests
1353+
to the primary to be routed to the new primary (targeted replica)
1354+
1355+
The command works like this:
1356+
....
1357+
pgo failover mycluster --query
1358+
....
1359+
1360+
That command will show you a list of replica targets you can choose
1361+
to failover to. You will select one of those for the following
1362+
command:
1363+
....
1364+
pgo failover mycluster --target=mycluster-abxq
1365+
....
1366+
1367+
There is a CRD called *pgtask* that will hold the failover request
1368+
and also the status of that request. You can view the status
1369+
by viewing it:
1370+
....
1371+
kubectl get pgtasks mycluster-failover -o yaml
1372+
....
1373+
1374+
Once completed, you will see a new replica has been started to replace
1375+
the promoted replica, this happens automatically due to the re-lable, the
1376+
Deployment will recreate its pod because of this. The failover typically
1377+
takes only a few seconds, however, the creation of the replacement
1378+
replica can take longer depending on how much data is being replicated.
1379+
13771380
== bash Completion
13781381

13791382
There is a bash completion file that is included for users to try, this

docs/operator-testing.asciidoc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
= PostgreSQL Operator Testing
2+
:toc:
3+
v2.6, {docdate}
4+
5+
== Performing a Smoke Test
6+
7+
A simple *smoke test* of the postgres operator includes testing
8+
the following:
9+
10+
* get version information (*pgo version*)
11+
* create a cluster (*pgo create cluster testcluster*)
12+
* scale a cluster (*pgo scale testcluster *)
13+
* show a cluster (*pgo show cluster testcluster*)
14+
* show all clusters (*pgo show cluster all*)
15+
* backup a cluster (*pgo backup testcluster*)
16+
* show backup of cluster (*pgo show backup testcluster*)
17+
* show backup pvc of cluster (*pgo show pvc testcluster-backup-pvc*)
18+
* restore a cluster (*pgo create cluster restoredcluster --backup-pvc=testcluster-backup-pvc --backup-path=testcluster-backups/2017-01-01-01-01-01 --secret-from=testcluster*)
19+
* test a cluster (*pgo test restoredcluster*)
20+
* minor upgrade a cluster (*pgo upgrade testcluster*)
21+
* major upgrade a cluster (*pgo upgrade testcluster --upgrade-type=major*)
22+
* delete a cluster (*pgo delete cluster testcluster --delete-data --delete-backups*)
23+
* create a policy from local file (*pgo create policy policy1 --in-file=./examples/policy/policy1.sql*)
24+
* create a policy from git repo (*pgo create policy gitpolicy --url=https://github.com/CrunchyData/postgres-operator/blob/master/examples/policy/gitpolicy.sql*)
25+
* repeat testing using emptydir storage type
26+
* repeat testing using create storage type
27+
* repeat testing using existing storage type
28+
* create a series of clusters (*pgo create cluster myseries --series=2*)
29+
* apply labels at cluster creation (*pgo create cluster xraydb --series=2 --labels=project=xray*)
30+
* apply a label to an existing set of clusters (*pgo label --label=env=research --selector=project=xray*)
31+
* create a user for a given cluster (*pgo create user user0 --valid-days=30 --managed --db=userdb --selector=name=xraydb0*)
32+
* load a csv file into a cluster (*pgo load --load-config=./sample-load-config.json --selector=project=xray*)
33+
* extend a user's password allowed age (*pgo user --change-password=user0 --valid-days=10 --selector=name=xraydb1*)
34+
* drop user access (*pgo user --delete-user=user2 --selector=project=xray*)
35+
* check password age (*pgo user --expired=10 --selector=project=xray*)
36+
* backup an entire project (*pgo backup --selector=project=xray*)
37+
* delete an entire project (*pgo delete cluster --selector=project=xray*)
38+
* create a cluster with a crunchy-collect sidecar(*pgo create cluster testcluster --metrics*)
39+
* perform a failover (*pgo failover mycluster*)
40+

examples/pgo-bash-completion

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -304,8 +304,8 @@ _pgo_create_cluster()
304304
flags+=("--metrics")
305305
flags+=("-m")
306306
local_nonpersistent_flags+=("--metrics")
307-
flags+=("--node-name=")
308-
local_nonpersistent_flags+=("--node-name=")
307+
flags+=("--node-label=")
308+
local_nonpersistent_flags+=("--node-label=")
309309
flags+=("--password=")
310310
two_word_flags+=("-w")
311311
local_nonpersistent_flags+=("--password=")
@@ -595,6 +595,32 @@ _pgo_delete()
595595
noun_aliases=()
596596
}
597597

598+
_pgo_failover()
599+
{
600+
last_command="pgo_failover"
601+
commands=()
602+
603+
flags=()
604+
two_word_flags=()
605+
local_nonpersistent_flags=()
606+
flags_with_completion=()
607+
flags_completion=()
608+
609+
flags+=("--no-prompt")
610+
flags+=("-n")
611+
local_nonpersistent_flags+=("--no-prompt")
612+
flags+=("--query")
613+
local_nonpersistent_flags+=("--query")
614+
flags+=("--target=")
615+
local_nonpersistent_flags+=("--target=")
616+
flags+=("--apiserver-url=")
617+
flags+=("--debug")
618+
619+
must_have_one_flag=()
620+
must_have_one_noun=()
621+
noun_aliases=()
622+
}
623+
598624
_pgo_label()
599625
{
600626
last_command="pgo_label"
@@ -665,6 +691,8 @@ _pgo_scale()
665691
flags_with_completion=()
666692
flags_completion=()
667693

694+
flags+=("--node-label=")
695+
local_nonpersistent_flags+=("--node-label=")
668696
flags+=("--replica-count=")
669697
two_word_flags+=("-r")
670698
local_nonpersistent_flags+=("--replica-count=")
@@ -954,6 +982,7 @@ _pgo()
954982
commands+=("backup")
955983
commands+=("create")
956984
commands+=("delete")
985+
commands+=("failover")
957986
commands+=("label")
958987
commands+=("load")
959988
commands+=("scale")

operator/cluster/failover_strategy_1.go

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -50,12 +50,12 @@ func (r Strategy1) Failover(clientset *kubernetes.Clientset, client *rest.RESTCl
5050

5151
log.Info("strategy 1 Failover called on " + clusterName + " target is " + target)
5252

53-
if target == "" {
54-
log.Debug("failover target not set, will use best estimate")
55-
pod, err = util.GetBestTarget(clientset, clusterName, namespace)
56-
} else {
57-
pod, err = util.GetPod(clientset, target, namespace)
58-
}
53+
//if target == "" {
54+
// log.Debug("failover target not set, will use best estimate")
55+
// pod, target, err = util.GetBestTarget(clientset, clusterName, namespace)
56+
//} else {
57+
pod, err = util.GetPod(clientset, target, namespace)
58+
//}
5959
if err != nil {
6060
log.Error(err)
6161
return err
@@ -71,20 +71,20 @@ func (r Strategy1) Failover(clientset *kubernetes.Clientset, client *rest.RESTCl
7171
updateFailoverStatus(client, task, namespace, clusterName, "deleting primary deployment "+clusterName)
7272

7373
//trigger the failover on the replica
74-
err = promote(pod, clientset, client, namespace, target, restconfig)
74+
err = promote(pod, clientset, client, namespace, restconfig)
7575
//if err != nil {
7676
//log.Error(err)
7777
//return err
7878
//}
79-
updateFailoverStatus(client, task, namespace, clusterName, "promoting replica"+target)
79+
updateFailoverStatus(client, task, namespace, clusterName, "promoting pod "+pod.Name+" target "+target)
8080

8181
//relabel the deployment with primary labels
8282
err = relabel(pod, clientset, namespace, clusterName, target)
8383
//if err != nil {
8484
//log.Error(err)
8585
////return err
8686
//}
87-
updateFailoverStatus(client, task, namespace, clusterName, "re-labeling replica")
87+
updateFailoverStatus(client, task, namespace, clusterName, "re-labeling deployment...pod "+pod.Name+"was the failover target...failover completed")
8888

8989
return err
9090

@@ -144,7 +144,10 @@ func deletePrimary(clientset *kubernetes.Clientset, namespace, clusterName strin
144144
return err
145145
}
146146

147-
func promote(pod *v1.Pod, clientset *kubernetes.Clientset, client *rest.RESTClient, namespace, target string, restconfig *rest.Config) error {
147+
func promote(
148+
pod *v1.Pod,
149+
clientset *kubernetes.Clientset,
150+
client *rest.RESTClient, namespace string, restconfig *rest.Config) error {
148151
var err error
149152

150153
//get the target pod that matches the replica-name=target
@@ -175,14 +178,14 @@ func relabel(pod *v1.Pod, clientset *kubernetes.Clientset, namespace, clusterNam
175178
//set replica=false on the deployment
176179
//set name=clustername on the deployment
177180
newLabels := make(map[string]string)
178-
newLabels["replica"] = "false"
179181
newLabels["name"] = clusterName
180182

181183
err = updateLabels(namespace, clientset, targetDeployment, target, newLabels)
182184
if err != nil {
183185
log.Error(err)
184186
}
185187

188+
newLabels["replica"] = "false"
186189
err = updatePodLabels(namespace, clientset, pod, target, newLabels)
187190
if err != nil {
188191
log.Error(err)

pgo/cmd/failover.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ var failoverCmd = &cobra.Command{
4141
if Query {
4242
createFailover(args)
4343
} else if util.AskForConfirmation(NoPrompt) {
44+
if Target == "" {
45+
fmt.Println(`--target is required for failover.`)
46+
return
47+
}
4448
createFailover(args)
4549
} else {
4650
fmt.Println("Aborting...")

util/failover.go

Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ import (
2121
"k8s.io/api/core/v1"
2222
//"k8s.io/apimachinery/pkg/api/errors"
2323
"errors"
24+
"k8s.io/api/extensions/v1beta1"
2425
meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
2526

2627
"k8s.io/client-go/kubernetes"
@@ -31,11 +32,43 @@ import (
3132
)
3233

3334
// GetBestTarget
34-
func GetBestTarget(clientset *kubernetes.Clientset, clusterName, namespace string) (*v1.Pod, error) {
35+
func GetBestTarget(clientset *kubernetes.Clientset, clusterName, namespace string) (*v1.Pod, *v1beta1.Deployment, error) {
3536

3637
var err error
37-
var pod *v1.Pod
38-
return pod, err
38+
39+
//get all the replica deployment pods for this cluster
40+
var pod v1.Pod
41+
var deployment v1beta1.Deployment
42+
43+
//get all the deployments that are replicas for this clustername
44+
45+
//selector=replica=true,pg-cluster=clusterName
46+
var pods *v1.PodList
47+
lo := meta_v1.ListOptions{LabelSelector: "pg-cluster=" + clusterName + ",replica=true"}
48+
pods, err = clientset.CoreV1().Pods(namespace).List(lo)
49+
if err != nil {
50+
log.Error(err)
51+
return &pod, &deployment, err
52+
}
53+
if len(pods.Items) == 0 {
54+
return &pod, &deployment, errors.New("no replica pods found for cluster " + clusterName)
55+
}
56+
57+
for _, p := range pods.Items {
58+
pod = p
59+
log.Debug("pod found for replica " + pod.Name)
60+
if len(pods.Items) == 1 {
61+
log.Debug("only 1 pod found for failover best match..using it by default")
62+
return &pod, &deployment, err
63+
}
64+
65+
for _, c := range pod.Spec.Containers {
66+
log.Debug("container " + c.Name + " found in pod")
67+
}
68+
69+
}
70+
71+
return &pod, &deployment, err
3972
}
4073

4174
// GetPodName from a deployment name

0 commit comments

Comments
 (0)