Skip to content

Commit 9d0640a

Browse files
author
Martin Jackson
committed
Add blog to WIP
1 parent b13d1b7 commit 9d0640a

File tree

1 file changed

+113
-0
lines changed

1 file changed

+113
-0
lines changed
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
date: 2025-11-17
3+
title: Introducing RamenDR Starter Kit
4+
summary: A New Pattern to illustrate Regional DR for Virtualization Workloads running on OpenShift Data Foundation
5+
author: Martin Jackson
6+
blog_tags:
7+
- patterns
8+
- announce
9+
---
10+
:toc:
11+
:imagesdir: /images
12+
13+
We are excited to announce that the **validatedpatterns-sandbox/ramendr-starter-kit** repository is now available and
14+
has reached the Sandbox tier of Validated Patterns.
15+
16+
== The Pattern
17+
18+
TBD
19+
20+
== On the use of AI to generate scripts
21+
22+
This pattern is also noteworthy in that all of the major shell scripts in the pattern were written by
23+
Cursor. This was a major learning experience, both in the capabilities of modern AI coding tools, and in some
24+
of their limitations.
25+
26+
=== The Good
27+
28+
* Error handling and visual output are better than the shell sripts (or Ansible code) I would have written if
29+
I had written all of this from scratch.
30+
* The "inner loop" of development felt a lot faster using the generated code than if I had written it all from
31+
scratch. The value in this pattern is in the use of the components together, not in finding new and novel
32+
ways to retrieve certificate material from a running OpenShift cluster.
33+
34+
=== The Bad
35+
36+
* Even when the context "knew" it was working on OpenShift and Hive, it used different mechanisms to retrieve
37+
kubeconfig files for managed clusters. I had to remind it to use a known-good mechanism, which had worked for
38+
downloading kubeconfigs to the user workstation.
39+
* Several of these scripts are bash scripts wrappen in Kubernetes jobs or cronjobs. The generator had some problems
40+
with using local variables in places it could not, and in using shell here documents in places that was not allowed
41+
in YAML. Eventually I set the context that we were better off using `File.get` calls and externalizing the scripts
42+
from the jobs altogether.
43+
44+
=== The Ugly
45+
46+
* I am uncomfortable at the level of duplication in the code. Time will tell whether some of these scripts will become
47+
problematic to maintain. A more rigorous analysis might find several opportunities to refactor code.
48+
* The sheer volume of code makes it a bit daunting to look at. All of the major scripts in the pattern are over 150
49+
lines long, and the longest (as of this publication) is over 1300 lines long.
50+
* Some of the choices of technique and loading dependencies were a bit too generic. We have images for Validated
51+
Patterns that provide things like a Python interpreter with access to the YAML module, the AWS CLI, and other things
52+
that turned out to be useful. I left in the cursor frameworks for downloading things like the AWS CLI, because they
53+
correctly detect that those dependencies are already installed and may prove beneficial if we move to different
54+
images.
55+
56+
== DR Terminology - What are we talking about?
57+
58+
**High Availability (“HA”)** includes all characteristics, qualities and workflows of a system that prevent
59+
unavailability events for workloads. This is a very broad category, and includes things like redundancy built into
60+
individual disks, such that failure of a single drive does not result in outage to the workload. Load balancing,
61+
redundant power supplies, running a workload across multiple fault domains, are some of the techniques that belong to
62+
HA, because they keep the workload from becoming unavailable in the first place. Usually HA is completely automatic in that it does not require a real-time human in the loop.
63+
64+
**Disaster Recovery (“DR”)** includes the characteristics, qualities and workflows of a system to recover from an
65+
outage event when there has been data loss. DR events often also include things that are recognized as major
66+
environmental disasters (such as weather events like hurricanes, tornadoes, fires), or other large-scale problems that
67+
cause widespread devastation or disruption to a location where workloads run, such that critical personnel might also
68+
be affected (i.e. unavailable because they are dead or disabled) and questions of how decisions will be made without
69+
key decision makers are also considered. (This is often included under the heading of “Business Continuity,” which is
70+
closely related to DR.) There are two critical differences between HA and DR: The first is the expectation of human
71+
decision-making in the loop, and the other is the data loss aspect. That is, in a DR event we know we have lost data;
72+
we are working on how much is acceptable to lose and how quickly we can restore workloads. This is what makes it
73+
fundamentally a different thing than HA; but some organizations do not really see or enforce this distinction and that
74+
leads to a lot of confusion. Some vendors also do not strongly make this distinction, which does not discourage that
75+
confusion.
76+
77+
DR policies can be driven by external regulatory or legal requirements, or an organization’s internal understanding of
78+
what such external legal and regulatory requirements mean. That is to say - the law may not specifically require a
79+
particular level of DR, but the organization interprets the law to mean that is what they need to do to be compliant
80+
to the law or regulation. The Sarbanes Oxley Act (“SOX”) in the US was adopted after the Enron and Worldcom financial
81+
scandals of the early 2000’s, but includes a number of requirements for accurate financial reporting, which many
82+
organizations have used to justify and fund substantial BC/DR programs.
83+
84+
**Business Continuity: (“BC”, but usually used together with DR as “BCDR” or BC/DR”)** refers to primarily the people
85+
side of recovery from disasters. Large organizations will have teams that focus on BC/DR and use that term in the team
86+
title or name. Such teams will be responsible for making sure that engineering and application groups are compliant to
87+
the organization’s BC/DR policies. This can involve scheduling and running BC/DR “drills” and actual live testing of
88+
BC/DR technologies.
89+
90+
**Recovery Time Objective (“RTO”)** is the amount of time it takes to restore a failed workload to service. This is
91+
NOT the amount of data that is tolerable to lose - that is defined by the companion RPO.
92+
93+
**Recovery Point Objective (“RPO”)** is the amount of data a workload can stand to lose. One confusing aspect of RPO is that it can be defined as a time interval (as opposed to, say, a number of transactions). But an RPO of “5 minutes”
94+
should be read as “we want to lose no more than 5 minutes’ worth of data.
95+
96+
RPO/RTO: So lots of people want a 0/0 RPO/RTO, often without understanding what it takes to implement that. It can be
97+
fantastically expensive, even for the world’s largest and best-funded organizations.
98+
99+
== Why Does it Matter?
100+
101+
In a perfect world, every application would have its own knowledge of where it is available and would shard and
102+
replicate its own data. But many appplications were built without these concepts in mind, and even if a company
103+
wanted to and could afford re-writing every application, it could not re-write them and deploy them all at once.
104+
105+
Thus, users benefit from being able to rely on technology products and solutions to enable a regional disaster
106+
recovery capability when the application does not support it natively.
107+
108+
The ability to recover a workload in the event of a regional disaster is considered a requirement in several
109+
industries for applications that the user deems critical enough to require DR support for, but unable to provide
110+
it natively in the application.
111+
112+
113+

0 commit comments

Comments
 (0)