You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: manageability-and-operations/operations-advisory/operating-model/sre-function.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,24 +8,24 @@
8
8
9
9
# SRE Function in Cloud Operating Model
10
10
When defining the Cloud Operating Model, the Site Reliability Engineering (SRE) function embodies the core of Cloud Operations.
11
-
SRE team will have a size of a minimun of 8 engineers for operations and on-call duties. There are several theories around the ideal ratio of SREs vs Developers. The truth is the magic number will change as the organization, and workloads, evolve and mature.
12
-
The more automation and AiOps are leveraged the less repetitive tasks and manual intervetion will be needed, allowing SRE team members to focus on the real egineering part.
11
+
The SRE team will have a size of a minimum of 8 engineers for operations and on-call duties. There are several theories around the ideal ratio of SREs vs Developers. The truth is the magic number will change as the organization, and workloads, evolve and mature.
12
+
The more automation and AIOps are leveraged the less repetitive tasks and manual intervention will be needed, allowing SRE team members to focus on the real engineering part.
13
13
14
14
15
15
# SRE Role in Day-2 Operations
16
16
17
-
**SRE**funtion encompasses reliability concepts into DevOps, focusing on designing and implementing highly scalable and resilient systems, addressing automatically potential and in-progress issues. In other words, each service can run an repair itself, extending the concept of 'autonomous' to virtually any service.
17
+
**SRE**function encompasses reliability concepts in DevOps, focusing on designing and implementing highly scalable and resilient systems, addressing automatically potential and in-progress issues. In other words, each service can run and repair itself, extending the concept of 'autonomous' to virtually any service.
18
18
[SRE on Git](https://github.com/dastergon/awesome-sre?tab=readme-ov-file#sre-tools).
19
19
20
20
**DevOps** is more a philosophy, than a function, focusing on streamlining development and deployment processes, increasing the speed at which new features are delivered. Tasks in development - Dev- and operations - Ops - are part of a continuous loop that includes building, deploying, testing, and monitoring applications and services.
21
-
To achieve this, [DevOps](https://docs.oracle.com/en-us/iaas/Content/GSG/Reference/getting-started-as-devops.htm) relies on methodologies, such as CI/CD, Agile Development and automation.
21
+
To achieve this, [DevOps](https://docs.oracle.com/en-us/iaas/Content/GSG/Reference/getting-started-as-devops.htm) relies on methodologies, such as CI/CD, Agile Development, and automation.
provides a powerful end-to-end platform for your DevOps practice, including private Git repositories as well as connection capability to GitHub, GitLab and other external repos.
24
+
provides a powerful end-to-end platform for your DevOps practice, including private Git repositories as well as connection capability to GitHub, GitLab, and other external repositories.
25
25
26
26
1. Adopt a version control system in the form of a single repository.
27
27
28
-
2. Automate building, testing and deployment.
28
+
2. Automate building, testing, and deployment.
29
29
30
30
3. Exploit IaC.
31
31
@@ -35,29 +35,29 @@ provides a powerful end-to-end platform for your DevOps practice, including priv
35
35
36
36
# SRE Best Practises and OCI Support for them
37
37
38
-
1. Define Service Level Objectives: these should be identified based on relevance to the business. Each organization will need to think it through, and most likely will define SLOs based on 'internal' SLOs, like resource utilization and response time, and 'end-users' SLOs like availabilty and end-user experience. They could also be device-dependent (like a Mobile Apps adoption or availability).
38
+
1. Define Service Level Objectives: these should be identified based on relevance to the business. Each organization will need to think it through, and most likely will define SLOs based on 'internal' SLOs, like resource utilization and response time, and 'end-user' SLOs like availability and end-user experience. They could also be device-dependent (like a Mobile Apps adoption or availability).
39
39
40
-
2. Unify the Observability platform, possibly with native SLOs features. OCI provides the capability to define thresholds and [custom metrics](https://docs.oracle.com/en-us/iaas/Content/Monitoring/Tasks/publishingcustommetrics.htm) to achieve this. Besides, available plug-ins and APIs, can expose the same metrics available to external tools such as [Grafana](https://grafana.com/grafana/plugins/oci-metrics-datasource/).
40
+
2. Unify the Observability platform, possibly with native SLO features. OCI provides the capability to define thresholds and [custom metrics](https://docs.oracle.com/en-us/iaas/Content/Monitoring/Tasks/publishingcustommetrics.htm) to achieve this. Besides, available plug-ins and APIs can expose the same metrics available to external tools such as [Grafana](https://grafana.com/grafana/plugins/oci-metrics-datasource/).
41
41
42
-
3. Define granularity and frequency (resolution) of metrics collection based on architecture, usefulness and related effort/cost per metric. Review these parameters as your architecture evolves over time.
42
+
3. Define granularity and frequency (resolution) of metrics collection based on architecture, usefulness, and related effort/cost per metric. Review these parameters as your architecture evolves over time.
43
43
44
44
4. Implement Alerting tools for quick detection of potential issues. With [OCI Notifications](https://docs.oracle.com/en-us/iaas/Content/Notification/Concepts/notificationoverview.htm), you can easily detect and be notified in human-readable format, when something happens in OCI. Keep Alerts definition and triggering as simple as possible.
45
45
46
-
5. Leverage Automation. EaC -Everything as a Code- and Ansible support SRE work throught the entire lifecycle management, from provisioning to configuration changes. Ansible playbooks promote consistency and idempotency, for repetitive tasks as well as rollback when needed.
46
+
5. Leverage Automation. EaC -Everything as a Code- and Ansible support SRE work through the entire lifecycle management, from provisioning to configuration changes. Ansible playbooks promote consistency and idempotency, for repetitive tasks as well as rollback when needed.
47
47
48
-
6. Use 'canary deployments' approach to minimize effects on a limited number of users and for early detection of defects. Select metrics, canary population and duration depending on the
48
+
6. Use the 'canary deployments' approach to minimize effects on a limited number of users and for early detection of defects. Select metrics, canary population, and duration depending on the
49
49
50
50
7. Automate remediation mechanisms. Once Notifications are implemented, automation can be easily achieved via [Functions](https://docs.oracle.com/en-us/iaas/Content/Notification/Concepts/notificationoverview.htm#automation). Examples may be from filing Jira Tickets to resizing VMs and many more.
51
51
52
52
8. Unify the ticketing platform: OCI gives the chance to integrate MyOracleSupport with your ticketing system via [Support Management APIs](https://docs.oracle.com/en-us/iaas/api/#/en/incidentmanagement/20181231/).
53
53
54
-
9. Define After Action Review Process (AAR) and post-mortem analysis.
54
+
9. Define the After Action Review Process (AAR) and post-mortem analysis.
55
55
56
56
10. Plan for Capacity. OCI offers a powerful tool to help you with forecasting your capacity needs via [Operations Insight](https://docs.oracle.com/en-us/iaas/operations-insights/doc/capacity-planning.html#GUID-B2A3E104-494B-46A5-9F3E-8E3977C9328F).
57
57
58
-
11. Avoid proliferation of tools and maximize integrations among those used.
58
+
11. Avoid the proliferation of tools and maximize integrations among those used.
59
59
60
-
12. Document standards, processes and tools.
60
+
12. Document standards, processes, and tools.
61
61
62
62
13. Evolve your SRE ecosystem along your environment lifecycle.
0 commit comments