You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/support/cloudflare-status.mdx
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,8 @@
1
1
---
2
2
pcx_content_type: concept
3
3
title: Cloudflare Status
4
+
sidebar:
5
+
order: 5
4
6
---
5
7
6
8
Cloudflare provides updates on the status of our services and network at https://www.cloudflarestatus.com/, which you should check if you notice unexpected behavior with Cloudflare.
@@ -11,30 +12,30 @@ Cloudflare believes that openness and transparency are intrinsic to the delivery
11
12
12
13
This Standard Operating Procedure (SOP) defines how Cloudflare deals with all incidents and problems impacting its production environment and the ways in which Cloudflare communicates the nature and impact of these incidents to Enterprise customers, both planned and unplanned, regardless of severity. This procedure specifies how these efforts are uniformly followed in order to
13
14
14
-
* maximize environment uptime,
15
-
* minimize client impact,
16
-
* reduce the time to repair, and
17
-
* share information with our customers and the Internet community.
15
+
- maximize environment uptime,
16
+
- minimize client impact,
17
+
- reduce the time to repair, and
18
+
- share information with our customers and the Internet community.
18
19
19
-
***
20
+
---
20
21
21
22
## Scope
22
23
23
24
This SOP applies to Cloudflare customers and customer services as consumed by customers. The SOP is applicable to all customer production environments at Cloudflare including:
24
25
25
-
* Cloudflare’s public website ([www.cloudflare.com](http://www.cloudflare.com/))
- Network infrastructure owned or managed by Cloudflare for production services
30
+
- Vendor software, hardware and services that affect any part of Cloudflare production
30
31
31
-
***
32
+
---
32
33
33
34
## Background
34
35
35
36
Cloudflare wants to build a better Internet. In order to deliver an improved experience to millions of Internet users, Cloudflare’s internal operations must follow excellent service delivery processes and procedures. Cloudflare’s procedures therefore follow many industry-standard best practices, some of which specifically follow patterns of the Information Library Infrastructure Technology (ITIL). This SOP follows the best practices of the ITIL Problem Management methodology.
36
37
37
-
***
38
+
---
38
39
39
40
## Definitions
40
41
@@ -120,7 +121,7 @@ The primary tool which Cloudflare uses to publicly share information about its s
120
121
121
122
The Status Page is hosted by a Third Party ([Statuspage.io](http://statuspage.io/)) which is not dependent on Cloudflare’s services for operation.
122
123
123
-
***
124
+
---
124
125
125
126
## Roles and responsibilities
126
127
@@ -150,32 +151,32 @@ The overall Systems Reliability Engineering team who support the efforts of the
150
151
151
152
Support the Incident Manager during problem resolution. Join bridge calls, if requested. Ensure documentation is captured while diagnosing and correcting issues and proper escalation to other responsible groups is executed. Participate in Post Mortem reviews of some Incident Reports, as requested by Cloudflare Management.
152
153
153
-
***
154
+
---
154
155
155
156
## Standard Operating Procedure
156
157
157
158
This section details the procedures for incident and problem management. At a high-level, these processes relate as follows:
158
159
159
-
* Incident Management: The overall process for observing and responding to alerts, including: assessing the potential impact and severity of an Incident, classifying the Incident as a Problem, assigning a priority to the Problem, or dismissing the Incident as a non-impacting event if a problem condition cannot be identified.
160
+
- Incident Management: The overall process for observing and responding to alerts, including: assessing the potential impact and severity of an Incident, classifying the Incident as a Problem, assigning a priority to the Problem, or dismissing the Incident as a non-impacting event if a problem condition cannot be identified.
160
161
161
-
* Problem Management: The process of identifying the scope and extent of a Problem, assigning an appropriate severity level (P0, P1, P2, P3), the actions to resolve the Problem and restore the optimal state for production services, and the communication of the Problem to appropriate parties.
162
+
- Problem Management: The process of identifying the scope and extent of a Problem, assigning an appropriate severity level (P0, P1, P2, P3), the actions to resolve the Problem and restore the optimal state for production services, and the communication of the Problem to appropriate parties.
162
163
163
-
* Resolution Management: The process of investigating the causes and conditions which lead to a problem condition, reporting on the overall manner by which a problem was managed and resolved, and any subsequent analysis of how the conditions and causes of a problem may be prevented in the future.
164
+
- Resolution Management: The process of investigating the causes and conditions which lead to a problem condition, reporting on the overall manner by which a problem was managed and resolved, and any subsequent analysis of how the conditions and causes of a problem may be prevented in the future.
164
165
165
-
***
166
+
---
166
167
167
168
The primary goal of Incident Management is to identify and react to potential problems as quickly as possible, and thereby minimize impact to production services and provide the best possible levels of service quality and availability. The best possible levels of service quality and availability would be that all services operated exactly as designed 100% of the time, and were available and accessible 100% of the time.
168
169
169
170
Because we accept that a combination of forces within our control, and forces beyond our control, will eventually impact service health, we define Service Level Objectives (SLOs), and Service Level Agreements (SLAs), to describe what degradations in service health are acceptable for various services within Cloudflare’s network. SLAs and SLOs are expressed as percentages of periods of time (monthly and annually.)
170
171
171
172
The level of information given about an incident may vary, but the following information must be collected before an incident is classified and prioritized:
172
173
173
-
* Submitter Source (monitoring alert or alternate source)
174
-
* Customer(s) (if applicable)
175
-
* System or application (and hostname, if applicable)
176
-
* Time of alert
177
-
* Scope of impact: estimated number of systems, users, or regions impacted
178
-
* Type of impact: general scope of service impairment (e.g., loss of all access, degraded performance, dependent applications impacted, observed customer impact)
174
+
- Submitter Source (monitoring alert or alternate source)
175
+
- Customer(s) (if applicable)
176
+
- System or application (and hostname, if applicable)
177
+
- Time of alert
178
+
- Scope of impact: estimated number of systems, users, or regions impacted
179
+
- Type of impact: general scope of service impairment (e.g., loss of all access, degraded performance, dependent applications impacted, observed customer impact)
179
180
180
181
All Incidents which are classified as Problems, regardless of source, which have a priority of P0 or P1, will be logged within the Cloudflare ticketing system, JIRA. Some alerts will indicate conditions which may not be immediately impacting to service levels, and as necessary, will be categorized as Problems with a P2 or P3 priority.
181
182
@@ -191,31 +192,31 @@ All tickets will be categorized according to the following 4 levels of priority.
191
192
192
193
**P0**
193
194
194
-
* Complete loss of access to the Cloudflare application or API.
195
-
* Degraded access to the Cloudflare application or API (⪯ 98% as measured worldwide or from any major region).
196
-
* Complete loss of access to, or major performance degradation to, a Tier-1 Data Center.
197
-
* Degraded performance of any Tier-1 global transit provider (⪰ 20% packet loss worldwide or 30% packet loss from any major region).
198
-
* Degraded access to or performance of any critical system.
195
+
- Complete loss of access to the Cloudflare application or API.
196
+
- Degraded access to the Cloudflare application or API (⪯ 98% as measured worldwide or from any major region).
197
+
- Complete loss of access to, or major performance degradation to, a Tier-1 Data Center.
198
+
- Degraded performance of any Tier-1 global transit provider (⪰ 20% packet loss worldwide or 30% packet loss from any major region).
199
+
- Degraded access to or performance of any critical system.
199
200
200
201
**P1**
201
202
202
-
* Intermittent or degraded Site-wide performance degradation.
203
-
* Loss of an important function such as reporting.
204
-
* Loss of access to the Cloudflare application from one of the social media or external CloudFlare websites
205
-
* Outage to important outbound third-party interface.
206
-
* Inoperability of the site for one of the enterprise clients or distribution partners.
207
-
* Corruption or loss of customer data.
203
+
- Intermittent or degraded Site-wide performance degradation.
204
+
- Loss of an important function such as reporting.
205
+
- Loss of access to the Cloudflare application from one of the social media or external CloudFlare websites
206
+
- Outage to important outbound third-party interface.
207
+
- Inoperability of the site for one of the enterprise clients or distribution partners.
208
+
- Corruption or loss of customer data.
208
209
209
210
**P2**
210
211
211
-
* Sporadic or localized performance issue.
212
-
* System issues with no noticeable client impact yet (e.g. high CPU).
213
-
* Single client outage/degradation.
212
+
- Sporadic or localized performance issue.
213
+
- System issues with no noticeable client impact yet (e.g. high CPU).
214
+
- Single client outage/degradation.
214
215
215
216
**P3**
216
217
217
-
* Operational issues, procedural problems or service requests that have little or no effect on end-users and can be handled on an as-available basis.
218
-
* The default severity assigned to all tickets that have not yet been reviewed or assigned a severity level.
218
+
- Operational issues, procedural problems or service requests that have little or no effect on end-users and can be handled on an as-available basis.
219
+
- The default severity assigned to all tickets that have not yet been reviewed or assigned a severity level.
219
220
220
221
### Category
221
222
@@ -235,36 +236,36 @@ P0 and P1 incidents obviously have more impact to the business and therefore, ha
235
236
236
237
For all P0 and P1 issues, the on-duty Incident Manager should be contacted immediately. A schedule of incident managers will be posted to ensure that SRE knows who to contact at any given time. The incident manager is a critical resource responsible for the following:
237
238
238
-
* Validation of the severity of an issue
239
-
* Tracking of the issue from submission to resolution
240
-
* Representation of clients’ best interest
241
-
* Logging of all actions and times
242
-
* Direction of personnel toward the fastest possible resolution
243
-
* Ensuring that clients and internal management are notified of status according to pre-determined time periods (or upon change in status)
244
-
* Performing client, internal or third-party escalations when time limits are being exceeded or appropriate progress is not being made
245
-
* Ensuring that a meaningful explanation is applied to the ticket upon resolution
246
-
* Making certain that the initial submitter agrees that the issue is resolved before the ticket is closed
239
+
- Validation of the severity of an issue
240
+
- Tracking of the issue from submission to resolution
241
+
- Representation of clients’ best interest
242
+
- Logging of all actions and times
243
+
- Direction of personnel toward the fastest possible resolution
244
+
- Ensuring that clients and internal management are notified of status according to pre-determined time periods (or upon change in status)
245
+
- Performing client, internal or third-party escalations when time limits are being exceeded or appropriate progress is not being made
246
+
- Ensuring that a meaningful explanation is applied to the ticket upon resolution
247
+
- Making certain that the initial submitter agrees that the issue is resolved before the ticket is closed
247
248
248
-
***
249
+
---
249
250
250
251
## Incident Communications
251
252
252
253
External communications during an incident are critical for:
253
254
254
-
* Notifying the stakeholders that Cloudflare is aware of the issue and is pursuing resolution
255
-
* Reassuring clients that the matter is under review and that Cloudflare is looking out for their best interests
256
-
* Issues do not drag on unnecessarily and appropriate escalations are being made
257
-
* Informing key internal stakeholders of important incidents
255
+
- Notifying the stakeholders that Cloudflare is aware of the issue and is pursuing resolution
256
+
- Reassuring clients that the matter is under review and that Cloudflare is looking out for their best interests
257
+
- Issues do not drag on unnecessarily and appropriate escalations are being made
258
+
- Informing key internal stakeholders of important incidents
258
259
259
260
Major types of communications during an incident include:
Status Page will be created using templates by CSUP team member on-call as soon as an incident is identified.
266
267
267
-
***
268
+
---
268
269
269
270
## Post-Mortem reviews
270
271
@@ -288,7 +289,7 @@ The Incident Report (“IR”) is the primary method of communication to the cli
288
289
289
290
The person writing the report will vary depending on the severity of the issue and the responsible area. Upon completion of the draft report, it is critical to ensure that the report is reviewed by Cloudflare management for content, commitments and professional presentation. Once the report is approved it may be published to the client.
290
291
291
-
***
292
+
---
292
293
293
294
## Problem review
294
295
@@ -298,10 +299,10 @@ The above sections have detailed the handling of the incident and the root cause
298
299
299
300
The ticket criteria that need to be reported for both open and closed tickets include the following:
300
301
301
-
* Severity
302
-
* Category/Sub-category
303
-
* Responsible Group
304
-
* Age/Days Open
302
+
- Severity
303
+
- Category/Sub-category
304
+
- Responsible Group
305
+
- Age/Days Open
305
306
306
307
Wherever possible, this data should be reported graphically to show visible trends. These reports should be published to internal Cloudflare managers and area owners.
307
308
@@ -313,6 +314,6 @@ Each area owner for tickets will be responsible for not only ensuring that their
313
314
314
315
As part of all departmental staff meetings, group managers should be reviewing the ticket open and trending reports with the following objectives:
315
316
316
-
* Discussion of areas of success or concern
317
-
* Review of opportunities for improvement by the area owners
318
-
* Agreement on areas that warrant a new Problem ticket to be opened for remediation tracking
317
+
- Discussion of areas of success or concern
318
+
- Review of opportunities for improvement by the area owners
319
+
- Agreement on areas that warrant a new Problem ticket to be opened for remediation tracking
0 commit comments