Splunk Observability Cloud Unavailable

Incident Report for Splunk Observability Cloud EU0

Resolved

This incident has been resolved.

Posted Mar 23, 2023 - 22:15 UTC

Update

All systems are fully operational except for the charting and alerting system, which continues to experience issues with reflecting new and updated time series in running detectors and charts.

Posted Mar 23, 2023 - 21:58 UTC

Update

Data ingest and alerting are operational (some alerts for new time series or recently updated properties are delayed).

UI and API systems are functional, as well as some charting functionality.

We are continuing to monitor charting recovery as well as the burndown for the time series creation and update backlog.

Posted Mar 23, 2023 - 21:43 UTC

Update

As the metadata systems continue to recover we expect approximately 40 minutes of delay to property updates and new time series creations to be reflected in the alerting and charting systems.

Posted Mar 23, 2023 - 21:26 UTC

Update

Efforts to improve recovery time of alerting was successful and we see alerting systems operational. Customers may have received a large burst of backlogged alerts at 2:10 PT.

Access to APIs and Web Interface are now restored, however charting continues to experience issues.

Posted Mar 23, 2023 - 21:24 UTC

Update

Data ingest continues to be healthy and operational. We are seeing slow but steady recovery in the alerting system and are implementing further changes to accelerate the recovery. We expect UI/API functionality to be restored within the next 20 minutes.

Posted Mar 23, 2023 - 21:05 UTC

Update

Data ingest is now operational we are no longer dropping telemetry data. We are continuing to monitor the recovery of non-ingest APIs, UI, and Alerting functionality.

Posted Mar 23, 2023 - 20:38 UTC

Monitoring

We are seeing recovery of data ingest for all telemetry and continuing to monitor until full recovery is achieved. We are continuing to perform operations to restore APIs, UI, and Alerting functionality.

Posted Mar 23, 2023 - 20:23 UTC

Update

We've verified cluster functionality is back and are now working on resolving connection issues of the clients to the impacted cluster and expect some functionality to be restored in the next 30 minutes.

Posted Mar 23, 2023 - 20:09 UTC

Update

We have restored the impacted cluster and are now reenabling data processing to start processing backlogs.

Posted Mar 23, 2023 - 19:47 UTC

Update

A key data store was unexpectedly lost during planned operations for an unrelated system. We are working to bring the data store cluster back up, after which we expect the recovery of some ingest functionality. Following system recovery, we anticipate that data recovery and repair will need to occur for full functionality to be restored.

Posted Mar 23, 2023 - 19:22 UTC

Update

This incident has been updated to acknowledge that logs, apm, rum, and incident intelligence ingest have are also experiencing a major outage.

Posted Mar 23, 2023 - 18:49 UTC

Update

We are continuing on mitigations to recover the system and expect a resolution to take at least 3 hours at this time. We will continue to provide regular updates as we make progress on the resolution or expected timeline.

Posted Mar 23, 2023 - 18:41 UTC

Identified

We've identified the source of the issue and are working on mitigations. All major functionality and data ingest is impacted.

Posted Mar 23, 2023 - 18:31 UTC

Investigating

We are investigating a major outage of Splunk Observability Cloud starting at 10:47am PT. We will provide an update as soon as possible.

Posted Mar 23, 2023 - 18:04 UTC

This incident affected: Splunk APM (Splunk APM Ingest), Splunk Log Observer and Log Observer Connect (Splunk Log Observer Ingest), Datapoint Ingest, Splunk Observability Cloud Web Interface, Alerting, Splunk Incident Intelligene (Splunk Incident Intelligence Ingest), and Splunk RUM (Splunk RUM Ingest).