This incident has been resolved.
Mar 23, 15:15 PDT
All systems are fully operational except for the charting and alerting system, which continues to experience issues with reflecting new and updated time series in running detectors and charts.
Mar 23, 14:58 PDT
Data ingest and alerting are operational (some alerts for new time series or recently updated properties are delayed).
UI and API systems are functional, as well as some charting functionality.
We are continuing to monitor charting recovery as well as the burndown for the time series creation and update backlog.
Mar 23, 14:43 PDT
As the metadata systems continue to recover we expect approximately 40 minutes of delay to property updates and new time series creations to be reflected in the alerting and charting systems.
Mar 23, 14:26 PDT
Efforts to improve recovery time of alerting was successful and we see alerting systems operational. Customers may have received a large burst of backlogged alerts at 2:10 PT.
Access to APIs and Web Interface are now restored, however charting continues to experience issues.
Mar 23, 14:24 PDT
Data ingest continues to be healthy and operational. We are seeing slow but steady recovery in the alerting system and are implementing further changes to accelerate the recovery. We expect UI/API functionality to be restored within the next 20 minutes.
Mar 23, 14:05 PDT
Data ingest is now operational we are no longer dropping telemetry data. We are continuing to monitor the recovery of non-ingest APIs, UI, and Alerting functionality.
Mar 23, 13:38 PDT
We are seeing recovery of data ingest for all telemetry and continuing to monitor until full recovery is achieved. We are continuing to perform operations to restore APIs, UI, and Alerting functionality.
Mar 23, 13:23 PDT
We've verified cluster functionality is back and are now working on resolving connection issues of the clients to the impacted cluster and expect some functionality to be restored in the next 30 minutes.
Mar 23, 13:09 PDT
We have restored the impacted cluster and are now reenabling data processing to start processing backlogs.
Mar 23, 12:47 PDT
A key data store was unexpectedly lost during planned operations for an unrelated system. We are working to bring the data store cluster back up, after which we expect the recovery of some ingest functionality. Following system recovery, we anticipate that data recovery and repair will need to occur for full functionality to be restored.
Mar 23, 12:22 PDT
This incident has been updated to acknowledge that logs, apm, rum, and incident intelligence ingest have are also experiencing a major outage.
Mar 23, 11:49 PDT
We are continuing on mitigations to recover the system and expect a resolution to take at least 3 hours at this time. We will continue to provide regular updates as we make progress on the resolution or expected timeline.
Mar 23, 11:41 PDT
We've identified the source of the issue and are working on mitigations. All major functionality and data ingest is impacted.
Mar 23, 11:31 PDT
We are investigating a major outage of Splunk Observability Cloud starting at 10:47am PT. We will provide an update as soon as possible.
Mar 23, 11:04 PDT