Incident Report: Low Number of WID Auctions

Written by Admin | Sep 12, 2025 3:10:21 PM

Summary

Title: Low number of WID auctions

Date & Time: Monday 30.06.2025, 16:19h (CEST) - Wednesday 02.07.2025, 10:00h (CEST)
Affected Services: Auction start of within-day auctions
Status: resolved

Description

Overview

On Monday 30.06.2025 at approximately 16:19h (CEST), a critical error occurred during the execution of PRISMA's within-day capacity auctions, leading to the following issues: Within-day auctions were not created and therefore not started.

Technical Details

The issue was caused by a database running out of storage. Due to no storage being available an (intentional) lock on the database was not released. This caused the process responsible for creating within-day auctions to fail.

Scope of Impact

The expected number of within-day auctions for a given hour is at approx. 215 (the exact number can fluctuate, due to availability of capacity on TSO side). During the issue the number of running auctions was 112, which means that approx. 48% of the expected auctions were impacted.

Duration of Impact

The total duration of the incident from detection on 30.06.2025, 16:19h (CEST) to full restoration of the auction functionality on the same day at 18:00h (CEST) was 1 hour and 41 minutes.

According to PRISMA’s business continuity measures the UMM for resolving the issue is published once all auctions for the impacted transportation period have been conducted. The respective UMM was published on 01.07.2025, 07:45h (CEST) which is an additional 13 hours and 45 minutes later.

Timeline

Date	Time (CEST)	Responsible	Description
30 Jun 2025	16:19	PRISMA internal	PRISMA internal resources identify low number of within day auctions triggered by internal auction monitoring.
30 Jun 2025	17:23	PRISMA Emergency Guard	As part of PRISMA’s business continuity measures the Emergency Guard posted a UMM to inform the market about issues with the auction start.
30 Jun 2025	18:00	PRISMA internal	After identifying the processes that caused the database performance issues and actively ending these processes, the database recovered and all remaining sessions were unblocked.
30 Jun 2025	18:00 -20:00	PRISMA internal	Close monitoring of the state of within-day auctions and database performance by PRISMA engineers, to ensure that the fix is persistent.
1 Jul 2025	07:43	PRISMA Emergency Guard	As part of PRISMA’s business continuity measures the Emergency Guard informed the TSO emergency contacts via email that the incident is resolved.
1 Jul 2025	07:45	PRISMA Emergency Guard	As part of PRISMA’s business continuity measures the Emergency Guard updated the UMM with the information that the incident is resolved.
2 Jul 2025	10:00	PRISMA internal	Emergency Guard, Customer Success and involved engineers conducted a post mortem.

Root Cause Analysis (RCA)

Assessment: The incident was caused by the Shipper API. The existing fail safe of limiting the query and the existing rate limiting was not enough to prevent the incident.

Detection: The problem was identified by PRISMA internal monitoring and alarming.

Resolution & Recovery

Intermediate resolution:
In the course of the incident the expensive processes / queries were identified and manually ended.

Long-term resolution:
Introduction of improved query handling (e.g. queueing) and improvement of existing rate limiting for this specific endpoint. In addition the query execution for the Shipper API can be moved to a database replica.

Restoration of service:
The full restoration of the auction functionality was reached after the processes that caused the issues were manually ended.

Preventive Actions:

Immediately rate limit the shipper calling the auction endpoint of the Shipper API to prevent further usage in an unintended manner. In addition, identify and initiate direct communication with the shipper to change their usage behaviour of the Shipper API.

Introduction of improved query handling (e.g. queueing) and improvement of existing rate limiting for this specific endpoint.

Create dedicated database views to provide real-time visibility into critical metrics such as current locks, memory usage, and disk utilisation. This will streamline future incident analysis by enabling engineers to quickly access and interpret the most relevant information, thereby reducing diagnostic time and accelerating resolution.

Create internal incident reaction schema for database-level issues, defining which steps can be and which need to be taken during an incident analysis.

View full post