Marketnews

Incident Report: Low Number of WID Auctions

Written by Admin | Sep 12, 2025 3:10:21 PM

Summary 

 

  • Title: Low number of WID auctions 
  • Date & Time: Monday 30.06.2025, 16:19h (CEST) - Wednesday 02.07.2025, 10:00h (CEST) 
  • Affected Services: Auction start of within-day auctions 
  • Status: resolved 

Description 

 

Overview 

On Monday 30.06.2025 at approximately 16:19h (CEST), a critical error occurred during the execution of PRISMA's within-day capacity auctions, leading to the following issues: Within-day auctions were not created and therefore not started. 

Technical Details 

The issue was caused by a database running out of storage. Due to no storage being available an (intentional) lock on the database was not released. This caused the process responsible for creating within-day auctions to fail.  

Scope of Impact 

The expected number of within-day auctions for a given hour is at approx. 215 (the exact number can fluctuate, due to availability of capacity on TSO side). During the issue the number of running auctions was 112, which means that approx. 48% of the expected auctions were impacted. 

Duration of Impact 

The total duration of the incident from detection on 30.06.2025, 16:19h (CEST) to full restoration of the auction functionality on the same day at 18:00h (CEST) was 1 hour and 41 minutes. 

According to PRISMA’s business continuity measures the UMM for resolving the issue is published once all auctions for the impacted transportation period have been conducted. The respective UMM was published on 01.07.2025, 07:45h (CEST) which is an additional 13 hours and 45 minutes later. 

Timeline

Date

 

 Time (CEST)

 Responsible

 Description 

30 Jun 2025

16:19 

PRISMA internal 

PRISMA internal resources identify low number of within day auctions triggered by internal auction monitoring.  

30 Jun 2025

17:23 

PRISMA Emergency Guard 

As part of PRISMA’s business continuity measures the Emergency Guard posted a UMM to inform the market about issues with the auction start. 

30 Jun 2025

18:00 

PRISMA internal 

After identifying the processes that caused the database performance issues and actively ending these processes, the database recovered and all remaining sessions were unblocked. 

30 Jun 2025

18:00 -20:00 

PRISMA internal 

Close monitoring of the state of within-day auctions and database performance by PRISMA engineers, to ensure that the fix is persistent. 

1 Jul 2025

07:43 

PRISMA Emergency Guard 

As part of PRISMA’s business continuity measures the Emergency Guard informed the TSO emergency contacts via email that the incident is resolved. 

1 Jul 2025

07:45 

PRISMA Emergency Guard 

As part of PRISMA’s business continuity measures the Emergency Guard updated the UMM with the information that the incident is resolved. 

2 Jul 2025

10:00 

PRISMA internal 

Emergency Guard, Customer Success and involved engineers conducted a post mortem. 

Root Cause Analysis (RCA) 

Assessment: The incident was caused by the Shipper API. The existing fail safe of limiting the query and the existing rate limiting was not enough to prevent the incident. 

Detection: The problem was identified by PRISMA internal monitoring and alarming. 

 

Resolution & Recovery 

Intermediate resolution:  
In the course of the incident the expensive processes / queries were identified and manually ended. 

Long-term resolution:  
Introduction of improved query handling (e.g. queueing) and improvement of existing rate limiting for this specific endpoint. In addition the query execution for the Shipper API can be moved to a database replica. 

Restoration of service:  
The full restoration of the auction functionality was reached after the processes that caused the issues were manually ended. 

Preventive Actions:

  • Immediately rate limit the shipper calling the auction endpoint of the Shipper API to prevent further usage in an unintended manner. In addition, identify and initiate direct communication with the shipper to change their usage behaviour of the Shipper API. 
  • Introduction of improved query handling (e.g. queueing) and improvement of existing rate limiting for this specific endpoint. 
  • Create dedicated database views to provide real-time visibility into critical metrics such as current locks, memory usage, and disk utilisation. This will streamline future incident analysis by enabling engineers to quickly access and interpret the most relevant information, thereby reducing diagnostic time and accelerating resolution. 
  • Create internal incident reaction schema for database-level issues, defining which steps can be and which need to be taken during an incident analysis.