Incident Report: Yearly Auction Double Processing and Bidding Issues

Written by Admin | Sep 12, 2025 3:10:49 PM

Summary

Title: Yearly auctions double processing and bidding issues

Date & Time: Monday 07.07.2025, 12:34h (CEST) - Friday 11.07.2025, 10:40h (CEST)

Affected Services: Auction processing and bidding

Status: resolved

Description

Overview

On Monday 07.07.2025 at approximately 12:34h (CEST), two critical errors occurred during the execution of PRISMA's yearly capacity auctions, leading to the following issues:

Auction results were sent and published twice. In some cases, results were not sent at all. In the remaining document this issue is referred to as “Auction processing issues”.

Some shippers were unable to place bids from round 2 onwards in the auctions which went to the next round. In the remaining document this issue is referred to as “Bidding issues”.

Technical Details

Auction Processing Issues

Bidding Issues

The issue occurred because two application servers ended up processing the same task in parallel.

The first server was delayed in execution, leading the platform to assume it had failed and to reassign the task to a second server.

Re-assigning the task to a second server in case of failure is desired behaviour as part of platform redundancy. In this case it was not correctly detected that it was not a failure, but a delay.

Caused by legacy code shippers received an error message indicating they were not entitled to edit an existing bid or place a new bid.

The legacy code in question incorrectly handled bid ownership checks, when multiple bids of the same person or of multiple persons in the same organisation were involved.

Scope of Impact

Auction Processing Issues	Bidding Issues
The majority of TSOs and shippers participating in the yearly capacity auctions were impacted by the double processing of auction results.	21 auctions out of a total of 1574 published yearly auctions were impacted by the bidding issues in the second round of the auctions.

Duration of Impact

The total duration of the incident from detection on 07.07.2025, 12:34h (CEST) to deployment of fixes on PROD on 08.07.2025, 13:07h (CEST) and 16:35h (CEST) was 1 day, 4 hours and 1 minute.

Until full resolution including all activities for re-running the auction and data clean-up on platform and TSO side on 11.07.2025, 10:40h (CEST) an additional 2 days, 18 hours and 5 minutes were needed.

Timeline

Date	Time (CEST)	Responsible	Action Type	Description
Jul 7, 2025	12:34	PRISMA internal	Send MS Teams message	Customer Success informed about a possible incident. Shippers were reporting seeing their booking results twice. First analysis did not show any double bookings in internal support tool. Assumption that only cosmetic clean-up will be necessary after the auctions.
Jul 7, 2025	13:05	PRISMA internal	Create ISR	Customer Success created an ISR (Incident & Service Request), formally starting the internal incident process for the auction processing issues.
Jul 7, 2025	13:13	PRISMA Emergency Guard	Publish UMM	As part of PRISMA’s business continuity measures the Emergency Guard posted a UMM to inform the market about issues with the auction processing.
Jul 7, 2025	13:19	PRISMA internal	Send MS Teams message	Customer Success informed about shippers not being able to place a bid in round 2 of the remaining 21 auctions. First analysis showed that it was not an isolated issue, but affecting all 21 auctions. As a result, this was also treated as a formal incident.
Jul 7, 2025	14:13	PRISMA Emergency Guard	Publish UMM	As part of PRISMA’s business continuity measures the Emergency Guard posted a UMM to inform the market about issues with bidding in round 2.
Jul 7, 2025	15:25	PRISMA Emergency Guard	Align internally	Internal alignment of Emergency Guard with PRISMA’s Management about the next steps regarding cancellation or continuation of the auctions. Decision: to prevent market distortion, PRISMA recommends cancellation of remaining 21 auctions.
Jul 7, 2025	16:19	PRISMA Emergency Guard	Send Email	Emergency Guard sent an email to TSO emergency contacts to summarise the call and the decision taken.
Jul 7, 2025	16:28	PRISMA Emergency Guard	Publish UMM	As part of PRISMA’s business continuity measures the Emergency Guard posted a UMM to inform the market about the cancellation of the auctions.
Jul 8, 2025	10:07	Customer Success	Dismiss UMM	Customer Success (in alignment with Emergency Guard) dismissed the UMM regarding the bidding issues, since the auctions have been cancelled.
Jul 8, 2025	13:07	PRISMA internal	Deploy fix	After review and testing the fix for the bidding issues was deployed to the production system.
Jul 8, 2025	13:46	PRISMA Emergency Guard	Update UMM	Emergency Guard updated the existing UMM about issues with the auction processing to reflect the new marketing time-frame for re-run of the auctions.
Jul 8, 2025	14:04	PRISMA Emergency Guard	Update UMM	Emergency Guard updated the existing UMM about issues with the auction processing to include also the cases of the missing booking confirmations.
Jul 8, 2025	16:28	Customer Success	Update UMM	Customer Success updated the existing UMM about issues with the auction processing to reflect the new auction publishing time for the auction re-run.
Jul 8, 2025	16:35	PRISMA internal	Deploy fix	After review and testing the fix for the auction processing was deployed to the production system. The fix was available since 13:49h, but to avoid interference with the day-ahead auctions, the deployment was scheduled to happen afterwards.
Jul 11, 2025	10:40	PRISMA	Execute steps for data clean-up	Necessary steps for data clean-up were executed successfully.

Root Cause Analysis (RCA)

Auction Processing Issues

Bidding Issues

Cause: The application server processing the auction evaluation lost the connection to the database. A second application server started processing the same auctions as part of the redundancy implementation in the platform infrastructure.

Assessment: Incident was caused by unanticipated high load.

Detection: The problem was identified by a user report. Shippers contacted PRISMA’s Customer Success after experiencing first issues with duplicate results.

Cause: A piece of legacy code in the backend returned wrong permissions to the frontend. Shippers with more than one bid per company were not able to edit bids in the second bidding round, if the bids had been manually placed and not via a bidding plan.

Assessment: Incident was caused by an isolated bug.

Detection: The problem was identified by a user report. Shippers contacted PRISMA’s Customer Success after experiencing issues with placing bids.

Resolution & Recovery

Auction Processing Issues

Bidding Issues

Intermediate resolution: Enlargement of the connection pool for the database connection per application server. This resolution is deployed as per 08.07.2025, 13:07h (CEST).

During the runtime of the yearly interruptible auctions 2 weeks later, PRISMA closely monitored the platform systems for possible DOS attacks. No signs for attempted attacks were found.

Long-term resolution: Segmentation of the application servers to improve task allocation and system stability. A set of servers is dedicated to scheduled activities, like auction processing and report generation. A different set of servers is dedicated to processing requests originating from (public) endpoints of the platform, allowing PRISMA to reduce the connection pool size per server back to the original number of 100.

Restoration of service: Following the intermediate fix the necessary actions for data clean-up (invalidation of double bookings) were aligned and executed with the TSOs.

Intermediate resolution: A targeted fix was implemented to enhance the query logic in the backend by enabling it to correctly handle and accept multiple bids. This resolution is deployed as per 08.07.2025, 16:35h (CEST).

Long term resolution: Refactoring the legacy platform to a modern, independent service-based architecture will reduce the likelihood of such issues massively. This transformation project is already in progress and will be driven further with highest priority.

Restoration of service: Following the intermediate fix all cancelled auctions were successfully republished using a custom calendar. On 09.07.2025, 09:00h CEST the affected auctions were successfully re-run and at 13:45h (CEST) the incident was declared closed.

Preventive Actions:

Increase and improve test coverage of long-term auctions: Identifying the gap between existing test scenarios and additional scenarios that would have covered the specific constellations of the incident will reduce the risk of a reoccurrence.

Re-assess critical business phases and resource planning: Conducting a structured review of recurring, high-impact events, including potential changes in conditions or risks compared to previous years, adequate management oversight and engagement in preparation activities as well as resource allocation, role coverage and contingency plans.

Implementation of long-term resolution for auction processing issues: Adaption of the legacy infrastructure to allow a segregation of application servers, thorough testing and roll-out in production.

View full post