During the night of 12 June, an incident occurred affecting the travelcom IBE API services, resulting in limited availability of the affected services.
Based on the current analysis, the incident originated from an unstable container state within a single service. For such scenarios, an automated recovery mechanism is in place: when a container reaches an unhealthy or unstable health status, it is removed from operation and replaced by a new instance. This process is part of the regular operating model and is designed to handle individual technical fault states automatically, without manual intervention and without noticeable impact on ongoing operations.
The automated replacement process was triggered as expected. In this specific case, however, the new instance could not be started successfully because the designated container image contained an inconsistency. This image is normally used both to rebuild containers and to scale additional capacity. Since no functional new instance of the affected service could be created from this image, the intended automated recovery process could not be completed successfully.
The incident was detected by automated health checks. A critical alarm was not triggered immediately, as the automated recovery process had started as intended. The incident was escalated as critical only after repeated attempts to rebuild the service had failed.
To restore the service, a new functional image was built and deployed. From escalation to restoration of the affected service, the resolution took approximately 70 minutes.
As an immediate follow-up measure, the existing image policy is being reviewed and extended. The objective is to introduce additional validation and safeguarding mechanisms to ensure that faulty or inconsistent images can no longer be used as the basis for automated recovery or scaling processes. In addition, the alerting logic is being reviewed to determine whether repeated failed recovery attempts should be classified as a critical condition at an earlier stage.