Kubernetes failed storage upgrade

Resolved
Partial outage
Started 3 months ago Lasted about 18 hours

Affected

UT HPC webservices
registry.hpc.ut.ee
Updates
  • Resolved
    Resolved
    Thanks to the help of the storage system developers, proper functionality has now been restored to Kubernetes. All the workloads are fine, no data loss occurred, nothing needed to be restored from backups. Why this issue happens perplexed the developers themselves, and we'll be working with them in the future to make sure it won't happen again.
  • Identified
    Identified
    Due to failed storage upgrade, the Kubernetes cluster is in a severely degraded state. While running workloads mostly work, changes cannot be currently made. Registry, which is hosted on Kubernetes, is currently one of the stateful applications that does not properly work. We are looking into fixing it.