UTHPC - Kubernetes infrastructure issues – Incident details

Kubernetes infrastructure issues

Resolved
Major outage
Started about 1 year agoLasted about 1 hour

Affected

minu.etais.ee

Major outage from 7:54 PM to 8:45 PM

UT HPC webservices

Major outage from 7:54 PM to 8:45 PM

hpc.ut.ee

Major outage from 7:54 PM to 8:45 PM

docs.hpc.ut.ee

Major outage from 7:54 PM to 8:45 PM

Waldur portals

Major outage from 7:54 PM to 8:45 PM

puhuri-portal.neic.no

Major outage from 7:54 PM to 8:45 PM

Updates
  • Resolved
    Resolved

    The issues have now been resolved. All the workloads and storage has healed nicely, with no damage to data resiliency.

  • Monitoring
    Monitoring

    The issue has been found and remedied. The workloads are recovering.

  • Investigating
    Investigating

    There's a problem with Kubernetes, which causes all the dependant services to also be impacted. We are trying to find a resolution.