UTHPC - Unavailable cluster nodes and job queue delays – Incident details

Unavailable cluster nodes and job queue delays

Resolved
Degraded performance
Started 8 days agoLasted 7 days

Affected

rocket.hpc.ut.ee

Degraded performance from 10:11 AM to 6:35 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    A workaround is currently being tested and deployed

  • Update
    Update

    The issue has been tracked to a recent bug in the RedHat kernel. Our team is working on a workaround

  • Identified
    Identified

    We are currently experiencing severe delays in the job queue and significant number of nodes are currently down. The technical team is actively working to resolve this incident.