UTHPC - Network issue causing timeouts in HPC and Cloud – Incident details

Network issue causing timeouts in HPC and Cloud

Resolved
Partial outage
Started 8 months agoLasted about 2 hours

Affected

rocket.hpc.ut.ee

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

UT HPC webservices

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

support.hpc.ut.ee

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

Services

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

Jupyter

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

Galaxy

Partial outage from 2:55 PM to 3:38 PM, Operational from 3:38 PM to 5:20 PM

Updates
  • Resolved
    Resolved

    Marking this incident as resolved.

  • Identified
    Identified
    A hardware issue has been found with one of the infiniband switches, systems seem to be returning to normal now. There might be some slowness due to previous timeouts playing themselves out. We will continue to monitor the situation.
  • Investigating
    Investigating
    There's an infiniband network issue causing distributed filesystem operations to wait, which is causing cloud VMs and HPC not to answer properly.