Network issue causing timeouts in HPC and Cloud

Resolved
Partial outage
Started 12 days ago Lasted about 2 hours

Affected

rocket.hpc.ut.ee
UT HPC webservices
support.hpc.ut.ee
Services
Jupyter
Galaxy
Updates
  • Resolved
    Resolved

    Marking this incident as resolved.

  • Identified
    Identified
    A hardware issue has been found with one of the infiniband switches, systems seem to be returning to normal now. There might be some slowness due to previous timeouts playing themselves out. We will continue to monitor the situation.
  • Investigating
    Investigating
    There's an infiniband network issue causing distributed filesystem operations to wait, which is causing cloud VMs and HPC not to answer properly.