<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=248751834401391&amp;ev=PageView&amp;noscript=1">
Skip to content
  • There are no suggestions because the search field is empty.

Troubleshooting Hardware Errors

This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors.

This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors. These issues typically indicate memory or GPU-level faults and may result in VM instability, crashes, or degraded performance.


ECC Errors

ECC (Error-Correcting Code) memory is designed to detect and correct single-bit errors in RAM. Hyperstack environments that use ECC memory will log these issues when encountered.

Types of ECC Errors

  • Correctable ECC Errors:
    • These are single-bit memory faults that are automatically corrected by the system.
    • They don't immediately affect VM performance but may indicate hardware degradation if frequent.
  • Uncorrectable ECC Errors:
    • Multi-bit memory faults that cannot be corrected.
    • These may lead to VM crashes, corruption of data, or unexpected behavior.

Symptoms

  • VM entering an ERROR or SHUTOFF state unexpectedly.
  • Data corruption or application-level faults within the VM.

Recommended Actions

  • For correctable errors: Monitor frequency. If increasing, consider replacing the physical host or migrating workloads.
  • For uncorrectable errors: Contact Hyperstack Support. The host system may require maintenance or replacement.

XID Errors

XID errors are GPU-related error codes reported by the NVIDIA driver. They can indicate a wide range of issues including ECC memory faults within the GPU, kernel panics, or GPU process faults.

Common XID Errors

XID Code Description Impact
31 MMU fault (memory management issue) Possible kernel crash or task failure
43 Reset channel verification error GPU driver/hardware issue
48 Double-bit ECC error on GPU memory GPU unresponsive; VM may crash
63 Hardware error detected during GPU operation Application or driver crash
79 GPU has fallen off the bus Loss of GPU; requires reboot
94/95 Contained/Uncontained ECC errors GPU instability or uncorrectable error
119/120 GSP (GPU System Processor) RPC timeout or failure GPU communication issues
140 Unrecoverable ECC error Immediate action required

Refer to the official NVIDIA documentation for a complete list of XID error codes:

Symptoms

  • Training or inference processes fail repeatedly.
  • VM becomes unresponsive or GPU is no longer detected by the OS.

Recommended Actions

  • Restart the VM to reset GPU state.
  • Check system logs using nvidia-smi, dmesg, or driver logs.
  • Disable GSP if necessary (refer to NVIDIA vGPU Guide).
  • Contact Hyperstack Support with logs and timestamps if the issue persists.