Troubleshooting Hardware Errors
This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors.
This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors. These issues typically indicate memory or GPU-level faults and may result in VM instability, crashes, or degraded performance.
ECC Errors
ECC (Error-Correcting Code) memory is designed to detect and correct single-bit errors in RAM. Hyperstack environments that use ECC memory will log these issues when encountered.
Types of ECC Errors
- Correctable ECC Errors:
- These are single-bit memory faults that are automatically corrected by the system.
- They don't immediately affect VM performance but may indicate hardware degradation if frequent.
- Uncorrectable ECC Errors:
- Multi-bit memory faults that cannot be corrected.
- These may lead to VM crashes, corruption of data, or unexpected behavior.
Symptoms
- VM entering an
ERROR
orSHUTOFF
state unexpectedly. - Data corruption or application-level faults within the VM.
Recommended Actions
- For correctable errors: Monitor frequency. If increasing, consider replacing the physical host or migrating workloads.
- For uncorrectable errors: Contact Hyperstack Support. The host system may require maintenance or replacement.
XID Errors
XID errors are GPU-related error codes reported by the NVIDIA driver. They can indicate a wide range of issues including ECC memory faults within the GPU, kernel panics, or GPU process faults.
Common XID Errors
XID Code | Description | Impact |
---|---|---|
31 | MMU fault (memory management issue) | Possible kernel crash or task failure |
43 | Reset channel verification error | GPU driver/hardware issue |
48 | Double-bit ECC error on GPU memory | GPU unresponsive; VM may crash |
63 | Hardware error detected during GPU operation | Application or driver crash |
79 | GPU has fallen off the bus | Loss of GPU; requires reboot |
94/95 | Contained/Uncontained ECC errors | GPU instability or uncorrectable error |
119/120 | GSP (GPU System Processor) RPC timeout or failure | GPU communication issues |
140 | Unrecoverable ECC error | Immediate action required |
Refer to the official NVIDIA documentation for a complete list of XID error codes:
Symptoms
- Training or inference processes fail repeatedly.
- VM becomes unresponsive or GPU is no longer detected by the OS.
Recommended Actions
- Restart the VM to reset GPU state.
- Check system logs using
nvidia-smi
,dmesg
, or driver logs. - Disable GSP if necessary (refer to NVIDIA vGPU Guide).
- Contact Hyperstack Support with logs and timestamps if the issue persists.