Troubleshooting Hardware Errors

This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors. These issues typically indicate memory or GPU-level faults and may result in VM instability, crashes, or degraded performance.

ECC Errors

ECC (Error-Correcting Code) memory is designed to detect and correct single-bit errors in RAM. Hyperstack environments that use ECC memory will log these issues when encountered.

Types of ECC Errors

Correctable ECC Errors:
- These are single-bit memory faults that are automatically corrected by the system.
- They don't immediately affect VM performance but may indicate hardware degradation if frequent.
Uncorrectable ECC Errors:
- Multi-bit memory faults that cannot be corrected.
- These may lead to VM crashes, corruption of data, or unexpected behavior.

Symptoms

VM entering an ERROR or SHUTOFF state unexpectedly.
Data corruption or application-level faults within the VM.

Recommended Actions

For correctable errors: Monitor frequency. If increasing, consider replacing the physical host or migrating workloads.
For uncorrectable errors: Contact Hyperstack Support. The host system may require maintenance or replacement.

XID Errors

XID errors are GPU-related error codes reported by the NVIDIA driver. They can indicate a wide range of issues including ECC memory faults within the GPU, kernel panics, or GPU process faults.

Common XID Errors

XID Code	Description	Impact
31	MMU fault (memory management issue)	Possible kernel crash or task failure
43	Reset channel verification error	GPU driver/hardware issue
48	Double-bit ECC error on GPU memory	GPU unresponsive; VM may crash
63	Hardware error detected during GPU operation	Application or driver crash
79	GPU has fallen off the bus	Loss of GPU; requires reboot
94/95	Contained/Uncontained ECC errors	GPU instability or uncorrectable error
119/120	GSP (GPU System Processor) RPC timeout or failure	GPU communication issues
140	Unrecoverable ECC error	Immediate action required

Refer to the official NVIDIA documentation for a complete list of XID error codes:

NVIDIA XID Error Documentation

Symptoms

Training or inference processes fail repeatedly.
VM becomes unresponsive or GPU is no longer detected by the OS.

Recommended Actions

Restart the VM to reset GPU state.
Check system logs using nvidia-smi, dmesg, or driver logs.
Disable GSP if necessary (refer to NVIDIA vGPU Guide).
Contact Hyperstack Support with logs and timestamps if the issue persists.

Troubleshooting Hardware Errors

This guide explains how to identify, understand, and resolve two of the most common low-level hardware error types affecting virtual machines (VMs) in Hyperstack: ECC errors and XID errors.

ECC Errors

Types of ECC Errors

Symptoms

Recommended Actions

XID Errors

Common XID Errors

Symptoms

Recommended Actions