Hardware Events

Overview

HUATUO monitors Linux kernel hardware error events with zero instrumentation overhead and minimal runtime cost. Structured fault records are persisted to storage and exposed as Prometheus counters for use by alerting and visualization systems.

Use Cases

General-Purpose Computing

In large-scale server clusters, memory ECC correctable errors (CE) are common low-severity fault signals. A single CE is automatically corrected by hardware. If the CE rate on a given DIMM rises persistently, however, it indicates impending memory failure. HUATUO detects such events in real time via EDAC/MCE tracepoints, enabling operations teams to perform preventive replacements before complete memory failure and unplanned downtime occur.
AI Computing

AI training workloads require high hardware reliability. A single faulty PCIe device can cause an entire training job to fail. HUATUO supports PCIe AER event monitoring and reports link-layer errors on GPUs, NVLink bridges, and RDMA NICs (such as InfiniBand HCAs) — including Data Link Protocol Errors and ECRC Errors — in real time. This data provides hardware health status to AI cluster schedulers, supporting rapid fault node isolation and workload migration.
Storage Services

Storage servers typically host large numbers of PCIe NVMe SSDs and HBA cards. PCIe AER errors such as Completion Timeout and Malformed TLP are early indicators of storage device performance degradation or drive dropout. HUATUO monitoring data can be correlated with storage I/O latency metrics to support root cause analysis.
Security and Compliance

Industries with strict compliance requirements — such as finance and government — must maintain a complete history of all hardware faults. Structured event records (including timestamps, device identifiers, error types, and raw register values) can serve directly as compliance evidence for hardware health logs.

How It Works

HUATUO observes the kernel’s MCE, EDAC, ACPI GHES, and PCIe AER subsystems via eBPF. When an eBPF tracepoint fires, the raw event is written to a BPF Perf Event Buffer. A user-space process reads the event, parses the struct fields, generates a structured record, and persists it locally or to a remote store. The overall architecture is shown below:

RAS Architecture

The Linux kernel’s RAS framework consists of several loosely coupled subsystems. Together, they cover the full hardware fault spectrum — from CPU internal errors to PCIe link errors.

graph TB
    subgraph HW["Hardware Layer"]
        CPU["CPU\nx86 / x86-64"]
        MEM["Memory\nDDR4/DDR5 DIMM ECC"]
        Platform["Platform Hardware\nSoC / PCH"]
        PCIeDev["PCIe Devices\nGPU / NVMe / HCA / FPGA"]
    end

    subgraph FW["Firmware Layer"]
        BIOS["BIOS / UEFI\nCPER Buffer (APEI)"]
    end

    subgraph Kernel["Linux Kernel RAS Subsystems"]
        MCE["MCE Subsystem\narch/x86/kernel/cpu/mce"]
        EDAC["EDAC Subsystem\ndrivers/edac"]
        GHES["ACPI GHES Subsystem\ndrivers/acpi/apei"]
        AER["PCIe AER Subsystem\ndrivers/pci/pcie/aer"]
    end

    subgraph TP["Kernel Tracepoints"]
        TP1["tracepoint/mce/mce_record"]
        TP2["tracepoint/ras/mc_event"]
        TP3["tracepoint/ras/non_standard_event"]
        TP4["tracepoint/ras/aer_event"]
    end

    CPU -->|"MCE Exception (#MC) + THR Interrupt"| MCE
    MEM -->|ECC Error| EDAC
    Platform -->|APEI Error Record| BIOS
    BIOS -->|CPER Buffer| GHES
    PCIeDev -->|AER Interrupt| AER

    MCE --> TP1
    EDAC --> TP2
    GHES --> TP3
    AER --> TP4

MCE

MCE (Machine Check Architecture) is a hardware fault-tolerance mechanism built into the processor, defined by Intel and AMD in their respective architecture specifications. The processor contains a set of Machine Check Banks, each corresponding to a class of hardware resource (e.g., L1 cache, L2 cache, memory controller, TLB). When a hardware error is detected, the MSRs of the corresponding bank (MCi_STATUS, MCi_ADDR, MCi_MISC) are populated with error information, and an MCE exception is raised.
MCE THR

MCE supports a threshold interrupt mechanism. When the count of a given class of correctable errors exceeds a configured threshold, a dedicated APIC interrupt (THR) is triggered instead of escalating to a full MCE exception. This allows the operating system to issue an early alert when the error rate rises abnormally, rather than waiting until the error becomes uncorrectable.
EDAC

EDAC (Error Detection And Correction) is the Linux kernel subsystem dedicated to handling memory and hardware ECC errors. Its stated goal is “to detect and report errors occurring in the computer hardware running under Linux.” EDAC drivers communicate directly with the memory controller and parse the physical location of ECC errors — including memory controller index, channel, slot, and row/column address.
ACPI GHES

ACPI GHES (Generic Hardware Error Source) is a platform-agnostic hardware error reporting mechanism defined by the BIOS/UEFI through the APEI (ACPI Platform Error Interface) specification. The BIOS firmware writes hardware errors that cannot be handled by a specific driver — such as SoC-internal errors or platform-specific memory errors — into CPER (Common Platform Error Record) buffers described in the GHES descriptor. The Linux kernel reads these CPER records and reports the “non-standard” error sections that cannot be parsed by a standard subsystem.
PCIe AER

PCIe AER (Advanced Error Reporting) is an error reporting mechanism defined in the PCIe specification. It enables PCIe devices to report link-layer and transaction-layer errors to the operating system with precision.

Metrics Reference

RAS Metrics

# HELP huatuo_bamai_ras_hw_total total RAS hardware error events by source type
# TYPE huatuo_bamai_ras_hw_total counter
huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="acpi"} 0
huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="aer"} 0
huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="edac"} 0
huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="mce"} 0
huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="thr"} 0

NIC Packet Drop

huatuo_bamai_netdev_hw_rx_dropped_total{host="hostname",region="dev",device="eth0",driver="ixgbe"} 0

RDMA PFC

# HELP huatuo_bamai_netdev_dcb_pfc_received_total count of the received pfc frames
# TYPE huatuo_bamai_netdev_dcb_pfc_received_total counter
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="0",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="1",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="2",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="3",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="4",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="5",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="6",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="7",region="dev"} 0
# HELP huatuo_bamai_netdev_dcb_pfc_send_total count of the sent pfc frames
# TYPE huatuo_bamai_netdev_dcb_pfc_send_total counter
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="0",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="1",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="2",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="3",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="4",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="5",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="6",region="dev"} 0
huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="7",region="dev"} 0

Storage

Every hardware error event is persisted in structured form — either to the local huatuo-local directory or to a remote store such as Elasticsearch or OpenSearch. All records share the following common fields:

{
    "hostname": "hostname",
    "region": "dev",
    "uploaded_time": "2026-03-05T18:28:39.153438921+08:00",
    "time": "2026-03-05 18:28:39.153 +0800",
    "tracer_name": "netdev_event",
    "tracer_time": "2026-03-05 18:28:39.153 +0800",
    "tracer_type": "auto",
    "tracer_data": {
        "ifname": "eth0",
        "index": 2,
        "linkstatus": "linkstatus_admindown",
        "mac": "5c:6f:11:11:11:11",
        "start": false
    }
}

The linkstatus field takes the following values:

linkstatus_adminup — brought up by an administrator, e.g., ip link set dev eth0 up
linkstatus_admindown — brought down by an administrator, e.g., ip link set dev eth0 down
linkstatus_carrierup — physical link restored
linkstatus_carrierdown — physical link failure

{
    "hostname": "localhost",
    "region": "xxx",
    "uploaded_time": "2026-05-11T16:58:47.328548319+08:00",
    "time": "2026-05-11 16:58:47.328 +0800",
    "tracer_name": "ras",
    "tracer_time": "2026-05-11 16:58:47.328 +0800",
    "tracer_type": "auto",
    "tracer_data": {
        "dev": "MEM",
        "event": "EDAC",
        "type": "Corrected",
        "timestamp": 537792166031,
        "info": "{\"err_count\":0,\"err_type\":\"Corrected\",\"err_msg\":\"memory read error\",\"label\":\"CPU_SrcID#0_Ha#0_Chan#0_DIMM#0\",\"mc_index\":0,\"top_layer\":0,\"mid_layer\":0,\"low_layer\":-1,\"addr\":7860269056,\"grain\":128,\"syndrome\":0,\"driver\":\" area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0\"}"
    }
}

Field	Description
`Device`	Identifier of the hardware component where the error occurred (e.g., `CPU/MEM`, `MEM`, `ACPI`, `PCIe 0000:01:00.0`)
`Event`	Event subtype (`MCE`, `EDAC`, `APIC`, `AER`)
`ErrType`	Error severity level (see table below)
`Timestamp`	Timestamp
`Info`	Detailed fields for the specific event

Error Type	Description	Typical Sources
`Corrected`	Automatically corrected by hardware; transparent to the OS	MCE CE, EDAC CE, ACPI Sev=1, AER Severity=2
`UncorrectedRecoverable`	Not corrected by hardware, but recoverable by system software	MCE UE, EDAC UE, ACPI Sev=2, AER Severity=0
`UncorrectedDeferred`	Not corrected by hardware; requires deferred handling	MCE MCI_STATUS_DEFERRED, EDAC HW_EVENT_ERR_DEFERRED
`UncorrectedFatal`	Fatal hardware error; requires immediate reboot	EDAC FATAL, ACPI Sev=3, AER Severity=0
`Info`	Error type for which the system is expected to log informational records	EDAC HW_EVENT_ERR_INFO, ACPI Sev=0

Field Reference

MCE

Monitored components: CPU cores, L1/L2/L3 cache, TLB, memory controller (IMC), and interconnect buses (QPI/UPI/Infinity Fabric).

Field	MSR Source	Description
`mcg_cpu_cap`	`MCG_CAP`	Machine Check Global Capability Register. The lower 8 bits (`Count`) indicate the number of MC Banks in the system.
`mcg_msr_status`	`MCG_STATUS`	Machine Check Global Status Register.
`banks_msr_status`	`MCi_STATUS`	Bank Status Register (primary field). The lower 16 bits contain the MCA error code, classifying the error type (e.g., memory hierarchy error, bus error). The upper bits include control flags: `UC` (uncorrectable), `EN` (enabled), `MISCV` (MISC valid), `ADDRV` (ADDR valid), and `PCC` (processor context corrupt).
`banks_msr_addr`	`MCi_ADDR`	Physical memory address where the error occurred (valid only when `MCi_STATUS.ADDRV=1`). Used to identify the faulty DIMM or cache line.
`banks_msr_misc`	`MCi_MISC`	Supplementary information register (valid only when `MCi_STATUS.MISCV=1`).
`mca_synd_msr`	`MCA_SYND`	Syndrome register (AMD-specific).
`mca_ipid_msr`	`MCA_IPID`	Instance ID register (AMD-specific).
`instr_pointer`	RIP register	Instruction pointer at the time of the MCE (reliable only when `MCG_STATUS.EIPV=1`).
`tsc_timestamp`	TSC	CPU timestamp counter value at the time of the error (can be converted to absolute time using the kernel clock).
`walltime`	Kernel time	Unix timestamp (in seconds) at the time of the error.
`cpu`	—	Logical CPU number where the MCE occurred.
`cpuid`	CPUID	CPUID value of the CPU where the MCE occurred (includes Family, Model, and Stepping).
`apicid`	APIC ID	APIC ID of the CPU where the MCE occurred (can be mapped to a physical core or hyperthread).
`socketid`	—	CPU socket number (Socket ID). Used to identify physical CPUs in multi-socket servers.
`code_seg`	CS register	Code segment register value at the time of the MCE (used to determine privilege level).
`bank`	—	Bank number (typically: Bank 0 = L1I, Bank 1 = L1D, Bank 2 = L2, Bank 4+ = memory controller; numbering varies by platform).
`cpuvendor`	—	CPU vendor identifier: `0` = Intel, `1` = Unknown, `2` = AMD.

EDAC

Monitored components: memory ECC errors.

Field	Description
`err_count`	Cumulative error count for this event.
`err_type`	Error severity level.
`err_msg`	Human-readable error description string (e.g., `"CE memory read error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:8 syndrome:0x0)"`).
`label`	Physical DIMM location label (e.g., `"CPU_SrcID#0_Ha#0_Chan#0_DIMM#0"`). Generated by the EDAC driver based on DIMM topology; maps directly to a physical memory slot in the system.
`mc_index`	Memory controller index (0-based). Distinguishes between IMCs on servers with multiple memory controllers.
`top_layer`	Top-layer index in the memory hierarchy (typically the channel number; -1 indicates invalid).
`mid_layer`	Middle-layer index in the memory hierarchy (typically the slot or rank number; -1 indicates invalid).
`low_layer`	Bottom-layer index in the memory hierarchy (typically the bank or row number; -1 indicates invalid).
`addr`	Physical memory address where the error occurred (64-bit unsigned integer; 0 indicates an invalid address).
`grain`	Error granularity (grain size, in bytes). Represents the smallest memory unit that may be affected. Computed as `1 << GrainBits`. For example, `grain=8` means the error is localized to an 8-byte unit (a cache line sub-block).
`syndrome`	ECC syndrome value.
`driver`	EDAC driver name (e.g., `"amd64_edac"`, `"sb_edac"`).

ACPI GHES

Monitored components: platform-specific hardware errors.

Field	Description
`severity`	Raw ACPI/CPER error severity value.
`sec_type`	Error section type GUID (16 bytes, hexadecimal string). Defined by the UEFI specification and hardware vendors. Identifies the hardware category of the error record (e.g., memory error section, PCIe error section, ARM processor error section).
`fru_id`	FRU (Field Replaceable Unit) identifier GUID (16 bytes, hexadecimal string). Uniquely identifies the replaceable hardware component where the error occurred (e.g., a specific DIMM or PCIe card).
`fru_text`	Human-readable FRU description string (e.g., `"CPU0_DIMM_A1"`).
`data_len`	Raw error data payload length (in bytes).
`raw_data`	Hexadecimal dump of raw error data (space-separated bytes). Used for in-depth diagnostics; must be interpreted with the relevant hardware vendor documentation.

PCIe AER

Monitored devices include GPUs, NVMe SSDs, RDMA NICs/HCAs, FPGA accelerator cards, and PCIe switches.

Field	Description
`dev_name`	PCIe device name (BDF format), e.g., `"0000:03:00.0"` (Domain:Bus:Device.Function).
`err_type`	Error severity level (`Corrected` / `Uncorrected` / `Fatal`).
`err_reason`	Error reason description string. Decoded from the bits of the AER status register (see the tables below).
`tlp_header`	TLP (Transaction Layer Packet) header quad-word that triggered the error (format: `{dword0, dword1, dword2, dword3}`, hexadecimal). The TLP header contains the transaction type, address, and requester ID — key data for root cause analysis. Displays `"not available"` when `TlpHeaderValid=0`.

PCIe Correctable Error Types

Bitmask	Description
`0x00000001`	Receiver Error. The physical layer received a data symbol that does not conform to the specification. Typically caused by signal integrity issues such as excessive cable length or impedance mismatch.
`0x00000040`	Bad TLP. The LCRC (link-layer CRC) check on a TLP failed, indicating bit flips during transmission. The PCIe link layer automatically retransmits the TLP.
`0x00000080`	Bad DLLP. A link-layer control packet (such as ACK/NAK or flow control update) failed its CRC check.
`0x00000100`	Replay Number Rollover. The `REPLAY_NUM` field tracks retransmit count. This error indicates too many retransmissions since the last ACK, typically signaling sustained poor link quality.
`0x00001000`	Replay Timer Timeout. The sender did not receive an ACK within the allowed time, triggering TLP retransmission. Persistent occurrence indicates abnormal link latency or insufficient receiver processing capacity.
`0x00002000`	Advisory Non-Fatal Error. An uncorrectable error that software has downgraded to correctable (requires the ANFE feature in the AER capability). Commonly seen when an Unsupported Request Completion is received.
`0x00004000`	Corrected Internal Error. An internal ECC or parity error that the device corrected autonomously.
`0x00008000`	Header Log Overflow. The AER header log register is full. TLP headers for subsequent errors cannot be recorded, though errors are still counted.

PCIe Uncorrectable Error Types

Bitmask	Description
`0x00000001`	Undefined. A reserved bit was set, typically indicating non-compliant firmware or hardware behavior.
`0x00000010`	Data Link Protocol Error. A packet that violates the DLLP protocol specification was received. This is a severe link-layer fault.
`0x00000020`	Surprise Down Error. The physical link disconnected without a Hot-Plug notification (e.g., due to unexpected power loss or poor contact). This is a high-severity error in hot-plug environments.
`0x00001000`	Poisoned TLP. A TLP was received with the Error Poisoning (EP) bit set to 1, indicating that the upstream sender was aware of data corruption. This mechanism propagates and isolates errors to prevent silent data corruption.
`0x00002000`	Flow Control Protocol Error. A packet that violates PCIe flow control credit rules was received. This is a severe protocol violation.
`0x00004000`	Completion Timeout. The requester sent a non-posted transaction (e.g., Memory Read) but did not receive a Completion within the required timeout. Commonly caused by NVMe firmware issues, RDMA NIC driver bugs, or PCIe link interruptions.
`0x00008000`	Completer Abort. The completer returned an explicit CA (Completer Abort) status, indicating that the request was rejected.
`0x00010000`	Unexpected Completion. A Completion was received that could not be matched to any outstanding request (tag mismatch). Typically caused by device firmware bugs or data path errors.
`0x00020000`	Receiver Overflow. The receiver’s flow control credits indicated available buffer space, but an overflow occurred. This is a severe flow control violation.
`0x00040000`	Malformed TLP. The packet header contains fields that violate the specification (e.g., illegal length, reserved bits set, invalid address range). Typically indicates a severe firmware defect.
`0x00080000`	ECRC Error. The ECRC check on the TLP trailer failed (requires ECRC support on both endpoints). Indicates data corruption across the entire transmission path, including internal PCIe switch fabric. A key metric in high-reliability environments.
`0x00100000`	Unsupported Request. The completer returned a UR (Unsupported Request) status, indicating that the transaction type or address range is not supported by the device.
`0x00200000`	ACS Violation. PCIe ACS (Access Control Services) prevents peer-to-peer DMA between PCIe devices from bypassing the IOMMU. This error indicates a data access that violates the ACS policy. Requires attention in virtualization security environments.
`0x00400000`	Uncorrectable Internal Error. An internal ECC or parity error occurred that the device could not self-correct (e.g., SRAM double-bit error). Typically indicates hardware damage.
`0x00800000`	MC Blocked TLP. A PCIe Multicast TLP was blocked by ACS or the Multicast control mechanism.
`0x01000000`	AtomicOp Egress Blocked. An AtomicOp request (FetchAdd, Swap, or CAS) was blocked from egressing by ACS. Commonly seen in RDMA or GPU direct-connect configurations.
`0x02000000`	TLP Prefix Blocked. A packet with an End-End TLP Prefix was blocked from forwarding by ACS or another mechanism.

Summary

Deploy HUATUO in production to enable hardware error monitoring and proactive operations.

Feedback

Was this page helpful?

Thanks !

Sorry, please submit issues.