1 - 内核全景观测

当前版本支持的指标:

CPU 系统

调度延迟

如下指标可以观测进程调度延迟状态,即一个进程从变得可运行的时刻(即被放进运行队列),到它真正开始在 CPU 上执行的这段时间。

# HELP huatuo_bamai_runqlat_container_latency cpu run queue latency for the containers
# TYPE huatuo_bamai_runqlat_container_latency gauge
huatuo_bamai_runqlat_container_latency{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev",zone="0"} 226
huatuo_bamai_runqlat_container_latency{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev",zone="1"} 0
huatuo_bamai_runqlat_container_latency{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev",zone="2"} 0
huatuo_bamai_runqlat_container_latency{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev",zone="3"} 0

# HELP huatuo_bamai_runqlat_latency cpu run queue latency for the host
# TYPE huatuo_bamai_runqlat_latency gauge
huatuo_bamai_runqlat_latency{host="hostname",region="dev",zone="0"} 35100
huatuo_bamai_runqlat_latency{host="hostname",region="dev",zone="1"} 0
huatuo_bamai_runqlat_latency{host="hostname",region="dev",zone="2"} 0
huatuo_bamai_runqlat_latency{host="hostname",region="dev",zone="3"} 0
指标 意义 单位 对象 取值 标签
runqlat_container_latency 进程调度延迟计数:
zone0, 0~10ms
zone1, 10-20ms
zone2, 20-50ms
zone3, 50+ms
计数 容器 eBPF container_host, container_hostnamespace, container_level, container_name, container_type, host, region, zone
runqlat_latency 进程调度延迟计数:
zone0, 0~10ms
zone1, 10-20ms
zone2, 20-50ms
zone3, 50+ms
计数 物理机 eBPF host, region, zone

中断延迟

系统中各类软中断在不同CPU上的响应延迟指标(当前只采集了 NET_RX/NET_TX)。

# HELP huatuo_bamai_softirq_latency softirq latency
# TYPE huatuo_bamai_softirq_latency gauge
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_RX",zone="0"} 125
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_RX",zone="1"} 2
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_RX",zone="2"} 0
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_RX",zone="3"} 0
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_TX",zone="0"} 0
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_TX",zone="1"} 0
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_TX",zone="2"} 0
huatuo_bamai_softirq_latency{cpuid="0",host="hostname",region="dev",type="NET_TX",zone="3"} 0
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_RX",zone="0"} 110
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_RX",zone="1"} 0
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_RX",zone="2"} 1
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_RX",zone="3"} 0
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_TX",zone="0"} 0
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_TX",zone="1"} 0
huatuo_bamai_softirq_latency{cpuid="1",host="hostname",region="dev",type="NET_TX",zone="2"} 0
指标 意义 单位 对象 取值 标签
softirq_latency 软中断响应延迟在不同 zone 的计数:
zone0, 0-10us
zone1, 10-100us
zone2, 100-1000us
zone3, 1+ms
计数 物理机 eBPF cpuid, host, region, type, zone

资源利用率

通过如下指标可以观测,物理机,容器的 CPU 资源使用情况,prometheus 指标格式:

# HELP huatuo_bamai_cpu_util_sys cpu sys for the host
# TYPE huatuo_bamai_cpu_util_sys gauge
huatuo_bamai_cpu_util_sys{host="hostname",region="dev"} 6.268857848549965e-06
# HELP huatuo_bamai_cpu_util_total cpu total for the host
# TYPE huatuo_bamai_cpu_util_total gauge
huatuo_bamai_cpu_util_total{host="hostname",region="dev"} 1.7736934944144352e-05
# HELP huatuo_bamai_cpu_util_usr cpu usr for the host
# TYPE huatuo_bamai_cpu_util_usr gauge
huatuo_bamai_cpu_util_usr{host="hostname",region="dev"} 1.1468077095594387e-05

# HELP huatuo_bamai_cpu_util_container_sys cpu sys for the containers
# TYPE huatuo_bamai_cpu_util_container_sys gauge
huatuo_bamai_cpu_util_container_sys{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1.6708593420881415e-07
# HELP huatuo_bamai_cpu_util_container_total cpu total for the containers
# TYPE huatuo_bamai_cpu_util_container_total gauge
huatuo_bamai_cpu_util_container_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 3.379584661890774e-07
# HELP huatuo_bamai_cpu_util_container_usr cpu usr for the containers
# TYPE huatuo_bamai_cpu_util_container_usr gauge
huatuo_bamai_cpu_util_container_usr{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1.7087253017325962e-07
指标 意义 单位 对象 标签
cpu_util_sys CPU 内核态利用率 % 物理机 host, region
cpu_util_usr CPU 用户态利用率 % 物理机 host, region
cpu_util_total CPU 总利用率 % 物理机 host, region
cpu_util_container_sys CPU 内核态利用率 % 容器 container_host,container_hostnamespace,container_level,container_name,container_type,host,region
cpu_util_container_usr CPU 用户态利用率 % 容器 container_host,container_hostnamespace,container_level,container_name,container_type,host,region
cpu_util_container_total CPU 总利用率 % 容器 container_host,container_hostnamespace,container_level,container_name,container_type,host,region

资源配置

通过如下指标可以了解容器 CPU 资源配置情况,prometheus 指标格式:

# HELP huatuo_bamai_cpu_util_container_cores cpu core number for the containers
# TYPE huatuo_bamai_cpu_util_container_cores gauge
huatuo_bamai_cpu_util_container_cores{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="Burstable",container_name="coredns",container_type="Normal",host="hostname",region="dev"} 6
指标 意义 单位 对象 标签
cpu_util_container_cores CPU 核心数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

资源争抢

这些指标体现了容器争抢,被限制等状态,prometheus 指标格式:

# HELP huatuo_bamai_cpu_stat_container_nr_throttled throttle nr for the containers
# TYPE huatuo_bamai_cpu_stat_container_nr_throttled gauge
huatuo_bamai_cpu_stat_container_nr_throttled{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_cpu_stat_container_throttled_time throttle time for the containers
# TYPE huatuo_bamai_cpu_stat_container_throttled_time gauge
huatuo_bamai_cpu_stat_container_throttled_time{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
cpu_stat_container_nr_throttled 当前 cgroup 被 throttled 限制的次数 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
cpu_stat_container_throttled_time 当前 cgroup 被 throttled 限制的总时间 纳秒 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

Ref:

此外,滴滴内核支持如下争抢指标,未来会开放:

# HELP huatuo_bamai_cpu_stat_container_wait_rate wait rate for the containers
# TYPE huatuo_bamai_cpu_stat_container_wait_rate gauge
huatuo_bamai_cpu_stat_container_wait_rate{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_cpu_stat_container_throttle_wait_rate throttle wait rate for the containers
# TYPE huatuo_bamai_cpu_stat_container_throttle_wait_rate gauge
huatuo_bamai_cpu_stat_container_throttle_wait_rate{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_cpu_stat_container_inner_wait_rate inner wait rate for the containers
# TYPE huatuo_bamai_cpu_stat_container_inner_wait_rate gauge
huatuo_bamai_cpu_stat_container_inner_wait_rate{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_cpu_stat_container_exter_wait_rate exter wait rate for the containers
# TYPE huatuo_bamai_cpu_stat_container_exter_wait_rate gauge
huatuo_bamai_cpu_stat_container_exter_wait_rate{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0

资源突发

如下指标体现了容器出现资源突发使用状态:

# HELP huatuo_bamai_cpu_stat_container_nr_bursts burst nr for the containers
# TYPE huatuo_bamai_cpu_stat_container_nr_bursts gauge
huatuo_bamai_cpu_stat_container_nr_bursts{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_cpu_stat_container_burst_time burst time for the containers
# TYPE huatuo_bamai_cpu_stat_container_burst_time gauge
huatuo_bamai_cpu_stat_container_burst_time{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
cpu_stat_container_burst_time 所有在各个周期中超过 quota 部分所累计使用的真实墙钟时间 纳秒 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
cpu_stat_container_nr_bursts 发生超额使用的周期数量 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

资源负载

这些指标体现物理机、容器负载状态。

# HELP huatuo_bamai_loadavg_load1 system load average, 1 minute
# TYPE huatuo_bamai_loadavg_load1 gauge
huatuo_bamai_loadavg_load1{host="hostname",region="dev"} 0.3
# HELP huatuo_bamai_loadavg_load15 system load average, 15 minutes
# TYPE huatuo_bamai_loadavg_load15 gauge
huatuo_bamai_loadavg_load15{host="hostname",region="dev"} 0.22
# HELP huatuo_bamai_loadavg_load5 system load average, 5 minutes
# TYPE huatuo_bamai_loadavg_load5 gauge
huatuo_bamai_loadavg_load5{host="hostname",region="dev"} 0.2
# HELP huatuo_bamai_loadavg_container_nr_running nr_running of container
# TYPE huatuo_bamai_loadavg_container_nr_running gauge
huatuo_bamai_loadavg_container_nr_running{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_loadavg_container_nr_uninterruptible nr_uninterruptible of container
# TYPE huatuo_bamai_loadavg_container_nr_uninterruptible gauge
huatuo_bamai_loadavg_container_nr_uninterruptible{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签 备注
loadavg_load1 系统过去 1 分钟的平均负载 计数 物理机 host, region
loadavg_load5 系统过去 5 分钟的平均负载 计数 物理机 host, region
loadavg_load15 系统过去 15 分钟的平均负载 计数 物理机 host, region
loadavg_container_container_nr_running 容器中运行的任务数量 计数 容器 host, region 只支持 cgroup v1
loadavg_container_container_nr_uninterruptible 容器中不可中断任务的数量 计数 容器 host, region 只支持 cgroup v1

内存系统

资源回收

系统内存回收行为可能导致进程被阻塞。通过这些指标可以了解系统内存状态。

# HELP huatuo_bamai_memory_free_allocpages_stall time stalled in alloc pages
# TYPE huatuo_bamai_memory_free_allocpages_stall gauge
huatuo_bamai_memory_free_allocpages_stall{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_free_compaction_stall time stalled in memory compaction
# TYPE huatuo_bamai_memory_free_compaction_stall gauge
huatuo_bamai_memory_free_compaction_stall{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_reclaim_container_directstall counter of cgroup reclaim when try_charge
# TYPE huatuo_bamai_memory_reclaim_container_directstall gauge
huatuo_bamai_memory_reclaim_container_directstall{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 取值 标签
memory_free_allocpages_stall 系统在分配内存页过程中的耗时计数 纳秒 物理机 eBPF host, region
memory_free_compaction_stall 系统在规整内存页过程中的耗时计数 纳秒 物理机 eBPF host, region
memory_reclaim_container_directstall 容器直接内存事件次数 计数 容器 eBPF container_host, container_hostnamespace, container_level, container_name, container_type, host, region

资源状态

通过如下指标可以了解整体系统、容器的内存状态。

# HELP huatuo_bamai_memory_vmstat_container_active_anon cgroup memory.stat active_anon
# TYPE huatuo_bamai_memory_vmstat_container_active_anon gauge
huatuo_bamai_memory_vmstat_container_active_anon{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1.47456e+07
# HELP huatuo_bamai_memory_vmstat_container_active_file cgroup memory.stat active_file
# TYPE huatuo_bamai_memory_vmstat_container_active_file gauge
huatuo_bamai_memory_vmstat_container_active_file{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 2.3617536e+07
# HELP huatuo_bamai_memory_vmstat_container_file_dirty cgroup memory.stat file_dirty
# TYPE huatuo_bamai_memory_vmstat_container_file_dirty gauge
huatuo_bamai_memory_vmstat_container_file_dirty{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_file_writeback cgroup memory.stat file_writeback
# TYPE huatuo_bamai_memory_vmstat_container_file_writeback gauge
huatuo_bamai_memory_vmstat_container_file_writeback{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_inactive_anon cgroup memory.stat inactive_anon
# TYPE huatuo_bamai_memory_vmstat_container_inactive_anon gauge
huatuo_bamai_memory_vmstat_container_inactive_anon{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_inactive_file cgroup memory.stat inactive_file
# TYPE huatuo_bamai_memory_vmstat_container_inactive_file gauge
huatuo_bamai_memory_vmstat_container_inactive_file{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 65536
# HELP huatuo_bamai_memory_vmstat_container_pgdeactivate cgroup memory.stat pgdeactivate
# TYPE huatuo_bamai_memory_vmstat_container_pgdeactivate gauge
huatuo_bamai_memory_vmstat_container_pgdeactivate{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_pgrefill cgroup memory.stat pgrefill
# TYPE huatuo_bamai_memory_vmstat_container_pgrefill gauge
huatuo_bamai_memory_vmstat_container_pgrefill{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_pgscan_direct cgroup memory.stat pgscan_direct
# TYPE huatuo_bamai_memory_vmstat_container_pgscan_direct gauge
huatuo_bamai_memory_vmstat_container_pgscan_direct{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_pgscan_kswapd cgroup memory.stat pgscan_kswapd
# TYPE huatuo_bamai_memory_vmstat_container_pgscan_kswapd gauge
huatuo_bamai_memory_vmstat_container_pgscan_kswapd{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_pgsteal_direct cgroup memory.stat pgsteal_direct
# TYPE huatuo_bamai_memory_vmstat_container_pgsteal_direct gauge
huatuo_bamai_memory_vmstat_container_pgsteal_direct{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_pgsteal_kswapd cgroup memory.stat pgsteal_kswapd
# TYPE huatuo_bamai_memory_vmstat_container_pgsteal_kswapd gauge
huatuo_bamai_memory_vmstat_container_pgsteal_kswapd{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_shmem cgroup memory.stat shmem
# TYPE huatuo_bamai_memory_vmstat_container_shmem gauge
huatuo_bamai_memory_vmstat_container_shmem{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_shmem_thp cgroup memory.stat shmem_thp
# TYPE huatuo_bamai_memory_vmstat_container_shmem_thp gauge
huatuo_bamai_memory_vmstat_container_shmem_thp{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_container_unevictable cgroup memory.stat unevictable
# TYPE huatuo_bamai_memory_vmstat_container_unevictable gauge
huatuo_bamai_memory_vmstat_container_unevictable{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
memory_vmstat_container_active_file 活跃的文件内存数 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_active_anon 活跃的匿名内存数 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_inactive_file 非活跃的文件内存数 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_inactive_anon 非活跃的匿名内存数 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_file_dirty 已修改且还未写入磁盘的文件内存大小 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_file_writeback 已修改且正等待写入磁盘的文件内存大小 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_dirty 已修改且还未写入磁盘的内存大小 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_writeback 已修改且正等待写入磁盘的文件,匿名内存大小 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgdeactivate 将页面从 active LRU 移动到 inactive LRU 的数量 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgrefill 在 active LRU 链表上被扫描的页面总数 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgscan_direct 直接回收时,在 inactive LRU 上扫描过的页面总数 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgscan_kswapd kswapd 在 inactive LRU 链表上扫描过的页面总数 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgsteal_direct 直接回收时,成功从 inactive LRU 回收的页面总数 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_pgsteal_kswapd kswapd 成功从 inactive LRU 回收的页面总数 页数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_vmstat_container_unevictable 不可回收的页面字节数 字节, Bytes 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

物理机内存资源指标:

# HELP huatuo_bamai_memory_vmstat_allocstall_device /proc/vmstat allocstall_device
# TYPE huatuo_bamai_memory_vmstat_allocstall_device gauge
huatuo_bamai_memory_vmstat_allocstall_device{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_allocstall_dma /proc/vmstat allocstall_dma
# TYPE huatuo_bamai_memory_vmstat_allocstall_dma gauge
huatuo_bamai_memory_vmstat_allocstall_dma{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_allocstall_dma32 /proc/vmstat allocstall_dma32
# TYPE huatuo_bamai_memory_vmstat_allocstall_dma32 gauge
huatuo_bamai_memory_vmstat_allocstall_dma32{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_allocstall_movable /proc/vmstat allocstall_movable
# TYPE huatuo_bamai_memory_vmstat_allocstall_movable gauge
huatuo_bamai_memory_vmstat_allocstall_movable{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_allocstall_normal /proc/vmstat allocstall_normal
# TYPE huatuo_bamai_memory_vmstat_allocstall_normal gauge
huatuo_bamai_memory_vmstat_allocstall_normal{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_nr_active_anon /proc/vmstat nr_active_anon
# TYPE huatuo_bamai_memory_vmstat_nr_active_anon gauge
huatuo_bamai_memory_vmstat_nr_active_anon{host="hostname",region="dev"} 155449
# HELP huatuo_bamai_memory_vmstat_nr_active_file /proc/vmstat nr_active_file
# TYPE huatuo_bamai_memory_vmstat_nr_active_file gauge
huatuo_bamai_memory_vmstat_nr_active_file{host="hostname",region="dev"} 212425
# HELP huatuo_bamai_memory_vmstat_nr_dirty /proc/vmstat nr_dirty
# TYPE huatuo_bamai_memory_vmstat_nr_dirty gauge
huatuo_bamai_memory_vmstat_nr_dirty{host="hostname",region="dev"} 19047
# HELP huatuo_bamai_memory_vmstat_nr_dirty_background_threshold /proc/vmstat nr_dirty_background_threshold
# TYPE huatuo_bamai_memory_vmstat_nr_dirty_background_threshold gauge
huatuo_bamai_memory_vmstat_nr_dirty_background_threshold{host="hostname",region="dev"} 379858
# HELP huatuo_bamai_memory_vmstat_nr_dirty_threshold /proc/vmstat nr_dirty_threshold
# TYPE huatuo_bamai_memory_vmstat_nr_dirty_threshold gauge
huatuo_bamai_memory_vmstat_nr_dirty_threshold{host="hostname",region="dev"} 760646
# HELP huatuo_bamai_memory_vmstat_nr_free_pages /proc/vmstat nr_free_pages
# TYPE huatuo_bamai_memory_vmstat_nr_free_pages gauge
huatuo_bamai_memory_vmstat_nr_free_pages{host="hostname",region="dev"} 3.20535e+06
# HELP huatuo_bamai_memory_vmstat_nr_inactive_anon /proc/vmstat nr_inactive_anon
# TYPE huatuo_bamai_memory_vmstat_nr_inactive_anon gauge
huatuo_bamai_memory_vmstat_nr_inactive_anon{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_nr_inactive_file /proc/vmstat nr_inactive_file
# TYPE huatuo_bamai_memory_vmstat_nr_inactive_file gauge
huatuo_bamai_memory_vmstat_nr_inactive_file{host="hostname",region="dev"} 428518
# HELP huatuo_bamai_memory_vmstat_nr_mlock /proc/vmstat nr_mlock
# TYPE huatuo_bamai_memory_vmstat_nr_mlock gauge
huatuo_bamai_memory_vmstat_nr_mlock{host="hostname",region="dev"} 6821
# HELP huatuo_bamai_memory_vmstat_nr_shmem /proc/vmstat nr_shmem
# TYPE huatuo_bamai_memory_vmstat_nr_shmem gauge
huatuo_bamai_memory_vmstat_nr_shmem{host="hostname",region="dev"} 541
# HELP huatuo_bamai_memory_vmstat_nr_shmem_hugepages /proc/vmstat nr_shmem_hugepages
# TYPE huatuo_bamai_memory_vmstat_nr_shmem_hugepages gauge
huatuo_bamai_memory_vmstat_nr_shmem_hugepages{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_nr_shmem_pmdmapped /proc/vmstat nr_shmem_pmdmapped
# TYPE huatuo_bamai_memory_vmstat_nr_shmem_pmdmapped gauge
huatuo_bamai_memory_vmstat_nr_shmem_pmdmapped{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_nr_slab_reclaimable /proc/vmstat nr_slab_reclaimable
# TYPE huatuo_bamai_memory_vmstat_nr_slab_reclaimable gauge
huatuo_bamai_memory_vmstat_nr_slab_reclaimable{host="hostname",region="dev"} 22322
# HELP huatuo_bamai_memory_vmstat_nr_slab_unreclaimable /proc/vmstat nr_slab_unreclaimable
# TYPE huatuo_bamai_memory_vmstat_nr_slab_unreclaimable gauge
huatuo_bamai_memory_vmstat_nr_slab_unreclaimable{host="hostname",region="dev"} 24168
# HELP huatuo_bamai_memory_vmstat_nr_unevictable /proc/vmstat nr_unevictable
# TYPE huatuo_bamai_memory_vmstat_nr_unevictable gauge
huatuo_bamai_memory_vmstat_nr_unevictable{host="hostname",region="dev"} 6839
# HELP huatuo_bamai_memory_vmstat_nr_writeback /proc/vmstat nr_writeback
# TYPE huatuo_bamai_memory_vmstat_nr_writeback gauge
huatuo_bamai_memory_vmstat_nr_writeback{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_nr_writeback_temp /proc/vmstat nr_writeback_temp
# TYPE huatuo_bamai_memory_vmstat_nr_writeback_temp gauge
huatuo_bamai_memory_vmstat_nr_writeback_temp{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_numa_pages_migrated /proc/vmstat numa_pages_migrated
# TYPE huatuo_bamai_memory_vmstat_numa_pages_migrated gauge
huatuo_bamai_memory_vmstat_numa_pages_migrated{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgdeactivate /proc/vmstat pgdeactivate
# TYPE huatuo_bamai_memory_vmstat_pgdeactivate gauge
huatuo_bamai_memory_vmstat_pgdeactivate{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgrefill /proc/vmstat pgrefill
# TYPE huatuo_bamai_memory_vmstat_pgrefill gauge
huatuo_bamai_memory_vmstat_pgrefill{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgscan_direct /proc/vmstat pgscan_direct
# TYPE huatuo_bamai_memory_vmstat_pgscan_direct gauge
huatuo_bamai_memory_vmstat_pgscan_direct{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgscan_direct_throttle /proc/vmstat pgscan_direct_throttle
# TYPE huatuo_bamai_memory_vmstat_pgscan_direct_throttle gauge
huatuo_bamai_memory_vmstat_pgscan_direct_throttle{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgscan_kswapd /proc/vmstat pgscan_kswapd
# TYPE huatuo_bamai_memory_vmstat_pgscan_kswapd gauge
huatuo_bamai_memory_vmstat_pgscan_kswapd{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgsteal_direct /proc/vmstat pgsteal_direct
# TYPE huatuo_bamai_memory_vmstat_pgsteal_direct gauge
huatuo_bamai_memory_vmstat_pgsteal_direct{host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_vmstat_pgsteal_kswapd /proc/vmstat pgsteal_kswapd
# TYPE huatuo_bamai_memory_vmstat_pgsteal_kswapd gauge
huatuo_bamai_memory_vmstat_pgsteal_kswapd{host="hostname",region="dev"} 0
  • 页面状态与 LRU 分布, Page state & LRU
指标 意义 单位 对象 标签
nr_free_pages 空闲页面总数(伙伴系统可直接分配)。 页面 物理机 host, region
nr_inactive_anon 非活跃匿名页面数 页面 物理机 host, region
nr_inactive_file 活跃文件页面数 页面 物理机 host, region
nr_active_anon 活跃匿名页面数 页面 物理机 host, region
nr_active_file 活跃文件页面数 页面 物理机 host, region
nr_unevictable 不可回收页面数(mlocked、hugetlbfs 等) 页面 物理机 host, region
nr_mlock 被 mlock() 锁定的页面数 页面 物理机 host, region
nr_shmem tmpfs / shmem 使用的页面数 页面 物理机 host, region
nr_slab_reclaimable 可回收的 slab 缓存对象 页面 物理机 host, region
nr_slab_unreclaimable 不可回收的 slab 缓存对象 页面 物理机 host, region
  • 脏页与写回控制, Dirty & writeback thresholds
指标 意义 单位 对象 标签
nr_dirty 当前脏页数 页面 物理机 host, region
nr_writeback 正在写回的页面数 页面 物理机 host, region
nr_dirty_threshold 脏页达到此阈值时开始强制写回(dirty_background_ratio / dirty_ratio 决定) 页面 物理机 host, region
nr_dirty_background_threshold 后台写回开始的阈值 页面 物理机 host, region
nr_dirty_background_threshold 后台写回开始的阈值 页面 物理机 host, region
  • 页面错误与换页, Page fault & swapping
指标 意义 单位 对象 标签
pgfault 总缺页异常次数 计数 物理机 host, region
pgmajfault 主缺页异常次数 计数 物理机 host, region
pgpgin 从块设备读入的页面数 页面 物理机 host, region
pgpgout 写出到块设备的页面数 页面 物理机 host, region
pswpin/pswpout 换入/换出的页面数(swap) 页面 物理机 host, region
  • 回收与扫描, Reclaim & scanning
指标 意义 单位 对象 标签
pgscan_kswapd/direct/khugepaged kswapd/直接回收/khugepaged 扫描的页面数 页面数 物理机 host, region
pgsteal_kswapd/direct/khugepaged 回收成功的页面数 页面数 物理机 host, region
  • 透明大页, THP
指标 意义 单位 对象 标签
thp_fault_alloc 缺页时成功分配 THP 的次数 计数 物理机 host, region
thp_fault_fallback 缺页时分配 THP 失败而回落普通页的次数 计数 物理机 host, region
thp_collapse_alloc khugepaged 折叠成 THP 的成功次数 计数 物理机 host, region
thp_collapse_alloc_failed khugepaged 折叠 THP 的失败次数 计数 物理机 host, region
  • NUMA 相关统计, NUMA balancing & allocation
指标 意义 单位 对象 标签
numa_hit 进程希望从某个节点分配内存,并且成功在该节点上分配到的页面总数。 计数 物理机 host, region
numa_miss 进程原本希望从其他节点分配,但由于目标节点内存不足等原因,最终在本节点分配成功的页面数。 计数 物理机 host, region
numa_foreign 进程原本希望从本节点分配内存,但最终在其他节点分配成功的页面数。 计数 物理机 host, region
numa_local 进程在本地节点上成功分配到的页面总数。 计数 物理机 host, region
numa_other 进程在远程节点上分配到的页面总数。 计数 物理机 host, region
numa_pages_migrated 由于自动 NUMA 平衡而成功迁移的页面总数 计数 物理机 host, region

Ref:

资源事件

容器级别的内存事件指标。

# HELP huatuo_bamai_memory_events_container_high memory events high
# TYPE huatuo_bamai_memory_events_container_high gauge
huatuo_bamai_memory_events_container_high{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_events_container_low memory events low
# TYPE huatuo_bamai_memory_events_container_low gauge
huatuo_bamai_memory_events_container_low{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_events_container_max memory events max
# TYPE huatuo_bamai_memory_events_container_max gauge
huatuo_bamai_memory_events_container_max{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_events_container_oom memory events oom
# TYPE huatuo_bamai_memory_events_container_oom gauge
huatuo_bamai_memory_events_container_oom{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_events_container_oom_group_kill memory events oom_group_kill
# TYPE huatuo_bamai_memory_events_container_oom_group_kill gauge
huatuo_bamai_memory_events_container_oom_group_kill{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_memory_events_container_oom_kill memory events oom_kill
# TYPE huatuo_bamai_memory_events_container_oom_kill gauge
huatuo_bamai_memory_events_container_oom_kill{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
memory_events_container_low 使用量低于 memory.low,但由于系统内存压力大,仍被主动回收的次数。说明 memory.low 被过度承诺。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_events_container_high 内存使用量超过 memory.high(软限制),导致进程被节流并强制走直接回收的次数。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_events_container_max 内存使用量达到或即将超过 memory.max(硬限制),触发内存分配失败检查的次数。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_events_container_oom 内存使用量达到 memory.max 限制,导致内存分配失败,进入 OOM 路径的次数。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_events_container_oom_kill cgroup 内因达到内存限制而被 OOM killer 杀死的进程数。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
memory_events_container_oom_group_kill 整个 cgroup 被 OOM killer 杀死的次数。 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

Buddyinfo

展示 Buddy 分配器(内核页分配器核心算法)在每个 NUMA 节点(Node)和每个内存区域(Zone)中的空闲内存块分布情况。

# HELP huatuo_bamai_memory_buddyinfo_blocks buddy info
# TYPE huatuo_bamai_memory_buddyinfo_blocks gauge
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="0",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="0",region="dev",zone="DMA32"} 3
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="0",region="dev",zone="Normal"} 7
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="1",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="1",region="dev",zone="DMA32"} 1
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="1",region="dev",zone="Normal"} 36
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="10",region="dev",zone="DMA"} 2
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="10",region="dev",zone="DMA32"} 743
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="10",region="dev",zone="Normal"} 2265
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="2",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="2",region="dev",zone="DMA32"} 3
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="2",region="dev",zone="Normal"} 10
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="3",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="3",region="dev",zone="DMA32"} 2
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="3",region="dev",zone="Normal"} 224
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="4",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="4",region="dev",zone="DMA32"} 1
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="4",region="dev",zone="Normal"} 376
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="5",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="5",region="dev",zone="DMA32"} 1
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="5",region="dev",zone="Normal"} 165
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="6",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="6",region="dev",zone="DMA32"} 3
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="6",region="dev",zone="Normal"} 118
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="7",region="dev",zone="DMA"} 0
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="7",region="dev",zone="DMA32"} 4
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="7",region="dev",zone="Normal"} 172
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="8",region="dev",zone="DMA"} 1
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="8",region="dev",zone="DMA32"} 4
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="8",region="dev",zone="Normal"} 35
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="9",region="dev",zone="DMA"} 2
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="9",region="dev",zone="DMA32"} 4
huatuo_bamai_memory_buddyinfo_blocks{host="hostname",node="0",order="9",region="dev",zone="Normal"} 25
指标 意义 单位 对象 取值 标签
memory_buddyinfo_blocks buddy 内存页空闲情况。 内存页 物理机 procfs host, node, order, region, zone

网络系统

TCP 内存

如下指标描述 TCP 协议栈占用系统内存状态。

# HELP huatuo_bamai_tcp_memory_limit_pages tcp memory pages limit
# TYPE huatuo_bamai_tcp_memory_limit_pages gauge
huatuo_bamai_tcp_memory_limit_pages{host="hostname",region="dev"} 380526
# HELP huatuo_bamai_tcp_memory_usage_bytes tcp memory bytes usage
# TYPE huatuo_bamai_tcp_memory_usage_bytes gauge
huatuo_bamai_tcp_memory_usage_bytes{host="hostname",region="dev"} 0
# HELP huatuo_bamai_tcp_memory_usage_pages tcp memory pages usage
# TYPE huatuo_bamai_tcp_memory_usage_pages gauge
huatuo_bamai_tcp_memory_usage_pages{host="hostname",region="dev"} 0
# HELP huatuo_bamai_tcp_memory_usage_percent tcp memory usage percent
# TYPE huatuo_bamai_tcp_memory_usage_percent gauge
huatuo_bamai_tcp_memory_usage_percent{host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
tcp_memory_limit_pages 系统可使用的 TCP 总内存大小 内存页 物理机 host, region
tcp_memory_usage_bytes 系统已使用的 TCP 内存大小 字节 物理机 host, region
tcp_memory_usage_pages 系统已使用的 TCP 内存大小 内存页 物理机 host, region
tcp_memory_usage_percent 系统已使用的 TCP 内存百分比(相对 TCP 内存总限制) % 物理机 host, region

邻居项

如下指标描述邻居项使用状态。

# HELP huatuo_bamai_arp_container_entries arp entries in container netns
# TYPE huatuo_bamai_arp_container_entries gauge
huatuo_bamai_arp_container_entries{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_arp_entries host init namespace
# TYPE huatuo_bamai_arp_entries gauge
huatuo_bamai_arp_entries{host="hostname",region="dev"} 5
# HELP huatuo_bamai_arp_total all entries in arp_cache for containers and host netns
# TYPE huatuo_bamai_arp_total gauge
huatuo_bamai_arp_total{host="hostname",region="dev"} 12
指标 意义 单位 对象 标签
arp_entries 宿主机网络命名空间 arp 条目数量 计数 宿主命名空间 host, region
arp_total 物理机所有网络命名空间 arp 条目数量总和 计数 物理机 host, region
arp_container_entries 容器网络命名空间 arp 条目数量 计数 容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

Qdisc

Qdisc 是内核网络子系统重要模块。通过观测该模块,可以清楚的看到网络报文处理,延迟情况。

# HELP huatuo_bamai_netdev_qdisc_backlog Number of bytes currently in queue to be sent.
# TYPE huatuo_bamai_netdev_qdisc_backlog gauge
huatuo_bamai_netdev_qdisc_backlog{device="ens2",host="hostname",kind="fq_codel",region="dev"} 0
# HELP huatuo_bamai_netdev_qdisc_bytes_total Number of bytes sent.
# TYPE huatuo_bamai_netdev_qdisc_bytes_total counter
huatuo_bamai_netdev_qdisc_bytes_total{device="ens2",host="hostname",kind="fq_codel",region="dev"} 2.578235443e+09
# HELP huatuo_bamai_netdev_qdisc_current_queue_length Number of packets currently in queue to be sent.
# TYPE huatuo_bamai_netdev_qdisc_current_queue_length gauge
huatuo_bamai_netdev_qdisc_current_queue_length{device="ens2",host="hostname",kind="fq_codel",region="dev"} 0
# HELP huatuo_bamai_netdev_qdisc_drops_total Number of packet drops.
# TYPE huatuo_bamai_netdev_qdisc_drops_total counter
huatuo_bamai_netdev_qdisc_drops_total{device="ens2",host="hostname",kind="fq_codel",region="dev"} 0
# HELP huatuo_bamai_netdev_qdisc_overlimits_total Number of packet overlimits.
# TYPE huatuo_bamai_netdev_qdisc_overlimits_total counter
huatuo_bamai_netdev_qdisc_overlimits_total{device="ens2",host="hostname",kind="fq_codel",region="dev"} 0
# HELP huatuo_bamai_netdev_qdisc_packets_total Number of packets sent.
# TYPE huatuo_bamai_netdev_qdisc_packets_total counter
huatuo_bamai_netdev_qdisc_packets_total{device="ens2",host="hostname",kind="fq_codel",region="dev"} 6.867714e+06
# HELP huatuo_bamai_netdev_qdisc_requeues_total Number of packets dequeued, not transmitted, and requeued.
# TYPE huatuo_bamai_netdev_qdisc_requeues_total counter
huatuo_bamai_netdev_qdisc_requeues_total{device="ens2",host="hostname",kind="fq_codel",region="dev"} 0
指标 意义 单位 对象 标签
qdisc_backlog 后备排队待发送的包数 字节 物理机 device, host, kind, region
qdisc_current_queue_length 当前排队的包量 计数 物理机 device, host, kind, region
qdisc_overlimits_total 超限次数 计数 物理机 device, host, kind, region
qdisc_requeues_total 由于网卡/驱动暂时无法发送而被重新入队的次数 计数 物理机 device, host, kind, region
qdisc_drops_total 主动丢弃的包数(因队列满、限速策略等原因) 计数 物理机 device, host, kind, region
qdisc_bytes_total 已发送的包量 字节 物理机 device, host, kind, region
qdisc_packets_total 已发送的包数 计数 物理机 device, host, kind, region

硬件丢包

网络设备硬件接收方向丢包数。

# HELP huatuo_bamai_netdev_hw_rx_dropped count of packets dropped at hardware level
# TYPE huatuo_bamai_netdev_hw_rx_dropped gauge
huatuo_bamai_netdev_hw_rx_dropped{device="eth0",driver="mlx5_core",host="hostname",region="dev"} 0
指标 意义 单位 对象 取值 标签
netdev_hw_rx_dropped 网卡硬件接收方向丢包 计数 物理机 eBPF device, driver, host, region

网络设备

# HELP huatuo_bamai_netdev_container_receive_bytes_total Network device statistic receive_bytes.
# TYPE huatuo_bamai_netdev_container_receive_bytes_total counter
huatuo_bamai_netdev_container_receive_bytes_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 6.4400018e+07
# HELP huatuo_bamai_netdev_container_receive_compressed_total Network device statistic receive_compressed.
# TYPE huatuo_bamai_netdev_container_receive_compressed_total counter
huatuo_bamai_netdev_container_receive_compressed_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_dropped_total Network device statistic receive_dropped.
# TYPE huatuo_bamai_netdev_container_receive_dropped_total counter
huatuo_bamai_netdev_container_receive_dropped_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_errors_total Network device statistic receive_errors.
# TYPE huatuo_bamai_netdev_container_receive_errors_total counter
huatuo_bamai_netdev_container_receive_errors_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_fifo_total Network device statistic receive_fifo.
# TYPE huatuo_bamai_netdev_container_receive_fifo_total counter
huatuo_bamai_netdev_container_receive_fifo_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_frame_total Network device statistic receive_frame.
# TYPE huatuo_bamai_netdev_container_receive_frame_total counter
huatuo_bamai_netdev_container_receive_frame_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_multicast_total Network device statistic receive_multicast.
# TYPE huatuo_bamai_netdev_container_receive_multicast_total counter
huatuo_bamai_netdev_container_receive_multicast_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_receive_packets_total Network device statistic receive_packets.
# TYPE huatuo_bamai_netdev_container_receive_packets_total counter
huatuo_bamai_netdev_container_receive_packets_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 693155
# HELP huatuo_bamai_netdev_container_transmit_bytes_total Network device statistic transmit_bytes.
# TYPE huatuo_bamai_netdev_container_transmit_bytes_total counter
huatuo_bamai_netdev_container_transmit_bytes_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 6.2347911e+07
# HELP huatuo_bamai_netdev_container_transmit_carrier_total Network device statistic transmit_carrier.
# TYPE huatuo_bamai_netdev_container_transmit_carrier_total counter
huatuo_bamai_netdev_container_transmit_carrier_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_colls_total Network device statistic transmit_colls.
# TYPE huatuo_bamai_netdev_container_transmit_colls_total counter
huatuo_bamai_netdev_container_transmit_colls_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_compressed_total Network device statistic transmit_compressed.
# TYPE huatuo_bamai_netdev_container_transmit_compressed_total counter
huatuo_bamai_netdev_container_transmit_compressed_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_dropped_total Network device statistic transmit_dropped.
# TYPE huatuo_bamai_netdev_container_transmit_dropped_total counter
huatuo_bamai_netdev_container_transmit_dropped_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_errors_total Network device statistic transmit_errors.
# TYPE huatuo_bamai_netdev_container_transmit_errors_total counter
huatuo_bamai_netdev_container_transmit_errors_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_fifo_total Network device statistic transmit_fifo.
# TYPE huatuo_bamai_netdev_container_transmit_fifo_total counter
huatuo_bamai_netdev_container_transmit_fifo_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netdev_container_transmit_packets_total Network device statistic transmit_packets.
# TYPE huatuo_bamai_netdev_container_transmit_packets_total counter
huatuo_bamai_netdev_container_transmit_packets_total{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",device="eth0",host="hostname",region="dev"} 660218
指标 意义 单位 对象 标签
netdev_receive_bytes_total 成功接收的总字节数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_packets_total 成功接收的数据包总数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_compressed_total 接收到的已压缩数据包数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_frame_total 接收帧错误数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_errors_total 接收错误总数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_dropped_total 由于各种原因被内核或驱动丢弃的接收包数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_receive_fifo_total 接收FIFO/环形缓冲区溢出错误数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_bytes_total 成功发送的总字节数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_packets_total 成功发送的数据包总数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_errors_total 发送错误总数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_dropped_total 发送过程中被丢弃的包数(队列满、策略丢弃等) 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_fifo_total 发送FIFO/环形缓冲区错误数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_carrier_total 载波错误次数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netdev_transmit_compressed_total 发送的已压缩数据包数 计数 物理机或者容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

TCP

# HELP huatuo_bamai_netstat_container_TcpExt_ArpFilter statistic TcpExtArpFilter.
# TYPE huatuo_bamai_netstat_container_TcpExt_ArpFilter gauge
huatuo_bamai_netstat_container_TcpExt_ArpFilter{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_BusyPollRxPackets statistic TcpExtBusyPollRxPackets.
# TYPE huatuo_bamai_netstat_container_TcpExt_BusyPollRxPackets gauge
huatuo_bamai_netstat_container_TcpExt_BusyPollRxPackets{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_DelayedACKLocked statistic TcpExtDelayedACKLocked.
# TYPE huatuo_bamai_netstat_container_TcpExt_DelayedACKLocked gauge
huatuo_bamai_netstat_container_TcpExt_DelayedACKLocked{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_DelayedACKLost statistic TcpExtDelayedACKLost.
# TYPE huatuo_bamai_netstat_container_TcpExt_DelayedACKLost gauge
huatuo_bamai_netstat_container_TcpExt_DelayedACKLost{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_DelayedACKs statistic TcpExtDelayedACKs.
# TYPE huatuo_bamai_netstat_container_TcpExt_DelayedACKs gauge
huatuo_bamai_netstat_container_TcpExt_DelayedACKs{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 4650
# HELP huatuo_bamai_netstat_container_TcpExt_EmbryonicRsts statistic TcpExtEmbryonicRsts.
# TYPE huatuo_bamai_netstat_container_TcpExt_EmbryonicRsts gauge
huatuo_bamai_netstat_container_TcpExt_EmbryonicRsts{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_IPReversePathFilter statistic TcpExtIPReversePathFilter.
# TYPE huatuo_bamai_netstat_container_TcpExt_IPReversePathFilter gauge
huatuo_bamai_netstat_container_TcpExt_IPReversePathFilter{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_ListenDrops statistic TcpExtListenDrops.
# TYPE huatuo_bamai_netstat_container_TcpExt_ListenDrops gauge
huatuo_bamai_netstat_container_TcpExt_ListenDrops{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_ListenOverflows statistic TcpExtListenOverflows.
# TYPE huatuo_bamai_netstat_container_TcpExt_ListenOverflows gauge
huatuo_bamai_netstat_container_TcpExt_ListenOverflows{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_LockDroppedIcmps statistic TcpExtLockDroppedIcmps.
# TYPE huatuo_bamai_netstat_container_TcpExt_LockDroppedIcmps gauge
huatuo_bamai_netstat_container_TcpExt_LockDroppedIcmps{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_OfoPruned statistic TcpExtOfoPruned.
# TYPE huatuo_bamai_netstat_container_TcpExt_OfoPruned gauge
huatuo_bamai_netstat_container_TcpExt_OfoPruned{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_OutOfWindowIcmps statistic TcpExtOutOfWindowIcmps.
# TYPE huatuo_bamai_netstat_container_TcpExt_OutOfWindowIcmps gauge
huatuo_bamai_netstat_container_TcpExt_OutOfWindowIcmps{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_PAWSActive statistic TcpExtPAWSActive.
# TYPE huatuo_bamai_netstat_container_TcpExt_PAWSActive gauge
huatuo_bamai_netstat_container_TcpExt_PAWSActive{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_PAWSEstab statistic TcpExtPAWSEstab.
# TYPE huatuo_bamai_netstat_container_TcpExt_PAWSEstab gauge
huatuo_bamai_netstat_container_TcpExt_PAWSEstab{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_PFMemallocDrop statistic TcpExtPFMemallocDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_PFMemallocDrop gauge
huatuo_bamai_netstat_container_TcpExt_PFMemallocDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_PruneCalled statistic TcpExtPruneCalled.
# TYPE huatuo_bamai_netstat_container_TcpExt_PruneCalled gauge
huatuo_bamai_netstat_container_TcpExt_PruneCalled{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_RcvPruned statistic TcpExtRcvPruned.
# TYPE huatuo_bamai_netstat_container_TcpExt_RcvPruned gauge
huatuo_bamai_netstat_container_TcpExt_RcvPruned{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_SyncookiesFailed statistic TcpExtSyncookiesFailed.
# TYPE huatuo_bamai_netstat_container_TcpExt_SyncookiesFailed gauge
huatuo_bamai_netstat_container_TcpExt_SyncookiesFailed{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_SyncookiesRecv statistic TcpExtSyncookiesRecv.
# TYPE huatuo_bamai_netstat_container_TcpExt_SyncookiesRecv gauge
huatuo_bamai_netstat_container_TcpExt_SyncookiesRecv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_SyncookiesSent statistic TcpExtSyncookiesSent.
# TYPE huatuo_bamai_netstat_container_TcpExt_SyncookiesSent gauge
huatuo_bamai_netstat_container_TcpExt_SyncookiesSent{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedChallenge statistic TcpExtTCPACKSkippedChallenge.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedChallenge gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedChallenge{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedFinWait2 statistic TcpExtTCPACKSkippedFinWait2.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedFinWait2 gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedFinWait2{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedPAWS statistic TcpExtTCPACKSkippedPAWS.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedPAWS gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedPAWS{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSeq statistic TcpExtTCPACKSkippedSeq.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSeq gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSeq{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSynRecv statistic TcpExtTCPACKSkippedSynRecv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSynRecv gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedSynRecv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedTimeWait statistic TcpExtTCPACKSkippedTimeWait.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedTimeWait gauge
huatuo_bamai_netstat_container_TcpExt_TCPACKSkippedTimeWait{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAOBad statistic TcpExtTCPAOBad.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAOBad gauge
huatuo_bamai_netstat_container_TcpExt_TCPAOBad{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAODroppedIcmps statistic TcpExtTCPAODroppedIcmps.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAODroppedIcmps gauge
huatuo_bamai_netstat_container_TcpExt_TCPAODroppedIcmps{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAOGood statistic TcpExtTCPAOGood.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAOGood gauge
huatuo_bamai_netstat_container_TcpExt_TCPAOGood{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAOKeyNotFound statistic TcpExtTCPAOKeyNotFound.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAOKeyNotFound gauge
huatuo_bamai_netstat_container_TcpExt_TCPAOKeyNotFound{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAORequired statistic TcpExtTCPAORequired.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAORequired gauge
huatuo_bamai_netstat_container_TcpExt_TCPAORequired{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortFailed statistic TcpExtTCPAbortFailed.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortFailed gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortFailed{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortOnClose statistic TcpExtTCPAbortOnClose.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortOnClose gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortOnClose{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortOnData statistic TcpExtTCPAbortOnData.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortOnData gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortOnData{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortOnLinger statistic TcpExtTCPAbortOnLinger.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortOnLinger gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortOnLinger{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortOnMemory statistic TcpExtTCPAbortOnMemory.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortOnMemory gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortOnMemory{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAbortOnTimeout statistic TcpExtTCPAbortOnTimeout.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAbortOnTimeout gauge
huatuo_bamai_netstat_container_TcpExt_TCPAbortOnTimeout{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAckCompressed statistic TcpExtTCPAckCompressed.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAckCompressed gauge
huatuo_bamai_netstat_container_TcpExt_TCPAckCompressed{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPAutoCorking statistic TcpExtTCPAutoCorking.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPAutoCorking gauge
huatuo_bamai_netstat_container_TcpExt_TCPAutoCorking{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPBacklogCoalesce statistic TcpExtTCPBacklogCoalesce.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPBacklogCoalesce gauge
huatuo_bamai_netstat_container_TcpExt_TCPBacklogCoalesce{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 3
# HELP huatuo_bamai_netstat_container_TcpExt_TCPBacklogDrop statistic TcpExtTCPBacklogDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPBacklogDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPBacklogDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPChallengeACK statistic TcpExtTCPChallengeACK.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPChallengeACK gauge
huatuo_bamai_netstat_container_TcpExt_TCPChallengeACK{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredDubious statistic TcpExtTCPDSACKIgnoredDubious.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredDubious gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredDubious{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredNoUndo statistic TcpExtTCPDSACKIgnoredNoUndo.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredNoUndo gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredNoUndo{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredOld statistic TcpExtTCPDSACKIgnoredOld.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredOld gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKIgnoredOld{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoRecv statistic TcpExtTCPDSACKOfoRecv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoRecv gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoRecv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoSent statistic TcpExtTCPDSACKOfoSent.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoSent gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKOfoSent{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKOldSent statistic TcpExtTCPDSACKOldSent.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKOldSent gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKOldSent{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecv statistic TcpExtTCPDSACKRecv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecv gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecvSegs statistic TcpExtTCPDSACKRecvSegs.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecvSegs gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKRecvSegs{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDSACKUndo statistic TcpExtTCPDSACKUndo.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDSACKUndo gauge
huatuo_bamai_netstat_container_TcpExt_TCPDSACKUndo{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDeferAcceptDrop statistic TcpExtTCPDeferAcceptDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDeferAcceptDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPDeferAcceptDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDelivered statistic TcpExtTCPDelivered.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDelivered gauge
huatuo_bamai_netstat_container_TcpExt_TCPDelivered{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 3.28098e+06
# HELP huatuo_bamai_netstat_container_TcpExt_TCPDeliveredCE statistic TcpExtTCPDeliveredCE.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPDeliveredCE gauge
huatuo_bamai_netstat_container_TcpExt_TCPDeliveredCE{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActive statistic TcpExtTCPFastOpenActive.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActive gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActive{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActiveFail statistic TcpExtTCPFastOpenActiveFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActiveFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenActiveFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenBlackhole statistic TcpExtTCPFastOpenBlackhole.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenBlackhole gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenBlackhole{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenCookieReqd statistic TcpExtTCPFastOpenCookieReqd.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenCookieReqd gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenCookieReqd{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenListenOverflow statistic TcpExtTCPFastOpenListenOverflow.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenListenOverflow gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenListenOverflow{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassive statistic TcpExtTCPFastOpenPassive.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassive gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassive{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveAltKey statistic TcpExtTCPFastOpenPassiveAltKey.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveAltKey gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveAltKey{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveFail statistic TcpExtTCPFastOpenPassiveFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastOpenPassiveFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFastRetrans statistic TcpExtTCPFastRetrans.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFastRetrans gauge
huatuo_bamai_netstat_container_TcpExt_TCPFastRetrans{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFromZeroWindowAdv statistic TcpExtTCPFromZeroWindowAdv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFromZeroWindowAdv gauge
huatuo_bamai_netstat_container_TcpExt_TCPFromZeroWindowAdv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPFullUndo statistic TcpExtTCPFullUndo.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPFullUndo gauge
huatuo_bamai_netstat_container_TcpExt_TCPFullUndo{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHPAcks statistic TcpExtTCPHPAcks.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHPAcks gauge
huatuo_bamai_netstat_container_TcpExt_TCPHPAcks{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 616667
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHPHits statistic TcpExtTCPHPHits.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHPHits gauge
huatuo_bamai_netstat_container_TcpExt_TCPHPHits{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 9913
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayCwnd statistic TcpExtTCPHystartDelayCwnd.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayCwnd gauge
huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayCwnd{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayDetect statistic TcpExtTCPHystartDelayDetect.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayDetect gauge
huatuo_bamai_netstat_container_TcpExt_TCPHystartDelayDetect{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainCwnd statistic TcpExtTCPHystartTrainCwnd.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainCwnd gauge
huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainCwnd{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainDetect statistic TcpExtTCPHystartTrainDetect.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainDetect gauge
huatuo_bamai_netstat_container_TcpExt_TCPHystartTrainDetect{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPKeepAlive statistic TcpExtTCPKeepAlive.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPKeepAlive gauge
huatuo_bamai_netstat_container_TcpExt_TCPKeepAlive{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 20
# HELP huatuo_bamai_netstat_container_TcpExt_TCPLossFailures statistic TcpExtTCPLossFailures.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPLossFailures gauge
huatuo_bamai_netstat_container_TcpExt_TCPLossFailures{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPLossProbeRecovery statistic TcpExtTCPLossProbeRecovery.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPLossProbeRecovery gauge
huatuo_bamai_netstat_container_TcpExt_TCPLossProbeRecovery{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPLossProbes statistic TcpExtTCPLossProbes.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPLossProbes gauge
huatuo_bamai_netstat_container_TcpExt_TCPLossProbes{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_netstat_container_TcpExt_TCPLossUndo statistic TcpExtTCPLossUndo.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPLossUndo gauge
huatuo_bamai_netstat_container_TcpExt_TCPLossUndo{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPLostRetransmit statistic TcpExtTCPLostRetransmit.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPLostRetransmit gauge
huatuo_bamai_netstat_container_TcpExt_TCPLostRetransmit{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMD5Failure statistic TcpExtTCPMD5Failure.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMD5Failure gauge
huatuo_bamai_netstat_container_TcpExt_TCPMD5Failure{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMD5NotFound statistic TcpExtTCPMD5NotFound.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMD5NotFound gauge
huatuo_bamai_netstat_container_TcpExt_TCPMD5NotFound{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMD5Unexpected statistic TcpExtTCPMD5Unexpected.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMD5Unexpected gauge
huatuo_bamai_netstat_container_TcpExt_TCPMD5Unexpected{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMTUPFail statistic TcpExtTCPMTUPFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMTUPFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPMTUPFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMTUPSuccess statistic TcpExtTCPMTUPSuccess.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMTUPSuccess gauge
huatuo_bamai_netstat_container_TcpExt_TCPMTUPSuccess{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressures statistic TcpExtTCPMemoryPressures.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressures gauge
huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressures{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressuresChrono statistic TcpExtTCPMemoryPressuresChrono.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressuresChrono gauge
huatuo_bamai_netstat_container_TcpExt_TCPMemoryPressuresChrono{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqFailure statistic TcpExtTCPMigrateReqFailure.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqFailure gauge
huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqFailure{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqSuccess statistic TcpExtTCPMigrateReqSuccess.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqSuccess gauge
huatuo_bamai_netstat_container_TcpExt_TCPMigrateReqSuccess{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPMinTTLDrop statistic TcpExtTCPMinTTLDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPMinTTLDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPMinTTLDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPOFODrop statistic TcpExtTCPOFODrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPOFODrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPOFODrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPOFOMerge statistic TcpExtTCPOFOMerge.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPOFOMerge gauge
huatuo_bamai_netstat_container_TcpExt_TCPOFOMerge{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPOFOQueue statistic TcpExtTCPOFOQueue.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPOFOQueue gauge
huatuo_bamai_netstat_container_TcpExt_TCPOFOQueue{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPOrigDataSent statistic TcpExtTCPOrigDataSent.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPOrigDataSent gauge
huatuo_bamai_netstat_container_TcpExt_TCPOrigDataSent{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 2.675557e+06
# HELP huatuo_bamai_netstat_container_TcpExt_TCPPLBRehash statistic TcpExtTCPPLBRehash.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPPLBRehash gauge
huatuo_bamai_netstat_container_TcpExt_TCPPLBRehash{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPPartialUndo statistic TcpExtTCPPartialUndo.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPPartialUndo gauge
huatuo_bamai_netstat_container_TcpExt_TCPPartialUndo{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPPureAcks statistic TcpExtTCPPureAcks.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPPureAcks gauge
huatuo_bamai_netstat_container_TcpExt_TCPPureAcks{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 2.095262e+06
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRcvCoalesce statistic TcpExtTCPRcvCoalesce.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRcvCoalesce gauge
huatuo_bamai_netstat_container_TcpExt_TCPRcvCoalesce{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 3
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRcvCollapsed statistic TcpExtTCPRcvCollapsed.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRcvCollapsed gauge
huatuo_bamai_netstat_container_TcpExt_TCPRcvCollapsed{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRcvQDrop statistic TcpExtTCPRcvQDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRcvQDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPRcvQDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRenoFailures statistic TcpExtTCPRenoFailures.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRenoFailures gauge
huatuo_bamai_netstat_container_TcpExt_TCPRenoFailures{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRenoRecovery statistic TcpExtTCPRenoRecovery.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRenoRecovery gauge
huatuo_bamai_netstat_container_TcpExt_TCPRenoRecovery{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRenoRecoveryFail statistic TcpExtTCPRenoRecoveryFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRenoRecoveryFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPRenoRecoveryFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRenoReorder statistic TcpExtTCPRenoReorder.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRenoReorder gauge
huatuo_bamai_netstat_container_TcpExt_TCPRenoReorder{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDoCookies statistic TcpExtTCPReqQFullDoCookies.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDoCookies gauge
huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDoCookies{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDrop statistic TcpExtTCPReqQFullDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPReqQFullDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPRetransFail statistic TcpExtTCPRetransFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPRetransFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPRetransFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSACKDiscard statistic TcpExtTCPSACKDiscard.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSACKDiscard gauge
huatuo_bamai_netstat_container_TcpExt_TCPSACKDiscard{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSACKReneging statistic TcpExtTCPSACKReneging.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSACKReneging gauge
huatuo_bamai_netstat_container_TcpExt_TCPSACKReneging{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSACKReorder statistic TcpExtTCPSACKReorder.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSACKReorder gauge
huatuo_bamai_netstat_container_TcpExt_TCPSACKReorder{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSYNChallenge statistic TcpExtTCPSYNChallenge.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSYNChallenge gauge
huatuo_bamai_netstat_container_TcpExt_TCPSYNChallenge{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackFailures statistic TcpExtTCPSackFailures.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackFailures gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackFailures{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackMerged statistic TcpExtTCPSackMerged.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackMerged gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackMerged{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackRecovery statistic TcpExtTCPSackRecovery.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackRecovery gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackRecovery{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackRecoveryFail statistic TcpExtTCPSackRecoveryFail.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackRecoveryFail gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackRecoveryFail{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackShiftFallback statistic TcpExtTCPSackShiftFallback.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackShiftFallback gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackShiftFallback{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSackShifted statistic TcpExtTCPSackShifted.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSackShifted gauge
huatuo_bamai_netstat_container_TcpExt_TCPSackShifted{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSlowStartRetrans statistic TcpExtTCPSlowStartRetrans.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSlowStartRetrans gauge
huatuo_bamai_netstat_container_TcpExt_TCPSlowStartRetrans{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRTOs statistic TcpExtTCPSpuriousRTOs.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRTOs gauge
huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRTOs{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRtxHostQueues statistic TcpExtTCPSpuriousRtxHostQueues.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRtxHostQueues gauge
huatuo_bamai_netstat_container_TcpExt_TCPSpuriousRtxHostQueues{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPSynRetrans statistic TcpExtTCPSynRetrans.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPSynRetrans gauge
huatuo_bamai_netstat_container_TcpExt_TCPSynRetrans{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPTSReorder statistic TcpExtTCPTSReorder.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPTSReorder gauge
huatuo_bamai_netstat_container_TcpExt_TCPTSReorder{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPTimeWaitOverflow statistic TcpExtTCPTimeWaitOverflow.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPTimeWaitOverflow gauge
huatuo_bamai_netstat_container_TcpExt_TCPTimeWaitOverflow{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPTimeouts statistic TcpExtTCPTimeouts.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPTimeouts gauge
huatuo_bamai_netstat_container_TcpExt_TCPTimeouts{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPToZeroWindowAdv statistic TcpExtTCPToZeroWindowAdv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPToZeroWindowAdv gauge
huatuo_bamai_netstat_container_TcpExt_TCPToZeroWindowAdv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPWantZeroWindowAdv statistic TcpExtTCPWantZeroWindowAdv.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPWantZeroWindowAdv gauge
huatuo_bamai_netstat_container_TcpExt_TCPWantZeroWindowAdv{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPWinProbe statistic TcpExtTCPWinProbe.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPWinProbe gauge
huatuo_bamai_netstat_container_TcpExt_TCPWinProbe{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPWqueueTooBig statistic TcpExtTCPWqueueTooBig.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPWqueueTooBig gauge
huatuo_bamai_netstat_container_TcpExt_TCPWqueueTooBig{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TCPZeroWindowDrop statistic TcpExtTCPZeroWindowDrop.
# TYPE huatuo_bamai_netstat_container_TcpExt_TCPZeroWindowDrop gauge
huatuo_bamai_netstat_container_TcpExt_TCPZeroWindowDrop{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TW statistic TcpExtTW.
# TYPE huatuo_bamai_netstat_container_TcpExt_TW gauge
huatuo_bamai_netstat_container_TcpExt_TW{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 720624
# HELP huatuo_bamai_netstat_container_TcpExt_TWKilled statistic TcpExtTWKilled.
# TYPE huatuo_bamai_netstat_container_TcpExt_TWKilled gauge
huatuo_bamai_netstat_container_TcpExt_TWKilled{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TWRecycled statistic TcpExtTWRecycled.
# TYPE huatuo_bamai_netstat_container_TcpExt_TWRecycled gauge
huatuo_bamai_netstat_container_TcpExt_TWRecycled{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 2461
# HELP huatuo_bamai_netstat_container_TcpExt_TcpDuplicateDataRehash statistic TcpExtTcpDuplicateDataRehash.
# TYPE huatuo_bamai_netstat_container_TcpExt_TcpDuplicateDataRehash gauge
huatuo_bamai_netstat_container_TcpExt_TcpDuplicateDataRehash{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_netstat_container_TcpExt_TcpTimeoutRehash statistic TcpExtTcpTimeoutRehash.
# TYPE huatuo_bamai_netstat_container_TcpExt_TcpTimeoutRehash gauge
huatuo_bamai_netstat_container_TcpExt_TcpTimeoutRehash{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
netstat_TcpExt_ArpFilter 因 ARP 过滤规则而被丢弃的数据包数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_BusyPollRxPackets 通过 busy polling 机制接收到的数据包数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_DelayedACKLocked 由于用户态进程锁住了 socket,而无法发送 delayed ACK 的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_DelayedACKLost 延迟 ACK 丢失导致重传的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_DelayedACKs 尝试发送 delayed ACK 的次数,包括未成功发送的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_EmbryonicRsts 在 SYN_RECV 状态收到带 RST/SYN 标记的包个数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_ListenDrops 因全连接队列满丢弃的连接总数(含ListenOverflows) 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_ListenOverflows 表示在 TCP 监听队列中发生的溢出次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_OfoPruned 乱序队列因内存不足被修剪的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_OutOfWindowIcmps 收到的与当前 TCP 窗口无关的 ICMP 错误报文数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_PruneCalled 因内存不足触发缓存清理的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_RcvPruned 接收队列因内存不足被修剪(丢弃数据包)的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_SyncookiesFailed 验证失败的 SYN cookie 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_SyncookiesRecv 表示接收的 SYN cookie 的数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_SyncookiesSent 表示发送的 SYN cookie 的数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPACKSkippedChallenge 在处理 Challenge ACK 过程中跳过的其他 ACK 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPACKSkippedFinWait2 在 FIN-WAIT-2 状态下跳过的 ACK 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPACKSkippedPAWS 因 PAWS 检查失败而跳过的 ACK 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPACKSkippedSeq 因为序列号检查而跳过的 ACK 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPACKSkippedTimeWait 在 TIME-WAIT 状态下跳过的 ACK 数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPAbortOnClose 用户态程序在缓冲区内还有数据时关闭连接的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPAbortOnData 收到未知数据导致被关闭的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPAbortOnLinger 在LINGER状态下等待超时后中止连接的数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPAbortOnMemory 因内存问题关闭连接的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPAbortOnTimeout 因各种计时器的重传次数超过上限而关闭连接的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPLossFailures 丢失数据包而进行恢复失败的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPLossProbeRecovery 检测到丢失的数据包恢复的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPLossProbes TCP 检测到丢失的数据包数量,通常用于检测网络拥塞或丢包 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPLossUndo 在恢复过程中检测到丢失而撤销的次数 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region
netstat_TcpExt_TCPLostRetransmit 丢包重传的数量 计数 宿主,容器 container_host, container_hostnamespace, container_level, container_name, container_type, host, region

备注:TcpExt 扩展指标非常多,可按需参考官方文档。

Ref:

Socket

# HELP huatuo_bamai_sockstat_container_FRAG_inuse Number of FRAG sockets in state inuse.
# TYPE huatuo_bamai_sockstat_container_FRAG_inuse gauge
huatuo_bamai_sockstat_container_FRAG_inuse{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_FRAG_memory Number of FRAG sockets in state memory.
# TYPE huatuo_bamai_sockstat_container_FRAG_memory gauge
huatuo_bamai_sockstat_container_FRAG_memory{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_RAW_inuse Number of RAW sockets in state inuse.
# TYPE huatuo_bamai_sockstat_container_RAW_inuse gauge
huatuo_bamai_sockstat_container_RAW_inuse{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_TCP_alloc Number of TCP sockets in state alloc.
# TYPE huatuo_bamai_sockstat_container_TCP_alloc gauge
huatuo_bamai_sockstat_container_TCP_alloc{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 171
# HELP huatuo_bamai_sockstat_container_TCP_inuse Number of TCP sockets in state inuse.
# TYPE huatuo_bamai_sockstat_container_TCP_inuse gauge
huatuo_bamai_sockstat_container_TCP_inuse{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 1
# HELP huatuo_bamai_sockstat_container_TCP_orphan Number of TCP sockets in state orphan.
# TYPE huatuo_bamai_sockstat_container_TCP_orphan gauge
huatuo_bamai_sockstat_container_TCP_orphan{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_TCP_tw Number of TCP sockets in state tw.
# TYPE huatuo_bamai_sockstat_container_TCP_tw gauge
huatuo_bamai_sockstat_container_TCP_tw{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 75
# HELP huatuo_bamai_sockstat_container_UDPLITE_inuse Number of UDPLITE sockets in state inuse.
# TYPE huatuo_bamai_sockstat_container_UDPLITE_inuse gauge
huatuo_bamai_sockstat_container_UDPLITE_inuse{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_UDP_inuse Number of UDP sockets in state inuse.
# TYPE huatuo_bamai_sockstat_container_UDP_inuse gauge
huatuo_bamai_sockstat_container_UDP_inuse{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 0
# HELP huatuo_bamai_sockstat_container_sockets_used Number of IPv4 sockets in use.
# TYPE huatuo_bamai_sockstat_container_sockets_used gauge
huatuo_bamai_sockstat_container_sockets_used{container_host="coredns-855c4dd65d-8v5kg",container_hostnamespace="kube-system",container_level="burstable",container_name="coredns",container_type="normal",host="hostname",region="dev"} 7
# HELP huatuo_bamai_sockstat_sockets_used Number of IPv4 sockets in use.
# TYPE huatuo_bamai_sockstat_sockets_used gauge
huatuo_bamai_sockstat_sockets_used{host="hostname",region="dev"} 409
指标 意义 单位 对象 标签
sockstat_sockets_used 系统层面当前正在使用的 socket 描述符总数 计数 系统
sockstat_TCP_inuse 当前处于 TCP 连接状态(如 ESTABLISHED、LISTEN 等,除 TIME_WAIT 外)的 socket 数量 计数 宿主,容器
sockstat_TCP_orphan 通常表示应用已关闭但 TCP 连接仍未结束 计数 宿主,容器
sockstat_TCP_tw 当前处于 TIME_WAIT 状态的 TCP socket 数量 计数 宿主,容器
sockstat_TCP_alloc 当前已分配的 TCP socket 对象总数 计数 宿主,容器
sockstat_TCP_mem TCP 套接字当前占用的内核内存页数 内存页 系统
sockstat_UDP_inuse 当前已绑定了本地端口的 UDP socket 数量 计数 宿主,容器

IO

iolatency 用来统计磁盘 I/O 延迟分布。可以把它理解成“把一次磁盘请求拆成几个阶段,再分别看每个阶段耗时多久”。

  • q2c:从请求进入队列到完成,反映整个 I/O 生命周期延迟
  • d2c:从驱动层下发到完成,更接近磁盘和驱动本身的耗时
  • freeze:磁盘冻结事件次数

队列

这些指标都会自动带上公共标签 hostregion。其中容器维度指标还会固定带上 container_hostcontainer_namecontainer_typecontainer_levelcontainer_hostnamespace 标签。

# HELP huatuo_bamai_iolatency_blkdisk_d2c the disk d2c latency
# TYPE huatuo_bamai_iolatency_blkdisk_d2c gauge
huatuo_bamai_iolatency_blkdisk_d2c{disk="253:1",host="hostname",region="dev",zone="0"} 3
# HELP huatuo_bamai_iolatency_blkdisk_q2c the disk q2c latency
# TYPE huatuo_bamai_iolatency_blkdisk_q2c gauge
huatuo_bamai_iolatency_blkdisk_q2c{disk="253:1",host="hostname",region="dev",zone="0"} 3
# HELP huatuo_bamai_iolatency_container_blkdisk_d2c container blkio d2c latency
# TYPE huatuo_bamai_iolatency_container_blkdisk_d2c gauge
huatuo_bamai_iolatency_container_blkdisk_d2c{container_host="etcd-hostname",container_hostnamespace="kube-system",container_level="burstable",container_name="etcd",container_type="normal",disk="253:1",host="hostname",region="dev",zone="5"} 2
# HELP huatuo_bamai_iolatency_container_blkdisk_q2c container blkio q2c latency
# TYPE huatuo_bamai_iolatency_container_blkdisk_q2c gauge
huatuo_bamai_iolatency_container_blkdisk_q2c{container_host="etcd-hostname",container_hostnamespace="kube-system",container_level="burstable",container_name="etcd",container_type="normal",disk="253:1",host="hostname",region="dev",zone="5"} 2
指标 意义 单位 对象 标签
iolatency_blkdisk_q2c 宿主机磁盘整体 I/O 生命周期延迟统计,从入队到完成。分桶为:zone0 20-30ms,zone1 30-50ms,zone2 50-100ms,zone3 100-200ms,zone4 200-400ms,zone5 400ms+ 计数 宿主 host, region, disk, zone
iolatency_blkdisk_d2c 宿主机磁盘驱动到完成阶段的延迟统计,更接近设备处理耗时。分桶为:zone0 20-30ms,zone1 30-50ms,zone2 50-100ms,zone3 100-200ms,zone4 200-400ms,zone5 400ms+ 计数 宿主 host, region, disk, zone
iolatency_container_blkdisk_q2c 容器触发的整体 I/O 生命周期延迟统计,从入队到完成。分桶为:zone0 20-30ms,zone1 30-50ms,zone2 50-100ms,zone3 100-200ms,zone4 200-400ms,zone5 400ms+ 计数 容器 host, region, container_host, container_name, container_type, container_level, container_hostnamespace, zone
iolatency_container_blkdisk_d2c 容器触发的驱动到完成阶段延迟统计。分桶为:zone0 20-30ms,zone1 30-50ms,zone2 50-100ms,zone3 100-200ms,zone4 200-400ms,zone5 400ms+ 计数 容器 host, region, container_host, container_name, container_type, container_level, container_hostnamespace, zone

硬件

# HELP huatuo_bamai_iolatency_blkdisk_freeze the disk freeze event count
# TYPE huatuo_bamai_iolatency_blkdisk_freeze gauge
huatuo_bamai_iolatency_blkdisk_freeze{disk="253:1",host="hostname",region="dev"} 0
指标 意义 单位 对象 标签
iolatency_blkdisk_freeze 宿主机磁盘 freeze 事件次数 计数 宿主 host, region, disk

通用系统

Soft Lockup

# HELP huatuo_bamai_softlockup_total softlockup counter
# TYPE huatuo_bamai_softlockup_total counter
huatuo_bamai_softlockup_total{host="hostname",region="dev"} 0
指标 意义 单位 对象 取值 标签
softlockup_total 系统 softlockup 事件计数 计数 物理机 BPF

HungTask

# HELP huatuo_bamai_hungtask_total hungtask counter
# TYPE huatuo_bamai_hungtask_total counter
huatuo_bamai_hungtask_total{host="hostname",region="dev"} 0
指标 意义 单位 对象 取值 标签
hungtask_total 系统 hungtask 事件计数 计数 物理机 BPF

GPU

当前版本支持的 GPU 平台:

  • MetaX
指标 描述 单位 统计纬度 指标来源
metax_gpu_sdk_info GPU SDK 信息 - version sml.GetSDKVersion
metax_gpu_driver_info GPU 驱动信息 - version sml.GetGPUVersion with driver unit
metax_gpu_info GPU 基本信息 - gpu
metax_gpu_board_power_watts GPU 板级功耗 瓦特(W) gpu sml.ListGPUBoardWayElectricInfos
metax_gpu_pcie_link_speed_gt_per_second GPU PCIe 当前链路速率 GT/s gpu sml.GetGPUPcieLinkInfo
metax_gpu_pcie_link_width_lanes GPU PCIe 当前链路宽度 链路宽度(通道数) gpu sml.GetGPUPcieLinkInfo
metax_gpu_pcie_receive_bytes_per_second GPU PCIe 接收吞吐率 Bps gpu sml.GetGPUPcieThroughputInfo
metax_gpu_pcie_transmit_bytes_per_second GPU PCIe 发送吞吐率 Bps gpu sml.GetGPUPcieThroughputInfo
metax_gpu_metaxlink_link_speed_gt_per_second GPU MetaXLink 当前链路速率 GT/s gpu, metaxlink sml.ListGPUMetaXLinkLinkInfos
metax_gpu_metaxlink_link_width_lanes GPU MetaXLink 当前链路宽度 链路宽度(通道数) gpu, metaxlink sml.ListGPUMetaXLinkLinkInfos
metax_gpu_metaxlink_receive_bytes_per_second GPU MetaXLink 接收吞吐率 Bps gpu, metaxlink sml.ListGPUMetaXLinkThroughputInfos
metax_gpu_metaxlink_transmit_bytes_per_second GPU MetaXLink 发送吞吐率 Bps gpu, metaxlink sml.ListGPUMetaXLinkThroughputInfos
metax_gpu_metaxlink_receive_bytes_total GPU MetaXLink 接收数据总量 字节 gpu, metaxlink sml.ListGPUMetaXLinkTrafficStatInfos
metax_gpu_metaxlink_transmit_bytes_total GPU MetaXLink 发送数据总量 字节 gpu, metaxlink sml.ListGPUMetaXLinkTrafficStatInfos
metax_gpu_metaxlink_aer_errors_total GPU MetaXLink AER 错误次数 计数 gpu, metaxlink, error_type sml.ListGPUMetaXLinkAerErrorsInfos
metax_gpu_status GPU 状态 - gpu, die sml.GetDieStatus
metax_gpu_temperature_celsius GPU 温度 摄氏度 gpu, die sml.GetDieTemperature
metax_gpu_utilization_percent GPU 利用率(0–100) % gpu, die, ip sml.GetDieUtilization
metax_gpu_memory_total_bytes 显存总容量 字节 gpu, die sml.GetDieMemoryInfo
metax_gpu_memory_used_bytes 已使用显存容量 字节 gpu, die sml.GetDieMemoryInfo
metax_gpu_clock_mhz GPU 时钟频率 兆赫兹(MHz) gpu, die, ip sml.ListDieClocks
metax_gpu_clocks_throttling GPU 时钟降频原因 - gpu, die, reason sml.GetDieClocksThrottleStatus
metax_gpu_dpm_performance_level GPU DPM 性能等级 - gpu, die, ip sml.GetDieDPMPerformanceLevel
metax_gpu_ecc_memory_errors_total GPU ECC 内存错误次数 计数 gpu, die, memory_type, error_type sml.GetDieECCMemoryInfo
metax_gpu_ecc_memory_retired_pages_total GPU ECC 内存退役页数 计数 gpu, die sml.GetDieECCMemoryInfo

2 - 异常事件诊断

📖 概述

HUATUO 基于 eBPF 技术,对 Linux 内核中的 CPU 调度、内存子系统、网络协议栈、硬件错误等核心子系统实施实时异常事件观测。当内核触发 softlockup、OOM、硬件 MCE 等异常状态时,eBPF 程序通过挂钩(hook)内核函数(kprobe)或内核 tracepoint,在事件发生的第一时间采集进程信息、内核调用栈、网络上下文等现场数据,并经由 perf event 环形缓冲区传递至用户态处理程序,最终持久化至 Elasticsearch 或本地磁盘文件。

相比传统的基于内核日志(dmesg/syslog)采集方案,eBPF 事件观测具备更低的数据丢失风险——不会因内核日志缓冲区满溢而丢失关键事件;同时可捕获不会写入内核日志的短暂性异常(如软中断关闭时间过长);并提供容器级别的事件关联信息,满足云原生场景下的精准定位需求。

当前支持 11 类事件的持续观测,覆盖 CPU 调度健康状态(softirq_tracing、softlockup、hungtask)、内存压力(oom、memory_reclaim_events)、网络协议栈(dropwatch、net_rx_latency、netdev_events、netdev_bonding_lacp、netdev_txqueue_timeout)以及硬件可靠性(ras)等方面。

🎯 场景

Kubernetes 容器内存故障定位:在容器频繁 OOM 重启场景下,oom 事件同时记录被 OOM Killer 终止的进程(victim)与触发 OOM 的进程(trigger)的 memcg cgroup 指针及容器 ID,结合时序数据可快速定位内存资源争抢的根因容器,降低人工排查容器日志的时间成本。

AI 训练集群硬件故障感知:在 GPU 训练服务器上,ras 事件持续采集 MCE(Machine Check Exception)、EDAC 内存控制器错误和 PCIe AER(Advanced Error Reporting)错误,对错误进行严重程度分级(Corrected / UncorrectedRecoverable / UncorrectedFatal),在训练任务中断前提前感知硬件老化或单点故障,减少因硬件故障导致的训练任务损失。

网络性能毛刺分析:dropwatch 观测 TCP 协议栈丢包行为(含 syn_flood、listen_overflow 等类型),net_rx_latency 检测单个数据包从网卡驱动到用户态的完整接收路径延迟,按阶段(网卡到内核、内核到 TCP、TCP 到用户态)分别设置阈值(默认 5ms / 10ms / 115ms),精准定位造成业务超时的网络层位置,提升网络问题根因定位效率。

主机调度健康观测:softirq_tracing(软中断关闭时间,默认阈值 10ms)、softlockup(CPU 无法调度,约 1 秒)、hungtask(D 状态进程任务挂起)三类事件联合覆盖 CPU 调度路径的异常状态,当系统出现卡顿、响应超时等现象时,自动保留内核调用栈等诊断信息,支持在故障消失后的离线分析。

🚀 使用

配置参数

各事件可通过以下参数进行调优,参数均提供默认值,无需配置即可运行:

参数 默认值 说明
softirq.disabled_threshold 10000000(10ms,纳秒) 软中断关闭时间触发阈值
memory_reclaim.blocked_threshold 900000000(900ms,纳秒) 直接内存回收时间触发阈值
net_rx_latency.driver2net_rx 5(ms) 从网卡驱动到 __netif_receive_skb 的延迟阈值
net_rx_latency.driver2tcp 10(ms) 从网卡驱动到 tcp_v4_rcv 的延迟阈值
net_rx_latency.driver2userspace 115(ms) 从网卡驱动到用户态拷贝(skb_copy_datagram_iovec)的延迟阈值
net_rx_latency.excluded_host_netnamespace true 是否过滤宿主机网络命名空间(默认仅观测容器)
net_rx_latency.excluded_container_qos [] 需要排除的容器 QoS 级别列表
dropwatch.excluded_neigh_invalidate true 是否过滤 neigh_invalidate 引起的邻居表丢包噪声
netdev.device_list [] 需要监控链路状态的网卡设备名称列表
ras.mce_thr_backoff 1800(秒) MCE 阈值中断(THR)事件上报冷却时间,防止中断风暴
issues_list [] 已知问题过滤规则列表(用于 net_rx_latency)

事件列表

事件名称(tracer_name) 探针类型 触发条件 典型场景
softirq_tracing kprobe 软中断关闭时间 > 阈值(默认 10ms) 系统卡顿、网络延迟、调度延迟
softlockup kprobe CPU 长时间无法调度(约 1 秒) 系统软锁死、响应异常
hungtask kprobe D 状态进程任务挂起 瞬时批量 D 进程、IO 阻塞
oom kprobe OOM Killer 触发 容器/宿主机内存耗尽
memory_reclaim_events kprobe 容器进程直接回收时间 > 阈值(默认 900ms) 内存压力导致业务卡顿
ras tracepoint CPU/MEM/PCIe 硬件错误 硬件故障感知
dropwatch kprobe TCP 协议栈丢包 协议栈丢包导致业务毛刺
net_rx_latency kprobe 协议栈接收延迟超分段阈值 接收延迟引起业务超时
netdev_events netlink 网卡链路状态变化 网卡物理链路故障
netdev_bonding_lacp kprobe LACP 协议状态变化(仅 802.3ad 模式环境) 物理机与交换机故障边界界定
netdev_txqueue_timeout kprobe 网卡发送队列超时 网卡发送队列硬件故障

通用字段说明

所有事件数据均包含以下通用字段:

  • hostname:物理机 hostname
  • region:物理机所在可用区
  • uploaded_time:数据上传时间
  • container_id:如果事件关联容器,则记录的容器 ID
  • container_hostname:如果事件关联容器,则记录的容器 hostname
  • container_host_namespace:如果事件关联容器,则记录容器的 K8s 命名空间
  • container_type:容器类型,例如 normal 普通容器,sidecar 边车容器等
  • container_qos:容器 QoS 级别
  • tracer_name:事件名称(如 softirq_tracingoom 等)
  • tracer_id:此次的 tracing ID
  • tracer_time:触发 tracing 时间
  • tracer_type:触发类型(手动触发或自动触发)
  • tracer_data:特定事件私有数据(详见各事件说明)

1. softirq_tracing 软中断关闭

功能描述 检测内核关闭软中断时间过长时触发,记录关闭软中断期间的内核调用栈、当前进程信息等关键数据,帮助分析中断相关延迟问题。过滤器自动排除 ksoftirqdswapper 进程产生的噪声事件。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "uploaded_time": "2025-06-11T16:05:16.251152703+08:00",
    "hostname": "***",
    "tracer_data": {
        "offtime": 237328905,
        "threshold": 10000000,
        "comm": "***-agent",
        "pid": 688073,
        "cpu": 1,
        "now": 5532940660025295,
        "stack": "scheduler_tick/..."
    },
    "tracer_time": "2025-06-11 16:05:16.251 +0800",
    "tracer_type": "auto",
    "time": "2025-06-11 16:05:16.251 +0800",
    "region": "***",
    "tracer_name": "softirq_tracing"
}

字段含义解释

  • comm:触发事件的进程名称
  • stack:关闭软中断期间的内核调用栈
  • now:事件发生时的单调时钟时间戳(纳秒)
  • offtime:软中断关闭的持续时间(纳秒)
  • cpu:发生事件的 CPU 编号
  • threshold:触发阈值(纳秒),超过该值则记录事件
  • pid:触发事件的进程 ID

2. dropwatch 协议栈丢包

功能描述 检测内核网络协议栈中的丢包行为,输出丢包时的内核调用栈、网络五元组、TCP 状态等信息。支持识别 4 种丢包类型:common_drop(通用丢包)、syn_flood(SYN 洪泛)、listen_overflow_handshake1(半连接队列溢出)、listen_overflow_handshake3(全连接队列溢出)。过滤器默认排除 neigh_invalidate 邻居表过期丢包和 bnxt 驱动发送侧丢包等已知噪声。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "type": "common_drop",
        "comm": "kubelet",
        "pid": 1687046,
        "saddr": "10.79.68.62",
        "daddr": "10.134.72.4",
        "sport": 8080,
        "dport": 49000,
        "src_hostname": "<nil>",
        "dest_hostname": "<nil>",
        "max_ack_backlog": 128,
        "seq": 1009085774,
        "ack_seq": 689410995,
        "pkt_len": 1460,
        "sk_state": "ESTABLISHED",
        "stack": "kfree_skb/...",
        "netdev_queue_mapping": 3,
        "netdev_linkstatus": ["linkStatusUp"],
        "netdev_name": "eth0",
        "netdev_ifindex": 2,
        "net_cookie": 123456789
    }
}

字段含义解释

  • type:丢包类型(common_drop / syn_flood / listen_overflow_handshake1 / listen_overflow_handshake3
  • comm:触发丢包的进程名称
  • pid:进程 ID
  • saddr / daddr:源 IP / 目的 IP 地址
  • sport / dport:源端口 / 目的端口
  • src_hostname / dest_hostname:源/目的 IP 的反向 DNS 解析结果
  • max_ack_backlog:socket 最大 accept 队列长度
  • seq / ack_seq:TCP 序列号 / 确认序列号
  • pkt_len:数据包长度(字节)
  • sk_state:丢包时 TCP 连接状态
  • stack:丢包发生时的内核调用栈
  • netdev_queue_mapping:网卡队列索引
  • netdev_linkstatus:网卡链路状态标志列表
  • netdev_name:网卡设备名称
  • netdev_ifindex:网卡接口索引
  • net_cookie:网络命名空间标识符

3. net_rx_latency 协议栈延迟

功能描述 检测协议栈接收方向(网卡驱动 → 内核协议栈 → 用户态主动收包)的分段延迟事件。在接收路径上设置三个观测点,任意阶段延迟超过对应阈值(默认:网卡到内核 5ms、内核到 TCP 10ms、TCP 到用户态 115ms)时触发,记录网络五元组、TCP 序列号、延迟位置及延迟时间。默认过滤宿主机网络命名空间,仅观测容器网络。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "comm": "nginx",
        "pid": 2921092,
        "where": "TO_USER_COPY",
        "latency_ms": 95973,
        "state": "ESTABLISHED",
        "saddr": "10.156.248.76",
        "daddr": "10.134.72.4",
        "sport": 9213,
        "dport": 49000,
        "seq": 1009085774,
        "ack_seq": 689410995,
        "pkt_len": 26064
    }
}

字段含义解释

  • comm:触发事件的进程名称
  • pid:触发事件的进程 ID
  • saddr / daddr:源 IP / 目的 IP 地址
  • sport / dport:源端口 / 目的端口
  • seq / ack_seq:TCP 序列号 / 确认序列号
  • state:TCP 连接状态(如 ESTABLISHED
  • pkt_len:数据包长度(字节)
  • where:延迟发生的阶段(TO_NETIF_RCV 网卡到内核 / TO_TCPV4_RCV 内核到 TCP / TO_USER_COPY TCP 到用户态)
  • latency_ms:实际延迟时间(毫秒)

4. oom 内存耗尽

功能描述 检测宿主机或容器内发生的 OOM(Out of Memory)事件,记录被 OOM Killer 终止的进程(victim)与触发 OOM 的进程(trigger)信息,以及对应容器和 memory cgroup 的详细信息,提供完整的故障快照。同时维护宿主机和各容器的 OOM 计数指标。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "trigger_memcg_css": "0xff4b8d8be3818000",
        "trigger_container_id": "***",
        "trigger_container_hostname": "***.docker",
        "trigger_pid": 3218804,
        "trigger_process_name": "java",
        "victim_memcg_css": "0xff4b8d8be3818000",
        "victim_container_id": "***",
        "victim_container_hostname": "***.docker",
        "victim_pid": 3218745,
        "victim_process_name": "java"
    }
}

字段含义解释

  • victim_process_name / victim_pid:被 OOM Killer 终止的进程名称与 PID
  • victim_container_hostname / victim_container_id:被终止进程所在的容器主机名与容器 ID
  • victim_memcg_css:被终止进程对应的 memory cgroup 指针(十六进制)
  • trigger_process_name / trigger_pid:触发 OOM 的进程名称与 PID
  • trigger_container_hostname / trigger_container_id:触发进程所在的容器主机名与容器 ID
  • trigger_memcg_css:触发进程对应的 memory cgroup 指针(十六进制)

5. softlockup 软锁死

功能描述 检测系统 softlockup 事件(CPU 长时间无法被调度,约 1 秒),提供导致锁死的目标进程信息、所在 CPU 及所有 CPU 的 NMI 回溯信息。采用退避(backoff)策略,同一轮事件风暴期间上报间隔从 10 分钟递增至最长 3 小时,防止重复上报。同时维护 softlockup 发生次数的计数指标。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "cpu": 15,
        "pid": 12345,
        "comm": "kworker/15:0",
        "cpus_stack": "2025-06-10 14:30:22 sysrq: Show backtrace of all active CPUs\nNMI backtrace for cpu 15\n..."
    }
}

字段含义解释

  • cpu:发生 softlockup 的 CPU 编号
  • pid:触发 softlockup 的进程 PID
  • comm:触发 softlockup 的进程名称
  • cpus_stack:所有 CPU 的 NMI 回溯信息(多行文本,包含时间戳和调用栈)

6. hungtask 任务挂起

功能描述 检测系统 hungtask 事件,捕获当前所有处于 D 状态(不可中断睡眠)的进程内核栈及所有 CPU 的回溯信息,用于保留故障现场。采用退避策略,同一轮事件风暴期间上报间隔从 10 分钟递增至最长 3 小时。同时维护 hungtask 发生次数的计数指标。注意:部分 Linux 发行版(如 Fedora 42)默认禁用 hungtask 检测,此时该观测器不会启动。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "pid": 2567042,
        "comm": "kworker/u48:2",
        "cpus_stack": "2025-06-10 09:57:14 sysrq: Show backtrace of all active CPUs\nNMI backtrace for cpu 33\n...",
        "blocked_processes_stack": "task:java            state:D stack:    0 pid: 12345 ..."
    }
}

字段含义解释

  • pid:触发 hungtask 检测的进程 PID
  • comm:触发 hungtask 检测的进程名称
  • cpus_stack:所有 CPU 的 NMI 回溯信息(多行文本,包含时间戳和调用栈)
  • blocked_processes_stack:D 状态进程的内核栈信息

7. memory_reclaim_events 内存回收

功能描述 检测容器进程发生直接内存回收(direct reclaim)的事件,当同一进程在 1 秒内直接回收时间超过阈值(默认 900ms)时触发,记录回收耗时、进程及容器信息。注意:该观测器仅记录容器进程的内存回收事件,宿主机进程的事件会被过滤。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "pid": 1896137,
        "comm": "java",
        "deltatime": 1412702917
    }
}

字段含义解释

  • comm:触发直接内存回收的进程名称
  • pid:触发进程的 PID
  • deltatime:直接回收耗时(纳秒)

8. ras 硬件错误

功能描述 通过内核 tracepoint 检测 CPU、内存、PCIe 等硬件错误,支持 5 种硬件错误来源:MCE(Machine Check Exception)、EDAC(内存控制器)、ACPI/GHES(非标准硬件错误)、PCIe AER(高级错误上报)、MCE 阈值中断(THR)。错误按严重程度分级:Corrected(已纠正)、UncorrectedRecoverable(未纠正可恢复)、UncorrectedFatal(未纠正致命)。MCE 阈值中断事件采用冷却策略(默认 30 分钟),防止中断风暴触发大量重复上报。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

MCE 示例数据

{
    "tracer_data": {
        "dev": "CPU/MEM",
        "event": "MCE",
        "type": "UncorrectedRecoverable",
        "timestamp": 1749600000000000000,
        "info": "{\"mcg_cpu_cap\":4096,\"banks_msr_status\":9295429630892703744,\"cpu\":2,\"socketid\":0,\"bank\":5}"
    }
}

PCIe AER 示例数据

{
    "tracer_data": {
        "dev": "PCIe 0000:3b:00.0",
        "event": "AER",
        "type": "UncorrectedRecoverable",
        "timestamp": 1749600000000000000,
        "info": "{\"dev_name\":\"0000:3b:00.0\",\"err_type\":\"UncorrectedRecoverable\",\"err_reason\":\"Completion Timeout\",\"tlp_header\":\"not available\"}"
    }
}

字段含义解释

  • dev:发生错误的硬件设备(如 CPU/MEMPCIe 0000:3b:00.0
  • event:错误类型(MCE / EDAC / NON_STANDARD / AER / MCE_THRESHOLD
  • type:错误严重程度(Corrected / UncorrectedRecoverable / UncorrectedDeferred / UncorrectedFatal / Info
  • timestamp:硬件错误发生时的时间戳
  • info:JSON 格式的详细错误信息,内容因 event 类型不同而不同

9. netdev_events 网络设备

功能描述 通过订阅内核 netlink RTM_NEWLINK 消息,检测网卡链路状态变化事件(down/up、MTU 变更、AdminDown、CarrierDown 等),输出接口名称、链路状态、MAC 地址及驱动信息。观测器启动时会扫描 device_list 中配置的网卡当前状态作为基线,后续仅上报状态变化事件。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "ifname": "eth1",
        "index": 3,
        "linkstatus": "linkStatusAdminDown, linkStatusCarrierDown",
        "mac": "5c:6f:69:34:dc:72",
        "start": false,
        "driver": "ixgbe",
        "driver_version": "5.1.0-k",
        "firmware_version": "3.25 0x80000421 1.2163.0"
    }
}

字段含义解释

  • ifname:网络接口名称(如 eth1
  • index:接口索引号
  • linkstatus:链路状态变化描述(可包含多个状态)
  • mac:网卡 MAC 地址
  • start:是否为启动时扫描的基线事件(true:启动扫描,false:实时变化事件)
  • driver:网卡驱动名称
  • driver_version:网卡驱动版本
  • firmware_version:网卡固件版本

10. netdev_bonding_lacp LACP 协议

功能描述 检测 bonding 模式下 LACP(Link Aggregation Control Protocol,IEEE 802.3ad)协议的状态变化,读取并记录 /proc/net/bonding/ 下所有 bonding 接口的完整状态信息,包含模式、MII 状态、Actor/Partner 协商参数、Slave 链路状态等。仅在系统存在 IEEE 802.3ad bonding 模式接口时自动启用。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "content": "/proc/net/bonding/bond0\nEthernet Channel Bonding Driver: v4.18.0...\nBonding Mode: IEEE 802.3ad Dynamic link aggregation\nMII Status: down\n..."
    }
}

字段含义解释

  • content:完整的 bonding 接口状态信息(多行文本,包含所有 Slave 的 LACP 协商细节,等同于 /proc/net/bonding/bondX 文件内容)

11. netdev_txqueue_timeout 发送队列超时

功能描述 检测网卡发送队列超时(TX queue timeout)事件,记录发生超时的队列索引、设备名称和驱动名称,用于定位网卡发送方向的硬件故障。

数据存储 自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_data": {
        "queue_index": 3,
        "device_name": "eth0",
        "driver_name": "ixgbe"
    }
}

字段含义解释

  • queue_index:发生超时的发送队列索引
  • device_name:网卡设备名称
  • driver_name:网卡驱动名称

⚙️ 原理

整体架构

HUATUO 的异常事件观测基于 eBPF 技术,在内核态以极低的性能损耗采集异常事件现场数据,并通过用户态守护进程完成格式化、过滤、容器信息关联和持久化存储。

graph TB
    subgraph "Linux Kernel"
        direction TB
        K1["kprobe 挂钩\n(softirq_tracing / softlockup / hungtask\n oom / memory_reclaim_events / dropwatch\n net_rx_latency / netdev_txqueue_timeout)"]
        K2["tracepoint 挂钩\n(ras: MCE / EDAC / AER / ACPI)"]
        K3["netlink 订阅\n(netdev_events: RTM_NEWLINK)"]
        K4["kprobe 挂钩\n(netdev_bonding_lacp: 802.3ad)"]
        PEB["Perf Event 环形缓冲区\n(8192 页)"]
    end

    subgraph "HUATUO 用户态"
        direction TB
        EH["Go 事件处理 goroutine\n(每类事件独立运行)"]
        CF["过滤器\n(阈值判断 / 降噪 / 已知问题过滤)"]
        CM["容器信息关联\n(CSS → ContainerID\n NetNS → ContainerID)"]
    end

    subgraph "存储"
        ES["Elasticsearch"]
        DISK["本地磁盘文件"]
    end

    K1 --> PEB
    K2 --> PEB
    K4 --> PEB
    PEB --> EH
    K3 --> EH
    EH --> CF
    CF --> CM
    CM --> ES
    CM --> DISK

事件处理流程

sequenceDiagram
    participant K as Linux Kernel
    participant B as eBPF Program
    participant P as Perf Event Buffer
    participant H as Go 事件处理器
    participant F as 过滤器
    participant S as 存储

    K->>B: 触发 kprobe / tracepoint
    B->>B: 采集现场数据<br/>(进程信息 / 内核栈 / 网络上下文)
    B->>P: 写入 perf event 环形缓冲区
    H->>P: 读取事件数据(阻塞等待)
    H->>F: 格式化并执行过滤<br/>(阈值 / 降噪 / 已知问题)
    F->>H: 通过过滤的事件
    H->>H: 关联容器信息<br/>(CSS / NetNS 映射)
    H->>S: 持久化存储<br/>(Elasticsearch / 本地文件)

3 - 全自动化追踪

📖 概述

HUATUO AutoTracing(全自动化追踪)是一种事件驱动的自动诊断机制。当物理机或容器出现 CPU 突增、D 状态进程堆积、磁盘 IO 打满、内存突发分配等性能异常时,系统依据预设阈值自动触发现场数据采集,无需人工介入即可保留完整的诊断快照。

采集内容包括 eBPF 火焰图(perf 工具系统级或容器级 CPU 调用栈采样)、D 状态进程内核调用栈、磁盘 IO 调用栈、进程内存使用排行等。为避免持续触发导致的数据冗余,各事件均内置冷却策略(默认 30 分钟),确保在事件风暴期间仅保留关键快照。

当前支持 5 类事件:cpusys(物理机 CPU sys 突增)、cpuidle(容器 CPU 使用率突增)、dload(容器 D 状态负载突增)、iotracing(磁盘 IO 异常)、memburst(内存突发分配)。

🎯 场景

AI 训练任务 CPU 热点定位:在 GPU 训练集群中,训练任务偶发性卡顿往往由内核态 CPU 占用率突增(cpusys)引起。AutoTracing 在 sys 占用率超过阈值的瞬间自动触发系统级 perf 火焰图采集,将内核调用栈热点以火焰图数据结构(flamedata)持久化,支持在故障消失后进行离线分析,避免人工复现困难。

Kubernetes 容器 CPU 性能毛刺分析:在微服务架构中,容器 CPU 使用率(cpuidle)的短暂突增可能导致响应延迟超时,但问题往往在告警响应前已恢复。AutoTracing 在容器 CPU 超阈值时自动触发容器级 perf 采样,生成精确到容器 cgroup 范围的火焰图,快速定位热点函数,降低依赖日志排查的时间成本。

云原生环境 D 状态进程堆积排查:在高 IO 负载或存储抖动时,容器内可能出现大量 D 状态(不可中断睡眠)进程,导致系统卡顿。dload 事件通过对容器负载均值进行指数加权移动平均(EMA)计算,在 D 状态进程负载超过阈值时自动抓取容器内及宿主机上相关进程的内核调用栈,精准定位阻塞根因。

磁盘 IO 瓶颈根因定位:在大数据或日志密集型业务中,磁盘 IO 利用率或写入带宽打满会导致应用请求堆积。iotracing 持续轮询 /proc/diskstats,在磁盘 IO 指标连续两次超过阈值时触发,采集高 IO 进程列表(含各进程读写字节数与打开文件详情)及正在等待 IO 调度的进程内核调用栈,快速缩小磁盘 IO 高消耗的进程范围。

🚀 使用

配置参数

各事件可通过以下参数进行调优,参数均提供默认值,无需配置即可运行:

参数 默认值 说明
cpuidle.user_threshold 75(%) 容器 CPU user 占用率触发阈值
cpuidle.sys_threshold 45(%) 容器 CPU sys 占用率触发阈值
cpuidle.usage_threshold 90(%) 容器 CPU 总占用率触发阈值
cpuidle.delta_user_threshold 45(%) 容器 CPU user 占用率增量触发阈值
cpuidle.delta_sys_threshold 20(%) 容器 CPU sys 占用率增量触发阈值
cpuidle.delta_usage_threshold 55(%) 容器 CPU 总占用率增量触发阈值
cpuidle.interval 10(秒) 检测间隔
cpuidle.interval_tracing 1800(秒) 同一容器触发冷却时间
cpuidle.run_tracing_tool_timeout 10(秒) perf 火焰图采集超时
cpusys.sys_threshold 45(%) 物理机 CPU sys 占用率触发阈值
cpusys.delta_sys_threshold 20(%) 物理机 CPU sys 占用率增量触发阈值
cpusys.interval 10(秒) 检测间隔
cpusys.run_tracing_tool_timeout 10(秒) perf 火焰图采集超时
dload.threshold_load 5 容器不可中断进程负载 EMA 触发阈值
dload.interval 10(秒) 检测间隔
dload.interval_tracing 1800(秒) 同一容器触发冷却时间
iotracing.rbps_threshold 2000(MB/s) 磁盘读吞吐率触发阈值
iotracing.wbps_threshold 1500(MB/s) 磁盘写吞吐率触发阈值
iotracing.util_threshold 90(%) 磁盘 IO 利用率触发阈值
iotracing.await_threshold 100(ms) 磁盘 IO 平均等待时间触发阈值
iotracing.run_tracing_tool_timeout 10(秒) IO 调用栈采集超时
iotracing.max_proc_dump 10 最多采集的高 IO 进程数
iotracing.max_files_per_proc_dump 5 每个进程最多采集的打开文件数
memburst.delta_memory_burst 100(%) 匿名内存相对滑动窗口最早采样的增长率阈值(100% 即 ≥ 2 倍时触发)
memburst.delta_anon_threshold 70(%) 匿名内存占物理机总内存的比例阈值
memburst.interval 10(秒) 检测间隔
memburst.interval_tracing 1800(秒) 触发冷却时间
memburst.sliding_window_length 60 滑动窗口采样数(对应 600 秒历史数据)
memburst.dump_process_max_num 10 最多采集的内存消耗进程数

事件列表

事件名称(tracer_name) 观测对象 触发条件 典型场景
cpusys 物理机 sys > 45% 或 delta_sys > 20% 内核态 CPU 突增、系统调用热点
cpuidle 容器 (user>75% 且 delta_user>45%) 或 (sys>45% 且 delta_sys>20%) 或 (total>90% 且 delta_total>55%) 容器 CPU 使用率突增、热点函数分析
dload 容器 不可中断进程负载 EMA > 5 D 状态进程堆积、IO 阻塞
iotracing 物理机 磁盘 IO 指标连续两次超阈值 磁盘 IO 打满、IO 等待高延迟
memburst 物理机 匿名内存 ≥ 窗口最早值 2 倍且占总内存 ≥ 70% 内存突发分配、OOM 前兆

通用字段说明

所有事件数据均包含以下通用字段:

  • hostname:物理机 hostname
  • region:物理机所在可用区
  • uploaded_time:数据上传时间
  • container_id:如果事件关联容器,则记录的容器 ID
  • container_hostname:如果事件关联容器,则记录的容器 hostname
  • container_host_namespace:如果事件关联容器,则记录容器的 K8s 命名空间
  • container_type:容器类型
  • container_qos:容器 QoS 级别
  • tracer_name:事件名称(如 cpusysmemburst 等)
  • tracer_id:此次的 tracing ID
  • tracer_time:触发 tracing 时间
  • tracer_type:触发类型(手动触发或自动触发)
  • tracer_data:特定事件私有数据(详见各事件说明)

1. cpusys

功能描述 周期性读取 /proc/stat,计算物理机 CPU sys 占用率及相邻两次采样的增量。当 sys 占用率超过阈值(默认 45%)或增量超过阈值(默认 20%)时,触发系统级 perf 采样,生成全机 CPU 火焰图数据。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_name": "cpusys",
    "tracer_data": {
        "now_sys": 52,
        "sys_threshold": 45,
        "deltasys": 25,
        "deltasys_threshold": 20,
        "flamedata": [
            {"level": 0, "value": 1000, "self": 0, "label": "all"},
            {"level": 1, "value": 350, "self": 350, "label": "do_syscall_64"}
        ]
    }
}

字段含义解释

  • now_sys:触发时物理机 CPU sys 占用率(%)
  • sys_threshold:sys 占用率触发阈值(%)
  • deltasys:相邻两次采样的 sys 占用率增量(%)
  • deltasys_threshold:sys 增量触发阈值(%)
  • flamedata:perf 采样生成的火焰图帧数据列表,每帧包含:
    • level:调用栈层级深度
    • value:该帧(含子帧)的采样计数
    • self:该帧自身(不含子帧)的采样计数
    • label:函数或进程名称标签

2. cpuidle

功能描述 周期性读取容器 cgroup CPU 统计,计算容器 CPU user、sys、总占用率及各指标的相邻增量。当任意一组阈值条件成立时(user>75% 且 delta_user>45%,或 sys>45% 且 delta_sys>20%,或 total>90% 且 delta_total>55%),触发容器级 perf 采样生成火焰图。同一容器默认 30 分钟冷却,避免重复触发。支持通过容器过滤器(filter)排除特定容器。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_name": "cpuidle",
    "tracer_data": {
        "user": 80,
        "user_threshold": 75,
        "deltauser": 48,
        "deltauser_threshold": 45,
        "sys": 12,
        "sys_threshold": 45,
        "deltasys": 5,
        "deltasys_threshold": 20,
        "usage": 92,
        "usage_threshold": 90,
        "deltausage": 53,
        "deltausage_threshold": 55,
        "flamedata": [
            {"level": 0, "value": 1000, "self": 0, "label": "all"},
            {"level": 1, "value": 800, "self": 800, "label": "java/com.example.App.main"}
        ]
    }
}

字段含义解释

  • user / user_threshold:触发时容器 CPU user 占用率(%)及其阈值
  • deltauser / deltauser_threshold:user 占用率增量(%)及其阈值
  • sys / sys_threshold:触发时容器 CPU sys 占用率(%)及其阈值
  • deltasys / deltasys_threshold:sys 占用率增量(%)及其阈值
  • usage / usage_threshold:触发时容器 CPU 总占用率(%)及其阈值
  • deltausage / deltausage_threshold:总占用率增量(%)及其阈值
  • flamedata:容器级 perf 采样火焰图帧数据,字段含义同 cpusys

3. dload

功能描述 通过 netlink 及 cgroup 读取容器内进程状态,对不可中断(D 状态)进程的负载贡献进行指数加权移动平均(EMA)计算。当容器 D 状态负载 EMA 超过阈值(默认 5)时,采集容器内及宿主机中所有 D 状态进程的内核调用栈,支持已知问题过滤(issues_list)降低误报率。同一容器默认 30 分钟冷却。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_name": "dload",
    "tracer_data": {
        "threshold": 5,
        "nr_sleeping": 120,
        "nr_running": 4,
        "nr_stopped": 0,
        "nr_uninterruptible": 8,
        "nr_iowait": 3,
        "load_avg": 7.23,
        "dload_avg": 6.81,
        "known_issue": "",
        "stack": "task:java            state:D stack:    0 pid: 12345 tgid: 12345 ...\n  io_schedule+0x18/0x40\n  ext4_file_write_iter+0x..."
    }
}

字段含义解释

  • threshold:D 状态负载 EMA 触发阈值
  • nr_sleeping:容器内睡眠状态进程数
  • nr_running:容器内运行状态进程数
  • nr_stopped:容器内停止状态进程数
  • nr_uninterruptible:容器内不可中断(D 状态)进程数
  • nr_iowait:容器内 IO 等待状态进程数
  • load_avg:触发时容器负载均值
  • dload_avg:触发时容器 D 状态负载 EMA 值
  • known_issue:命中的已知问题描述(为空表示未命中)
  • stack:D 状态进程的内核调用栈(多进程多行文本)

4. iotracing

功能描述 以 5 秒间隔轮询 /proc/diskstats,计算各磁盘设备的读写吞吐率、IO 利用率及 IO 等待时间。当任一指标连续两次采样均超过对应阈值时触发(自动忽略 md 设备),采集高 IO 进程列表(含各进程的读写字节数及打开文件统计)以及正在等待 IO 调度的进程内核调用栈。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_name": "iotracing",
    "tracer_data": {
        "reason_snapshot": {
            "type": "ioutil",
            "device": "sda",
            "iostatus": {
                "read_bps": 120,
                "read_iops": 450,
                "read_await": 12,
                "write_bps": 2100,
                "write_iops": 890,
                "write_await": 145,
                "io_util": 95,
                "queue_size": 32
            }
        },
        "process_io_data": [
            {
                "pid": 12345,
                "comm": "java",
                "container_hostname": "app-pod-xxx",
                "fs_read": 0,
                "fs_write": 52428800,
                "disk_read": 0,
                "disk_write": 49152000,
                "file_stat": ["/data/logs/app.log"],
                "file_count": 1
            }
        ],
        "timeout_io_stack": [
            {
                "pid": 12345,
                "comm": "java",
                "container_hostname": "app-pod-xxx",
                "latency_us": 250000,
                "stack": {
                    "back_trace": [
                        "io_schedule+0x18/0x40",
                        "ext4_file_write_iter+0x2a0/0x4c0"
                    ]
                }
            }
        ]
    }
}

字段含义解释

  • reason_snapshot:触发 IO 采集的原因快照
    • type:触发类型(ioutil IO 利用率 / read_bps 读吞吐率 / write_bps 写吞吐率 / read_await 读等待时间 / write_await 写等待时间)
    • device:触发阈值的磁盘设备名称
    • iostatus:触发时各磁盘 IO 指标快照(read_bps/write_bps 单位 MB/s,read_await/write_await 单位 ms,io_util 单位 %,queue_size 为队列深度)
  • process_io_data:高 IO 进程列表,每条记录包含:
    • pid / comm:进程 PID 与进程名
    • container_hostname:进程所在容器 hostname(宿主机进程为空)
    • fs_read / fs_write:进程文件系统层面的读写字节数
    • disk_read / disk_write:进程磁盘层面的实际读写字节数
    • file_stat:进程当前打开的文件路径列表
    • file_count:进程打开的文件总数
  • timeout_io_stack:等待 IO 调度的进程调用栈列表,每条记录包含:
    • pid / comm:进程 PID 与进程名
    • container_hostname:进程所在容器 hostname
    • latency_us:IO 等待时长(微秒)
    • stack.back_trace:内核调用栈帧列表

5. memburst

功能描述 周期性采样物理机匿名内存(anonymous memory)使用量,维护长度为 60 个采样点(对应 600 秒)的滑动窗口。当当前匿名内存 ≥ 窗口最早采样值的 2 倍,且匿名内存占物理机总内存 ≥ 70% 时触发,采集内存消耗最多的前 N 个进程(默认 10 个)的 PID、进程名和 RSS 内存值。默认 30 分钟冷却。

数据存储 事件数据自动存储至 Elasticsearch 或物理机磁盘文件。

示例数据

{
    "tracer_name": "memburst",
    "tracer_data": {
        "top_memory_usage": [
            {
                "pid": 3456,
                "process_name": "java",
                "memory_size": 8589934592
            },
            {
                "pid": 3789,
                "process_name": "python3",
                "memory_size": 2147483648
            }
        ]
    }
}

字段含义解释

  • top_memory_usage:内存消耗最多的进程列表(按 RSS 降序排列),每条记录包含:
    • pid:进程 PID
    • process_name:进程名称
    • memory_size:进程 RSS 内存占用(字节)

⚙️ 原理

整体架构

HUATUO AutoTracing 以周期性轮询为基础,结合 eBPF 调用栈采集与 perf 火焰图生成,在内核层实现低开销的异常诊断数据采集。

graph TB
    subgraph "数据来源"
        P1["/proc/stat\n(物理机 CPU 占用率)"]
        P2["cgroup CPU 统计\n(容器 CPU 占用率)"]
        P3["netlink / cgroup\n(容器进程状态 / 负载均值)"]
        P4["/proc/diskstats\n(磁盘 IO 指标)"]
        P5["/proc/meminfo\n+ cgroup 内存统计"]
    end

    subgraph "HUATUO AutoTracing"
        DT["阈值检测\n(滑动窗口 / EMA / 连续两次超阈值)"]
        BO["冷却策略\n(30 分钟 backoff)"]
        PERF["perf 火焰图采集\n(系统级 / 容器级)"]
        BPF["eBPF kprobe\n(IO 调度延迟追踪)"]
        CM["容器信息关联\n(cgroup → ContainerID)"]
    end

    subgraph "存储"
        ES["Elasticsearch"]
        DISK["本地磁盘文件"]
    end

    P1 --> DT
    P2 --> DT
    P3 --> DT
    P4 --> DT
    P5 --> DT
    DT --> BO
    BO --> PERF
    BO --> BPF
    PERF --> CM
    BPF --> CM
    CM --> ES
    CM --> DISK

事件处理流程

sequenceDiagram
    participant M as 周期性指标采集
    participant D as 阈值检测器
    participant B as 冷却策略(backoff)
    participant C as 现场数据采集器
    participant S as 存储

    M->>D: 推送指标(每 10 秒)
    D->>D: 阈值判断(滑动窗口 / EMA / 连续两次)
    alt 超过阈值
        D->>B: 检查冷却状态
        alt 允许触发
            B->>C: 触发现场采集<br/>(perf 火焰图 / D 状态进程栈 / IO 进程列表)
            C->>C: 关联容器信息(cgroup → ContainerID)
            C->>S: 持久化存储(Elasticsearch / 本地文件)
        else 冷却期内
            B-->>D: 跳过本次触发
        end
    end

4 - 硬件故障诊断

概述

HUATUO 华佗以零侵入、低开销的方式持续监听 Linux 内核上报的硬件错误事件,将结构化的故障记录持久化存储,并以 Prometheus 指标形式对外暴露汇总计数器,供告警与可视化系统使用。

应用场景

  • 通用计算

    大规模服务器集群中,内存 ECC 可纠正错误(CE)是常见的低级别故障信号。单次 CE 可由硬件自动修复,但若同一 DIMM 上 CE 频率持续升高,则预示着内存条即将失效。华佗通过 EDAC/MCE tracepoint 实时感知此类事件,使工程团队能够在内存彻底失效前完成预防性换件,避免意外宕机。

  • AI 计算

    AI 训练任务对硬件可靠性要求极高,单块故障的 PCIe 设备即可导致整个训练任务失败。华佗支持 PCIe AER 事件监测,能够实时上报 GPU、NVLink Bridge、RDMA 网卡(如 InfiniBand HCA)的链路层错误(Data Link Protocol Error、ECRC Error 等),为 AI 集群调度系统提供硬件健康状态数据,支撑故障节点的快速隔离与任务迁移。

  • 存储服务

    存储服务器通常配备大量 PCIe NVMe SSD 和 HBA 卡。PCIe AER 中的 Completion Timeout、Malformed TLP 等错误是存储设备性能抖动或掉线的先兆。华佗监控数据可与存储 IO 延迟指标联动,支撑根因分析。

  • 安全合规

    金融、政务等对合规有严格要求的行业,需要完整记录所有硬件故障历史。结构化事件存储(含时间戳、设备标识、错误类型、原始寄存器值)可直接作为硬件健康日志的合规存证。

监控原理

HUATUO 华佗通过 eBPF 技术观测内核的 MCE / EDAC / ACPI GHES / PCIe AER 子系统,当 eBPF tracepoint 被触发时,将原始事件写入 BPF Perf Event Buffer。用户态程序读取事件,解析结构体字段,生成结构化记录,并存储至本地或远端。总体架构如下:

RAS 原理

Linux 内核的 RAS 体系由多个相对独立的子系统协同构成,共同覆盖从 CPU 内部错误到 PCIe 链路错误的完整硬件故障谱系。

graph TB
    subgraph HW["硬件层"]
        CPU["CPU\nx86 / x86-64"]
        MEM["内存\nDDR4/DDR5 DIMM ECC"]
        Platform["平台硬件\nSoC / PCH"]
        PCIeDev["PCIe 设备\nGPU / NVMe / HCA / FPGA"]
    end

    subgraph FW["固件层"]
        BIOS["BIOS / UEFI\nCPER 缓冲区(APEI)"]
    end

    subgraph Kernel["Linux 内核 RAS 子系统"]
        MCE["MCE 子系统\narch/x86/kernel/cpu/mce"]
        EDAC["EDAC 子系统\ndrivers/edac"]
        GHES["ACPI GHES 子系统\ndrivers/acpi/apei"]
        AER["PCIe AER 子系统\ndrivers/pci/pcie/aer"]
    end

    subgraph TP["内核 Tracepoint"]
        TP1["tracepoint/mce/mce_record"]
        TP2["tracepoint/ras/mc_event"]
        TP3["tracepoint/ras/non_standard_event"]
        TP4["tracepoint/ras/aer_event"]
    end

    CPU -->|"MCE 异常(#MC)+ THR 中断"| MCE
    MEM -->|ECC 错误| EDAC
    Platform -->|APEI 错误记录| BIOS
    BIOS -->|CPER 缓冲区| GHES
    PCIeDev -->|AER 中断| AER

    MCE --> TP1
    EDAC --> TP2
    GHES --> TP3
    AER --> TP4
  • MCE

    MCE(Machine Check Architecture)是处理器内置的硬件容错机制,由 Intel 和 AMD 在各自的架构规范中定义。处理器内部存在若干 Bank(Machine Check Bank),每个 Bank 对应一类硬件资源(如 L1 Cache、L2 Cache、内存控制器、TLB 等)。当检测到硬件错误时,对应 Bank 的 MSR 寄存器(MCi_STATUS、MCi_ADDR、MCi_MISC)被填充错误信息,并触发 MCE 异常。

  • MCE THR

    MCE 支持阈值中断机制。当某类可纠正错误的计数超过预设阈值时,触发专用 APIC 中断(THR),而不升级为完整的 MCE 异常。此机制允许操作系统在错误频率异常升高时提前告警,而非等到错误完全不可纠正时才介入。

  • EDAC

    EDAC(Error Detection And Correction)是 Linux 内核中专门处理内存和硬件 ECC 错误的子系统,其目标是"检测并报告运行在 Linux 下的计算机系统中发生的硬件错误"。EDAC 驱动直接与内存控制器通信,解析 ECC 错误的物理位置(内存控制器编号、Channel、Slot、行列地址)。

  • ACPI GHES

    ACPI GHES(Generic Hardware Error Source,通用硬件错误源)是一种平台无关的硬件错误上报机制,由 BIOS/UEFI 通过 APEI(ACPI Platform Error Interface)规范定义。BIOS 固件将无法被特定驱动处理的硬件错误(如特定 SoC 内部错误、平台特定内存错误)写入 GHES 描述符中的 CPER(Common Platform Error Record)缓冲区。Linux 内核读取 CPER 记录,并上报无法被标准子系统解析的"非标准"错误部分。

  • PCIe AER

    PCIe AER(Advanced Error Reporting)是 PCIe 规范定义的错误上报机制,允许 PCIe 设备向操作系统精确报告链路层和事务层的错误类型。

指标总览

  • RAS 指标

    # HELP huatuo_bamai_ras_hw_total total RAS hardware error events by source type
    # TYPE huatuo_bamai_ras_hw_total counter
    huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="acpi"} 0
    huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="aer"} 0
    huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="edac"} 0
    huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="mce"} 0
    huatuo_bamai_ras_hw_total{host="hostname",region="dev",type="thr"} 0
    
  • 网卡丢包

    huatuo_bamai_netdev_hw_rx_dropped_total{host="hostname",region="dev",device="eth0",driver="ixgbe"} 0
    
  • RDMA PFC

    # HELP huatuo_bamai_netdev_dcb_pfc_received_total count of the received pfc frames
    # TYPE huatuo_bamai_netdev_dcb_pfc_received_total counter
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="0",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="1",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="2",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="3",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="4",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="5",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="6",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_received_total{device="enp6s0f0np0",host="hostname",prio="7",region="dev"} 0
    # HELP huatuo_bamai_netdev_dcb_pfc_send_total count of the sent pfc frames
    # TYPE huatuo_bamai_netdev_dcb_pfc_send_total counter
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="0",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="1",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="2",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="3",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="4",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="5",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="6",region="dev"} 0
    huatuo_bamai_netdev_dcb_pfc_send_total{device="enp6s0f0np0",host="hostname",prio="7",region="dev"} 0
    
  • 结构化存储

    此外,每个硬件错误事件均以结构化形式持久化(存储于本地 huatuo-local 目录或远端 ES/OS 存储等),包含以下公共字段:

    {
        "hostname": "hostname",
        "region": "dev",
        "uploaded_time": "2026-03-05T18:28:39.153438921+08:00",
        "time": "2026-03-05 18:28:39.153 +0800",
        "tracer_name": "netdev_event",
        "tracer_time": "2026-03-05 18:28:39.153 +0800",
        "tracer_type": "auto",
        "tracer_data": {
            "ifname": "eth0",
            "index": 2,
            "linkstatus": "linkstatus_admindown",
            "mac": "5c:6f:11:11:11:11",
            "start": false
        }
    }
    

    linkstatus 字段的可能取值如下:

    • linkstatus_adminup 管理员开启网卡,例如 ip link set dev eth0 up
    • linkstatus_admindown 管理员关闭网卡,例如 ip link set dev eth0 down
    • linkstatus_carrierup 物理链路恢复
    • linkstatus_carrierdown 物理链路故障
    {
        "hostname": "localhost",
        "region": "xxx",
        "uploaded_time": "2026-05-11T16:58:47.328548319+08:00",
        "time": "2026-05-11 16:58:47.328 +0800",
        "tracer_name": "ras",
        "tracer_time": "2026-05-11 16:58:47.328 +0800",
        "tracer_type": "auto",
        "tracer_data": {
            "dev": "MEM",
            "event": "EDAC",
            "type": "Corrected",
            "timestamp": 537792166031,
            "info": "{\"err_count\":0,\"err_type\":\"Corrected\",\"err_msg\":\"memory read error\",\"label\":\"CPU_SrcID#0_Ha#0_Chan#0_DIMM#0\",\"mc_index\":0,\"top_layer\":0,\"mid_layer\":0,\"low_layer\":-1,\"addr\":7860269056,\"grain\":128,\"syndrome\":0,\"driver\":\" area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0\"}"
        }
    }
    
    字段名 含义
    Device 发生错误的硬件部件标识(如 CPU/MEMMEMACPIPCIe 0000:01:00.0
    Event 事件子类型(MCEEDACAPICAER
    ErrType 错误严重级别(见下表)
    Timestamp 时间戳
    Info 具体事件的详细字段
    错误类型 含义 典型来源
    Corrected 已由硬件自动纠正,系统无感知 MCE CE, EDAC CE, ACPI Sev=1, AER Severity=2
    UncorrectedRecoverable 硬件无法纠正,但系统软件可修复的错误 MCE UE, EDAC UE, ACPI Sev=2, AER Severity=0
    UncorrectedDeferred 硬件无法纠正,需要延迟处理的错误 MCE MCI_STATUS_DEFERRED, EDAC HW_EVENT_ERR_DEFERRED
    UncorrectedFatal 硬件无法纠正的致命错误,需立即重启 EDAC FATAL, ACPI Sev=3, AER Severity=0
    Info 期望系统记录日志信息的错误类型 EDAC HW_EVENT_ERR_INFO, ACPI Sev=0

详细说明

  • MCE

    监控部件:CPU 核心、L1/L2/L3 Cache、TLB、内存控制器(IMC)、互连总线(QPI/UPI/Infinity Fabric)。

    字段名 MSR 来源 含义
    mcg_cpu_cap MCG_CAP 机器检查全局能力寄存器。低 8 位(Count)表示系统中 MC Bank 的数量。
    mcg_msr_status MCG_STATUS 机器检查全局状态寄存器**。
    banks_msr_status MCi_STATUS Bank 状态寄存器(最核心字段)。低 16 位为 MCA 错误代码(分类错误类型,如内存层次错误、总线错误等);高位包含 UC(不可纠正)、EN(已使能)、MISCV(MISC 有效)、ADDRV(ADDR 有效)、PCC(处理器上下文损坏)等控制位。
    banks_msr_addr MCi_ADDR 发生错误的物理内存地址(仅当 MCi_STATUS.ADDRV=1 时有效)。可用于定位故障 DIMM 或 Cache Line。
    banks_msr_misc MCi_MISC 补充信息寄存器(仅当 MCi_STATUS.MISCV=1 时有效)。
    mca_synd_msr MCA_SYND 综合征寄存器(AMD 专用)。
    mca_ipid_msr MCA_IPID 实例 ID 寄存器(AMD 专用)。
    instr_pointer RIP 寄存器 发生 MCE 时的指令指针(仅当 MCG_STATUS.EIPV=1 时可靠)。
    tsc_timestamp TSC 发生错误时的 CPU 时间戳计数器值(可与内核时钟换算为绝对时间)。
    walltime 内核时间 发生错误时的 Unix 时间戳(秒)。
    cpu 发生 MCE 的逻辑 CPU 编号。
    cpuid CPUID 发生 MCE 的 CPU 的 CPUID 值(包含 Family/Model/Stepping)。
    apicid APIC ID 发生 MCE 的 CPU 对应的 APIC ID(可映射到物理核/超线程)。
    socketid CPU 插槽编号(Socket ID)。多路服务器场景下用于区分物理 CPU。
    code_seg CS 寄存器 发生 MCE 时的代码段寄存器值(用于判断特权级)。
    bank Bank 编号(通常 Bank 0=L1I,Bank 1=L1D,Bank 2=L2,Bank 4+=内存控制器,但编号因平台而异)。
    cpuvendor CPU 厂商标识:0=Intel,1=未知,2=AMD。
  • EDAC

    监控部件:内存 ECC 错误。

    字段名 含义
    err_count 本次事件中累计的错误次数。
    err_type 错误严重级别。
    err_msg 人类可读错误描述字符串(如 "CE memory read error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:8 syndrome:0x0)")。
    label 内存条物理位置标签(如 "CPU_SrcID#0_Ha#0_Chan#0_DIMM#0"),由 EDAC 驱动根据 DIMM 拓扑生成,可直接对应机器内部的内存插槽位置。
    mc_index 内存控制器编号(0-based)。多内存控制器服务器上用于区分不同 IMC。
    top_layer 内存层次结构顶层索引(通常为 Channel 编号,即内存通道号,-1 表示无效)。
    mid_layer 内存层次结构中层索引(通常为 Slot/Rank 编号,-1 表示无效)。
    low_layer 内存层次结构底层索引(通常为 Bank/Row 编号,-1 表示无效)。
    addr 发生错误的物理内存地址(64-bit 无符号整数,0 表示地址无效)。
    grain 错误粒度(Grain Size,字节数)。表示可能受影响的最小内存单元大小。
    syndrome ECC 综合征值。
    driver EDAC 驱动名称(如 "amd64_edac""sb_edac")。
  • ACPI GHES

    监控部件:平台特定硬件错误。

    字段名 含义
    severity ACPI/CPER 错误严重级别原始值。
    sec_type 错误部分类型 GUID(16 字节,十六进制字符串)。由 UEFI 规范和各硬件厂商定义,标识错误记录所属的硬件类别(如内存错误部分、PCIe 错误部分、ARM 处理器错误部分等)。
    fru_id FRU(Field Replaceable Unit,现场可替换单元)标识符 GUID(16 字节,十六进制字符串)。唯一标识发生错误的可更换硬件组件(如某块内存条、某个 PCIe 卡)。
    fru_text FRU 人类可读描述字符串(如 "CPU0_DIMM_A1")。
    data_len 原始错误数据载荷长度(字节数)。
    raw_data 原始错误数据的十六进制转储(空格分隔字节)。用于深度诊断,需结合具体硬件厂商文档解析。
  • PCIe AER

    监控设备包括 GPU、NVMe SSD、RDMA 网卡/HCA、FPGA 加速卡、PCIe Switch 等。

    字段名 含义
    dev_name PCIe 设备名称(BDF 格式),如 "0000:03:00.0",分别对应 Domain:Bus:Device.Function。
    err_type 错误严重级别(Corrected / Uncorrected / Fatal)。
    err_reason 具体错误原因描述字符串,由 AER 状态寄存器的比特位解码得出(见下方两张表)。
    tlp_header 触发错误的 TLP(Transaction Layer Packet)头部四元组(格式:{dword0, dword1, dword2, dword3},十六进制)。TLP 头部包含事务类型、地址、请求者 ID 等信息,是定位错误根因的关键数据。若 TlpHeaderValid=0 则显示 "not available"
  • PCIe 可纠正错误类型

    位掩码 含义
    0x00000001 接收端错误。物理层接收到不符合规范的数据符号,通常由信号完整性问题引起(如过长连线、阻抗不匹配)。
    0x00000040 TLP(事务层数据包)错误。数据包的 LCRC(链路层 CRC)校验失败,表明事务层数据在传输中发生翻转,PCIe 链路层会自动重传该 TLP。
    0x00000080 DLLP(数据链路层数据包)错误。链路层控制包(如 ACK/NAK、流控更新)CRC 校验失败。
    0x00000100 重传序列号溢出。REPLAY_NUM 字段用于追踪重传次数,该错误表明自上次 ACK 以来已发生过多次重传,通常意味着链路质量持续较差。
    0x00001000 重传计时器超时。发送方在规定时间内未收到 ACK,触发 TLP 重传。持续出现表明链路延迟异常或接收端处理能力不足。
    0x00002000 顾问性非致命错误。本质上是一个不可纠正但被软件降级为可纠正处理的错误(需启用 AER capability 中的 ANFE 功能),常见于接收到 Unsupported Request Completion 的场景。
    0x00004000 已纠正内部错误。设备内部 ECC 或奇偶校验错误,已由设备自主纠正。
    0x00008000 头部日志溢出。AER 头部日志寄存器已满,后续错误的 TLP 头部无法被记录(但错误本身仍被计数)。
  • PCIe 不可纠正错误类型

    位掩码 含义
    0x00000001 未定义错误。保留位被置位,通常表明固件或硬件存在不合规行为。
    0x00000010 数据链路协议错误。收到了违反 DLLP 协议规范的数据包,属于严重的链路层故障。
    0x00000020 意外下线错误。物理链路在未经 Hot-Plug 通知的情况下突然断开(如设备意外掉电或接触不良),为热插拔场景下的高危错误。
    0x00001000 毒化 TLP。接收到数据有效位(EP,Error Poisoning)被主动设置为 1 的 TLP,表明上游发送方知晓该数据已损坏。此机制用于错误传播和隔离,避免静默数据损坏。
    0x00002000 流控协议错误。接收到违反 PCIe 流控信用(Credit)规则的数据包,属于严重的协议违规。
    0x00004000 完成超时。请求方(Requester)发出非 Posted 事务(如 Memory Read)后,在规定超时时间内未收到完成包(Completion)。常见于 NVMe 盘固件异常、RDMA 网卡驱动 Bug 或 PCIe 链路中断。
    0x00008000 完成方中止。接收端显式返回 CA(Completer Abort)状态的 Completion,表示请求被完成方拒绝。
    0x00010000 意外完成包。收到了无法与任何已发出的请求匹配的 Completion(Tag 不匹配),通常由设备固件 Bug 或数据路径错误引起。
    0x00020000 接收缓冲区溢出。接收端流控信用信息显示其缓冲区未满,但实际发生了溢出,属于严重的流控违规。
    0x00040000 格式错误的 TLP。数据包头部字段违反规范(如非法长度、保留字段被置位、不合法的地址范围),通常表明设备固件存在严重缺陷。
    0x00080000 端到端 CRC 错误。TLP 尾部的 ECRC 校验失败(需双端设备均支持 ECRC 功能),表明数据在整个传输链路(含 PCIe Switch 内部交换)中发生损坏,是高可靠性场景中的关键指标。
    0x00100000 不支持的请求错误。接收端返回 UR(Unsupported Request)状态,表明请求的事务类型或地址范围不被该设备支持。
    0x00200000 ACS(Access Control Services)违规。PCIe ACS 机制用于防止 PCIe 设备之间的对等(Peer-to-Peer)DMA 绕过 IOMMU,此错误表明发生了违反 ACS 策略的数据访问,在虚拟化安全场景中需重点关注。
    0x00400000 不可纠正的内部错误。设备内部发生无法自行纠正的 ECC 或奇偶校验错误(如 SRAM 双比特错误),通常意味着设备硬件损坏。
    0x00800000 多播(MC)TLP 被阻断。PCIe 多播(Multicast)TLP 被 ACS 或 MC 控制机制阻止。
    0x01000000 原子操作出口被阻断。AtomicOp(原子操作请求,如 FetchAdd、Swap、CAS)因 ACS 控制被阻止出站,常见于 RDMA/GPU 直连场景。
    0x02000000 TLP 前缀被阻断。带有 End-End TLP Prefix 的数据包被 ACS 或其他机制阻止转发。

总结

推荐在生产环境中部署华佗,实现全面的硬件错误监控与主动运维。