AutoTracing
π Overview
AutoTracing is an event-driven automatic diagnosis mechanism. When a host or container shows performance anomalies β such as CPU spikes, accumulation of D-state processes, saturated disk IO, or sudden memory allocation β the system triggers on-site data collection automatically based on preset thresholds, with no manual intervention required.
Collected artifacts include eBPF flame graphs (system-wide or container-scoped CPU call stack samples via perf), D-state process kernel call stacks, disk IO call stacks, and process memory usage rankings. Each event type has a built-in cooldown period (30 minutes by default) to prevent redundant data from continuous triggers.
Five event types are supported: cpusys (host CPU sys spike), cpuidle (container CPU usage spike), dload (container D-state load spike), iotracing (disk IO anomaly), and memburst (memory burst allocation).
π― Use Cases
CPU Hotspot Analysis for AI Training Jobs: In GPU training clusters, intermittent training stalls are often caused by sudden increases in kernel-mode CPU usage (cpusys). When sys utilization exceeds the threshold, AutoTracing immediately triggers a system-wide perf flame graph collection, persisting kernel call stack hotspots as structured flame graph data (flamedata) for offline analysis after the anomaly has passed.
Container CPU Jitter Analysis in Kubernetes: In microservice architectures, brief container CPU spikes (cpuidle) may cause response timeouts, but the issue often recovers before alert responders can act. When container CPU exceeds the threshold, AutoTracing triggers container-scoped perf sampling and generates a flame graph scoped to the container’s cgroup, identifying hotspot functions and reducing time spent on log-based investigation.
D-State Process Accumulation in Cloud-Native Environments: Under high IO load or storage jitter, containers may accumulate large numbers of D-state (uninterruptible sleep) processes, causing system stalls. The dload event applies an exponential weighted moving average (EMA) to the container’s uninterruptible process load. When the EMA exceeds the threshold, kernel call stacks are collected for all D-state processes inside the container and on the host, pinpointing the blocking root cause.
Disk IO Bottleneck Root Cause Analysis: In data-intensive or log-heavy workloads, saturated disk IO utilization or write bandwidth causes application request backlog. iotracing continuously polls /proc/diskstats and triggers when any IO metric exceeds its threshold for two consecutive samples. It then collects a list of high-IO processes (with per-process read/write byte counts and open file details) and kernel call stacks of processes waiting in IO scheduling, narrowing down the processes responsible for high disk IO consumption.
π Usage
Configuration
All events provide default values and work without configuration:
| Parameter | Default | Description |
|---|---|---|
cpuidle.user_threshold |
75 (%) |
Container CPU user utilization trigger threshold |
cpuidle.sys_threshold |
45 (%) |
Container CPU sys utilization trigger threshold |
cpuidle.usage_threshold |
90 (%) |
Container total CPU utilization trigger threshold |
cpuidle.delta_user_threshold |
45 (%) |
Container CPU user utilization delta trigger threshold |
cpuidle.delta_sys_threshold |
20 (%) |
Container CPU sys utilization delta trigger threshold |
cpuidle.delta_usage_threshold |
55 (%) |
Container total CPU utilization delta trigger threshold |
cpuidle.interval |
10 (s) |
Detection interval |
cpuidle.interval_tracing |
1800 (s) |
Per-container cooldown period between triggers |
cpuidle.run_tracing_tool_timeout |
10 (s) |
perf flame graph collection timeout |
cpusys.sys_threshold |
45 (%) |
Host CPU sys utilization trigger threshold |
cpusys.delta_sys_threshold |
20 (%) |
Host CPU sys utilization delta trigger threshold |
cpusys.interval |
10 (s) |
Detection interval |
cpusys.run_tracing_tool_timeout |
10 (s) |
perf flame graph collection timeout |
dload.threshold_load |
5 |
Container D-state process load EMA trigger threshold |
dload.interval |
10 (s) |
Detection interval |
dload.interval_tracing |
1800 (s) |
Per-container cooldown period between triggers |
iotracing.rbps_threshold |
2000 (MB/s) |
Disk read throughput trigger threshold |
iotracing.wbps_threshold |
1500 (MB/s) |
Disk write throughput trigger threshold |
iotracing.util_threshold |
90 (%) |
Disk IO utilization trigger threshold |
iotracing.await_threshold |
100 (ms) |
Disk IO average wait time trigger threshold |
iotracing.run_tracing_tool_timeout |
10 (s) |
IO call stack collection timeout |
iotracing.max_proc_dump |
10 |
Maximum number of high-IO processes to collect |
iotracing.max_files_per_proc_dump |
5 |
Maximum open files to collect per process |
memburst.delta_memory_burst |
100 (%) |
Anonymous memory growth rate threshold relative to the oldest sample in the sliding window (100% means β₯ 2Γ triggers) |
memburst.delta_anon_threshold |
70 (%) |
Anonymous memory as a percentage of total host memory threshold |
memburst.interval |
10 (s) |
Detection interval |
memburst.interval_tracing |
1800 (s) |
Cooldown period between triggers |
memburst.sliding_window_length |
60 |
Sliding window sample count (corresponding to 600 seconds of history) |
memburst.dump_process_max_num |
10 |
Maximum number of top memory-consuming processes to collect |
Event List
| Event Name (tracer_name) | Target | Trigger Condition | Typical Scenario |
|---|---|---|---|
cpusys |
Host | sys > 45% or delta_sys > 20% | Kernel-mode CPU spike, syscall hotspot |
cpuidle |
Container | (user>75% and delta_user>45%) or (sys>45% and delta_sys>20%) or (total>90% and delta_total>55%) | Container CPU spike, hotspot function analysis |
dload |
Container | D-state process load EMA > 5 | D-state process accumulation, IO blocking |
iotracing |
Host | Any IO metric exceeds threshold for two consecutive samples | Saturated disk IO, high IO wait latency |
memburst |
Host | Anonymous memory β₯ 2Γ oldest window sample and β₯ 70% of total memory | Memory burst allocation, OOM precursor |
Fields
All event records include the following common fields:
- hostname: Physical host hostname
- region: Availability zone of the physical host
- uploaded_time: Data upload timestamp
- container_id: Container ID if the event is associated with a container
- container_hostname: Container hostname if the event is associated with a container
- container_host_namespace: Kubernetes namespace of the container
- container_type: Container type (e.g.,
normal,sidecar) - container_qos: Container QoS level
- tracer_name: Event name (e.g.,
cpusys,memburst) - tracer_id: Tracing session ID
- tracer_time: Time when the tracing was triggered
- tracer_type: Trigger type (manual or automatic)
- tracer_data: Event-specific private data (see individual event descriptions below)
1. cpusys
Description Periodically reads /proc/stat to calculate host CPU sys utilization and the delta between consecutive samples. When sys utilization exceeds the threshold (default 45%) or the delta exceeds its threshold (default 20%), a system-wide perf sampling run is triggered to generate a full-host CPU flame graph.
Storage Event data is automatically stored in Elasticsearch or a local disk file.
Sample Data
{
"tracer_name": "cpusys",
"tracer_data": {
"now_sys": 52,
"sys_threshold": 45,
"deltasys": 25,
"deltasys_threshold": 20,
"flamedata": [
{"level": 0, "value": 1000, "self": 0, "label": "all"},
{"level": 1, "value": 350, "self": 350, "label": "do_syscall_64"}
]
}
}
Field Descriptions
- now_sys: Host CPU sys utilization at trigger time (%)
- sys_threshold: sys utilization trigger threshold (%)
- deltasys: sys utilization delta between consecutive samples (%)
- deltasys_threshold: sys delta trigger threshold (%)
- flamedata: Flame graph frame data from perf sampling. Each frame contains:
- level: Call stack depth level
- value: Sample count for this frame including descendant frames
- self: Sample count for this frame excluding descendant frames
- label: Function or process name label
2. cpuidle
Description Periodically reads container cgroup CPU statistics to calculate container CPU user, sys, and total utilization along with their inter-sample deltas. A trigger fires if any of the following conditions holds: (user>75% and delta_user>45%), or (sys>45% and delta_sys>20%), or (total>90% and delta_total>55%). Container-scoped perf sampling is then run to generate a flame graph. A 30-minute per-container cooldown prevents repeated triggers. Specific containers can be excluded via the filter configuration.
Storage Event data is automatically stored in Elasticsearch or a local disk file.
Sample Data
{
"tracer_name": "cpuidle",
"tracer_data": {
"user": 80,
"user_threshold": 75,
"deltauser": 48,
"deltauser_threshold": 45,
"sys": 12,
"sys_threshold": 45,
"deltasys": 5,
"deltasys_threshold": 20,
"usage": 92,
"usage_threshold": 90,
"deltausage": 53,
"deltausage_threshold": 55,
"flamedata": [
{"level": 0, "value": 1000, "self": 0, "label": "all"},
{"level": 1, "value": 800, "self": 800, "label": "java/com.example.App.main"}
]
}
}
Field Descriptions
- user / user_threshold: Container CPU user utilization at trigger time (%) and its threshold
- deltauser / deltauser_threshold: User utilization inter-sample delta (%) and its threshold
- sys / sys_threshold: Container CPU sys utilization at trigger time (%) and its threshold
- deltasys / deltasys_threshold: Sys utilization inter-sample delta (%) and its threshold
- usage / usage_threshold: Container total CPU utilization at trigger time (%) and its threshold
- deltausage / deltausage_threshold: Total utilization inter-sample delta (%) and its threshold
- flamedata: Container-scoped perf flame graph frame data; field meanings same as
cpusys
3. dload
Description Reads container process states via netlink and cgroup, then computes an exponential weighted moving average (EMA) of the load contribution from uninterruptible (D-state) processes per container. When the EMA exceeds the threshold (default 5), kernel call stacks are collected for all D-state processes inside the container and on the host. Known-issue filtering (issues_list) reduces false positives. A 30-minute per-container cooldown applies.
Storage Event data is automatically stored in Elasticsearch or a local disk file.
Sample Data
{
"tracer_name": "dload",
"tracer_data": {
"threshold": 5,
"nr_sleeping": 120,
"nr_running": 4,
"nr_stopped": 0,
"nr_uninterruptible": 8,
"nr_iowait": 3,
"load_avg": 7.23,
"dload_avg": 6.81,
"known_issue": "",
"stack": "task:java state:D stack: 0 pid: 12345 tgid: 12345 ...\n io_schedule+0x18/0x40\n ext4_file_write_iter+0x..."
}
}
Field Descriptions
- threshold: D-state load EMA trigger threshold
- nr_sleeping: Number of sleeping processes in the container
- nr_running: Number of running processes in the container
- nr_stopped: Number of stopped processes in the container
- nr_uninterruptible: Number of uninterruptible (D-state) processes in the container
- nr_iowait: Number of IO-waiting processes in the container
- load_avg: Container load average at trigger time
- dload_avg: Container D-state load EMA value at trigger time
- known_issue: Matched known issue description (empty if none matched)
- stack: Kernel call stacks of D-state processes (multi-process, multi-line text)
4. iotracing
Description Polls /proc/diskstats at 5-second intervals to calculate per-disk read/write throughput, IO utilization, and IO wait time. md devices are excluded automatically. A trigger fires when any metric exceeds its threshold for two consecutive samples. On trigger, the system collects a list of high-IO processes (with per-process read/write byte counts and open file details) and kernel call stacks of processes waiting in IO scheduling.
Storage Event data is automatically stored in Elasticsearch or a local disk file.
Sample Data
{
"tracer_name": "iotracing",
"tracer_data": {
"reason_snapshot": {
"type": "ioutil",
"device": "sda",
"iostatus": {
"read_bps": 120,
"read_iops": 450,
"read_await": 12,
"write_bps": 2100,
"write_iops": 890,
"write_await": 145,
"io_util": 95,
"queue_size": 32
}
},
"process_io_data": [
{
"pid": 12345,
"comm": "java",
"container_hostname": "app-pod-xxx",
"fs_read": 0,
"fs_write": 52428800,
"disk_read": 0,
"disk_write": 49152000,
"file_stat": ["/data/logs/app.log"],
"file_count": 1
}
],
"timeout_io_stack": [
{
"pid": 12345,
"comm": "java",
"container_hostname": "app-pod-xxx",
"latency_us": 250000,
"stack": {
"back_trace": [
"io_schedule+0x18/0x40",
"ext4_file_write_iter+0x2a0/0x4c0"
]
}
}
]
}
}
Field Descriptions
- reason_snapshot: Snapshot of the condition that triggered IO collection
- type: Trigger type (
ioutilIO utilization /read_bpsread throughput /write_bpswrite throughput /read_awaitread wait time /write_awaitwrite wait time) - device: Name of the disk device that exceeded the threshold
- iostatus: Disk IO metric snapshot at trigger time (
read_bps/write_bpsin MB/s,read_await/write_awaitin ms,io_utilin %,queue_sizeis queue depth)
- type: Trigger type (
- process_io_data: List of high-IO processes. Each record contains:
- pid / comm: Process PID and name
- container_hostname: Container hostname of the process (empty for host processes)
- fs_read / fs_write: Bytes read/written at the filesystem layer
- disk_read / disk_write: Bytes actually read/written at the disk layer
- file_stat: List of file paths currently open by the process
- file_count: Total number of files open by the process
- timeout_io_stack: Call stacks of processes waiting in IO scheduling. Each record contains:
- pid / comm: Process PID and name
- container_hostname: Container hostname of the process
- latency_us: IO wait duration (microseconds)
- stack.back_trace: List of kernel call stack frames
5. memburst
Description Periodically samples host anonymous memory usage and maintains a sliding window of 60 samples (corresponding to 600 seconds). A trigger fires when current anonymous memory is β₯ 2Γ the oldest sample in the window and anonymous memory accounts for β₯ 70% of total host memory. On trigger, the top N processes by memory consumption (default 10) are collected, recording their PID, process name, and RSS memory size. A 30-minute cooldown applies.
Storage Event data is automatically stored in Elasticsearch or a local disk file.
Sample Data
{
"tracer_name": "memburst",
"tracer_data": {
"top_memory_usage": [
{
"pid": 3456,
"process_name": "java",
"memory_size": 8589934592
},
{
"pid": 3789,
"process_name": "python3",
"memory_size": 2147483648
}
]
}
}
Field Descriptions
- top_memory_usage: List of top memory-consuming processes sorted by RSS in descending order. Each record contains:
- pid: Process PID
- process_name: Process name
- memory_size: Process RSS memory usage (bytes)
βοΈ Principle
Architecture
AutoTracing is built on periodic polling, combined with eBPF call stack collection and perf flame graph generation, to collect anomaly diagnostic data at the kernel level with low overhead.
graph TB
subgraph "Data Sources"
P1["/proc/stat\n(Host CPU utilization)"]
P2["cgroup CPU stats\n(Container CPU utilization)"]
P3["netlink / cgroup\n(Container process states / load average)"]
P4["/proc/diskstats\n(Disk IO metrics)"]
P5["/proc/meminfo\n+ cgroup memory stats"]
end
subgraph "HUATUO AutoTracing"
DT["Threshold Detection\n(sliding window / EMA / two consecutive breaches)"]
BO["Cooldown\n(30-minute backoff)"]
PERF["perf Flame Graph\n(system-wide / container-scoped)"]
BPF["eBPF kprobe\n(IO scheduling latency tracing)"]
CM["Container Correlation\n(cgroup β ContainerID)"]
end
subgraph "Storage"
ES["Elasticsearch"]
DISK["Local Disk File"]
end
P1 --> DT
P2 --> DT
P3 --> DT
P4 --> DT
P5 --> DT
DT --> BO
BO --> PERF
BO --> BPF
PERF --> CM
BPF --> CM
CM --> ES
CM --> DISK
Event Processing Flow
sequenceDiagram
participant M as Periodic Metric Collection
participant D as Threshold Detector
participant B as Cooldown (backoff)
participant C as On-site Data Collector
participant S as Storage
M->>D: Push metrics (every 10s)
D->>D: Evaluate threshold (sliding window / EMA / consecutive)
alt Threshold exceeded
D->>B: Check cooldown state
alt Trigger allowed
B->>C: Trigger collection<br/>(perf flame graph / D-state stacks / IO process list)
C->>C: Correlate container info (cgroup β ContainerID)
C->>S: Persist data (Elasticsearch / local file)
else In cooldown
B-->>D: Skip this trigger
end
end